[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

2018-09-17 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22396


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

2018-09-16 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22396#discussion_r217926918
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1897,7 +1897,8 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - In version 2.3 and earlier, CSV rows are considered as malformed if at 
least one column value in the row is malformed. CSV parser dropped such rows in 
the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 
2.4, CSV row is considered as malformed only when it contains malformed column 
values requested from CSV datasource, other values can be ignored. As an 
example, CSV file contains the "id,name" header and one row "1234". In Spark 
2.4, selection of the id column consists of a row with one column value 1234 
but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore 
the previous behavior, set `spark.sql.csv.parser.columnPruning.enabled` to 
`false`.
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
-  - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. 
In version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files. For example, the row of `"a", 
null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as 
`a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to 
empty (not quoted) string.  
+  - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. 
In version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files. For example, the row of `"a", 
null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as 
`a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to 
empty (not quoted) string.
+  - Since Spark 2.4, The LOAD DATA command supports wildcard characters ? 
and *, which match any one character, and zero or more characters, 
respectively. Example: LOAD DATA INPATH '/tmp/folder*/ or LOAD DATA INPATH 
/tmp/part-?. Special Characters like spaces also now work in paths. Example: 
LOAD DATA INPATH /tmp/folder name/.
--- End diff --

The commands and paths should be back-tick-quoted for readability. I think 
they may be interpreted as markdown otherwise.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

2018-09-16 Thread sujith71955
Github user sujith71955 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22396#discussion_r217920696
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
   - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. 
In version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files. For example, the row of `"a", 
null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as 
`a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to 
empty (not quoted) string.  
+  - Since Spark 2.4 load command from local filesystem supports wildcards 
in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/).Also in 
Older versions space in folder/file names has been represented using '%20'(e.g. 
LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv), this usage will not be 
supported from spark 2.4 version. Since Spark 2.4, Spark supports normal space 
character in folder/file names (e.g. LOAD DATA INPATH 
'hdfs://tmp/folderName/file Name.csv') and wildcard character '?' can be used. 
(e.g. LOAD DATA INPATH 'hdfs://tmp/folderName/fileName?.csv')
--- End diff --

@gatorsmile Just used a common encoder (%20) in our example.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

2018-09-16 Thread sujith71955
Github user sujith71955 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22396#discussion_r217920417
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
   - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. 
In version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files. For example, the row of `"a", 
null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as 
`a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to 
empty (not quoted) string.  
+  - Since Spark 2.4 load command from local filesystem supports wildcards 
in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/).Also in 
Older versions space in folder/file names has been represented using '%20'(e.g. 
LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv), this usage will not be 
supported from spark 2.4 version. Since Spark 2.4, Spark supports normal space 
character in folder/file names (e.g. LOAD DATA INPATH 
'hdfs://tmp/folderName/file Name.csv') and wildcard character '?' can be used. 
(e.g. LOAD DATA INPATH 'hdfs://tmp/folderName/fileName?.csv')
--- End diff --

@srowen   Sorry Sean i missed your suggested text,  I updated the message 
based on your suggestions.  Actually i became bit confused as this PR is a 
combination of bug fix and improvement  :) . 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

2018-09-15 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22396#discussion_r217897208
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
   - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. 
In version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files. For example, the row of `"a", 
null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as 
`a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to 
empty (not quoted) string.  
+  - Since Spark 2.4 load command from local filesystem supports wildcards 
in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/).Also in 
Older versions space in folder/file names has been represented using '%20'(e.g. 
LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv), this usage will not be 
supported from spark 2.4 version. Since Spark 2.4, Spark supports normal space 
character in folder/file names (e.g. LOAD DATA INPATH 
'hdfs://tmp/folderName/file Name.csv') and wildcard character '?' can be used. 
(e.g. LOAD DATA INPATH 'hdfs://tmp/folderName/fileName?.csv')
--- End diff --

Agree, the text should be clear that this is only an example of a character 
that could work in a path now. It might be the most common one. @sujith71955 
see my suggested text above.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

2018-09-15 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22396#discussion_r217897114
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
   - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. 
In version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files. For example, the row of `"a", 
null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as 
`a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to 
empty (not quoted) string.  
+  - Since Spark 2.4 load command from local filesystem supports wildcards 
in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/).Also in 
Older versions space in folder/file names has been represented using '%20'(e.g. 
LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv), this usage will not be 
supported from spark 2.4 version. Since Spark 2.4, Spark supports normal space 
character in folder/file names (e.g. LOAD DATA INPATH 
'hdfs://tmp/folderName/file Name.csv') and wildcard character '?' can be used. 
(e.g. LOAD DATA INPATH 'hdfs://tmp/folderName/fileName?.csv')
--- End diff --

Why we only mention the space character? 

```
val p2 = new Path(new URI("a%30b"))
print(p2)

val p = new Path("a%30b")
print(p)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

2018-09-14 Thread sujith71955
Github user sujith71955 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22396#discussion_r217802972
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
   - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. 
In version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files. For example, the row of `"a", 
null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as 
`a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to 
empty (not quoted) string.  
+  - Since Spark 2.4 load command from local filesystem supports wildcards 
in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/).Also in 
Older versions space in folder/file names has been represented using '%20'(e.g. 
LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv), this usage will not be 
supported from spark 2.4 version. Since Spark 2.4, Spark supports normal space 
character in folder/file names (e.g. LOAD DATA INPATH 
'hdfs://tmp/folderName/file Name.csv') and wildcard character '?' can be used. 
(e.g. LOAD DATA INPATH 'hdfs://tmp/folderName/fileName?.csv')
--- End diff --

@cloud-fan We follow the same syntax as old versions for Load command path, 
except in older versions user was not able to provide wildcard characters in 
folder level of the local fs , Now we do support  with our new implementation 
and even in hdfs we do support the same syntax. So now it is consistent. All 
the usage which i mentioned can be applied in both local and hdfs file systems. 
 Now the usages are more consistent compare to older versions.

For more details please refer below PR let me know for any clarifications. 
Thanks
https://github.com/apache/spark/pull/20611


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

2018-09-14 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22396#discussion_r217758371
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
   - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. 
In version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files. For example, the row of `"a", 
null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as 
`a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to 
empty (not quoted) string.  
+  - Since Spark 2.4 load command from local filesystem supports wildcards 
in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/).Also in 
Older versions space in folder/file names has been represented using '%20'(e.g. 
LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv), this usage will not be 
supported from spark 2.4 version. Since Spark 2.4, Spark supports normal space 
character in folder/file names (e.g. LOAD DATA INPATH 
'hdfs://tmp/folderName/file Name.csv') and wildcard character '?' can be used. 
(e.g. LOAD DATA INPATH 'hdfs://tmp/folderName/fileName?.csv')
--- End diff --

for curiosity, now we have the same path syntax for local fs and HDFS?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

2018-09-14 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22396#discussion_r217722539
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
   - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. 
In version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files. For example, the row of `"a", 
null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as 
`a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to 
empty (not quoted) string.  
+  - Since Spark 2.4 load command from local filesystem supports wildcards 
in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/). Now 
onwards normal space convention can be used in folder/file names (e.g. LOAD 
DATA INPATH 'tmp/folderName/file Name.csv), Older versions space in folder/file 
names has been represented using '%20'(e.g. LOAD DATA INPATH 
'tmp/folderName/myFile%20Name.csv), this usage will not be supported from spark 
2.4 version.
--- End diff --

I think this text doesn't describe the change then. The new functionality 
is that wildcards work at all levels in both local and remote file systems, 
right? This also says that `%20` escaping used to work, but your results show 
it didn't. That doesn't seem like a change in behavior. This text also has 
typos and spacing problems. To be clear, here is the text I suggest:

Since Spark 2.4, the `LOAD DATA` command supports wildcard characters `?` 
and `*`, which match any one character, and zero or more characters, 
respectively. Example: `LOAD DATA INPATH '/tmp/folder*/` or `LOAD DATA INPATH 
/tmp/part-?`. Characters like spaces also now work in paths: `LOAD DATA INPATH 
`/tmp/folder name/`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

2018-09-14 Thread sujith71955
Github user sujith71955 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22396#discussion_r217673140
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
   - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. 
In version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files. For example, the row of `"a", 
null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as 
`a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to 
empty (not quoted) string.  
+  - Since Spark 2.4 load command from local filesystem supports wildcards 
in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/). Now 
onwards normal space convention can be used in folder/file names (e.g. LOAD 
DATA INPATH 'tmp/folderName/file Name.csv), Older versions space in folder/file 
names has been represented using '%20'(e.g. LOAD DATA INPATH 
'tmp/folderName/myFile%20Name.csv), this usage will not be supported from spark 
2.4 version.
--- End diff --

Is it specific to the local file system? << Yes, , its specific to local 
file system as in hdfs user can provide wildcard character in folder level 
also, for local file system folder level support was not there and error will 
be thrown)
Can this text add a quick example of using ? too?<< Yes i added the same>>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

2018-09-14 Thread sujith71955
Github user sujith71955 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22396#discussion_r217672824
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
   - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. 
In version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files. For example, the row of `"a", 
null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as 
`a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to 
empty (not quoted) string.  
+  - Since Spark 2.4 load command from local filesystem supports wildcards 
in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/). Now 
onwards normal space convention can be used in folder/file names (e.g. LOAD 
DATA INPATH 'tmp/folderName/file Name.csv), Older versions space in folder/file 
names has been represented using '%20'(e.g. LOAD DATA INPATH 
'tmp/folderName/myFile%20Name.csv), this usage will not be supported from spark 
2.4 version.
--- End diff --

<>
@srowen  I modified the statement please recheck once and let me know for 
any suggestions.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

2018-09-12 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22396#discussion_r217209344
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
   - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. 
In version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files. For example, the row of `"a", 
null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as 
`a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to 
empty (not quoted) string.  
+  - Since Spark 2.4 load command from local filesystem supports wildcards 
in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/). Now 
onwards normal space convention can be used in folder/file names (e.g. LOAD 
DATA INPATH 'tmp/folderName/file Name.csv), Older versions space in folder/file 
names has been represented using '%20'(e.g. LOAD DATA INPATH 
'tmp/folderName/myFile%20Name.csv), this usage will not be supported from spark 
2.4 version.
--- End diff --

Is it specific to the local file system?
Can this text add a quick example of using `?` too? this wildcard syntax is 
not regex syntax.
From your previous analysis, spaces in paths didn't work before even if 
escaped with `%20`. Shouldn't we just say that, additionally, special 
characters in paths like spaces should work now, and give the example?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

2018-09-12 Thread sujith71955
Github user sujith71955 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22396#discussion_r217151516
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
   - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. 
In version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files. For example, the row of `"a", 
null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as 
`a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to 
empty (not quoted) string.  
+  - Since Spark 2.4 load command from local filesystem supports wildcards 
in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/). Now 
onwards normal space convention can be used in folder/file names (e.g. LOAD 
DATA INPATH 'tmp/folderName/file Name.csv), Older versions space in folder/file 
names has been represented using '%20'(e.g. LOAD DATA INPATH 
'tmp/folderName/myFile%20Name.csv).
--- End diff --

taken care, please let me know for any further suggestions. thanks for 
looking into this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

2018-09-12 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22396#discussion_r217008531
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
   - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. 
In version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files. For example, the row of `"a", 
null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as 
`a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to 
empty (not quoted) string.  
+  - Since Spark 2.4 load command from local filesystem supports wildcards 
in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/). Now 
onwards normal space convention can be used in folder/file names (e.g. LOAD 
DATA INPATH 'tmp/folderName/file Name.csv), Older versions space in folder/file 
names has been represented using '%20'(e.g. LOAD DATA INPATH 
'tmp/folderName/myFile%20Name.csv).
--- End diff --

We should also mention that, the old way to escape the special chars will 
not work in 2.4.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

2018-09-11 Thread sujith71955
GitHub user sujith71955 opened a pull request:

https://github.com/apache/spark/pull/22396

[SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS path for loadtable 
command.


What changes were proposed in this pull request
Updated the Migration guide for the behavior changes done in the JIRA issue 
SPARK-23425.

How was this patch tested?
Manually verified.

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sujith71955/spark master_newtest

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22396.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22396


commit c875b16e9ebb0ac1702227ca6d24afa9f9f2d1af
Author: s71955 
Date:   2018-09-11T14:11:55Z

[SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS path for load table 
command.

What changes were proposed in this pull request
Updated the Migration guide for the behavior changes done in the JIRA issue 
SPARK-23425.

How was this patch tested?
Manually verified.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org