[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22396 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22396#discussion_r217926918 --- Diff: docs/sql-programming-guide.md --- @@ -1897,7 +1897,8 @@ working with timestamps in `pandas_udf`s to get the best performance, see - In version 2.3 and earlier, CSV rows are considered as malformed if at least one column value in the row is malformed. CSV parser dropped such rows in the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 2.4, CSV row is considered as malformed only when it contains malformed column values requested from CSV datasource, other values can be ignored. As an example, CSV file contains the "id,name" header and one row "1234". In Spark 2.4, selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore the previous behavior, set `spark.sql.csv.parser.columnPruning.enabled` to `false`. - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. - - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string. + - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string. + - Since Spark 2.4, The LOAD DATA command supports wildcard characters ? and *, which match any one character, and zero or more characters, respectively. Example: LOAD DATA INPATH '/tmp/folder*/ or LOAD DATA INPATH /tmp/part-?. Special Characters like spaces also now work in paths. Example: LOAD DATA INPATH /tmp/folder name/. --- End diff -- The commands and paths should be back-tick-quoted for readability. I think they may be interpreted as markdown otherwise. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...
Github user sujith71955 commented on a diff in the pull request: https://github.com/apache/spark/pull/22396#discussion_r217920696 --- Diff: docs/sql-programming-guide.md --- @@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string. + - Since Spark 2.4 load command from local filesystem supports wildcards in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/).Also in Older versions space in folder/file names has been represented using '%20'(e.g. LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv), this usage will not be supported from spark 2.4 version. Since Spark 2.4, Spark supports normal space character in folder/file names (e.g. LOAD DATA INPATH 'hdfs://tmp/folderName/file Name.csv') and wildcard character '?' can be used. (e.g. LOAD DATA INPATH 'hdfs://tmp/folderName/fileName?.csv') --- End diff -- @gatorsmile Just used a common encoder (%20) in our example. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...
Github user sujith71955 commented on a diff in the pull request: https://github.com/apache/spark/pull/22396#discussion_r217920417 --- Diff: docs/sql-programming-guide.md --- @@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string. + - Since Spark 2.4 load command from local filesystem supports wildcards in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/).Also in Older versions space in folder/file names has been represented using '%20'(e.g. LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv), this usage will not be supported from spark 2.4 version. Since Spark 2.4, Spark supports normal space character in folder/file names (e.g. LOAD DATA INPATH 'hdfs://tmp/folderName/file Name.csv') and wildcard character '?' can be used. (e.g. LOAD DATA INPATH 'hdfs://tmp/folderName/fileName?.csv') --- End diff -- @srowen Sorry Sean i missed your suggested text, I updated the message based on your suggestions. Actually i became bit confused as this PR is a combination of bug fix and improvement :) . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22396#discussion_r217897208 --- Diff: docs/sql-programming-guide.md --- @@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string. + - Since Spark 2.4 load command from local filesystem supports wildcards in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/).Also in Older versions space in folder/file names has been represented using '%20'(e.g. LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv), this usage will not be supported from spark 2.4 version. Since Spark 2.4, Spark supports normal space character in folder/file names (e.g. LOAD DATA INPATH 'hdfs://tmp/folderName/file Name.csv') and wildcard character '?' can be used. (e.g. LOAD DATA INPATH 'hdfs://tmp/folderName/fileName?.csv') --- End diff -- Agree, the text should be clear that this is only an example of a character that could work in a path now. It might be the most common one. @sujith71955 see my suggested text above. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22396#discussion_r217897114 --- Diff: docs/sql-programming-guide.md --- @@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string. + - Since Spark 2.4 load command from local filesystem supports wildcards in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/).Also in Older versions space in folder/file names has been represented using '%20'(e.g. LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv), this usage will not be supported from spark 2.4 version. Since Spark 2.4, Spark supports normal space character in folder/file names (e.g. LOAD DATA INPATH 'hdfs://tmp/folderName/file Name.csv') and wildcard character '?' can be used. (e.g. LOAD DATA INPATH 'hdfs://tmp/folderName/fileName?.csv') --- End diff -- Why we only mention the space character? ``` val p2 = new Path(new URI("a%30b")) print(p2) val p = new Path("a%30b") print(p) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...
Github user sujith71955 commented on a diff in the pull request: https://github.com/apache/spark/pull/22396#discussion_r217802972 --- Diff: docs/sql-programming-guide.md --- @@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string. + - Since Spark 2.4 load command from local filesystem supports wildcards in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/).Also in Older versions space in folder/file names has been represented using '%20'(e.g. LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv), this usage will not be supported from spark 2.4 version. Since Spark 2.4, Spark supports normal space character in folder/file names (e.g. LOAD DATA INPATH 'hdfs://tmp/folderName/file Name.csv') and wildcard character '?' can be used. (e.g. LOAD DATA INPATH 'hdfs://tmp/folderName/fileName?.csv') --- End diff -- @cloud-fan We follow the same syntax as old versions for Load command path, except in older versions user was not able to provide wildcard characters in folder level of the local fs , Now we do support with our new implementation and even in hdfs we do support the same syntax. So now it is consistent. All the usage which i mentioned can be applied in both local and hdfs file systems. Now the usages are more consistent compare to older versions. For more details please refer below PR let me know for any clarifications. Thanks https://github.com/apache/spark/pull/20611 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22396#discussion_r217758371 --- Diff: docs/sql-programming-guide.md --- @@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string. + - Since Spark 2.4 load command from local filesystem supports wildcards in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/).Also in Older versions space in folder/file names has been represented using '%20'(e.g. LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv), this usage will not be supported from spark 2.4 version. Since Spark 2.4, Spark supports normal space character in folder/file names (e.g. LOAD DATA INPATH 'hdfs://tmp/folderName/file Name.csv') and wildcard character '?' can be used. (e.g. LOAD DATA INPATH 'hdfs://tmp/folderName/fileName?.csv') --- End diff -- for curiosity, now we have the same path syntax for local fs and HDFS? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22396#discussion_r217722539 --- Diff: docs/sql-programming-guide.md --- @@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string. + - Since Spark 2.4 load command from local filesystem supports wildcards in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/). Now onwards normal space convention can be used in folder/file names (e.g. LOAD DATA INPATH 'tmp/folderName/file Name.csv), Older versions space in folder/file names has been represented using '%20'(e.g. LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv), this usage will not be supported from spark 2.4 version. --- End diff -- I think this text doesn't describe the change then. The new functionality is that wildcards work at all levels in both local and remote file systems, right? This also says that `%20` escaping used to work, but your results show it didn't. That doesn't seem like a change in behavior. This text also has typos and spacing problems. To be clear, here is the text I suggest: Since Spark 2.4, the `LOAD DATA` command supports wildcard characters `?` and `*`, which match any one character, and zero or more characters, respectively. Example: `LOAD DATA INPATH '/tmp/folder*/` or `LOAD DATA INPATH /tmp/part-?`. Characters like spaces also now work in paths: `LOAD DATA INPATH `/tmp/folder name/`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...
Github user sujith71955 commented on a diff in the pull request: https://github.com/apache/spark/pull/22396#discussion_r217673140 --- Diff: docs/sql-programming-guide.md --- @@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string. + - Since Spark 2.4 load command from local filesystem supports wildcards in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/). Now onwards normal space convention can be used in folder/file names (e.g. LOAD DATA INPATH 'tmp/folderName/file Name.csv), Older versions space in folder/file names has been represented using '%20'(e.g. LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv), this usage will not be supported from spark 2.4 version. --- End diff -- Is it specific to the local file system? << Yes, , its specific to local file system as in hdfs user can provide wildcard character in folder level also, for local file system folder level support was not there and error will be thrown) Can this text add a quick example of using ? too?<< Yes i added the same>> --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...
Github user sujith71955 commented on a diff in the pull request: https://github.com/apache/spark/pull/22396#discussion_r217672824 --- Diff: docs/sql-programming-guide.md --- @@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string. + - Since Spark 2.4 load command from local filesystem supports wildcards in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/). Now onwards normal space convention can be used in folder/file names (e.g. LOAD DATA INPATH 'tmp/folderName/file Name.csv), Older versions space in folder/file names has been represented using '%20'(e.g. LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv), this usage will not be supported from spark 2.4 version. --- End diff -- <> @srowen I modified the statement please recheck once and let me know for any suggestions. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22396#discussion_r217209344 --- Diff: docs/sql-programming-guide.md --- @@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string. + - Since Spark 2.4 load command from local filesystem supports wildcards in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/). Now onwards normal space convention can be used in folder/file names (e.g. LOAD DATA INPATH 'tmp/folderName/file Name.csv), Older versions space in folder/file names has been represented using '%20'(e.g. LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv), this usage will not be supported from spark 2.4 version. --- End diff -- Is it specific to the local file system? Can this text add a quick example of using `?` too? this wildcard syntax is not regex syntax. From your previous analysis, spaces in paths didn't work before even if escaped with `%20`. Shouldn't we just say that, additionally, special characters in paths like spaces should work now, and give the example? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...
Github user sujith71955 commented on a diff in the pull request: https://github.com/apache/spark/pull/22396#discussion_r217151516 --- Diff: docs/sql-programming-guide.md --- @@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string. + - Since Spark 2.4 load command from local filesystem supports wildcards in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/). Now onwards normal space convention can be used in folder/file names (e.g. LOAD DATA INPATH 'tmp/folderName/file Name.csv), Older versions space in folder/file names has been represented using '%20'(e.g. LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv). --- End diff -- taken care, please let me know for any further suggestions. thanks for looking into this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22396#discussion_r217008531 --- Diff: docs/sql-programming-guide.md --- @@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string. + - Since Spark 2.4 load command from local filesystem supports wildcards in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/). Now onwards normal space convention can be used in folder/file names (e.g. LOAD DATA INPATH 'tmp/folderName/file Name.csv), Older versions space in folder/file names has been represented using '%20'(e.g. LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv). --- End diff -- We should also mention that, the old way to escape the special chars will not work in 2.4. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...
GitHub user sujith71955 opened a pull request: https://github.com/apache/spark/pull/22396 [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS path for loadtable command. What changes were proposed in this pull request Updated the Migration guide for the behavior changes done in the JIRA issue SPARK-23425. How was this patch tested? Manually verified. ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sujith71955/spark master_newtest Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22396.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22396 commit c875b16e9ebb0ac1702227ca6d24afa9f9f2d1af Author: s71955 Date: 2018-09-11T14:11:55Z [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS path for load table command. What changes were proposed in this pull request Updated the Migration guide for the behavior changes done in the JIRA issue SPARK-23425. How was this patch tested? Manually verified. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org