[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19247 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...
Github user xysun commented on a diff in the pull request: https://github.com/apache/spark/pull/19247#discussion_r161932613 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala --- @@ -86,6 +86,18 @@ abstract class FileStreamSourceTest } } + case class AddTextFileDataWithSpaceInFileName(content: String, src: File, tmp: File) --- End diff -- sure will update and add the test for file sink. Thanks for the review. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/19247#discussion_r161907322 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala --- @@ -408,6 +420,18 @@ class FileStreamSourceSuite extends FileStreamSourceTest { } } + test("SPARK-21996 read from text files -- file name has space") { --- End diff -- Not. I meant it's not an issue of file formats. There are not some special codes in file stream source. If there should be any tests for such issue, they should be inside file format tests. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19247#discussion_r161906360 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala --- @@ -408,6 +420,18 @@ class FileStreamSourceSuite extends FileStreamSourceTest { } } + test("SPARK-21996 read from text files -- file name has space") { --- End diff -- yes, for this PR it is, but it would be great if we can ensure that all the data sources have the same behavior... Maybe we can do this is another PR if you think it is better --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/19247#discussion_r161893771 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala --- @@ -408,6 +420,18 @@ class FileStreamSourceSuite extends FileStreamSourceTest { } } + test("SPARK-21996 read from text files -- file name has space") { --- End diff -- this test should be enough. The issue is in file stream source. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/19247#discussion_r161892343 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala --- @@ -86,6 +86,18 @@ abstract class FileStreamSourceTest } } + case class AddTextFileDataWithSpaceInFileName(content: String, src: File, tmp: File) --- End diff -- I would suggest that adding a new parameter to AddTextFileData rather than introducing a new class, such as ``` case class AddTextFileData(content: String, src: File, tmp: File, tempFilePrefix: String = "text") ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19247#discussion_r161452270 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala --- @@ -408,6 +420,18 @@ class FileStreamSourceSuite extends FileStreamSourceTest { } } + test("SPARK-21996 read from text files -- file name has space") { --- End diff -- can we run the same test also for the other input format, ie. parquet, orc, ... ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...
Github user xysun commented on a diff in the pull request: https://github.com/apache/spark/pull/19247#discussion_r161430294 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala --- @@ -233,7 +233,7 @@ class FileStreamSource( } val files = allFiles.sortBy(_.getModificationTime)(fileSortOrder).map { status => - (status.getPath.toUri.toString, status.getModificationTime) + (status.getPath.toUri.getPath, status.getModificationTime) --- End diff -- Thanks. I'll update the commit. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19247#discussion_r161387225 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala --- @@ -233,7 +233,7 @@ class FileStreamSource( } val files = allFiles.sortBy(_.getModificationTime)(fileSortOrder).map { status => - (status.getPath.toUri.toString, status.getModificationTime) + (status.getPath.toUri.getPath, status.getModificationTime) --- End diff -- Ohaa, `getPath` from `java.net.URI` in the proposed fix drops them. Sure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19247#discussion_r161387193 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala --- @@ -233,7 +233,7 @@ class FileStreamSource( } val files = allFiles.sortBy(_.getModificationTime)(fileSortOrder).map { status => - (status.getPath.toUri.toString, status.getModificationTime) + (status.getPath.toUri.getPath, status.getModificationTime) --- End diff -- Wait .. @zsxwing do you mean `getPath` (`org.apache.hadoop.fs.Path`) from `org.apache.hadoop.fs.FileStatus` drops the scheme and credentials? Here seems `Seq[org.apache.hadoop.fs.FileStatus]` and the code you pointed out looks `Array[org.apache.spark.sql.execution.streaming.FileStreamSource.FileEntry]`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/19247#discussion_r161353643 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala --- @@ -233,7 +233,7 @@ class FileStreamSource( } val files = allFiles.sortBy(_.getModificationTime)(fileSortOrder).map { status => - (status.getPath.toUri.toString, status.getModificationTime) + (status.getPath.toUri.getPath, status.getModificationTime) --- End diff -- FYI, `getPath` will drop scheme and credentials. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/19247#discussion_r161352304 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala --- @@ -233,7 +233,7 @@ class FileStreamSource( } val files = allFiles.sortBy(_.getModificationTime)(fileSortOrder).map { status => - (status.getPath.toUri.toString, status.getModificationTime) + (status.getPath.toUri.getPath, status.getModificationTime) --- End diff -- The correct fix is fixing this line: https://github.com/xysun/spark/blob/4f5979a72ce9cb36a3327e79b8592b9e42bdf5af/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L168 It should be `files.map(new Path(new URI(_.path)).toString)`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19247#discussion_r159893158 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala --- @@ -233,7 +233,7 @@ class FileStreamSource( } val files = allFiles.sortBy(_.getModificationTime)(fileSortOrder).map { status => - (status.getPath.toUri.toString, status.getModificationTime) + (status.getPath.toUri.getPath, status.getModificationTime) --- End diff -- ping @xysun ^ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19247#discussion_r142136060 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala --- @@ -233,7 +233,7 @@ class FileStreamSource( } val files = allFiles.sortBy(_.getModificationTime)(fileSortOrder).map { status => - (status.getPath.toUri.toString, status.getModificationTime) + (status.getPath.toUri.getPath, status.getModificationTime) --- End diff -- why not `status.getPath.toString`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...
GitHub user xysun opened a pull request: https://github.com/apache/spark/pull/19247 [Spark-21996][SQL] read files with space in name for streaming ## What changes were proposed in this pull request? Structured streaming is now able to read files with space in file name (previously it would skip the file and output a warning) ## How was this patch tested? Added new unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/xysun/spark SPARK-21996 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19247.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19247 commit 38901ef03048d6173eabe1973897e8c556a51709 Author: Xiayun Sun Date: 2017-09-15T07:51:06Z add test to reproduce commit 4f5979a72ce9cb36a3327e79b8592b9e42bdf5af Author: Xiayun Sun Date: 2017-09-15T14:07:13Z use URI.getPath instead of toString --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org