[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...

2018-01-17 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19247


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...

2018-01-16 Thread xysun
Github user xysun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19247#discussion_r161932613
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
 ---
@@ -86,6 +86,18 @@ abstract class FileStreamSourceTest
 }
   }
 
+  case class AddTextFileDataWithSpaceInFileName(content: String, src: 
File, tmp: File)
--- End diff --

sure will update and add the test for file sink. Thanks for the review. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...

2018-01-16 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/19247#discussion_r161907322
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
 ---
@@ -408,6 +420,18 @@ class FileStreamSourceSuite extends 
FileStreamSourceTest {
 }
   }
 
+  test("SPARK-21996 read from text files -- file name has space") {
--- End diff --

Not. I meant it's not an issue of file formats. There are not some special 
codes in file stream source. If there should be any tests for such issue, they 
should be inside file format tests.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...

2018-01-16 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19247#discussion_r161906360
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
 ---
@@ -408,6 +420,18 @@ class FileStreamSourceSuite extends 
FileStreamSourceTest {
 }
   }
 
+  test("SPARK-21996 read from text files -- file name has space") {
--- End diff --

yes, for this PR it is, but it would be great if we can ensure that all the 
data sources have the same behavior... Maybe we can do this is another PR if 
you think it is better


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...

2018-01-16 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/19247#discussion_r161893771
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
 ---
@@ -408,6 +420,18 @@ class FileStreamSourceSuite extends 
FileStreamSourceTest {
 }
   }
 
+  test("SPARK-21996 read from text files -- file name has space") {
--- End diff --

this test should be enough. The issue is in file stream source.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...

2018-01-16 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/19247#discussion_r161892343
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
 ---
@@ -86,6 +86,18 @@ abstract class FileStreamSourceTest
 }
   }
 
+  case class AddTextFileDataWithSpaceInFileName(content: String, src: 
File, tmp: File)
--- End diff --

I would suggest that adding a new parameter to AddTextFileData rather than 
introducing a new class, such as
```  case class AddTextFileData(content: String, src: File, tmp: File, 
tempFilePrefix: String = "text")
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...

2018-01-15 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19247#discussion_r161452270
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
 ---
@@ -408,6 +420,18 @@ class FileStreamSourceSuite extends 
FileStreamSourceTest {
 }
   }
 
+  test("SPARK-21996 read from text files -- file name has space") {
--- End diff --

can we run the same test also for the other input format, ie. parquet, orc, 
... ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...

2018-01-14 Thread xysun
Github user xysun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19247#discussion_r161430294
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ---
@@ -233,7 +233,7 @@ class FileStreamSource(
 }
 
 val files = allFiles.sortBy(_.getModificationTime)(fileSortOrder).map 
{ status =>
-  (status.getPath.toUri.toString, status.getModificationTime)
+  (status.getPath.toUri.getPath, status.getModificationTime)
--- End diff --

Thanks. I'll update the commit. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...

2018-01-13 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19247#discussion_r161387225
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ---
@@ -233,7 +233,7 @@ class FileStreamSource(
 }
 
 val files = allFiles.sortBy(_.getModificationTime)(fileSortOrder).map 
{ status =>
-  (status.getPath.toUri.toString, status.getModificationTime)
+  (status.getPath.toUri.getPath, status.getModificationTime)
--- End diff --

Ohaa, `getPath` from `java.net.URI` in the proposed fix drops them. Sure.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...

2018-01-13 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19247#discussion_r161387193
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ---
@@ -233,7 +233,7 @@ class FileStreamSource(
 }
 
 val files = allFiles.sortBy(_.getModificationTime)(fileSortOrder).map 
{ status =>
-  (status.getPath.toUri.toString, status.getModificationTime)
+  (status.getPath.toUri.getPath, status.getModificationTime)
--- End diff --

Wait .. @zsxwing do you mean `getPath` (`org.apache.hadoop.fs.Path`) from 
`org.apache.hadoop.fs.FileStatus` drops the scheme and credentials?

Here seems `Seq[org.apache.hadoop.fs.FileStatus]` and the code you pointed 
out looks 
`Array[org.apache.spark.sql.execution.streaming.FileStreamSource.FileEntry]`.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...

2018-01-12 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/19247#discussion_r161353643
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ---
@@ -233,7 +233,7 @@ class FileStreamSource(
 }
 
 val files = allFiles.sortBy(_.getModificationTime)(fileSortOrder).map 
{ status =>
-  (status.getPath.toUri.toString, status.getModificationTime)
+  (status.getPath.toUri.getPath, status.getModificationTime)
--- End diff --

FYI, `getPath` will drop scheme and credentials.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...

2018-01-12 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/19247#discussion_r161352304
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ---
@@ -233,7 +233,7 @@ class FileStreamSource(
 }
 
 val files = allFiles.sortBy(_.getModificationTime)(fileSortOrder).map 
{ status =>
-  (status.getPath.toUri.toString, status.getModificationTime)
+  (status.getPath.toUri.getPath, status.getModificationTime)
--- End diff --

The correct fix is fixing this line: 
https://github.com/xysun/spark/blob/4f5979a72ce9cb36a3327e79b8592b9e42bdf5af/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L168

It should be `files.map(new Path(new URI(_.path)).toString)`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...

2018-01-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19247#discussion_r159893158
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ---
@@ -233,7 +233,7 @@ class FileStreamSource(
 }
 
 val files = allFiles.sortBy(_.getModificationTime)(fileSortOrder).map 
{ status =>
-  (status.getPath.toUri.toString, status.getModificationTime)
+  (status.getPath.toUri.getPath, status.getModificationTime)
--- End diff --

ping @xysun ^


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...

2017-10-02 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19247#discussion_r142136060
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ---
@@ -233,7 +233,7 @@ class FileStreamSource(
 }
 
 val files = allFiles.sortBy(_.getModificationTime)(fileSortOrder).map 
{ status =>
-  (status.getPath.toUri.toString, status.getModificationTime)
+  (status.getPath.toUri.getPath, status.getModificationTime)
--- End diff --

why not `status.getPath.toString`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...

2017-09-15 Thread xysun
GitHub user xysun opened a pull request:

https://github.com/apache/spark/pull/19247

[Spark-21996][SQL] read files with space in name for streaming

## What changes were proposed in this pull request?

Structured streaming is now able to read files with space in file name 
(previously it would skip the file and output a warning)

## How was this patch tested?

Added new unit test. 



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xysun/spark SPARK-21996

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19247.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19247


commit 38901ef03048d6173eabe1973897e8c556a51709
Author: Xiayun Sun 
Date:   2017-09-15T07:51:06Z

add test to reproduce

commit 4f5979a72ce9cb36a3327e79b8592b9e42bdf5af
Author: Xiayun Sun 
Date:   2017-09-15T14:07:13Z

use URI.getPath instead of toString




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org