[GitHub] spark pull request #18304: [SPARK-21098] Add lineseparator parameter to csv ...

2017-06-14 Thread danielvdende
GitHub user danielvdende opened a pull request:

https://github.com/apache/spark/pull/18304

[SPARK-21098] Add lineseparator parameter to csv options

This commit adds the lineSeparator option to the spark support for
reading/writing csv files.

## What changes were proposed in this pull request?

This PR adds support for specifying the `lineSeparator` option when 
reading/writing to/from csv. This functionality is supported by the Univocity 
parsers library in use by Spark.

## How was this patch tested?

A file with DOS (`\r\n`) line endings was added to the CsvSuite, which 
should produce the exact same results given the correct line separator


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/danielvdende/spark 
add-csv-line-separator-option

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18304.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18304


commit 9684a39394b1fd35636af6adc3e45394b01934cf
Author: Daniel van der Ende 
Date:   2017-06-14T18:19:50Z

[SPARK-21098] Add lineseparator parameter to csv options

This commit adds the lineSeparator option to the spark support for
reading/writing csv files.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18304: [SPARK-21098] Add lineseparator parameter to csv ...

2017-06-14 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18304#discussion_r122085991
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -90,6 +90,7 @@ class CSVOptions(
   val quote = getChar("quote", '\"')
   val escape = getChar("escape", '\\')
   val comment = getChar("comment", '\u')
+  val lineSeparator = parameters.getOrElse("lineSeparator", "\n")
--- End diff --

Could we read back with this custom line separator?  For writing path, hard 
coded `\n` was removed but it looks not dependent on the library again - 
https://github.com/apache/spark/pull/16428#issuecomment-269756936


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18304: [SPARK-21098] Add lineseparator parameter to csv ...

2017-06-14 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18304#discussion_r122089891
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -90,6 +90,7 @@ class CSVOptions(
   val quote = getChar("quote", '\"')
   val escape = getChar("escape", '\\')
   val comment = getChar("comment", '\u')
+  val lineSeparator = parameters.getOrElse("lineSeparator", "\n")
--- End diff --

Could we read back with this custom line separator?  For writing path, hard 
coded `\n` was removed before as it does not affect but it looks now dependent 
on the univocity library -
 https://github.com/apache/spark/pull/16428#issuecomment-269756936


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18304: [SPARK-21098] Add lineseparator parameter to csv ...

2017-06-15 Thread danielvdende
Github user danielvdende commented on a diff in the pull request:

https://github.com/apache/spark/pull/18304#discussion_r122135698
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -90,6 +90,7 @@ class CSVOptions(
   val quote = getChar("quote", '\"')
   val escape = getChar("escape", '\\')
   val comment = getChar("comment", '\u')
+  val lineSeparator = parameters.getOrElse("lineSeparator", "\n")
--- End diff --

Hi, thanks for pointing this out, I wasn't aware of this. Looking at it 
though, how would you propose fixing this? We would need a `writerSettings` 
instance to fetch the univocity default I think. Also, looking at the other 
parameters, I also see hardcoded  values there (for instance `val quote = 
getChar("quote", '\"')`). Personally, I like the idea of being able to see the 
default used directly in Spark, to prevent changes in the univocity lib from 
impacting Spark usage (and requiring confusion when using). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18304: [SPARK-21098] Add lineseparator parameter to csv ...

2017-06-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18304#discussion_r122141063
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -90,6 +90,7 @@ class CSVOptions(
   val quote = getChar("quote", '\"')
   val escape = getChar("escape", '\\')
   val comment = getChar("comment", '\u')
+  val lineSeparator = parameters.getOrElse("lineSeparator", "\n")
--- End diff --

Yea, that's why I haven't proposed this change so far (for both hard corded 
`\n` and configurable line separator).

For the hard corded `\n`, IMHO, reviving hard corded `\n` is probably fine 
for now as consistent with JSON/TEXT datasources because It uses, after the 
change I described above, OS-dependent newlline by default in Univocity - 
https://github.com/uniVocity/univocity-parsers/blob/e34a149077147de7cafcafbbaf8a4310e6297584/src/main/java/com/univocity/parsers/common/Format.java#L78,
 which is machine-dependent. If we are worried of changing this, we could use 
`System.getProperty("line.separator")` or just not set via `Option`. This 
probably does not block this PR.

For the configurable line separator, we are not able to read it back when 
`wholeFile` (or `miltiLine` now) option is disabled (it's default and this 
should be primary) because we are dependent of `LineRecordReader` which only 
deals with `\n` or `\r\n`. We really should be able to read it back.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18304: [SPARK-21098] Add lineseparator parameter to csv ...

2017-06-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18304#discussion_r122142421
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -90,6 +90,7 @@ class CSVOptions(
   val quote = getChar("quote", '\"')
   val escape = getChar("escape", '\\')
   val comment = getChar("comment", '\u')
+  val lineSeparator = parameters.getOrElse("lineSeparator", "\n")
--- End diff --

To cut this short, up to my knowledge, for the current status (please 
correct me if I am wrong),

CSV read -> cover `\n` and `\r\n`
CSV read with `wholeFile` (`multiLine`) -> OS dependent newline (by 
Univocity)
CSV write -> OS dependent newline (by Univocity)

JSON read -> cover `\n` and `\r\n`
JSON read with `wholeFile` (`multiLine`) -> N/A (it reads the whole file as 
a single record for the current status)
JSON write -> `\n`

TEXT read -> cover `\n` and `\r\n`
TEXT write -> `\n`

For hardcorded newline, fixing it to `\n` is probably better. I wouldn't 
mind if this PR is turned to fix this for 'CSV read with wholeFile (multiLine)' 
and 'CSV write'.

For configurable line separator, it would be not configurable in this way 
for 'CSV read'. We should be able to read it back what we write out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18304: [SPARK-21098] Add lineseparator parameter to csv ...

2017-06-15 Thread danielvdende
Github user danielvdende commented on a diff in the pull request:

https://github.com/apache/spark/pull/18304#discussion_r122144727
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -90,6 +90,7 @@ class CSVOptions(
   val quote = getChar("quote", '\"')
   val escape = getChar("escape", '\\')
   val comment = getChar("comment", '\u')
+  val lineSeparator = parameters.getOrElse("lineSeparator", "\n")
--- End diff --

Ah, I think I understand now, thanks. 

Maybe I'm overlooking it, but I don't see the `wholeFile` option being used 
in `CsvOptions` at all at the moment (nor `multiline`) ?

I'll take a further look at fixing this up considering the dependency on 
`LineRecordReader` (which only supports the two line separators you mentioned)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18304: [SPARK-21098] Add lineseparator parameter to csv ...

2017-06-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18304#discussion_r122156785
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -90,6 +90,7 @@ class CSVOptions(
   val quote = getChar("quote", '\"')
   val escape = getChar("escape", '\\')
   val comment = getChar("comment", '\u')
+  val lineSeparator = parameters.getOrElse("lineSeparator", "\n")
--- End diff --

Ah, it was `wholeFile` but just renamed 
https://github.com/apache/spark/commit/2051428173d8cd548702eb1a2e1c1ca76b8f2fd5 
to `multiLine` four hours ago.

Yes, it would be wonderful if we can support it in reading too. If it is 
difficult, it is fine (to me) to turn this PR for hard corded one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18304: [SPARK-21098] Add lineseparator parameter to csv ...

2017-06-15 Thread danielvdende
Github user danielvdende commented on a diff in the pull request:

https://github.com/apache/spark/pull/18304#discussion_r122189644
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -90,6 +90,7 @@ class CSVOptions(
   val quote = getChar("quote", '\"')
   val escape = getChar("escape", '\\')
   val comment = getChar("comment", '\u')
+  val lineSeparator = parameters.getOrElse("lineSeparator", "\n")
--- End diff --

Yeah, I did some digging, but that will probably be a lot trickier, because 
it would involve making changes in the Hadoop repo as well. For now, I've 
changed the content of this PR to only contain the hardcoding to `\n` as you 
suggested. I may take a look at the Hadoop code that would require changing as 
well, but that would require a lot more work than the scope of this story 
suggests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18304: [SPARK-21098] Add lineseparator parameter to csv ...

2017-06-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18304#discussion_r122191554
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -90,6 +90,7 @@ class CSVOptions(
   val quote = getChar("quote", '\"')
   val escape = getChar("escape", '\\')
   val comment = getChar("comment", '\u')
+  val lineSeparator = parameters.getOrElse("lineSeparator", "\n")
--- End diff --

(Just FYI, about a year and one and a half year ago, I proposed this - 
https://github.com/apache/spark/pull/11016/files#diff-6a14f6bb643b1474139027d72a17f41aR141.
 It might probably be helpful if you are willing to take a look further.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org