[GitHub] spark pull request #18304: [SPARK-21098] Add lineseparator parameter to csv ...
GitHub user danielvdende opened a pull request: https://github.com/apache/spark/pull/18304 [SPARK-21098] Add lineseparator parameter to csv options This commit adds the lineSeparator option to the spark support for reading/writing csv files. ## What changes were proposed in this pull request? This PR adds support for specifying the `lineSeparator` option when reading/writing to/from csv. This functionality is supported by the Univocity parsers library in use by Spark. ## How was this patch tested? A file with DOS (`\r\n`) line endings was added to the CsvSuite, which should produce the exact same results given the correct line separator You can merge this pull request into a Git repository by running: $ git pull https://github.com/danielvdende/spark add-csv-line-separator-option Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18304.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18304 commit 9684a39394b1fd35636af6adc3e45394b01934cf Author: Daniel van der Ende Date: 2017-06-14T18:19:50Z [SPARK-21098] Add lineseparator parameter to csv options This commit adds the lineSeparator option to the spark support for reading/writing csv files. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18304: [SPARK-21098] Add lineseparator parameter to csv ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18304#discussion_r122085991 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala --- @@ -90,6 +90,7 @@ class CSVOptions( val quote = getChar("quote", '\"') val escape = getChar("escape", '\\') val comment = getChar("comment", '\u') + val lineSeparator = parameters.getOrElse("lineSeparator", "\n") --- End diff -- Could we read back with this custom line separator? For writing path, hard coded `\n` was removed but it looks not dependent on the library again - https://github.com/apache/spark/pull/16428#issuecomment-269756936 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18304: [SPARK-21098] Add lineseparator parameter to csv ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18304#discussion_r122089891 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala --- @@ -90,6 +90,7 @@ class CSVOptions( val quote = getChar("quote", '\"') val escape = getChar("escape", '\\') val comment = getChar("comment", '\u') + val lineSeparator = parameters.getOrElse("lineSeparator", "\n") --- End diff -- Could we read back with this custom line separator? For writing path, hard coded `\n` was removed before as it does not affect but it looks now dependent on the univocity library - https://github.com/apache/spark/pull/16428#issuecomment-269756936 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18304: [SPARK-21098] Add lineseparator parameter to csv ...
Github user danielvdende commented on a diff in the pull request: https://github.com/apache/spark/pull/18304#discussion_r122135698 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala --- @@ -90,6 +90,7 @@ class CSVOptions( val quote = getChar("quote", '\"') val escape = getChar("escape", '\\') val comment = getChar("comment", '\u') + val lineSeparator = parameters.getOrElse("lineSeparator", "\n") --- End diff -- Hi, thanks for pointing this out, I wasn't aware of this. Looking at it though, how would you propose fixing this? We would need a `writerSettings` instance to fetch the univocity default I think. Also, looking at the other parameters, I also see hardcoded values there (for instance `val quote = getChar("quote", '\"')`). Personally, I like the idea of being able to see the default used directly in Spark, to prevent changes in the univocity lib from impacting Spark usage (and requiring confusion when using). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18304: [SPARK-21098] Add lineseparator parameter to csv ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18304#discussion_r122141063 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala --- @@ -90,6 +90,7 @@ class CSVOptions( val quote = getChar("quote", '\"') val escape = getChar("escape", '\\') val comment = getChar("comment", '\u') + val lineSeparator = parameters.getOrElse("lineSeparator", "\n") --- End diff -- Yea, that's why I haven't proposed this change so far (for both hard corded `\n` and configurable line separator). For the hard corded `\n`, IMHO, reviving hard corded `\n` is probably fine for now as consistent with JSON/TEXT datasources because It uses, after the change I described above, OS-dependent newlline by default in Univocity - https://github.com/uniVocity/univocity-parsers/blob/e34a149077147de7cafcafbbaf8a4310e6297584/src/main/java/com/univocity/parsers/common/Format.java#L78, which is machine-dependent. If we are worried of changing this, we could use `System.getProperty("line.separator")` or just not set via `Option`. This probably does not block this PR. For the configurable line separator, we are not able to read it back when `wholeFile` (or `miltiLine` now) option is disabled (it's default and this should be primary) because we are dependent of `LineRecordReader` which only deals with `\n` or `\r\n`. We really should be able to read it back. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18304: [SPARK-21098] Add lineseparator parameter to csv ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18304#discussion_r122142421 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala --- @@ -90,6 +90,7 @@ class CSVOptions( val quote = getChar("quote", '\"') val escape = getChar("escape", '\\') val comment = getChar("comment", '\u') + val lineSeparator = parameters.getOrElse("lineSeparator", "\n") --- End diff -- To cut this short, up to my knowledge, for the current status (please correct me if I am wrong), CSV read -> cover `\n` and `\r\n` CSV read with `wholeFile` (`multiLine`) -> OS dependent newline (by Univocity) CSV write -> OS dependent newline (by Univocity) JSON read -> cover `\n` and `\r\n` JSON read with `wholeFile` (`multiLine`) -> N/A (it reads the whole file as a single record for the current status) JSON write -> `\n` TEXT read -> cover `\n` and `\r\n` TEXT write -> `\n` For hardcorded newline, fixing it to `\n` is probably better. I wouldn't mind if this PR is turned to fix this for 'CSV read with wholeFile (multiLine)' and 'CSV write'. For configurable line separator, it would be not configurable in this way for 'CSV read'. We should be able to read it back what we write out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18304: [SPARK-21098] Add lineseparator parameter to csv ...
Github user danielvdende commented on a diff in the pull request: https://github.com/apache/spark/pull/18304#discussion_r122144727 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala --- @@ -90,6 +90,7 @@ class CSVOptions( val quote = getChar("quote", '\"') val escape = getChar("escape", '\\') val comment = getChar("comment", '\u') + val lineSeparator = parameters.getOrElse("lineSeparator", "\n") --- End diff -- Ah, I think I understand now, thanks. Maybe I'm overlooking it, but I don't see the `wholeFile` option being used in `CsvOptions` at all at the moment (nor `multiline`) ? I'll take a further look at fixing this up considering the dependency on `LineRecordReader` (which only supports the two line separators you mentioned) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18304: [SPARK-21098] Add lineseparator parameter to csv ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18304#discussion_r122156785 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala --- @@ -90,6 +90,7 @@ class CSVOptions( val quote = getChar("quote", '\"') val escape = getChar("escape", '\\') val comment = getChar("comment", '\u') + val lineSeparator = parameters.getOrElse("lineSeparator", "\n") --- End diff -- Ah, it was `wholeFile` but just renamed https://github.com/apache/spark/commit/2051428173d8cd548702eb1a2e1c1ca76b8f2fd5 to `multiLine` four hours ago. Yes, it would be wonderful if we can support it in reading too. If it is difficult, it is fine (to me) to turn this PR for hard corded one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18304: [SPARK-21098] Add lineseparator parameter to csv ...
Github user danielvdende commented on a diff in the pull request: https://github.com/apache/spark/pull/18304#discussion_r122189644 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala --- @@ -90,6 +90,7 @@ class CSVOptions( val quote = getChar("quote", '\"') val escape = getChar("escape", '\\') val comment = getChar("comment", '\u') + val lineSeparator = parameters.getOrElse("lineSeparator", "\n") --- End diff -- Yeah, I did some digging, but that will probably be a lot trickier, because it would involve making changes in the Hadoop repo as well. For now, I've changed the content of this PR to only contain the hardcoding to `\n` as you suggested. I may take a look at the Hadoop code that would require changing as well, but that would require a lot more work than the scope of this story suggests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18304: [SPARK-21098] Add lineseparator parameter to csv ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18304#discussion_r122191554 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala --- @@ -90,6 +90,7 @@ class CSVOptions( val quote = getChar("quote", '\"') val escape = getChar("escape", '\\') val comment = getChar("comment", '\u') + val lineSeparator = parameters.getOrElse("lineSeparator", "\n") --- End diff -- (Just FYI, about a year and one and a half year ago, I proposed this - https://github.com/apache/spark/pull/11016/files#diff-6a14f6bb643b1474139027d72a17f41aR141. It might probably be helpful if you are willing to take a look further.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org