[GitHub] spark pull request #22503: [SPARK-25493] [SQL] Fix multiline crlf
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22503#discussion_r219688971 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala --- @@ -212,6 +212,7 @@ class CSVOptions( settings.setEmptyValue(emptyValueInRead) settings.setMaxCharsPerColumn(maxCharsPerColumn) settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER) +settings.setLineSeparatorDetectionEnabled(true) --- End diff -- Yup, I would rather enable this only for multiline mode. Also, please add what this configuration does in the PR description. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22503: [SPARK-25493] [SQL] Fix multiline crlf
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/22503#discussion_r219495737 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala --- @@ -212,6 +212,7 @@ class CSVOptions( settings.setEmptyValue(emptyValueInRead) settings.setMaxCharsPerColumn(maxCharsPerColumn) settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER) +settings.setLineSeparatorDetectionEnabled(true) --- End diff -- The auto-detection mechanism is enabled for both - multi-line and per-line mode. I guess it has some overhead on detection of new lines which is not needed in per-line mode. I would benchmark it in both modes (see `CSVBenchmarks`), and if the overhead in per-line mode is significant, I would not enable the option when `multiLine` is set to `false`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22503: [SPARK-25493] [SQL] Fix multiline crlf
GitHub user justinuang opened a pull request: https://github.com/apache/spark/pull/22503 [SPARK-25493] [SQL] Fix multiline crlf ## What changes were proposed in this pull request? CSVs with windows style crlf (carriage return line feed) don't work in multiline mode. They work fine in single line mode because the line separation is done by Hadoop, which can handle all the different types of line separators. This fixes it by enabling Univocity's line separator detection. ## How was this patch tested? Unit test with a file with crlf line endings. You can merge this pull request into a Git repository by running: $ git pull https://github.com/justinuang/spark fix-clrf-multiline Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22503.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22503 commit 5ce9de9f789ce108f6afb65e38bab44acc77a4e8 Author: Justin Uang Date: 2018-09-20T20:41:35Z Fix multiline crlf --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org