[GitHub] spark pull request #22503: [SPARK-25493] [SQL] Fix multiline crlf

2018-09-22 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22503#discussion_r219688971
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -212,6 +212,7 @@ class CSVOptions(
 settings.setEmptyValue(emptyValueInRead)
 settings.setMaxCharsPerColumn(maxCharsPerColumn)
 
settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER)
+settings.setLineSeparatorDetectionEnabled(true)
--- End diff --

Yup, I would rather enable this only for multiline mode. Also, please add 
what this configuration does in the PR description.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22503: [SPARK-25493] [SQL] Fix multiline crlf

2018-09-21 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22503#discussion_r219495737
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -212,6 +212,7 @@ class CSVOptions(
 settings.setEmptyValue(emptyValueInRead)
 settings.setMaxCharsPerColumn(maxCharsPerColumn)
 
settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER)
+settings.setLineSeparatorDetectionEnabled(true)
--- End diff --

The auto-detection mechanism is enabled for both - multi-line and per-line 
mode. I guess it has some overhead on detection of new lines which is not 
needed in per-line mode. I would benchmark it in both modes (see 
`CSVBenchmarks`), and if the overhead in per-line mode is significant, I would 
not enable the option when `multiLine` is set to `false`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22503: [SPARK-25493] [SQL] Fix multiline crlf

2018-09-20 Thread justinuang
GitHub user justinuang opened a pull request:

https://github.com/apache/spark/pull/22503

[SPARK-25493] [SQL] Fix multiline crlf

## What changes were proposed in this pull request?

CSVs with windows style crlf (carriage return line feed) don't work in 
multiline mode. They work fine in single line mode because the line separation 
is done by Hadoop, which can handle all the different types of line separators. 
This fixes it by enabling Univocity's line separator detection.

## How was this patch tested?

Unit test with a file with crlf line endings.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/justinuang/spark fix-clrf-multiline

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22503.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22503


commit 5ce9de9f789ce108f6afb65e38bab44acc77a4e8
Author: Justin Uang 
Date:   2018-09-20T20:41:35Z

Fix multiline crlf




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org