GitHub user justinuang opened a pull request:

    https://github.com/apache/spark/pull/22680

    [SPARK-25493][SQL] Use auto-detection for CRLF in CSV datasource multiline 
mode

    ## Upstream SPARK-XXXXX ticket and PR link (if not applicable, explain)
    
    Went through review, but upstream is not merging. Discussed offline with 
@vinooganesh that we will merge here first.
    
    https://github.com/apache/spark/pull/22503/files
    
    ## What changes were proposed in this pull request?
    
    CSVs with windows style crlf ('\r\n') don't work in multiline mode. They 
work fine in single line mode because the line separation is done by Hadoop, 
which can handle all the different types of line separators. This PR fixes it 
by enabling Univocity's line separator detection in multiline mode, which will 
detect '\r\n', '\r', or '\n' automatically as it is done by hadoop in single 
line mode.
    
    ## How was this patch tested?
    
    Unit test with a file with crlf line endings.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/palantir/spark 
palantirspark/fix-clrf-multiline

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22680.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22680
    
----
commit 8bc932a49a76d482510242a7d040fdf7e888c895
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-18T15:15:59Z

    Delete commented out code that's no longer applicable

commit cd4afe2e13a7140023cb50a8b9be798bb7b86e61
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-18T15:16:22Z

    Bump build-sbt cache to v1-build-sbt.. think old cache causes the OOM 
somehow

commit 047e65a0b916a756d2dc7b8106acfd94f530f07a
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-18T23:31:39Z

    Move all-project exclusion and global setting to DefaultSparkPlugin, nuke 
excludeDependencies

commit 53d6f5aa9f8627388a732ae6dd3ced0b6192fcf3
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-18T23:32:26Z

    Make enable() accept any DslEntry allowing enablePlugins etc not just 
Seq[Setting[_]]

commit e154185f7e4c94da78375de4f590f2fd20610e6b
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-19T00:56:57Z

    Exclude com.sun.jersey crap but only from copyJarsProjects (assembly, 
examples)

commit ecd06e96825220bdf8dbec3c5fa8725aae89eb7c
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-19T00:59:03Z

    revert unnecessary exclusions in hadoop-palantir/pom.xml

commit bde3a2af44e9def680615aa47d82bc4decf43a18
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-19T01:04:09Z

    ensure we update sbt before getting externalDependencyClasspath, prevent 
badly cached resolution results!!

commit cae7f8cc4381d01074815d9c9c29bf115b855cac
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-19T12:25:13Z

    delete unnecesary sbt cache restore in deploy

commit f4af82f99e0f14843c762a7f98d0f952cfb57d52
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-19T12:29:44Z

    make home-sbt cache depend on project update inputs

commit 35bebba53c5c9c7f9427334bab8a47bb9129be1c
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-19T12:48:59Z

    python / R tests also don't use SBT or maven

commit bec5c1eb9cd4489662b49770c0f77d0202987479
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-19T12:51:09Z

    fix v2-home-sbt cache, I guess it doesn't need escaping $

commit 82197f8198b44a3beea0430274e43e2d2b7509a5
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-19T15:27:05Z

    Log which tests (per project) didn't have timings

commit 00f28de1684632afcc43dd3e21781c584915509d
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-19T15:36:46Z

    Log which tests (per project) didn't have timings

commit 8b7fd7ff2be699e8aebcb014e1033e12e72b585e
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-19T16:43:47Z

    parallelize python tests, and feed the right versions into packaging tests 
(run-pip-tests)

commit 886f496483eaf673fda58a7fde932dfedd514bcd
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-19T17:10:47Z

    I guess we need to set up logging too

commit 0047e3165c973b11ea6f2e63c90fcb1888e93c3a
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-19T18:26:10Z

    disable cached resolution

commit 0e749a60656c7574529f5aee4d8785697f5af668
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-19T23:48:43Z

    run all python tests before giving up, don't stop early

commit 302b8351c44882c84ce1eb5fd6fb0454b4e1c276
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-20T00:20:29Z

    Explicitly calling `update` seems to be very slow without cached 
resolution. If we just call externalDependencyClasspath though that might be 
enough

commit 636bef690d799d3e60dcabfaef783d2636f84878
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-22T01:15:28Z

    downgrade yer numpys. newer numpy breaks a bunch of tests because of 
different array formatting & more

commit 7aaf1c6d3abdc3093f73e5f25b8f9bbc5d60b756
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-22T01:25:50Z

    Remove commented out miniconda installation since it's now in base, add 
comment about numpy

commit ed79717bfe77f60678f78f67d7c8c089d5d47f2a
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-22T01:57:20Z

    try parallelism = 8 since we have 8 cores

commit 3e3e74d67e8951081ae95b8fee4ca645c941653d
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-22T09:59:59Z

    oh python

commit d46c9206d73e2ad1d5b1dea0d1d0926241287dc5
Author: Dan Sănduleac <dansanduleac@...>
Date:   2018-03-22T12:19:47Z

    Merge pull request #326 from palantir/hy/circle-2.0
    
    Use circle 2.0

commit 383dab2c7fd1cff0d6c7763cc689beb8a19b8a42
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-22T18:19:19Z

    deploy step needs long form

commit 084f66226cf4a5d4e6427726f7afbe58979cb5ed
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-22T21:32:55Z

    Cache and merge test results for scala...

commit b1643b900e00693d3eba1e4b1125867d4617e2db
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-22T21:36:18Z

    Fix deploy failing if there's no R

commit 2d2799cca35cc2adf4e2f5a96e1811055e420d2d
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-23T00:28:22Z

    Fix versions to use --first-parent

commit 08c5e0f388d084b681944557ed9cbd1275bfb531
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-23T00:31:44Z

    Move other curl inside publish.sh

commit fc44d4f4a02e9707d478a4e8a1639636de94d902
Author: Dan Sanduleac <dsanduleac@...>
Date:   2018-03-23T02:19:29Z

    try parallelism 12 for run-scala-tests

commit fc092cb4c5b2a2462499f77db0f00a4b592c358b
Author: Dan Sănduleac <dansanduleac@...>
Date:   2018-03-23T10:09:43Z

    Merge pull request #330 from palantir/ds/merge-test-results-scala
    
    Cache & merge scala test results, bump parallelism

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to