GitHub user justinuang opened a pull request: https://github.com/apache/spark/pull/22680
[SPARK-25493][SQL] Use auto-detection for CRLF in CSV datasource multiline mode ## Upstream SPARK-XXXXX ticket and PR link (if not applicable, explain) Went through review, but upstream is not merging. Discussed offline with @vinooganesh that we will merge here first. https://github.com/apache/spark/pull/22503/files ## What changes were proposed in this pull request? CSVs with windows style crlf ('\r\n') don't work in multiline mode. They work fine in single line mode because the line separation is done by Hadoop, which can handle all the different types of line separators. This PR fixes it by enabling Univocity's line separator detection in multiline mode, which will detect '\r\n', '\r', or '\n' automatically as it is done by hadoop in single line mode. ## How was this patch tested? Unit test with a file with crlf line endings. You can merge this pull request into a Git repository by running: $ git pull https://github.com/palantir/spark palantirspark/fix-clrf-multiline Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22680.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22680 ---- commit 8bc932a49a76d482510242a7d040fdf7e888c895 Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-18T15:15:59Z Delete commented out code that's no longer applicable commit cd4afe2e13a7140023cb50a8b9be798bb7b86e61 Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-18T15:16:22Z Bump build-sbt cache to v1-build-sbt.. think old cache causes the OOM somehow commit 047e65a0b916a756d2dc7b8106acfd94f530f07a Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-18T23:31:39Z Move all-project exclusion and global setting to DefaultSparkPlugin, nuke excludeDependencies commit 53d6f5aa9f8627388a732ae6dd3ced0b6192fcf3 Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-18T23:32:26Z Make enable() accept any DslEntry allowing enablePlugins etc not just Seq[Setting[_]] commit e154185f7e4c94da78375de4f590f2fd20610e6b Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-19T00:56:57Z Exclude com.sun.jersey crap but only from copyJarsProjects (assembly, examples) commit ecd06e96825220bdf8dbec3c5fa8725aae89eb7c Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-19T00:59:03Z revert unnecessary exclusions in hadoop-palantir/pom.xml commit bde3a2af44e9def680615aa47d82bc4decf43a18 Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-19T01:04:09Z ensure we update sbt before getting externalDependencyClasspath, prevent badly cached resolution results!! commit cae7f8cc4381d01074815d9c9c29bf115b855cac Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-19T12:25:13Z delete unnecesary sbt cache restore in deploy commit f4af82f99e0f14843c762a7f98d0f952cfb57d52 Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-19T12:29:44Z make home-sbt cache depend on project update inputs commit 35bebba53c5c9c7f9427334bab8a47bb9129be1c Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-19T12:48:59Z python / R tests also don't use SBT or maven commit bec5c1eb9cd4489662b49770c0f77d0202987479 Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-19T12:51:09Z fix v2-home-sbt cache, I guess it doesn't need escaping $ commit 82197f8198b44a3beea0430274e43e2d2b7509a5 Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-19T15:27:05Z Log which tests (per project) didn't have timings commit 00f28de1684632afcc43dd3e21781c584915509d Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-19T15:36:46Z Log which tests (per project) didn't have timings commit 8b7fd7ff2be699e8aebcb014e1033e12e72b585e Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-19T16:43:47Z parallelize python tests, and feed the right versions into packaging tests (run-pip-tests) commit 886f496483eaf673fda58a7fde932dfedd514bcd Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-19T17:10:47Z I guess we need to set up logging too commit 0047e3165c973b11ea6f2e63c90fcb1888e93c3a Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-19T18:26:10Z disable cached resolution commit 0e749a60656c7574529f5aee4d8785697f5af668 Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-19T23:48:43Z run all python tests before giving up, don't stop early commit 302b8351c44882c84ce1eb5fd6fb0454b4e1c276 Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-20T00:20:29Z Explicitly calling `update` seems to be very slow without cached resolution. If we just call externalDependencyClasspath though that might be enough commit 636bef690d799d3e60dcabfaef783d2636f84878 Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-22T01:15:28Z downgrade yer numpys. newer numpy breaks a bunch of tests because of different array formatting & more commit 7aaf1c6d3abdc3093f73e5f25b8f9bbc5d60b756 Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-22T01:25:50Z Remove commented out miniconda installation since it's now in base, add comment about numpy commit ed79717bfe77f60678f78f67d7c8c089d5d47f2a Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-22T01:57:20Z try parallelism = 8 since we have 8 cores commit 3e3e74d67e8951081ae95b8fee4ca645c941653d Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-22T09:59:59Z oh python commit d46c9206d73e2ad1d5b1dea0d1d0926241287dc5 Author: Dan SÄnduleac <dansanduleac@...> Date: 2018-03-22T12:19:47Z Merge pull request #326 from palantir/hy/circle-2.0 Use circle 2.0 commit 383dab2c7fd1cff0d6c7763cc689beb8a19b8a42 Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-22T18:19:19Z deploy step needs long form commit 084f66226cf4a5d4e6427726f7afbe58979cb5ed Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-22T21:32:55Z Cache and merge test results for scala... commit b1643b900e00693d3eba1e4b1125867d4617e2db Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-22T21:36:18Z Fix deploy failing if there's no R commit 2d2799cca35cc2adf4e2f5a96e1811055e420d2d Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-23T00:28:22Z Fix versions to use --first-parent commit 08c5e0f388d084b681944557ed9cbd1275bfb531 Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-23T00:31:44Z Move other curl inside publish.sh commit fc44d4f4a02e9707d478a4e8a1639636de94d902 Author: Dan Sanduleac <dsanduleac@...> Date: 2018-03-23T02:19:29Z try parallelism 12 for run-scala-tests commit fc092cb4c5b2a2462499f77db0f00a4b592c358b Author: Dan SÄnduleac <dansanduleac@...> Date: 2018-03-23T10:09:43Z Merge pull request #330 from palantir/ds/merge-test-results-scala Cache & merge scala test results, bump parallelism ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org