[GitHub] spark issue #21892: [SPARK-24945][SQL] Switching to uniVocity 2.7.2
Github user jbax commented on the issue: https://github.com/apache/spark/pull/21892 univocity-parsers-2.7.3 released. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21892: [SPARK-24945][SQL] Switching to uniVocity 2.7.2
Github user jbax commented on the issue: https://github.com/apache/spark/pull/21892 Thanks @MaxGekk I've fixed the error and also made the parser run faster than before when processing fields that were not selected in general. Can you please retest with the latest SNAPSHOT build and let me know how it goes? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21892: [SPARK-24945][SQL] Switching to uniVocity 2.7.2
Github user jbax commented on the issue: https://github.com/apache/spark/pull/21892 Did anyone had a chance to test with the 2.7.3-SNAPSHOT build I released to see if the performance issue has been addressed? If it has then let me know and I'll release the final 2.7.3 build. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17177: [SPARK-19834][SQL] csv escape of quote escape
Github user jbax commented on the issue: https://github.com/apache/spark/pull/17177 2.4.0 released, thank you guys! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17177: [SPARK-19834][SQL] csv escape of quote escape
Github user jbax commented on the issue: https://github.com/apache/spark/pull/17177 Doesn't seem correct to me. All test cases are using broken CSV and trigger the parser handling of unescaped quotes, where it tries to rescue the data and produce something sensible. See my test case here: https://github.com/uniVocity/univocity-parsers/issues/143 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15493][SQL] Allow setting the quoteEsca...
Github user jbax commented on the pull request: https://github.com/apache/spark/pull/13267#issuecomment-221454486 @rxin In your case think it's better to have this turned on by default. Regarding your other questions: 1 - There's no timeline. 2.2.x will come out when new features are requested by our users and implemented. Currently there's nothing in the pipeline so we'll be on 2.1.x adding fixes and minor internal improvements over time. We have no open bugs either. 2 - Yes. It fixes a couple of bugs you guys probably won't come across, but it also improves the performance of the parser with whitespace trimming enabled (it's enabled by default, by the way). 3 - It's OK and I don't see why it would be a problem, other than having some client with a very uncommon use case (they are out there, that's why the library has a lot of configuration options). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15493][SQL] Allow setting the quoteEsca...
Github user jbax commented on the pull request: https://github.com/apache/spark/pull/13267#issuecomment-221408197 It's disabled by default because earlier versions were slower when writing CSV and it helped a little bit. Also because parsing unqoted values is faster. With version 2.1.0 the new algorithm made the writing performance improve a lot, and having quoteEscaping enabled now makes writing faster. I found this out after testing version 2.1.1 (a maintenance release) so I didn't change the default behavior. Versions 2.2.x and up will have this enabled by default. On 25 May 2016 6:33 AM, "Reynold Xin" <notificati...@github.com> wrote: > @jbax <https://github.com/jbax> can we get a 2nd opinion here about > quoteEscapingEnabled? > > â > You are receiving this because you were mentioned. > Reply to this email directly or view it on GitHub > <https://github.com/apache/spark/pull/13267#issuecomment-221390796> > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MINOR][SQL] Remove not affected settings for ...
Github user jbax commented on the pull request: https://github.com/apache/spark/pull/12818#issuecomment-216770729 By the way, may I suggest you guys to upgrade to version 2.1.0 as it comes with substantial performance improvements for parsing and writing CSV. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MINOR][SQL] Remove not affected settings for ...
Github user jbax commented on the pull request: https://github.com/apache/spark/pull/12818#issuecomment-216747285 Foo and bar are part of the same value, they just happen to have a line ending in between. And yes `setLineSeparator()` it is related to the values themselves when writing, unless `normalizeLineEndingsWithinQuotes=false` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MINOR][SQL] Remove not affected settings for ...
Github user jbax commented on the pull request: https://github.com/apache/spark/pull/12818#issuecomment-216743260 What happens if you do this: ``` scala> "foo\r\nbar\r\n".stripLineEnd ``` Shouldn't the result be this? ``` res0: String = foo\r\n bar ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MINOR][SQL] Remove not affected settings for ...
Github user jbax commented on the pull request: https://github.com/apache/spark/pull/12818#issuecomment-216738292 I just read the rest of this ticket. Be careful with the `setLineSeparator()`. It uses the default OS line separator but that's not always desired. By default, this is used to transform the `normalizedLineSeparator` when writing. If you are running on Windows this will be: ``` normalizedLineSeparator = '\n' lineSeparator = '\r\n' ``` Then write a value such as `"my \n multi line \n value"` You will end up with `"my \r\n multi line \r\n value"` If you actually want to have `"my \n multi line \n value"` you must either: - set the line separator explicitly to `\n` - set `normalizeLineEndingsWithinQuotes` to `false`, so whatever is in the input will be written to the output, without any line ending transformation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MINOR][SQL] Remove not affected settings for ...
Github user jbax commented on the pull request: https://github.com/apache/spark/pull/12818#issuecomment-216737346 Confirmed. It is only used if you call `CsvWriter.commentRow()` or `CsvWriter.commentRowToString()` to write comments to the output. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org