[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-31 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20849 The last case seems working dependently by Jackson (UTF-16 for the first and UTF-16LE for the second line) if we don't set `encoding` but Jackson parses it by `UTF-16LE` for both if we set `enco

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-31 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20849 Thanks for thoughtfully testing out but I believe we can still go with https://github.com/apache/spark/pull/20937 if we whitelist supported encodings for now? If that's right and I understoo

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-31 Thread MaxGekk
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/20849 @HyukjinKwon I did an experiment on the https://github.com/MaxGekk/spark-1/pull/2 and modified [the test](https://github.com/MaxGekk/spark-1/blob/f94d846b39ade89da24ef3e85f9721fb34e48154/sql/core/sr

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-31 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20849 Let's make the point clear. There are two things, _1. one for line-by-line parsing_ and _2. JSON parsing via Jackson_. The test you pointed out looks still a bit weird because Jackson is

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-31 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20849 From a quick look and wild guess, `UTF-16` case would be alone problematic because we are going to make the delimiter with a BOM bit `0xFF 0xFE 0x0D 0x00 0x0A 0x00`. --- -

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-31 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20849 @MaxGekk, So to make it clear, it parses line by line correctly regardless of BOM if we set `lineSep` + `encoding` fine but it fails to parse each line as JSON via Jackson since we explicitly se

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-30 Thread MaxGekk
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/20849 @cloud-fan It is regular file in UTF-16 with BOM=`0xFF 0xFE` which indicates endianness - little-endian. When we slice the file by lines, the first line is still in UTF-16 with BOM, the rest lines b

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-29 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20849 @MaxGekk are you talking about a malformed json file which has multiple encodings inside it? --- - To unsubscribe, e-mail: rev

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-29 Thread MaxGekk
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/20849 Please, look at https://github.com/apache/spark/pull/20937 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-29 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20849 Please give me few days to check your comments. I happen to be super busy for a personal reason. --- - To unsubscribe, e-mai

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-28 Thread MaxGekk
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/20849 Ironically this file came from a customer: https://issues.apache.org/jira/browse/SPARK-23410 . And that's why we reverted jackson's charset auto-detection: https://github.com/apache/spark/commit/12

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-28 Thread MaxGekk
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/20849 When I was trying to remove the flexible format for lineSep (recordDelimiter), I faced to a problem. I cannot fix the test: https://github.com/MaxGekk/spark-1/blob/54fd42b64e0715540010c4d59b8b4f7a4a

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-28 Thread MaxGekk
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/20849 > Shall we expose encoding and add an alias for charset? It works for me too. > Is this flexible option also a part of your public release? No, it is not. Only `charset` wa

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-28 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20849 > The PR doesn't solve any practical use cases It does. It allows many workarounds, for example, we can intentionally add a custom delimiter so that it can support multiple-line-ish JSO

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-28 Thread MaxGekk
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/20849 @HyukjinKwon > How about we go this way with separate PRs? I agree with that only to unblock the https://github.com/apache/spark/pull/20849 because it solves real problem of our custom

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-27 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20849 I think the felxible format needs more feedback and review. How about we go this way? 1. https://github.com/apache/spark/pull/20877 to support line separator in json datasource 2. j

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-27 Thread MaxGekk
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/20849 @HyukjinKwon I am working on a PR which includes changes of this PR, recordDelimiter (flexible format) + force an user to set the recordDelimiter option if charset is specified as @cloud-fan suggest

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-25 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20849 I am against to this mainly by https://github.com/MaxGekk/spark-1/pull/1#discussion_r175444502 if there isn't better way than rewriting it. Also, I think we should support `charset` option f

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-25 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20849 @MaxGekk @HyukjinKwon What are the status of this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For a

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-17 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20849 Does charset work with newlines? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands,

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-17 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20849 Shall we add non-ascii compatible characters in the test resource files? --- - To unsubscribe, e-mail: reviews-unsubscr...@sp

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20849 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional comma

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20849 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88341/ Test PASSed. ---

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-17 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20849 **[Test build #88341 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88341/testReport)** for PR 20849 at commit [`961b482`](https://github.com/apache/spark/commit/9

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-17 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20849 **[Test build #88341 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88341/testReport)** for PR 20849 at commit [`961b482`](https://github.com/apache/spark/commit/96

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20849 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional