[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20849 The last case seems working dependently by Jackson (UTF-16 for the first and UTF-16LE for the second line) if we don't set `encoding` but Jackson parses it by `UTF-16LE` for both if we set `encoding`. Did I understand correctly? To be clear, I am not against for the flexible format yet. Just want to solve the problem bit by bit. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20849 Thanks for thoughtfully testing out but I believe we can still go with https://github.com/apache/spark/pull/20937 if we whitelist supported encodings for now? If that's right and I understood correctly, let's move the discussion to https://github.com/apache/spark/pull/20937#discussion_r178426866. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/20849 @HyukjinKwon I did an experiment on the https://github.com/MaxGekk/spark-1/pull/2 and modified [the test](https://github.com/MaxGekk/spark-1/blob/f94d846b39ade89da24ef3e85f9721fb34e48154/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala#L2072-L2083): If `UTF-16LE` is set explicitly: ``` val jsonDF = spark.read.schema(schema) .option("lineSep", "x0d 00 0a 00") .option("encoding", "UTF-16LE") .json(testFile(fileName)) ``` only second line is returned correctly: ``` +-++ |firstName|lastName| +-++ | null|null| | Doug|Rood| +-++ ``` In the case of `UTF-16`, the first row is returned from the CSV file: ``` +-++ |firstName|lastName| +-++ |Chris| Baird| | null|null| +-++ ``` And you are right in the case if encoding is `UTF-16`, BOM is added to the delimiter: ``` val jsonDF = spark.read.schema(schema) .option("lineSep", "\r\n") .option("encoding", "UTF-16") .json(testFile(fileName)) ``` The `lineSeparator` parameter of `HadoopFileLinesReader` is `0xFE 0xFF 0x00 0x0D 0x00 0x0A` - BOM + UTF-16BE (in the CSV file BOM+UTF-16LE). Even if we cut BOM from lineSep, it will still not correct. So, there are 2 (or 3) problems actually. Just in case: ``` val jsonDF = spark.read.schema(schema) .option("lineSep", "\r\n") .option("encoding", "UTF-16LE") .json(testFile(fileName)) ``` ``` +-++ |firstName|lastName| +-++ | null|null| | Doug|Rood| +-++ ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20849 Let's make the point clear. There are two things, _1. one for line-by-line parsing_ and _2. JSON parsing via Jackson_. The test you pointed out looks still a bit weird because Jackson is going to detect the encoding for each line not the whole file. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20849 From a quick look and wild guess, `UTF-16` case would be alone problematic because we are going to make the delimiter with a BOM bit `0xFF 0xFE 0x0D 0x00 0x0A 0x00`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20849 @MaxGekk, So to make it clear, it parses line by line correctly regardless of BOM if we set `lineSep` + `encoding` fine but it fails to parse each line as JSON via Jackson since we explicitly set `UTF-16LE` or `UTF-16` for JSON parsing? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/20849 @cloud-fan It is regular file in UTF-16 with BOM=`0xFF 0xFE` which indicates endianness - little-endian. When we slice the file by lines, the first line is still in UTF-16 with BOM, the rest lines become UTF-16LE. To read the lines using the same settings for jackson, I used charset auto-detection mechanism of the jackson library. [To do so I didn't specify any charset](https://github.com/MaxGekk/spark-1/blob/54fd42b64e0715540010c4d59b8b4f7a4a1b0876/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala#L2074-L2076) of the input stream but after removing hexadecimal representation of `lineSep` I must set charset for the lineSep (`\r\n` or `\u000d\u000a`) otherwise it would be not possible to convert it to the array of byte needed by Hadoop LineReader. In such way, if I set `UTF-16`, I am able to read only the first line but if I set `UTF-16LE`, the first line cannot be read because it contains BOM (a `UTF-16LE` string must not contain any BOMs). So, the problem is the lineSep option doesn't define actual delimiter required to split input text by lines. It just defines a string which requires a charset to convert it to real delimiter (array of bytes). The hex format proposed in my [first PR](https://github.com/MaxGekk/spark-1/pull/1) solves the problem. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20849 @MaxGekk are you talking about a malformed json file which has multiple encodings inside it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/20849 Please, look at https://github.com/apache/spark/pull/20937 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20849 Please give me few days to check your comments. I happen to be super busy for a personal reason. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/20849 Ironically this file came from a customer: https://issues.apache.org/jira/browse/SPARK-23410 . And that's why we reverted jackson's charset auto-detection: https://github.com/apache/spark/commit/129fd45efb418c6afa95aa26e5b96f03a39dcdd0 . After all the changes (without lineSep in hex) we are not able to read it properly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/20849 When I was trying to remove the flexible format for lineSep (recordDelimiter), I faced to a problem. I cannot fix the test: https://github.com/MaxGekk/spark-1/blob/54fd42b64e0715540010c4d59b8b4f7a4a1b0876/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala#L2071-L2081 There are no any combination of charset and lineSep that allow me to read the file. Here is the structure of the file: ``` BOM json_record1 delimiter json_record2 delimiter ``` The delimiter in hex: **x0d 00 0a 00** . Basically it is `\r\n` in UTF-16LE. If I set: ``` .option("charset", "UTF-16LE").option("lineSep", "\r\n") ``` The first record is ignored because it contains BOM which `UTF-16LE` must not contain. As the result I am getting only the the second record. If I set `UTF-16`, I am getting the first record because it contains BOM (as UTF-16 string must contain) but the second is rejected. How does it work in the case if `.option("recordDelimiter", "x0d 00 0a 00")` and charset is not specified. The answer is charset auto-detection of jackson-json. Hadoop's LineRecord Reader just splits the json by the delimiter and we have: ``` Seq("BOM json_record1", "json_record2") ``` The first string is detected according to BOM. And BOM is removed from the result by jackson. The second string is detected according to its chars as `UTF-16LE`. And we are getting correct result. So, if we don't support lineSep format in which sequence of bytes are expressed explicitly, we cannot read unicode jsons with BOM in per-line mode. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/20849 > Shall we expose encoding and add an alias for charset? It works for me too. > Is this flexible option also a part of your public release? No, it is not. Only `charset` was exposed. As a summary, let's merge your PR https://github.com/apache/spark/pull/20877 . I will prepare a PR on top of your changes, remove flexible format of lineSep + force users to set line separator if charset is specified + `encoding` and `charset` as an alias. Flexible format of lineSep for text, csv and json will come as a separate PR. I hope it works for you. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20849 > The PR doesn't solve any practical use cases It does. It allows many workarounds, for example, we can intentionally add a custom delimiter so that it can support multiple-line-ish JSONs: ``` { "a": 1 } |^| { "b": 2 } ``` Go and google CSV's case too. > `encoding`? Only as an alias for `charset`. Yes, `encoding`. This has higher priority over `charset`. See `CSVOptions`. Also, that's what we use in PySpark's CSV, doesn't it? https://github.com/apache/spark/blob/a9350d7095b79c8374fb4a06fd3f1a1a67615f6f/python/pyspark/sql/readwriter.py#L333 Shall we expose `encoding` and add an alias for `charset`? > I would definitely discuss how are you going to extend lineSep in your PR: #20877 in the future to support Json Streaming for example. If you don't have such vision, I would prefer to block your PR. Why are you dragging an orthogonal thing into #20877? I don't think we would fail to make a decision on the flexible option I guess we have much time until 2.4.0. Even if we fail to make a decision on the flexible option, we can expose another option that supports the flexibility that forces unsetting `lineSep`, can't we? Is this flexible option also a part of your public release? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/20849 @HyukjinKwon > How about we go this way with separate PRs? I agree with that only to unblock the https://github.com/apache/spark/pull/20849 because it solves real problem of our customers: reading a folder with many json files in UTF-16BE (without BOM) in multiline mode. In this case, recordDelimiter (lineSep) is not required. > #20877 to support line separator in json datasource The PR doesn't solve any practical use cases because it doesn't address [Json Streaming](https://en.wikipedia.org/wiki/JSON_streaming) and https://github.com/apache/spark/pull/20877#issuecomment-375622342 . Also it is useless in the case of reading jsons in charset different from UTF-8 in per-line mode without the PR: https://github.com/apache/spark/pull/20849 . I don't know what practical problem does it solves actually. In your tests you check those delimiters: https://github.com/apache/spark/pull/20877/files#diff-fde14032b0e6ef8086461edf79a27c5dR2112 . Are those delimiters from real jsons? > json datasource with `encoding` option (forcing lineSep) `encoding`? Only as an alias for `charset`. We have been already using `charset` in our public release: https://docs.azuredatabricks.net/spark/latest/data-sources/read-json.html#charset-auto-detection . I will insist on the `charset` name for the option. > flexible format PR with another review ok. It could come as separate PR. The flexible format just leaves the room for future extensions - nothing more. I would definitely discuss how are you going to extend lineSep in your PR: https://github.com/apache/spark/pull/20877 in the future to support Json Streaming for example. If you don't have such vision, I would prefer to block your PR. /cc @gatorsmile @cloud-fan @hvanhovell @rxin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20849 I think the felxible format needs more feedback and review. How about we go this way? 1. https://github.com/apache/spark/pull/20877 to support line separator in json datasource 2. json datasource with `encoding` option (forcing `lineSep`) 3. flexible format PR with another review --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/20849 @HyukjinKwon I am working on a PR which includes changes of this PR, recordDelimiter (flexible format) + force an user to set the recordDelimiter option if charset is specified as @cloud-fan suggested. Does it work for you? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20849 I am against to this mainly by https://github.com/MaxGekk/spark-1/pull/1#discussion_r175444502 if there isn't better way than rewriting it. Also, I think we should support `charset` option for text datasource too since the current option is incomplete. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20849 @MaxGekk @HyukjinKwon What are the status of this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20849 Does charset work with newlines? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20849 Shall we add non-ascii compatible characters in the test resource files? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20849 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20849 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88341/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20849 **[Test build #88341 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88341/testReport)** for PR 20849 at commit [`961b482`](https://github.com/apache/spark/commit/961b48225b450f9053681405f594911896e3a7ff). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class Rec(f1: String, f2: Int)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20849 **[Test build #88341 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88341/testReport)** for PR 20849 at commit [`961b482`](https://github.com/apache/spark/commit/961b48225b450f9053681405f594911896e3a7ff). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20849 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org