[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-31 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20849
  
The last case seems working dependently by Jackson (UTF-16 for the first 
and UTF-16LE for the second line) if we don't set `encoding` but Jackson parses 
it by `UTF-16LE` for both if we set `encoding`. Did I understand correctly?

To be clear, I am not against for the flexible format yet. Just want to 
solve the problem bit by bit.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-31 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20849
  
Thanks for thoughtfully testing out but I believe we can still go with 
https://github.com/apache/spark/pull/20937 if we whitelist supported encodings 
for now?
If that's right and I understood correctly, let's move the discussion to 
https://github.com/apache/spark/pull/20937#discussion_r178426866.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-31 Thread MaxGekk
Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/20849
  
@HyukjinKwon I did an experiment on the 
https://github.com/MaxGekk/spark-1/pull/2 and modified [the 
test](https://github.com/MaxGekk/spark-1/blob/f94d846b39ade89da24ef3e85f9721fb34e48154/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala#L2072-L2083):
 
If `UTF-16LE` is set explicitly:
```
val jsonDF = spark.read.schema(schema)
  .option("lineSep", "x0d 00 0a 00")
  .option("encoding", "UTF-16LE")
  .json(testFile(fileName))
```
only second line is returned correctly:
```
+-++
|firstName|lastName|
+-++
| null|null|
| Doug|Rood|
+-++
```
In the case of `UTF-16`, the first row is returned from the CSV file:
```
+-++
|firstName|lastName|
+-++
|Chris|   Baird|
| null|null|
+-++
```

And you are right in the case if encoding is `UTF-16`, BOM is added to the 
delimiter:

```
val jsonDF = spark.read.schema(schema)
  .option("lineSep", "\r\n")
  .option("encoding", "UTF-16")
  .json(testFile(fileName))
```
The `lineSeparator` parameter of `HadoopFileLinesReader` is `0xFE 0xFF 0x00 
0x0D 0x00 0x0A` - BOM + UTF-16BE (in the CSV file BOM+UTF-16LE). Even if we cut 
BOM from lineSep, it will still not correct.

So, there are 2 (or 3) problems actually.

Just in case:

```
   val jsonDF = spark.read.schema(schema)
  .option("lineSep", "\r\n")
  .option("encoding", "UTF-16LE")
  .json(testFile(fileName))
```
```
+-++
|firstName|lastName|
+-++
| null|null|
| Doug|Rood|
+-++
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-31 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20849
  
Let's make the point clear. There are two things, _1. one for line-by-line 
parsing_ and _2. JSON parsing via Jackson_.

The test you pointed out looks still a bit weird because Jackson is going 
to detect the encoding for each line not the whole file.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-31 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20849
  
From a quick look and wild guess, `UTF-16` case would be alone problematic 
because we are going to make the delimiter with a BOM bit `0xFF 0xFE 0x0D 0x00 
0x0A 0x00`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-31 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20849
  
@MaxGekk, So to make it clear, it parses line by line correctly regardless 
of BOM if we set `lineSep` + `encoding` fine but it fails to parse each line as 
JSON via Jackson since we explicitly set `UTF-16LE` or `UTF-16` for JSON 
parsing?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-30 Thread MaxGekk
Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/20849
  
@cloud-fan It is regular file in UTF-16 with BOM=`0xFF 0xFE` which 
indicates endianness - little-endian. When we slice the file by lines, the 
first line is still in UTF-16 with BOM, the rest lines become UTF-16LE. To read 
the lines using the same settings for jackson, I used charset auto-detection 
mechanism of the jackson library. [To do so I didn't specify any 
charset](https://github.com/MaxGekk/spark-1/blob/54fd42b64e0715540010c4d59b8b4f7a4a1b0876/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala#L2074-L2076)
 of the input stream but after removing hexadecimal representation of `lineSep` 
I must set charset for the lineSep (`\r\n` or `\u000d\u000a`) otherwise it 
would be not possible to convert it to the array of byte needed by Hadoop 
LineReader. 

In such way, if I set `UTF-16`, I am able to read only the first line but 
if I set `UTF-16LE`, the first line cannot be read because it contains BOM (a 
`UTF-16LE` string must not contain any BOMs).

So, the problem is the lineSep option doesn't define actual delimiter 
required to split input text by lines. It just defines a string which requires 
a charset to convert it to real delimiter (array of bytes). The hex format 
proposed in my [first PR](https://github.com/MaxGekk/spark-1/pull/1) solves the 
problem.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-29 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20849
  
@MaxGekk are you talking about a malformed json file which has multiple 
encodings inside it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-29 Thread MaxGekk
Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/20849
  
Please, look at https://github.com/apache/spark/pull/20937


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-29 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20849
  
Please give me few days to check your comments. I happen to be super busy 
for a personal reason.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-28 Thread MaxGekk
Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/20849
  
Ironically this file came from a customer: 
https://issues.apache.org/jira/browse/SPARK-23410 . And that's why we reverted 
jackson's charset auto-detection: 
https://github.com/apache/spark/commit/129fd45efb418c6afa95aa26e5b96f03a39dcdd0 
. After all the changes (without lineSep in hex) we are not able to read it 
properly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-28 Thread MaxGekk
Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/20849
  
When I was trying to remove the flexible format for lineSep 
(recordDelimiter), I faced to a problem. I cannot fix the test: 
https://github.com/MaxGekk/spark-1/blob/54fd42b64e0715540010c4d59b8b4f7a4a1b0876/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala#L2071-L2081

There are no any combination of charset and lineSep that allow me to read 
the file. Here is the structure of the file:
```
BOM json_record1 delimiter json_record2 delimiter
```
The delimiter in hex: **x0d 00 0a 00** . Basically it is `\r\n` in 
UTF-16LE. If I set:
```
.option("charset", "UTF-16LE").option("lineSep", "\r\n")
```
The first record is ignored because it contains BOM which `UTF-16LE` must 
not contain. As the result I am getting only the the second record. If I set 
`UTF-16`, I am getting the first record because it contains BOM (as UTF-16 
string must contain) but the second is rejected.

How does it work in the case if `.option("recordDelimiter", "x0d 00 0a 
00")` and charset is not specified. The answer is charset auto-detection of 
jackson-json. Hadoop's LineRecord Reader just splits the json by the delimiter 
and we have:
```
Seq("BOM json_record1", "json_record2")
```
The first string is detected according to BOM. And BOM is removed from the 
result by jackson. The second string is detected according to its chars as 
`UTF-16LE`. And we are getting correct result.

So, if we don't support lineSep format in which sequence of bytes are 
expressed explicitly, we cannot read unicode jsons with BOM in per-line mode.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-28 Thread MaxGekk
Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/20849
  
> Shall we expose encoding and add an alias for charset?

It works for me too. 

> Is this flexible option also a part of your public release?

No, it is not. Only `charset` was exposed.

As a summary, let's merge your PR 
https://github.com/apache/spark/pull/20877 . I will prepare a PR on top of your 
changes, remove flexible format of lineSep + force users to set line separator 
if charset is specified + `encoding` and `charset` as an alias. Flexible format 
of lineSep for text, csv and json will come as a separate PR. I hope it works 
for you. 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-28 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20849
  
> The PR doesn't solve any practical use cases 

It does. It allows many workarounds, for example, we can intentionally add 
a custom delimiter so that it can support multiple-line-ish JSONs:

```
{
  "a": 1
}
|^|
{
  "b": 2
}
```

Go and google CSV's case too.

> `encoding`? Only as an alias for `charset`.

Yes, `encoding`. This has higher priority over `charset`. See `CSVOptions`. 
Also, that's what we use in PySpark's CSV, doesn't it?


https://github.com/apache/spark/blob/a9350d7095b79c8374fb4a06fd3f1a1a67615f6f/python/pyspark/sql/readwriter.py#L333

Shall we expose `encoding` and add an alias for `charset`?

> I would definitely discuss how are you going to extend lineSep in your 
PR: #20877 in the future to support Json Streaming for example. If you don't 
have such vision, I would prefer to block your PR.

Why are you dragging an orthogonal thing into #20877? I don't think we 
would fail to make a decision on the flexible option I guess we have much time 
until 2.4.0. 

Even if we fail to make a decision on the flexible option, we can expose 
another option that supports the flexibility that forces unsetting `lineSep`, 
can't we?

Is this flexible option also a part of your public release?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-28 Thread MaxGekk
Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/20849
  
@HyukjinKwon 
> How about we go this way with separate PRs?

I agree with that only to unblock the 
https://github.com/apache/spark/pull/20849 because it solves real problem of 
our customers: reading a folder with many json files in UTF-16BE (without BOM) 
in multiline mode. In this case, recordDelimiter (lineSep) is not required.

 > #20877 to support line separator in json datasource

The PR doesn't solve any practical use cases because it doesn't address 
[Json Streaming](https://en.wikipedia.org/wiki/JSON_streaming) and 
https://github.com/apache/spark/pull/20877#issuecomment-375622342 . Also it is 
useless in the case of reading jsons in charset different from UTF-8 in 
per-line mode without the PR: https://github.com/apache/spark/pull/20849 .  I 
don't know what practical problem does it solves actually. In your tests you 
check those delimiters: 
https://github.com/apache/spark/pull/20877/files#diff-fde14032b0e6ef8086461edf79a27c5dR2112
 . Are those delimiters from real jsons?

> json datasource with `encoding` option (forcing lineSep)

`encoding`? Only as an alias for `charset`. We have been already using 
`charset` in our public release: 
https://docs.azuredatabricks.net/spark/latest/data-sources/read-json.html#charset-auto-detection
 . I will insist on the `charset` name for the option.

> flexible format PR with another review

ok. It could come as separate PR. The flexible format just leaves the room 
for future extensions - nothing more. I would definitely discuss how are you 
going to extend lineSep in your PR: https://github.com/apache/spark/pull/20877 
in the future to support Json Streaming for example. If you don't have such 
vision, I would prefer to block your PR.

/cc @gatorsmile @cloud-fan @hvanhovell @rxin 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-27 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20849
  
I think the felxible format needs more feedback and review. How about we go 
this way?

1. https://github.com/apache/spark/pull/20877 to support line separator in 
json datasource
2. json datasource with `encoding` option (forcing `lineSep`)
3. flexible format PR with another review



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-27 Thread MaxGekk
Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/20849
  
@HyukjinKwon I am working on a PR which includes changes of this PR, 
recordDelimiter (flexible format) + force an user to set the recordDelimiter 
option if charset is specified as @cloud-fan suggested. Does it work for you?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-25 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20849
  
I am against to this mainly by 
https://github.com/MaxGekk/spark-1/pull/1#discussion_r175444502 if there isn't 
better way than rewriting it.
Also, I think we should support `charset` option for text datasource too 
since the current option is incomplete.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-25 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20849
  
@MaxGekk @HyukjinKwon What are the status of this PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20849
  
Does charset work with newlines?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-17 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20849
  
Shall we add non-ascii compatible characters in the test resource files?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20849
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20849
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88341/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20849
  
**[Test build #88341 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88341/testReport)**
 for PR 20849 at commit 
[`961b482`](https://github.com/apache/spark/commit/961b48225b450f9053681405f594911896e3a7ff).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class Rec(f1: String, f2: Int)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20849
  
**[Test build #88341 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88341/testReport)**
 for PR 20849 at commit 
[`961b482`](https://github.com/apache/spark/commit/961b48225b450f9053681405f594911896e3a7ff).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

2018-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20849
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org