Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/20849
The last case seems working dependently by Jackson (UTF-16 for the first
and UTF-16LE for the second line) if we don't set `encoding` but Jackson parses
it by `UTF-16LE` for both if we set `enco
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/20849
Thanks for thoughtfully testing out but I believe we can still go with
https://github.com/apache/spark/pull/20937 if we whitelist supported encodings
for now?
If that's right and I understoo
Github user MaxGekk commented on the issue:
https://github.com/apache/spark/pull/20849
@HyukjinKwon I did an experiment on the
https://github.com/MaxGekk/spark-1/pull/2 and modified [the
test](https://github.com/MaxGekk/spark-1/blob/f94d846b39ade89da24ef3e85f9721fb34e48154/sql/core/sr
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/20849
Let's make the point clear. There are two things, _1. one for line-by-line
parsing_ and _2. JSON parsing via Jackson_.
The test you pointed out looks still a bit weird because Jackson is
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/20849
From a quick look and wild guess, `UTF-16` case would be alone problematic
because we are going to make the delimiter with a BOM bit `0xFF 0xFE 0x0D 0x00
0x0A 0x00`.
---
-
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/20849
@MaxGekk, So to make it clear, it parses line by line correctly regardless
of BOM if we set `lineSep` + `encoding` fine but it fails to parse each line as
JSON via Jackson since we explicitly se
Github user MaxGekk commented on the issue:
https://github.com/apache/spark/pull/20849
@cloud-fan It is regular file in UTF-16 with BOM=`0xFF 0xFE` which
indicates endianness - little-endian. When we slice the file by lines, the
first line is still in UTF-16 with BOM, the rest lines b
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/20849
@MaxGekk are you talking about a malformed json file which has multiple
encodings inside it?
---
-
To unsubscribe, e-mail: rev
Github user MaxGekk commented on the issue:
https://github.com/apache/spark/pull/20849
Please, look at https://github.com/apache/spark/pull/20937
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/20849
Please give me few days to check your comments. I happen to be super busy
for a personal reason.
---
-
To unsubscribe, e-mai
Github user MaxGekk commented on the issue:
https://github.com/apache/spark/pull/20849
Ironically this file came from a customer:
https://issues.apache.org/jira/browse/SPARK-23410 . And that's why we reverted
jackson's charset auto-detection:
https://github.com/apache/spark/commit/12
Github user MaxGekk commented on the issue:
https://github.com/apache/spark/pull/20849
When I was trying to remove the flexible format for lineSep
(recordDelimiter), I faced to a problem. I cannot fix the test:
https://github.com/MaxGekk/spark-1/blob/54fd42b64e0715540010c4d59b8b4f7a4a
Github user MaxGekk commented on the issue:
https://github.com/apache/spark/pull/20849
> Shall we expose encoding and add an alias for charset?
It works for me too.
> Is this flexible option also a part of your public release?
No, it is not. Only `charset` wa
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/20849
> The PR doesn't solve any practical use cases
It does. It allows many workarounds, for example, we can intentionally add
a custom delimiter so that it can support multiple-line-ish JSO
Github user MaxGekk commented on the issue:
https://github.com/apache/spark/pull/20849
@HyukjinKwon
> How about we go this way with separate PRs?
I agree with that only to unblock the
https://github.com/apache/spark/pull/20849 because it solves real problem of
our custom
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/20849
I think the felxible format needs more feedback and review. How about we go
this way?
1. https://github.com/apache/spark/pull/20877 to support line separator in
json datasource
2. j
Github user MaxGekk commented on the issue:
https://github.com/apache/spark/pull/20849
@HyukjinKwon I am working on a PR which includes changes of this PR,
recordDelimiter (flexible format) + force an user to set the recordDelimiter
option if charset is specified as @cloud-fan suggest
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/20849
I am against to this mainly by
https://github.com/MaxGekk/spark-1/pull/1#discussion_r175444502 if there isn't
better way than rewriting it.
Also, I think we should support `charset` option f
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/20849
@MaxGekk @HyukjinKwon What are the status of this PR?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For a
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/20849
Does charset work with newlines?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands,
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/20849
Shall we add non-ascii compatible characters in the test resource files?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@sp
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20849
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20849
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88341/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20849
**[Test build #88341 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88341/testReport)**
for PR 20849 at commit
[`961b482`](https://github.com/apache/spark/commit/9
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20849
**[Test build #88341 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88341/testReport)**
for PR 20849 at commit
[`961b482`](https://github.com/apache/spark/commit/96
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20849
Can one of the admins verify this patch?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
26 matches
Mail list logo