[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

Bruce Robbins (JIRA) Wed, 14 Feb 2018 15:09:52 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364916#comment-16364916
 ]


Bruce Robbins commented on SPARK-23410:
---------------------------------------

On Spark 2.2.1, I got the same result as you. But with those extraneous null 
rows, it still doesn't look right.

When I converted your file to utf-8, Spark 2.2.1 gave me:
{noformat}
+---------+--------+
|firstName|lastName|
+---------+--------+
|    Chris|   Baird|
|     Doug|    Rood|
+---------+--------+
{noformat}
No extraneous null rows.

On a previous version (Spark 2.1.2), I got
{noformat}
8/02/14 14:51:47 WARN JacksonParser: Found at least one malformed records 
(sample: ��{^@"^@f^@i^@r^@s^@t^@N^@a^@m^@e^@"^@:^@"^@C^@h^@r^@i^@s^@"^@,^@ 
^@"^@l^@a^@s^@t^@N^@a^\
@m^@e^@"^@:^@"^@B^@a^@i^@r^@d^@"^@}^@). The JSON reader will replace
all malformed records with placeholder null in current PERMISSIVE parser mode.
To find out which corrupted records have been replaced with null, please use the
default inferred schema instead of providing a custom schema.

Code example to print all malformed records (scala):
===================================================
// The corrupted record exists in column _corrupt_record.
val parsedJson = spark.read.json("/path/to/json/file/test.json")


+---------+--------+
|firstName|lastName|
+---------+--------+
|     null|    null|
|     null|    null|
+---------+--------+
{noformat}
On a 2.4 snapshot, I got:
{noformat}
---------+--------+
|firstName|lastName|
+---------+--------+
|     null|    null|
|     null|    null|
|     null|    null|
|     null|    null|
|     null|    null|
+---------+--------+
{noformat}
 It worked *best* on Spark 2.2.1, but even there it still wasn't right.

 

> Unable to read jsons in charset different from UTF-8
> ----------------------------------------------------
>
>                 Key: SPARK-23410
>                 URL: https://issues.apache.org/jira/browse/SPARK-23410
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.3.0
>            Reporter: Maxim Gekk
>            Priority: Major
>         Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

Reply via email to