[jira] [Commented] (SPARK-12537) Add option to accept quoting of all character backslash quoting mechanism

Reynold Xin (JIRA) Wed, 30 Dec 2015 10:50:20 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-12537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075307#comment-15075307
 ]


Reynold Xin commented on SPARK-12537:
-------------------------------------

Note that there is really no "standard" to deal with non-standard JSON files. 
Each library does its own thing along the permissiveness spectrum. Python and 
Jackson make different decisions for different settings.

In general I think for big data, we'd want to err on the permissive side, 
because it is fairly annoying to find an error later on and have to rerun the 
pipelines with different settings.




> Add option to accept quoting of all character backslash quoting mechanism
> -------------------------------------------------------------------------
>
>                 Key: SPARK-12537
>                 URL: https://issues.apache.org/jira/browse/SPARK-12537
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.5.2
>            Reporter: Cazen Lee
>            Assignee: Apache Spark
>
> We can provides the option to choose JSON parser can be enabled to accept 
> quoting of all character or not.
> For example, if JSON file that includes not listed by JSON backslash quoting 
> specification, it returns corrupt_record
> {code:title=JSON File|borderStyle=solid}
> {"name": "Cazen Lee", "price": "$10"}
> {"name": "John Doe", "price": "\$20"}
> {"name": "Tracy", "price": "$10"}
> {code}
> corrupt_record(returns null)
> {code}
> scala> df.show
> +--------------------+---------+-----+
> |     _corrupt_record|     name|price|
> +--------------------+---------+-----+
> |                null|Cazen Lee|  $10|
> |{"name": "John Do...|     null| null|
> |                null|    Tracy|  $10|
> +--------------------+---------+-----+
> {code}
> And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
> option like below
> {code}
> scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
> "true").json("/user/Cazen/test/test2.txt")
> df: org.apache.spark.sql.DataFrame = [name: string, price: string]
> scala> df.show
> +---------+-----+
> |     name|price|
> +---------+-----+
> |Cazen Lee|  $10|
> | John Doe|  $20|
> |    Tracy|  $10|
> +---------+-----+
> {code}
> This issue similar to HIVE-11825, HIVE-12717.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12537) Add option to accept quoting of all character backslash quoting mechanism

Reply via email to