[ 
https://issues.apache.org/jira/browse/SPARK-12537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15072779#comment-15072779
 ] 

Cazen Lee commented on SPARK-12537:
-----------------------------------

Hmm I hope that this example is a good description..

Assume that gathering large amount of search word logs from customers device. 
And the logs stick together in device until wifi connection

Logs has nested JSON structure like below

{code}
{
        "deviceId": "Hcd8sdId8sdfC",
        "searchingWord": [{
                "timestamp": 134053453,
                "search": "Cazen Lee"
        }, {
                "timestamp": 134053455,
                "search": "John D\oe"
        }, {
                "timestamp": 134053457,
                "search": "wordwordword"
        }]
}
{code}

In this situation, distinct(deviceId) will 0 instead of 1 because user mistype 
"John Doe" to "John D\oe". If device connecting wifi after 2weeks later for 
vacation, one row will grow up 100MB, and whole logs are gone(parse err)

HIVE-11825 has similar situation, too. User can make various keyword in the log.

I'm sure it's OK to Jackson parser reject this example by default. It' not 
standard(JSON specification). And this example is not regular situation

But I think that's a little harsh. If user can set option to handling, it would 
be  helpful.

> Add option to accept quoting of all character backslash quoting mechanism
> -------------------------------------------------------------------------
>
>                 Key: SPARK-12537
>                 URL: https://issues.apache.org/jira/browse/SPARK-12537
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.5.2
>            Reporter: Cazen Lee
>            Assignee: Apache Spark
>
> We can provides the option to choose JSON parser can be enabled to accept 
> quoting of all character or not.
> For example, if JSON file that includes not listed by JSON backslash quoting 
> specification, it returns corrupt_record
> {code:title=JSON File|borderStyle=solid}
> {"name": "Cazen Lee", "price": "$10"}
> {"name": "John Doe", "price": "\$20"}
> {"name": "Tracy", "price": "$10"}
> {code}
> corrupt_record(returns null)
> {code}
> scala> df.show
> +--------------------+---------+-----+
> |     _corrupt_record|     name|price|
> +--------------------+---------+-----+
> |                null|Cazen Lee|  $10|
> |{"name": "John Do...|     null| null|
> |                null|    Tracy|  $10|
> +--------------------+---------+-----+
> {code}
> And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
> option like below
> {code}
> scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
> "true").json("/user/Cazen/test/test2.txt")
> df: org.apache.spark.sql.DataFrame = [name: string, price: string]
> scala> df.show
> +---------+-----+
> |     name|price|
> +---------+-----+
> |Cazen Lee|  $10|
> | John Doe|  $20|
> |    Tracy|  $10|
> +---------+-----+
> {code}
> This issue similar to HIVE-11825, HIVE-12717.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to