Egor Pahomov created SPARK-13654:
------------------------------------

             Summary: get_json_object fails with java.io.CharConversionException
                 Key: SPARK-13654
                 URL: https://issues.apache.org/jira/browse/SPARK-13654
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.6.1
            Reporter: Egor Pahomov


I execute next query on my data:
{code}
select count(distinct get_json_object(regexp_extract(line, "^\\p{ASCII}*$", 0), 
'$.event')) from 
(select line from logs.raw_client_log where year=2016 and month=2 and day>28 
and line rlike "^\\p{ASCII}*$" and line is not null) a 
{code}

And it fails with 
{code}
Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 
420 in stage 168.0 failed 4 times, most recent failure: Lost task 420.3 in 
stage 168.0 (TID 13064, nod5-2-hadoop.anchorfree.net): 
java.io.CharConversionException: Invalid UTF-32 character 0x6576656e(above 
10ffff)  at char #47, byte #191)
        at 
com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189)
        at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150)
        at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153)
        at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:1855)
        at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:571)
        at 
org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:142)
        at 
org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:141)
        at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2202)
        at 
org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2.apply(jsonExpressions.scala:141)
        at 
org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2.apply(jsonExpressions.scala:138)
        at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2202)
        at 
org.apache.spark.sql.catalyst.expressions.GetJsonObject.eval(jsonExpressions.scala:138)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
 Source)
        at 
org.apache.spark.sql.execution.Expand$$anonfun$doExecute$1$$anonfun$3$$anon$1.next(Expand.scala:76)
        at 
org.apache.spark.sql.execution.Expand$$anonfun$doExecute$1$$anonfun$3$$anon$1.next(Expand.scala:62)
        at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512)
        at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:686)
        at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
        at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
{code}

Basically Spark sells me the idea, that I have character 敮 in my data. But 
query 
{code}
select line from logs.raw_client_log where year=2016 and month=2 and day>27 and 
line rlike "敮"
{code} 
returns nothing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to