Egor Pahomov created SPARK-13654: ------------------------------------ Summary: get_json_object fails with java.io.CharConversionException Key: SPARK-13654 URL: https://issues.apache.org/jira/browse/SPARK-13654 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.1 Reporter: Egor Pahomov
I execute next query on my data: {code} select count(distinct get_json_object(regexp_extract(line, "^\\p{ASCII}*$", 0), '$.event')) from (select line from logs.raw_client_log where year=2016 and month=2 and day>28 and line rlike "^\\p{ASCII}*$" and line is not null) a {code} And it fails with {code} Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 420 in stage 168.0 failed 4 times, most recent failure: Lost task 420.3 in stage 168.0 (TID 13064, nod5-2-hadoop.anchorfree.net): java.io.CharConversionException: Invalid UTF-32 character 0x6576656e(above 10ffff) at char #47, byte #191) at com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189) at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:1855) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:571) at org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:142) at org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:141) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2202) at org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2.apply(jsonExpressions.scala:141) at org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2.apply(jsonExpressions.scala:138) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2202) at org.apache.spark.sql.catalyst.expressions.GetJsonObject.eval(jsonExpressions.scala:138) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source) at org.apache.spark.sql.execution.Expand$$anonfun$doExecute$1$$anonfun$3$$anon$1.next(Expand.scala:76) at org.apache.spark.sql.execution.Expand$$anonfun$doExecute$1$$anonfun$3$$anon$1.next(Expand.scala:62) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:686) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) {code} Basically Spark sells me the idea, that I have character 敮 in my data. But query {code} select line from logs.raw_client_log where year=2016 and month=2 and day>27 and line rlike "敮" {code} returns nothing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org