Spark SQL query AVRO file

java8964 Fri, 07 Aug 2015 11:30:43 -0700
Hi, Spark users:
We currently are using Spark 1.2.2 + Hive 0.12 + Hadoop 2.2.0 on our production 
cluster, which has 42 data/task nodes.
There is one dataset stored as Avro files about 3T. Our business has a complex 
query running for the dataset, which is stored in nest structure with Array of 
Struct in Avro and Hive.
We can query it using Hive without any problem, but we like the SparkSQL's 
performance, so we in fact run the same query in the Spark SQL, and found out 
it is in fact much faster than Hive.
But when we run it, we got the following error randomly from Spark executors, 
sometime seriously enough to fail the whole spark job.
Below the stack trace, and I think it is a bug related to Spark due to:
1) The error jumps out inconsistent, as sometimes we won't see it for this job. 
(We run it daily)2) Sometime it won't fail our job, as it recover after 
retry.3) Sometime it will fail our job, as I listed below.4) Is this due to the 
multithreading in Spark? The NullPointException indicates Hive got a Null 
ObjectInspector of the children of StructObjectInspector, as I read the Hive 
source code, but I know there is no null of ObjectInsepector as children of 
StructObjectInspector. Google this error didn't give me any hint. Does any one 
know anything like this?
Project 
[HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcatWS(,,CAST(account_id#23L,
 StringType),CAST(gross_contact_count_a#4L, StringType),CASE WHEN IS NULL 
tag_cnt#21 THEN 0 ELSE CAST(tag_cnt#21, StringType),CAST(list_cnt_a#5L, 
StringType),CAST(active_contact_count_a#16L, 
StringType),CAST(other_api_contact_count_a#6L, 
StringType),CAST(fb_api_contact_count_a#7L, 
StringType),CAST(evm_contact_count_a#8L, 
StringType),CAST(loyalty_contact_count_a#9L, 
StringType),CAST(mobile_jmml_contact_count_a#10L, 
StringType),CAST(savelocal_contact_count_a#11L, 
StringType),CAST(siteowner_contact_count_a#12L, 
StringType),CAST(socialcamp_service_contact_count_a#13L, 
S...org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 
in stage 1.0 failed 4 times, most recent failure: Lost task 58.3 in stage 1.0 
(TID 257, 10.20.95.146): java.lang.NullPointerException        at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories(AvroObjectInspectorGenerator.java:139)
        at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:89)
        at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:101)
        at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:117)
        at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspector(AvroObjectInspectorGenerator.java:81)
        at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.<init>(AvroObjectInspectorGenerator.java:55)
        at 
org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:69)      
  at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:112)
        at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:109)
        at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618)        at 
org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618)        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)        
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)        at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247)        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)        
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)        at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247)        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)        
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)        at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247)        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)        
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)        at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247)        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)        
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)        at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247)        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)      
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) 
       at org.apache.spark.scheduler.Task.run(Task.scala:56)        at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:198)        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
       at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
       at java.lang.Thread.run(Thread.java:745)h
Spark SQL query AVRO file

Reply via email to