Re: Spark Sql with python udf fail

lonely Feb Mon, 23 Mar 2015 01:38:22 -0700

I caught exceptions in the python UDF code, flush exceptions into a single
file, and made sure the the column number of the output lines as same as
sql schema.


Sth. interesting is that my output line of the UDF code is just 10 columns,
and the exception above is java.lang.ArrayIndexOutOfBoundsException: 9, is
there any inspirations?

2015-03-23 16:24 GMT+08:00 Cheng Lian <lian.cs....@gmail.com>:

> Could you elaborate on the UDF code?
>
>
> On 3/23/15 3:43 PM, lonely Feb wrote:
>
>> Hi all, I tried to transfer some hive jobs into spark-sql. When i ran a
>> sql job with python udf i got a exception:
>>
>> java.lang.ArrayIndexOutOfBoundsException: 9
>>         at org.apache.spark.sql.catalyst.expressions.GenericRow.apply(
>> Row.scala:142)
>>         at org.apache.spark.sql.catalyst.expressions.BoundReference.
>> eval(BoundAttribute.scala:37)
>>         at org.apache.spark.sql.catalyst.expressions.EqualTo.eval(
>> predicates.scala:166)
>>         at org.apache.spark.sql.catalyst.expressions.
>> InterpretedPredicate$$anonfun$apply$1.apply(predicates.scala:30)
>>         at org.apache.spark.sql.catalyst.expressions.
>> InterpretedPredicate$$anonfun$apply$1.apply(predicates.scala:30)
>>         at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>         at org.apache.spark.sql.execution.Aggregate$$anonfun$
>> execute$1$$anonfun$7.apply(Aggregate.scala:156)
>>         at org.apache.spark.sql.execution.Aggregate$$anonfun$
>> execute$1$$anonfun$7.apply(Aggregate.scala:151)
>>         at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
>>         at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
>>         at org.apache.spark.rdd.MapPartitionsRDD.compute(
>> MapPartitionsRDD.scala:35)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.
>> scala:263)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>>         at org.apache.spark.rdd.MapPartitionsRDD.compute(
>> MapPartitionsRDD.scala:35)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.
>> scala:263)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(
>> ShuffleMapTask.scala:68)
>>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(
>> ShuffleMapTask.scala:41)
>>         at org.apache.spark.scheduler.Task.run(Task.scala:56)
>>         at org.apache.spark.executor.Executor$TaskRunner.run(
>> Executor.scala:197)
>>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
>> ThreadPoolExecutor.java:1145)
>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> ThreadPoolExecutor.java:615)
>>         at java.lang.Thread.run(Thread.java:744)
>>
>> I suspected there was an odd line in the input file. But the input file
>> is so large and i could not found any abnormal lines with several jobs to
>> check. How can i get the abnormal line here ?
>>
>
>

Re: Spark Sql with python udf fail

Reply via email to