Re: Spark Sql with python udf fail

Cheng Lian Mon, 23 Mar 2015 02:01:40 -0700

I suspect there is a malformed row in your input dataset. Could you trysomething like this to confirm:


|sql("SELECT * FROM <your-table>").foreach(println)
|

If there does exist a malformed line, you should see similar exception.And you can catch it with the help of the output. Notice that themessages are printed to stdout on executor side.


On 3/23/15 4:36 PM, lonely Feb wrote:

I caught exceptions in the python UDF code, flush exceptions into asingle file, and made sure the the column number of the output linesas same as sql schema.

Sth. interesting is that my output line of the UDF code is just 10columns, and the exception above isjava.lang.ArrayIndexOutOfBoundsException: 9, is there any inspirations?

2015-03-23 16:24 GMT+08:00 Cheng Lian <lian.cs....@gmail.com<mailto:lian.cs....@gmail.com>>:


    Could you elaborate on the UDF code?


    On 3/23/15 3:43 PM, lonely Feb wrote:

        Hi all, I tried to transfer some hive jobs into spark-sql.
        When i ran a sql job with python udf i got a exception:

        java.lang.ArrayIndexOutOfBoundsException: 9
                at
        
org.apache.spark.sql.catalyst.expressions.GenericRow.apply(Row.scala:142)
                at
        
org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:37)
                at
        
org.apache.spark.sql.catalyst.expressions.EqualTo.eval(predicates.scala:166)
                at
        
org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$anonfun$apply$1.apply(predicates.scala:30)
                at
        
org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$anonfun$apply$1.apply(predicates.scala:30)
                at
        scala.collection.Iterator$anon$14.hasNext(Iterator.scala:390)
                at
        scala.collection.Iterator$anon$11.hasNext(Iterator.scala:327)
                at
        
org.apache.spark.sql.execution.Aggregate$anonfun$execute$1$anonfun$7.apply(Aggregate.scala:156)
                at
        
org.apache.spark.sql.execution.Aggregate$anonfun$execute$1$anonfun$7.apply(Aggregate.scala:151)
                at
        org.apache.spark.rdd.RDD$anonfun$13.apply(RDD.scala:601)
                at
        org.apache.spark.rdd.RDD$anonfun$13.apply(RDD.scala:601)
                at
        org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
                at
        org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
                at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
                at
        org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
                at
        org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
                at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
                at
        
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
                at
        
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
                at org.apache.spark.scheduler.Task.run(Task.scala:56)
                at
        org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
                at
        
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
                at
        
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
                at java.lang.Thread.run(Thread.java:744)

        I suspected there was an odd line in the input file. But the
        input file is so large and i could not found any abnormal
        lines with several jobs to check. How can i get the abnormal
        line here ?

Re: Spark Sql with python udf fail

Reply via email to