Hi, When i am trying to write dataset to parquet or to show(1,fase), my job throws array out of bounce exception.
16/12/21 12:38:50 WARN TaskSetManager: Lost task 7.0 in stage 36.0 (TID 81, ip-10-95-36-69.dev): java.lang.ArrayIndexOutOfBoundsException: 63 at org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:156) at org.apache.spark.unsafe.types.UTF8String.indexOf(UTF8String.java:565) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/12/21 12:38:50 INFO TaskSetManager: Starting task 7.1 in stage 36.0 (TID 106, ip-10-95-36-70.dev, partition 7, RACK_LOCAL, 6020 bytes) 16/12/21 12:38:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Launching task 106 on executor id: 4 hostname: ip-10-95-36-70.dev. 16/12/21 12:38:50 WARN TaskSetManager: Lost task 4.0 in stage 36.0 (TID 78, ip-10-95-36-70.dev): java.lang.ArrayIndexOutOfBoundsException: 62 at org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:156) at org.apache.spark.unsafe.types.UTF8String.indexOf(UTF8String.java:565) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) In my dataset there is one column which is longblob, if i convert to unbase64. I face this problem. i could able to write to parquet without conversion. So is there some limit for bytes per line?. Please give me your suggestion. -- Selvam Raman "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"