Writing into parquet throws Array out of bounds exception

Selvam Raman Wed, 21 Dec 2016 05:38:23 -0800

Hi,

When i am trying to write dataset to parquet or to show(1,fase), my job
throws array out of bounce exception.


16/12/21 12:38:50 WARN TaskSetManager: Lost task 7.0 in stage 36.0 (TID 81,
ip-10-95-36-69.dev): java.lang.ArrayIndexOutOfBoundsException: 63

at
org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:156)

at org.apache.spark.unsafe.types.UTF8String.indexOf(UTF8String.java:565)

at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)

at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)

at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)

at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)

at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)

at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)

at org.apache.spark.scheduler.Task.run(Task.scala:85)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)


16/12/21 12:38:50 INFO TaskSetManager: Starting task 7.1 in stage 36.0 (TID
106, ip-10-95-36-70.dev, partition 7, RACK_LOCAL, 6020 bytes)

16/12/21 12:38:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Launching
task 106 on executor id: 4 hostname: ip-10-95-36-70.dev.

16/12/21 12:38:50 WARN TaskSetManager: Lost task 4.0 in stage 36.0 (TID 78,
ip-10-95-36-70.dev): java.lang.ArrayIndexOutOfBoundsException: 62

at
org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:156)

at org.apache.spark.unsafe.types.UTF8String.indexOf(UTF8String.java:565)

at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)

at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)

at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)

at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)

at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)

at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)

at org.apache.spark.scheduler.Task.run(Task.scala:85)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)


In my dataset there is one column which is longblob, if i convert to
unbase64. I face this problem. i could able to write to parquet without
conversion.


So is there some limit for bytes per line?. Please give me your suggestion.

-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"

Writing into parquet throws Array out of bounds exception

Reply via email to