Exception when reading multiline JSON file

Kumaresh AK Thu, 12 Sep 2019 10:04:05 -0700

Hello Spark Community!
I am new to Spark. I tried to read a multiline json file (has around 2M
records and gzip size is about 2GB) and encountered an exception. It works
if I convert the same file into jsonl before reading it via spark.
Unfortunately the file is private and I cannot share it. Is there any
information that can help me narrow it down?
I tried writing to parquet and json. Both face the same exception. This is
the short form of the code:


    df = spark.read.option('multiline', 'true').json("file.json.gz") \
            .select(explode("objects")).select("col.*")

    df.write.mode('overwrite').json(p)
    # this crashes too: df.write.mode('overwrite').parquet(p)


Json file is of the form:
{

"version":"xyz",

"objects":[

{

"id":"abc",

....

},

{

"id": "def",

....

},

....

]

}


This is the exception:

> 2019-09-12 14:02:01,515 WARN scheduler.TaskSetManager: Lost task 0.0 in
> stage 1.0 (TID 1, 172.21.0.3, executor 0): org.apache.spark.SparkException:
> Task failed while writing rows.
>         at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
>         at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>         at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>         at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>         at org.apache.spark.scheduler.Task.run(Task.scala:121)
>         at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
>         at
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalArgumentException: Cannot grow BufferHolder by
> size 16 because the size after growing exceeds size limitation 2147483632
>         at
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:71)
>         at
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.grow(UnsafeWriter.java:62)
>         at
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.writeUnalignedBytes(UnsafeWriter.java:126)
>         at
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_0$(Unknown
> Source)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_1_6$(Unknown
> Source)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
> Source)
>         at
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1$$anonfun$apply$1.apply(FileFormat.scala:149)
>         at
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1$$anonfun$apply$1.apply(FileFormat.scala:148)
>         at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>         at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
> Source)
>         at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>         at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>         at
> scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
> Source)
>         at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>         at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:244)
>         at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
>         at
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
>         at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
>         ... 10 more
>
> 2019-09-12 14:02:01,518 INFO scheduler.TaskSetManager: Starting task 0.1
> in stage 1.0 (TID 2, 172.21.0.3, executor 0, partition 0, PROCESS_LOCAL,
> 8335 bytes)
> 2019-09-12 14:03:59,016 WARN scheduler.TaskSetManager: Lost task 0.1 in
> stage 1.0 (TID 2, 172.21.0.3, executor 0): org.apache.spark.SparkException:
> Task failed while writing rows.
>         at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
>         at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>         at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>         at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>         at org.apache.spark.scheduler.Task.run(Task.scala:121)
>         at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
>         at
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalArgumentException: Cannot grow BufferHolder by
> size 16 because the size after growing exceeds size limitation 2147483632
>         at
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:71)
>         at
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.grow(UnsafeWriter.java:62)
>         at
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.writeUnalignedBytes(UnsafeWriter.java:126)
>         at
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_0$(Unknown
> Source)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_1_6$(Unknown
> Source)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
> Source)
>         at
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1$$anonfun$apply$1.apply(FileFormat.scala:149)
>         at
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1$$anonfun$apply$1.apply(FileFormat.scala:148)
>         at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>         at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
> Source)
>         at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>         at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>         at
> scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
> Source)
>         at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>         at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:244)
>         at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
>         at
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
>         at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
>         ... 10 more
>

Exception when reading multiline JSON file

Reply via email to