[jira] [Commented] (SPARK-24018) Spark-without-hadoop package fails to create or read parquet files with snappy compression
[ https://issues.apache.org/jira/browse/SPARK-24018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535854#comment-16535854 ] Jean-Francis Roy commented on SPARK-24018: -- I don't think this is only related to the Spark shell: in my case I don't use any user classpath, snappy-1.1.2.6 is just nowhere to be found in Spark's or Hadoop's ClassPath. I first got this issue using spark-submit. The parquet lib version provided by Spark is incompatible with snappy-1.0.4.1 found in Hadoop's classpath. [~pclay] did you start spark-shell with any arguments? Maybe snappy is shaded in one of the JARs you have in your classpath? > Spark-without-hadoop package fails to create or read parquet files with > snappy compression > -- > > Key: SPARK-24018 > URL: https://issues.apache.org/jira/browse/SPARK-24018 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.3.0 >Reporter: Jean-Francis Roy >Priority: Minor > > On a brand-new installation of Spark 2.3.0 with a user-provided hadoop-2.8.3, > Spark fails to read or write dataframes in parquet format with snappy > compression. > This is due to an incompatibility between the snappy-java version that is > required by parquet (parquet is provided in Spark jars but snappy isn't) and > the version that is available from hadoop-2.8.3. > > Steps to reproduce: > * Download and extract hadoop-2.8.3 > * Download and extract spark-2.3.0-without-hadoop > * export JAVA_HOME, HADOOP_HOME, SPARK_HOME, PATH > * Following instructions from > [https://spark.apache.org/docs/latest/hadoop-provided.html], set > SPARK_DIST_CLASSPATH=$(hadoop classpath) in spark-env.sh > * Start a spark-shell, enter the following: > > {code:java} > import spark.implicits._ > val df = List(1, 2, 3, 4).toDF > df.write > .format("parquet") > .option("compression", "snappy") > .mode("overwrite") > .save("test.parquet") > {code} > > > This fails with the following: > {noformat} > java.lang.UnsatisfiedLinkError: > org.xerial.snappy.SnappyNative.maxCompressedLength(I)I > at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method) > at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316) > at > org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67) > at > org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81) > at > org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92) > at > org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112) > at > org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93) > at > org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150) > at > org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238) > at > org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109) > at > org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:396) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at > org.apache.spark.scheduler.Task.run(Task.scal
[jira] [Commented] (SPARK-24018) Spark-without-hadoop package fails to create or read parquet files with snappy compression
[ https://issues.apache.org/jira/browse/SPARK-24018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537815#comment-16537815 ] Jean-Francis Roy commented on SPARK-24018: -- Oh, indeed you are right! I was mistaken when I thought that I had the error using spark-submit, I just verified again and it works, thanks for the explanation! > Spark-without-hadoop package fails to create or read parquet files with > snappy compression > -- > > Key: SPARK-24018 > URL: https://issues.apache.org/jira/browse/SPARK-24018 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.3.0 >Reporter: Jean-Francis Roy >Priority: Minor > > On a brand-new installation of Spark 2.3.0 with a user-provided hadoop-2.8.3, > Spark fails to read or write dataframes in parquet format with snappy > compression. > This is due to an incompatibility between the snappy-java version that is > required by parquet (parquet is provided in Spark jars but snappy isn't) and > the version that is available from hadoop-2.8.3. > > Steps to reproduce: > * Download and extract hadoop-2.8.3 > * Download and extract spark-2.3.0-without-hadoop > * export JAVA_HOME, HADOOP_HOME, SPARK_HOME, PATH > * Following instructions from > [https://spark.apache.org/docs/latest/hadoop-provided.html], set > SPARK_DIST_CLASSPATH=$(hadoop classpath) in spark-env.sh > * Start a spark-shell, enter the following: > > {code:java} > import spark.implicits._ > val df = List(1, 2, 3, 4).toDF > df.write > .format("parquet") > .option("compression", "snappy") > .mode("overwrite") > .save("test.parquet") > {code} > > > This fails with the following: > {noformat} > java.lang.UnsatisfiedLinkError: > org.xerial.snappy.SnappyNative.maxCompressedLength(I)I > at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method) > at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316) > at > org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67) > at > org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81) > at > org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92) > at > org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112) > at > org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93) > at > org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150) > at > org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238) > at > org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109) > at > org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:396) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at > org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thre
[jira] [Created] (SPARK-24018) Spark-without-hadoop package fails to create or read parquet files with snappy compression
Jean-Francis Roy created SPARK-24018: Summary: Spark-without-hadoop package fails to create or read parquet files with snappy compression Key: SPARK-24018 URL: https://issues.apache.org/jira/browse/SPARK-24018 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 2.3.0 Reporter: Jean-Francis Roy On a brand-new installation of Spark 2.3.0 with a user-provided hadoop-2.8.3, Spark fails to read or write dataframes in parquet format with snappy compression. This is due to an incompatibility between the snappy-java version that is required by parquet (parquet is provided in Spark jars but snappy isn't) and the version that is available from hadoop-2.8.3. Steps to reproduce: * Download and extract hadoop-2.8.3 * Download and extract spark-2.3.0-without-hadoop * export JAVA_HOME, HADOOP_HOME, SPARK_HOME, PATH * Following instructions from https://spark.apache.org/docs/latest/hadoop-provided.html, set SPARK_DIST_CLASSPATH=$(hadoop classpath) in spark-env.sh * Start a spark-shell, enter the following: {code:java} import spark.implicits._ val df = List(1, 2, 3, 4).toDF df.write .format("parquet") .option("compression", "snappy") .mode("overwrite") .save("test.parquet") {code} This fails with the following: {noformat} java.lang.UnsatisfiedLinkError: org.xerial.snappy.SnappyNative.maxCompressedLength(I)I at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method) at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316) at org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67) at org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81) at org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92) at org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112) at org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93) at org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150) at org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238) at org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121) at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167) at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109) at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:396) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} Downloading snappy-java-1.1.2.6.jar and placing it in Sparks's jar folder solves the issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24018) Spark-without-hadoop package fails to create or read parquet files with snappy compression
[ https://issues.apache.org/jira/browse/SPARK-24018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Francis Roy updated SPARK-24018: - Description: On a brand-new installation of Spark 2.3.0 with a user-provided hadoop-2.8.3, Spark fails to read or write dataframes in parquet format with snappy compression. This is due to an incompatibility between the snappy-java version that is required by parquet (parquet is provided in Spark jars but snappy isn't) and the version that is available from hadoop-2.8.3. Steps to reproduce: * Download and extract hadoop-2.8.3 * Download and extract spark-2.3.0-without-hadoop * export JAVA_HOME, HADOOP_HOME, SPARK_HOME, PATH * Following instructions from [https://spark.apache.org/docs/latest/hadoop-provided.html], set SPARK_DIST_CLASSPATH=$(hadoop classpath) in spark-env.sh * Start a spark-shell, enter the following: {code:java} import spark.implicits._ val df = List(1, 2, 3, 4).toDF df.write .format("parquet") .option("compression", "snappy") .mode("overwrite") .save("test.parquet") {code} This fails with the following: {noformat} java.lang.UnsatisfiedLinkError: org.xerial.snappy.SnappyNative.maxCompressedLength(I)I at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method) at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316) at org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67) at org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81) at org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92) at org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112) at org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93) at org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150) at org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238) at org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121) at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167) at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109) at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:396) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748){noformat} Downloading snappy-java-1.1.2.6.jar and placing it in Sparks's jar folder solves the issue. was: On a brand-new installation of Spark 2.3.0 with a user-provided hadoop-2.8.3, Spark fails to read or write dataframes in parquet format with snappy compression. This is due to an incompatibility between the snappy-java version that is required by parquet (parquet is provided in Spark jars but snappy isn't) and the version that is available from hadoop-2.8.3. Steps to reproduce: * Download and extract hadoop-2.8.3 * Download and extract spark-2.3.0-without-hadoop * export JAVA_HOME, HADOOP_HOME, SPARK_HOME, PATH * Following instructions from https://spark.apache.org/docs/latest/hadoop-provided.html, set SPARK_DIST_CLASSPATH=$(hadoop classpath) in spark-env.sh * Start a spark-shell,
[jira] [Resolved] (SPARK-24018) Spark-without-hadoop package fails to create or read parquet files with snappy compression
[ https://issues.apache.org/jira/browse/SPARK-24018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Francis Roy resolved SPARK-24018. -- Resolution: Fixed Fix Version/s: 2.3.2 I confirm the fix that appeared in Spark 2.3.2 > Spark-without-hadoop package fails to create or read parquet files with > snappy compression > -- > > Key: SPARK-24018 > URL: https://issues.apache.org/jira/browse/SPARK-24018 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.3.0 >Reporter: Jean-Francis Roy >Priority: Minor > Fix For: 2.3.2 > > > On a brand-new installation of Spark 2.3.0 with a user-provided hadoop-2.8.3, > Spark fails to read or write dataframes in parquet format with snappy > compression. > This is due to an incompatibility between the snappy-java version that is > required by parquet (parquet is provided in Spark jars but snappy isn't) and > the version that is available from hadoop-2.8.3. > > Steps to reproduce: > * Download and extract hadoop-2.8.3 > * Download and extract spark-2.3.0-without-hadoop > * export JAVA_HOME, HADOOP_HOME, SPARK_HOME, PATH > * Following instructions from > [https://spark.apache.org/docs/latest/hadoop-provided.html], set > SPARK_DIST_CLASSPATH=$(hadoop classpath) in spark-env.sh > * Start a spark-shell, enter the following: > > {code:java} > import spark.implicits._ > val df = List(1, 2, 3, 4).toDF > df.write > .format("parquet") > .option("compression", "snappy") > .mode("overwrite") > .save("test.parquet") > {code} > > > This fails with the following: > {noformat} > java.lang.UnsatisfiedLinkError: > org.xerial.snappy.SnappyNative.maxCompressedLength(I)I > at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method) > at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316) > at > org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67) > at > org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81) > at > org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92) > at > org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112) > at > org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93) > at > org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150) > at > org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238) > at > org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109) > at > org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:396) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at > org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748){noformat} > > Downloading snappy-java-1.1.2.6.jar and placing
[jira] [Commented] (SPARK-25243) Use FailureSafeParser in from_json
[ https://issues.apache.org/jira/browse/SPARK-25243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284951#comment-17284951 ] Jean-Francis Roy commented on SPARK-25243: -- The documentation still states that `from_json` will return `null` if the JSON is malformed, which is not the case anymore by default. > Use FailureSafeParser in from_json > -- > > Key: SPARK-25243 > URL: https://issues.apache.org/jira/browse/SPARK-25243 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > The > [FailureSafeParser|https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FailureSafeParser.scala#L28] > is used in parsing JSON, CSV files and dataset of strings. It supports the > [PERMISSIVE, DROPMALFORMED and > FAILFAST|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala#L31-L44] > modes. The ticket aims to make the from_json function compatible to regular > parsing via FailureSafeParser and support above modes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34441) from_json documentation is wrong about malformed JSONs output
Jean-Francis Roy created SPARK-34441: Summary: from_json documentation is wrong about malformed JSONs output Key: SPARK-34441 URL: https://issues.apache.org/jira/browse/SPARK-34441 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 3.0.1, 3.0.0 Reporter: Jean-Francis Roy The documentation of the `from_json` function states that malformed json will return a `null` value, which is not the case anymore after https://issues.apache.org/jira/browse/SPARK-25243. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25243) Use FailureSafeParser in from_json
[ https://issues.apache.org/jira/browse/SPARK-25243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284954#comment-17284954 ] Jean-Francis Roy edited comment on SPARK-25243 at 2/15/21, 9:25 PM: {code:java} from_json() doesn't support the DROPMALFORMED mode. Acceptable modes are PERMISSIVE and FAILFAST.{code} It seems like the previous behavior cannot be reproduced anymore? was (Author: jeanfrancisroy): `from_json() doesn't support the DROPMALFORMED mode. Acceptable modes are PERMISSIVE and FAILFAST.` It seems like the previous behavior cannot be reproduced anymore? > Use FailureSafeParser in from_json > -- > > Key: SPARK-25243 > URL: https://issues.apache.org/jira/browse/SPARK-25243 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > The > [FailureSafeParser|https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FailureSafeParser.scala#L28] > is used in parsing JSON, CSV files and dataset of strings. It supports the > [PERMISSIVE, DROPMALFORMED and > FAILFAST|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala#L31-L44] > modes. The ticket aims to make the from_json function compatible to regular > parsing via FailureSafeParser and support above modes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25243) Use FailureSafeParser in from_json
[ https://issues.apache.org/jira/browse/SPARK-25243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284954#comment-17284954 ] Jean-Francis Roy commented on SPARK-25243: -- ``` from_json() doesn't support the DROPMALFORMED mode. Acceptable modes are PERMISSIVE and FAILFAST. ``` It seems like the previous behavior cannot be reproduced anymore? > Use FailureSafeParser in from_json > -- > > Key: SPARK-25243 > URL: https://issues.apache.org/jira/browse/SPARK-25243 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > The > [FailureSafeParser|https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FailureSafeParser.scala#L28] > is used in parsing JSON, CSV files and dataset of strings. It supports the > [PERMISSIVE, DROPMALFORMED and > FAILFAST|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala#L31-L44] > modes. The ticket aims to make the from_json function compatible to regular > parsing via FailureSafeParser and support above modes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25243) Use FailureSafeParser in from_json
[ https://issues.apache.org/jira/browse/SPARK-25243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284954#comment-17284954 ] Jean-Francis Roy edited comment on SPARK-25243 at 2/15/21, 9:25 PM: `from_json() doesn't support the DROPMALFORMED mode. Acceptable modes are PERMISSIVE and FAILFAST.` It seems like the previous behavior cannot be reproduced anymore? was (Author: jeanfrancisroy): ``` from_json() doesn't support the DROPMALFORMED mode. Acceptable modes are PERMISSIVE and FAILFAST. ``` It seems like the previous behavior cannot be reproduced anymore? > Use FailureSafeParser in from_json > -- > > Key: SPARK-25243 > URL: https://issues.apache.org/jira/browse/SPARK-25243 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > The > [FailureSafeParser|https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FailureSafeParser.scala#L28] > is used in parsing JSON, CSV files and dataset of strings. It supports the > [PERMISSIVE, DROPMALFORMED and > FAILFAST|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala#L31-L44] > modes. The ticket aims to make the from_json function compatible to regular > parsing via FailureSafeParser and support above modes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34441) from_json documentation is wrong about malformed JSONs output
[ https://issues.apache.org/jira/browse/SPARK-34441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284955#comment-17284955 ] Jean-Francis Roy commented on SPARK-34441: -- It even seems we cannot reproduce the previous behavior anymore: {code:java} from_json() doesn't support the DROPMALFORMED mode. Acceptable modes are PERMISSIVE and FAILFAST.{code} > from_json documentation is wrong about malformed JSONs output > - > > Key: SPARK-34441 > URL: https://issues.apache.org/jira/browse/SPARK-34441 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.0.0, 3.0.1 >Reporter: Jean-Francis Roy >Priority: Minor > > The documentation of the `from_json` function states that malformed json will > return a `null` value, which is not the case anymore after > https://issues.apache.org/jira/browse/SPARK-25243. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34441) from_json documentation is wrong about malformed JSONs output
[ https://issues.apache.org/jira/browse/SPARK-34441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17285618#comment-17285618 ] Jean-Francis Roy commented on SPARK-34441: -- [~hyukjin.kwon] of course, here is an example : {code:java} scala> case class Foo(a: String) scala> val ds = List("", "{", "{}", """{"a"}""", """{"a": "bar"}""", """{"a": 42}""").toDS scala> import org.apache.spark.sql.types._ scala> ds.withColumn("converted", from_json($"value", StructType(Array(StructField("a", StringType).show() ++-+ | value|converted| ++-+ || null| | {| []| | {}| []| | {"a"}| []| |{"a": "bar"}|[bar]| | {"a": 42}| [42]| ++-+{code} We see above that faulty JSON will often result in a structure with `null` fields instead of a `null` directly, which is a big change of behavior between Spark 2 and Spark 3. The documentation still states that the behavior is Spark 2's. Moreover, I cannot reproduce Spark 2's behavior. I do want faulty input to be converted to null. I can make the code throw using the `FAILFAST` mode: {code:java} scala> ds.withColumn("converted", from_json($"value", StructType(Array(StructField("a", StringType))), Map("mode" -> "FAILFAST"))).show() {code} But I cannot use the `DROPMALFORMED` mode as it is not supported: scala> ds.withColumn("converted", from_json($"value", StructType(Array(StructField("a", StringType))), Map("mode" -> "DROPMALFORMED"))).show() java.lang.IllegalArgumentException: from_json() doesn't support the DROPMALFORMED mode. Acceptable modes are PERMISSIVE and FAILFAST. > from_json documentation is wrong about malformed JSONs output > - > > Key: SPARK-34441 > URL: https://issues.apache.org/jira/browse/SPARK-34441 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.0.0, 3.0.1 >Reporter: Jean-Francis Roy >Priority: Minor > > The documentation of the `from_json` function states that malformed json will > return a `null` value, which is not the case anymore after > https://issues.apache.org/jira/browse/SPARK-25243. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-34441) from_json documentation is wrong about malformed JSONs output
[ https://issues.apache.org/jira/browse/SPARK-34441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17285618#comment-17285618 ] Jean-Francis Roy edited comment on SPARK-34441 at 2/17/21, 3:17 AM: [~hyukjin.kwon] of course, here is an example : {code:java} scala> case class Foo(a: String) scala> val ds = List("", "{", "{}", """{"a"}""", """{"a": "bar"}""", """{"a": 42}""").toDS scala> import org.apache.spark.sql.types._ scala> ds.withColumn("converted", from_json($"value", StructType(Array(StructField("a", StringType).show() ++-+ | value|converted| ++-+ || null| | {| []| | {}| []| | {"a"}| []| |{"a": "bar"}|[bar]| | {"a": 42}| [42]| ++-+{code} We see above that faulty JSON will often result in a structure with `null` fields instead of a `null` directly, which is a big change of behavior between Spark 2 and Spark 3. The documentation still states that the behavior is Spark 2's. Moreover, I cannot reproduce Spark 2's behavior. I do want faulty input to be converted to null. I can make the code throw using the `FAILFAST` mode: {code:java} scala> ds.withColumn("converted", from_json($"value", StructType(Array(StructField("a", StringType))), Map("mode" -> "FAILFAST"))).show() {code} But I cannot use the `DROPMALFORMED` mode as it is not supported: {code:java} scala> ds.withColumn("converted", from_json($"value", StructType(Array(StructField("a", StringType))), Map("mode" -> "DROPMALFORMED"))).show() java.lang.IllegalArgumentException: from_json() doesn't support the DROPMALFORMED mode. Acceptable modes are PERMISSIVE and FAILFAST. {code} was (Author: jeanfrancisroy): [~hyukjin.kwon] of course, here is an example : {code:java} scala> case class Foo(a: String) scala> val ds = List("", "{", "{}", """{"a"}""", """{"a": "bar"}""", """{"a": 42}""").toDS scala> import org.apache.spark.sql.types._ scala> ds.withColumn("converted", from_json($"value", StructType(Array(StructField("a", StringType).show() ++-+ | value|converted| ++-+ || null| | {| []| | {}| []| | {"a"}| []| |{"a": "bar"}|[bar]| | {"a": 42}| [42]| ++-+{code} We see above that faulty JSON will often result in a structure with `null` fields instead of a `null` directly, which is a big change of behavior between Spark 2 and Spark 3. The documentation still states that the behavior is Spark 2's. Moreover, I cannot reproduce Spark 2's behavior. I do want faulty input to be converted to null. I can make the code throw using the `FAILFAST` mode: {code:java} scala> ds.withColumn("converted", from_json($"value", StructType(Array(StructField("a", StringType))), Map("mode" -> "FAILFAST"))).show() {code} But I cannot use the `DROPMALFORMED` mode as it is not supported: scala> ds.withColumn("converted", from_json($"value", StructType(Array(StructField("a", StringType))), Map("mode" -> "DROPMALFORMED"))).show() java.lang.IllegalArgumentException: from_json() doesn't support the DROPMALFORMED mode. Acceptable modes are PERMISSIVE and FAILFAST. > from_json documentation is wrong about malformed JSONs output > - > > Key: SPARK-34441 > URL: https://issues.apache.org/jira/browse/SPARK-34441 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.0.0, 3.0.1 >Reporter: Jean-Francis Roy >Priority: Minor > > The documentation of the `from_json` function states that malformed json will > return a `null` value, which is not the case anymore after > https://issues.apache.org/jira/browse/SPARK-25243. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45094) isEmpty on union of RDDs sharing the same trait crash when the first RDD is empty
Jean-Francis Roy created SPARK-45094: Summary: isEmpty on union of RDDs sharing the same trait crash when the first RDD is empty Key: SPARK-45094 URL: https://issues.apache.org/jira/browse/SPARK-45094 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.0, 3.2.1 Reporter: Jean-Francis Roy Given two RDDs of different types, but sharing a common trait, one can obtain a union of the two RDDs by doing the following: {code:java} > import org.apache.spark.rdd.RDD > trait Foo { val a: Int } > case class Bar(a: Int) extends Foo > case class Baz(a: Int, b: Int) extends Foo > val bars = spark.sparkContext.parallelize(List(Bar(1), > Bar(2))).asInstanceOf[RDD[Foo]] > val bazs = spark.sparkContext.parallelize(List(Baz(1, 42), Baz(2, > 42))).asInstanceOf[RDD[Foo]] > val union = bars.union(bazs){code} When doing so, `count()` and `isEmpty()` are behaving as expected: {code:java} > union.count() 4 > union.isEmpty() false{code} However, if the first RDD is empty, `count()` will behave as expected, but `isEmpty()` will throw a `java.lang.ArrayStoreException`: {code:java} > val bars = > spark.sparkContext.parallelize(List.empty[Bar]).asInstanceOf[RDD[Foo]] > val union = bars.union(bazs) > union.count() 2 > union.isEmpty() BOOM{code} Full stack trace: {code:java} ERROR Executor: Exception in task 4.0 in stage 8.0 (TID 134) java.lang.ArrayStoreException: $line16.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$Baz at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:74) at scala.Array$.slowcopy(Array.scala:157) at scala.Array$.copy(Array.scala:183) at scala.collection.mutable.ResizableArray.copyToArray(ResizableArray.scala:80) at scala.collection.mutable.ResizableArray.copyToArray$(ResizableArray.scala:78) at scala.collection.mutable.ArrayBuffer.copyToArray(ArrayBuffer.scala:49) at scala.collection.TraversableOnce.copyToArray(TraversableOnce.scala:334) at scala.collection.TraversableOnce.copyToArray$(TraversableOnce.scala:333) at scala.collection.AbstractTraversable.copyToArray(Traversable.scala:108) at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:342) at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339) at scala.collection.AbstractTraversable.toArray(Traversable.scala:108) at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345) at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339) at scala.collection.AbstractIterator.toArray(Iterator.scala:1431) at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1462) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2303) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) 23/09/06 14:55:18 ERROR Executor: Exception in task 14.0 in stage 8.0 (TID 144) java.lang.ArrayStoreException: $line16.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$Baz at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:74) at scala.Array$.slowcopy(Array.scala:157) at scala.Array$.copy(Array.scala:183) at scala.collection.mutable.ResizableArray.copyToArray(ResizableArray.scala:80) at scala.collection.mutable.ResizableArray.copyToArray$(ResizableArray.scala:78) at scala.collection.mutable.ArrayBuffer.copyToArray(ArrayBuffer.scala:49) at scala.collection.TraversableOnce.copyToArray(TraversableOnce.scala:334) at scala.collection.TraversableOnce.copyToArray$(TraversableOnce.scala:333) at scala.collection.AbstractTraversable.copyToArray(Traversable.scala:108) at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:342) at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339) at scala.collection.AbstractTraversable.toArray(Traversable.scala:108) at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345) at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339) at scala.collection.AbstractIterator.toArray(Iterator.scala:1431) at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1462) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2303) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
[jira] [Updated] (SPARK-45094) isEmpty on union of RDDs sharing the same trait crash when the first RDD is empty
[ https://issues.apache.org/jira/browse/SPARK-45094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Francis Roy updated SPARK-45094: - Environment: Tested on Spark 3.2.1 and Spark 3.4.0, using Scala 2.12.17 and OpenJDK 11.0.20.1. > isEmpty on union of RDDs sharing the same trait crash when the first RDD is > empty > - > > Key: SPARK-45094 > URL: https://issues.apache.org/jira/browse/SPARK-45094 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1, 3.4.0 > Environment: Tested on Spark 3.2.1 and Spark 3.4.0, using Scala > 2.12.17 and OpenJDK 11.0.20.1. >Reporter: Jean-Francis Roy >Priority: Minor > > Given two RDDs of different types, but sharing a common trait, one can obtain > a union of the two RDDs by doing the following: > {code:java} > > import org.apache.spark.rdd.RDD > > trait Foo { val a: Int } > > case class Bar(a: Int) extends Foo > > case class Baz(a: Int, b: Int) extends Foo > > val bars = spark.sparkContext.parallelize(List(Bar(1), > > Bar(2))).asInstanceOf[RDD[Foo]] > > val bazs = spark.sparkContext.parallelize(List(Baz(1, 42), Baz(2, > > 42))).asInstanceOf[RDD[Foo]] > > val union = bars.union(bazs){code} > > When doing so, `count()` and `isEmpty()` are behaving as expected: > {code:java} > > union.count() > 4 > > union.isEmpty() > false{code} > However, if the first RDD is empty, `count()` will behave as expected, but > `isEmpty()` will throw a `java.lang.ArrayStoreException`: > {code:java} > > val bars = > > spark.sparkContext.parallelize(List.empty[Bar]).asInstanceOf[RDD[Foo]] > > val union = bars.union(bazs) > > union.count() > 2 > > union.isEmpty() > BOOM{code} > Full stack trace: > {code:java} > ERROR Executor: Exception in task 4.0 in stage 8.0 (TID 134) > java.lang.ArrayStoreException: > $line16.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$Baz > at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:74) > at scala.Array$.slowcopy(Array.scala:157) > at scala.Array$.copy(Array.scala:183) > at > scala.collection.mutable.ResizableArray.copyToArray(ResizableArray.scala:80) > at > scala.collection.mutable.ResizableArray.copyToArray$(ResizableArray.scala:78) > at scala.collection.mutable.ArrayBuffer.copyToArray(ArrayBuffer.scala:49) > at scala.collection.TraversableOnce.copyToArray(TraversableOnce.scala:334) > at > scala.collection.TraversableOnce.copyToArray$(TraversableOnce.scala:333) > at scala.collection.AbstractTraversable.copyToArray(Traversable.scala:108) > at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:342) > at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339) > at scala.collection.AbstractTraversable.toArray(Traversable.scala:108) > at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345) > at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1431) > at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1462) > at > org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2303) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > at org.apache.spark.scheduler.Task.run(Task.scala:139) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > 23/09/06 14:55:18 ERROR Executor: Exception in task 14.0 in stage 8.0 (TID > 144) > java.lang.ArrayStoreException: > $line16.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$Baz > at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:74) > at scala.Array$.slowcopy(Array.scala:157) > at scala.Array$.copy(Array.scala:183) > at > scala.collection.mutable.ResizableArray.copyToArray(ResizableArray.scala:80) > at > scala.collection.mutable.ResizableArray.copyToArray$(ResizableArray.scala:78) > at scala.collection.mutable.ArrayBuffer.copyToArray(ArrayBuffer.scala:49) > at scala.collection.TraversableOnce.copyToArray(TraversableOnce.scala:334) > at > scala.collection.TraversableOnce.copyToArray$(TraversableOnce.scala:333) > at scala.collection.AbstractTraversable.copyToArray(Traversable.scala:108) > at scala.collection.TraversableOnce.toArray(Traversable