[jira] [Commented] (SPARK-24018) Spark-without-hadoop package fails to create or read parquet files with snappy compression

2018-07-07 Thread Jean-Francis Roy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535854#comment-16535854
 ] 

Jean-Francis Roy commented on SPARK-24018:
--

I don't think this is only related to the Spark shell: in my case I don't use 
any user classpath, snappy-1.1.2.6 is just nowhere to be found in Spark's or 
Hadoop's ClassPath. I first got this issue using spark-submit. The parquet lib 
version provided by Spark is incompatible with snappy-1.0.4.1 found in Hadoop's 
classpath.

[~pclay] did you start spark-shell with any arguments? Maybe snappy is shaded 
in one of the JARs you have in your classpath?

> Spark-without-hadoop package fails to create or read parquet files with 
> snappy compression
> --
>
> Key: SPARK-24018
> URL: https://issues.apache.org/jira/browse/SPARK-24018
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.0
>Reporter: Jean-Francis Roy
>Priority: Minor
>
> On a brand-new installation of Spark 2.3.0 with a user-provided hadoop-2.8.3, 
> Spark fails to read or write dataframes in parquet format with snappy 
> compression.
> This is due to an incompatibility between the snappy-java version that is 
> required by parquet (parquet is provided in Spark jars but snappy isn't) and 
> the version that is available from hadoop-2.8.3.
>  
> Steps to reproduce:
>  * Download and extract hadoop-2.8.3
>  * Download and extract spark-2.3.0-without-hadoop
>  * export JAVA_HOME, HADOOP_HOME, SPARK_HOME, PATH
>  * Following instructions from 
> [https://spark.apache.org/docs/latest/hadoop-provided.html], set 
> SPARK_DIST_CLASSPATH=$(hadoop classpath) in spark-env.sh
>  * Start a spark-shell, enter the following:
>  
> {code:java}
> import spark.implicits._
> val df = List(1, 2, 3, 4).toDF
> df.write
>   .format("parquet")
>   .option("compression", "snappy")
>   .mode("overwrite")
>   .save("test.parquet")
> {code}
>  
>  
> This fails with the following:
> {noformat}
> java.lang.UnsatisfiedLinkError: 
> org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
> at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
> at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
> at 
> org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
> at 
> org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
> at 
> org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
> at 
> org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
> at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
> at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
> at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238)
> at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109)
> at 
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:396)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at 
> org.apache.spark.scheduler.Task.run(Task.scal

[jira] [Commented] (SPARK-24018) Spark-without-hadoop package fails to create or read parquet files with snappy compression

2018-07-09 Thread Jean-Francis Roy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537815#comment-16537815
 ] 

Jean-Francis Roy commented on SPARK-24018:
--

Oh, indeed you are right! I was mistaken when I thought that I had the error 
using spark-submit, I just verified again and it works, thanks for the 
explanation!

> Spark-without-hadoop package fails to create or read parquet files with 
> snappy compression
> --
>
> Key: SPARK-24018
> URL: https://issues.apache.org/jira/browse/SPARK-24018
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.0
>Reporter: Jean-Francis Roy
>Priority: Minor
>
> On a brand-new installation of Spark 2.3.0 with a user-provided hadoop-2.8.3, 
> Spark fails to read or write dataframes in parquet format with snappy 
> compression.
> This is due to an incompatibility between the snappy-java version that is 
> required by parquet (parquet is provided in Spark jars but snappy isn't) and 
> the version that is available from hadoop-2.8.3.
>  
> Steps to reproduce:
>  * Download and extract hadoop-2.8.3
>  * Download and extract spark-2.3.0-without-hadoop
>  * export JAVA_HOME, HADOOP_HOME, SPARK_HOME, PATH
>  * Following instructions from 
> [https://spark.apache.org/docs/latest/hadoop-provided.html], set 
> SPARK_DIST_CLASSPATH=$(hadoop classpath) in spark-env.sh
>  * Start a spark-shell, enter the following:
>  
> {code:java}
> import spark.implicits._
> val df = List(1, 2, 3, 4).toDF
> df.write
>   .format("parquet")
>   .option("compression", "snappy")
>   .mode("overwrite")
>   .save("test.parquet")
> {code}
>  
>  
> This fails with the following:
> {noformat}
> java.lang.UnsatisfiedLinkError: 
> org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
> at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
> at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
> at 
> org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
> at 
> org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
> at 
> org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
> at 
> org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
> at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
> at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
> at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238)
> at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109)
> at 
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:396)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at 
> org.apache.spark.scheduler.Task.run(Task.scala:109)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thre

[jira] [Created] (SPARK-24018) Spark-without-hadoop package fails to create or read parquet files with snappy compression

2018-04-18 Thread Jean-Francis Roy (JIRA)
Jean-Francis Roy created SPARK-24018:


 Summary: Spark-without-hadoop package fails to create or read 
parquet files with snappy compression
 Key: SPARK-24018
 URL: https://issues.apache.org/jira/browse/SPARK-24018
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 2.3.0
Reporter: Jean-Francis Roy


On a brand-new installation of Spark 2.3.0 with a user-provided hadoop-2.8.3, 
Spark fails to read or write dataframes in parquet format with snappy 
compression.

This is due to an incompatibility between the snappy-java version that is 
required by parquet (parquet is provided in Spark jars but snappy isn't) and 
the version that is available from hadoop-2.8.3.

 

Steps to reproduce:
 * Download and extract hadoop-2.8.3
 * Download and extract spark-2.3.0-without-hadoop
 * export JAVA_HOME, HADOOP_HOME, SPARK_HOME, PATH
 * Following instructions from 
https://spark.apache.org/docs/latest/hadoop-provided.html, set 
SPARK_DIST_CLASSPATH=$(hadoop classpath) in spark-env.sh
 * Start a spark-shell, enter the following:

 
{code:java}
import spark.implicits._
val df = List(1, 2, 3, 4).toDF
df.write
  .format("parquet")
  .option("compression", "snappy")
  .mode("overwrite")
  .save("test.parquet")
{code}
 

 

This fails with the following:
{noformat}
java.lang.UnsatisfiedLinkError: 
org.xerial.snappy.SnappyNative.maxCompressedLength(I)I at 
org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method) at 
org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316) at 
org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
 at 
org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
 at 
org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92) 
at 
org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
 at 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
 at 
org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
 at 
org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238) at 
org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
 at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167)
 at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109)
 at 
org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163)
 at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:396)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at 
org.apache.spark.scheduler.Task.run(Task.scala:109) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)
{noformat}
Downloading snappy-java-1.1.2.6.jar and placing it in Sparks's jar folder 
solves the issue.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24018) Spark-without-hadoop package fails to create or read parquet files with snappy compression

2018-04-18 Thread Jean-Francis Roy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Francis Roy updated SPARK-24018:
-
Description: 
On a brand-new installation of Spark 2.3.0 with a user-provided hadoop-2.8.3, 
Spark fails to read or write dataframes in parquet format with snappy 
compression.

This is due to an incompatibility between the snappy-java version that is 
required by parquet (parquet is provided in Spark jars but snappy isn't) and 
the version that is available from hadoop-2.8.3.

 

Steps to reproduce:
 * Download and extract hadoop-2.8.3
 * Download and extract spark-2.3.0-without-hadoop
 * export JAVA_HOME, HADOOP_HOME, SPARK_HOME, PATH
 * Following instructions from 
[https://spark.apache.org/docs/latest/hadoop-provided.html], set 
SPARK_DIST_CLASSPATH=$(hadoop classpath) in spark-env.sh
 * Start a spark-shell, enter the following:

 
{code:java}
import spark.implicits._
val df = List(1, 2, 3, 4).toDF
df.write
  .format("parquet")
  .option("compression", "snappy")
  .mode("overwrite")
  .save("test.parquet")
{code}
 

 

This fails with the following:
{noformat}
java.lang.UnsatisfiedLinkError: 
org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
at 
org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
at 
org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
at 
org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
at 
org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
at 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
at 
org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
at 
org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238)
at 
org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167)
at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109)
at 
org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:396)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at 
org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){noformat}
 

  Downloading snappy-java-1.1.2.6.jar and placing it in Sparks's jar folder 
solves the issue.

  was:
On a brand-new installation of Spark 2.3.0 with a user-provided hadoop-2.8.3, 
Spark fails to read or write dataframes in parquet format with snappy 
compression.

This is due to an incompatibility between the snappy-java version that is 
required by parquet (parquet is provided in Spark jars but snappy isn't) and 
the version that is available from hadoop-2.8.3.

 

Steps to reproduce:
 * Download and extract hadoop-2.8.3
 * Download and extract spark-2.3.0-without-hadoop
 * export JAVA_HOME, HADOOP_HOME, SPARK_HOME, PATH
 * Following instructions from 
https://spark.apache.org/docs/latest/hadoop-provided.html, set 
SPARK_DIST_CLASSPATH=$(hadoop classpath) in spark-env.sh
 * Start a spark-shell, 

[jira] [Resolved] (SPARK-24018) Spark-without-hadoop package fails to create or read parquet files with snappy compression

2018-12-10 Thread Jean-Francis Roy (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Francis Roy resolved SPARK-24018.
--
   Resolution: Fixed
Fix Version/s: 2.3.2

I confirm the fix that appeared in Spark 2.3.2

> Spark-without-hadoop package fails to create or read parquet files with 
> snappy compression
> --
>
> Key: SPARK-24018
> URL: https://issues.apache.org/jira/browse/SPARK-24018
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.0
>Reporter: Jean-Francis Roy
>Priority: Minor
> Fix For: 2.3.2
>
>
> On a brand-new installation of Spark 2.3.0 with a user-provided hadoop-2.8.3, 
> Spark fails to read or write dataframes in parquet format with snappy 
> compression.
> This is due to an incompatibility between the snappy-java version that is 
> required by parquet (parquet is provided in Spark jars but snappy isn't) and 
> the version that is available from hadoop-2.8.3.
>  
> Steps to reproduce:
>  * Download and extract hadoop-2.8.3
>  * Download and extract spark-2.3.0-without-hadoop
>  * export JAVA_HOME, HADOOP_HOME, SPARK_HOME, PATH
>  * Following instructions from 
> [https://spark.apache.org/docs/latest/hadoop-provided.html], set 
> SPARK_DIST_CLASSPATH=$(hadoop classpath) in spark-env.sh
>  * Start a spark-shell, enter the following:
>  
> {code:java}
> import spark.implicits._
> val df = List(1, 2, 3, 4).toDF
> df.write
>   .format("parquet")
>   .option("compression", "snappy")
>   .mode("overwrite")
>   .save("test.parquet")
> {code}
>  
>  
> This fails with the following:
> {noformat}
> java.lang.UnsatisfiedLinkError: 
> org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
> at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
> at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
> at 
> org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
> at 
> org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
> at 
> org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
> at 
> org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
> at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
> at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
> at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238)
> at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109)
> at 
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:396)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at 
> org.apache.spark.scheduler.Task.run(Task.scala:109)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){noformat}
>  
>   Downloading snappy-java-1.1.2.6.jar and placing 

[jira] [Commented] (SPARK-25243) Use FailureSafeParser in from_json

2021-02-15 Thread Jean-Francis Roy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284951#comment-17284951
 ] 

Jean-Francis Roy commented on SPARK-25243:
--

The documentation still states that `from_json` will return `null` if the JSON 
is malformed, which is not the case anymore by default.

> Use FailureSafeParser in from_json
> --
>
> Key: SPARK-25243
> URL: https://issues.apache.org/jira/browse/SPARK-25243
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> The 
> [FailureSafeParser|https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FailureSafeParser.scala#L28]
>   is used in parsing JSON, CSV files and dataset of strings. It supports the 
> [PERMISSIVE, DROPMALFORMED and 
> FAILFAST|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala#L31-L44]
>  modes. The ticket aims to make the from_json function compatible to regular 
> parsing via FailureSafeParser and support above modes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34441) from_json documentation is wrong about malformed JSONs output

2021-02-15 Thread Jean-Francis Roy (Jira)
Jean-Francis Roy created SPARK-34441:


 Summary: from_json documentation is wrong about malformed JSONs 
output
 Key: SPARK-34441
 URL: https://issues.apache.org/jira/browse/SPARK-34441
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 3.0.1, 3.0.0
Reporter: Jean-Francis Roy


The documentation of the `from_json` function states that malformed json will 
return a `null` value, which is not the case anymore after 
https://issues.apache.org/jira/browse/SPARK-25243.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25243) Use FailureSafeParser in from_json

2021-02-15 Thread Jean-Francis Roy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284954#comment-17284954
 ] 

Jean-Francis Roy edited comment on SPARK-25243 at 2/15/21, 9:25 PM:


{code:java}
from_json() doesn't support the DROPMALFORMED mode. Acceptable modes are 
PERMISSIVE and FAILFAST.{code}
It seems like the previous behavior cannot be reproduced anymore?


was (Author: jeanfrancisroy):
`from_json() doesn't support the DROPMALFORMED mode. Acceptable modes are 
PERMISSIVE and FAILFAST.`

It seems like the previous behavior cannot be reproduced anymore?

> Use FailureSafeParser in from_json
> --
>
> Key: SPARK-25243
> URL: https://issues.apache.org/jira/browse/SPARK-25243
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> The 
> [FailureSafeParser|https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FailureSafeParser.scala#L28]
>   is used in parsing JSON, CSV files and dataset of strings. It supports the 
> [PERMISSIVE, DROPMALFORMED and 
> FAILFAST|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala#L31-L44]
>  modes. The ticket aims to make the from_json function compatible to regular 
> parsing via FailureSafeParser and support above modes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25243) Use FailureSafeParser in from_json

2021-02-15 Thread Jean-Francis Roy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284954#comment-17284954
 ] 

Jean-Francis Roy commented on SPARK-25243:
--

```
from_json() doesn't support the DROPMALFORMED mode. Acceptable modes are 
PERMISSIVE and FAILFAST.

```

It seems like the previous behavior cannot be reproduced anymore?

> Use FailureSafeParser in from_json
> --
>
> Key: SPARK-25243
> URL: https://issues.apache.org/jira/browse/SPARK-25243
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> The 
> [FailureSafeParser|https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FailureSafeParser.scala#L28]
>   is used in parsing JSON, CSV files and dataset of strings. It supports the 
> [PERMISSIVE, DROPMALFORMED and 
> FAILFAST|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala#L31-L44]
>  modes. The ticket aims to make the from_json function compatible to regular 
> parsing via FailureSafeParser and support above modes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25243) Use FailureSafeParser in from_json

2021-02-15 Thread Jean-Francis Roy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284954#comment-17284954
 ] 

Jean-Francis Roy edited comment on SPARK-25243 at 2/15/21, 9:25 PM:


`from_json() doesn't support the DROPMALFORMED mode. Acceptable modes are 
PERMISSIVE and FAILFAST.`

It seems like the previous behavior cannot be reproduced anymore?


was (Author: jeanfrancisroy):
```
from_json() doesn't support the DROPMALFORMED mode. Acceptable modes are 
PERMISSIVE and FAILFAST.

```

It seems like the previous behavior cannot be reproduced anymore?

> Use FailureSafeParser in from_json
> --
>
> Key: SPARK-25243
> URL: https://issues.apache.org/jira/browse/SPARK-25243
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> The 
> [FailureSafeParser|https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FailureSafeParser.scala#L28]
>   is used in parsing JSON, CSV files and dataset of strings. It supports the 
> [PERMISSIVE, DROPMALFORMED and 
> FAILFAST|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala#L31-L44]
>  modes. The ticket aims to make the from_json function compatible to regular 
> parsing via FailureSafeParser and support above modes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34441) from_json documentation is wrong about malformed JSONs output

2021-02-15 Thread Jean-Francis Roy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284955#comment-17284955
 ] 

Jean-Francis Roy commented on SPARK-34441:
--

It even seems we cannot reproduce the previous behavior anymore:
{code:java}
from_json() doesn't support the DROPMALFORMED mode. Acceptable modes are 
PERMISSIVE and FAILFAST.{code}

> from_json documentation is wrong about malformed JSONs output
> -
>
> Key: SPARK-34441
> URL: https://issues.apache.org/jira/browse/SPARK-34441
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Jean-Francis Roy
>Priority: Minor
>
> The documentation of the `from_json` function states that malformed json will 
> return a `null` value, which is not the case anymore after 
> https://issues.apache.org/jira/browse/SPARK-25243.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34441) from_json documentation is wrong about malformed JSONs output

2021-02-16 Thread Jean-Francis Roy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17285618#comment-17285618
 ] 

Jean-Francis Roy commented on SPARK-34441:
--

[~hyukjin.kwon] of course, here is an example :

 

 
{code:java}
scala> case class Foo(a: String)
scala> val ds = List("", "{", "{}", """{"a"}""", """{"a": "bar"}""", """{"a": 
42}""").toDS
scala> import org.apache.spark.sql.types._
scala> ds.withColumn("converted", from_json($"value", 
StructType(Array(StructField("a", StringType).show()
++-+
|   value|converted|
++-+
|| null|
|   {|   []|
|  {}|   []|
|   {"a"}|   []|
|{"a": "bar"}|[bar]|
|   {"a": 42}| [42]|
++-+{code}
We see above that faulty JSON will often result in a structure with `null` 
fields instead of a `null` directly, which is a big change of behavior between 
Spark 2 and Spark 3. The documentation still states that the behavior is Spark 
2's.

Moreover, I cannot reproduce Spark 2's behavior. I do want faulty input to be 
converted to null.

I can make the code throw using the `FAILFAST` mode:

 
{code:java}
scala> ds.withColumn("converted", from_json($"value", 
StructType(Array(StructField("a", StringType))), Map("mode" -> 
"FAILFAST"))).show()
{code}
 

 

But I cannot use the `DROPMALFORMED` mode as it is not supported:
scala> ds.withColumn("converted", from_json($"value", 
StructType(Array(StructField("a", StringType))), Map("mode" -> 
"DROPMALFORMED"))).show()
java.lang.IllegalArgumentException: from_json() doesn't support the 
DROPMALFORMED mode. Acceptable modes are PERMISSIVE and FAILFAST.
 

> from_json documentation is wrong about malformed JSONs output
> -
>
> Key: SPARK-34441
> URL: https://issues.apache.org/jira/browse/SPARK-34441
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Jean-Francis Roy
>Priority: Minor
>
> The documentation of the `from_json` function states that malformed json will 
> return a `null` value, which is not the case anymore after 
> https://issues.apache.org/jira/browse/SPARK-25243.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34441) from_json documentation is wrong about malformed JSONs output

2021-02-16 Thread Jean-Francis Roy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17285618#comment-17285618
 ] 

Jean-Francis Roy edited comment on SPARK-34441 at 2/17/21, 3:17 AM:


[~hyukjin.kwon] of course, here is an example :

 

 
{code:java}
scala> case class Foo(a: String)
scala> val ds = List("", "{", "{}", """{"a"}""", """{"a": "bar"}""", """{"a": 
42}""").toDS
scala> import org.apache.spark.sql.types._
scala> ds.withColumn("converted", from_json($"value", 
StructType(Array(StructField("a", StringType).show()
++-+
|   value|converted|
++-+
|| null|
|   {|   []|
|  {}|   []|
|   {"a"}|   []|
|{"a": "bar"}|[bar]|
|   {"a": 42}| [42]|
++-+{code}
We see above that faulty JSON will often result in a structure with `null` 
fields instead of a `null` directly, which is a big change of behavior between 
Spark 2 and Spark 3. The documentation still states that the behavior is Spark 
2's.

Moreover, I cannot reproduce Spark 2's behavior. I do want faulty input to be 
converted to null.

I can make the code throw using the `FAILFAST` mode:

 
{code:java}
scala> ds.withColumn("converted", from_json($"value", 
StructType(Array(StructField("a", StringType))), Map("mode" -> 
"FAILFAST"))).show()
{code}
 

 

But I cannot use the `DROPMALFORMED` mode as it is not supported:
{code:java}
scala> ds.withColumn("converted", from_json($"value", 
StructType(Array(StructField("a", StringType))), Map("mode" -> 
"DROPMALFORMED"))).show()
 java.lang.IllegalArgumentException: from_json() doesn't support the 
DROPMALFORMED mode. Acceptable modes are PERMISSIVE and FAILFAST.
{code}


  


was (Author: jeanfrancisroy):
[~hyukjin.kwon] of course, here is an example :

 

 
{code:java}
scala> case class Foo(a: String)
scala> val ds = List("", "{", "{}", """{"a"}""", """{"a": "bar"}""", """{"a": 
42}""").toDS
scala> import org.apache.spark.sql.types._
scala> ds.withColumn("converted", from_json($"value", 
StructType(Array(StructField("a", StringType).show()
++-+
|   value|converted|
++-+
|| null|
|   {|   []|
|  {}|   []|
|   {"a"}|   []|
|{"a": "bar"}|[bar]|
|   {"a": 42}| [42]|
++-+{code}
We see above that faulty JSON will often result in a structure with `null` 
fields instead of a `null` directly, which is a big change of behavior between 
Spark 2 and Spark 3. The documentation still states that the behavior is Spark 
2's.

Moreover, I cannot reproduce Spark 2's behavior. I do want faulty input to be 
converted to null.

I can make the code throw using the `FAILFAST` mode:

 
{code:java}
scala> ds.withColumn("converted", from_json($"value", 
StructType(Array(StructField("a", StringType))), Map("mode" -> 
"FAILFAST"))).show()
{code}
 

 

But I cannot use the `DROPMALFORMED` mode as it is not supported:
scala> ds.withColumn("converted", from_json($"value", 
StructType(Array(StructField("a", StringType))), Map("mode" -> 
"DROPMALFORMED"))).show()
java.lang.IllegalArgumentException: from_json() doesn't support the 
DROPMALFORMED mode. Acceptable modes are PERMISSIVE and FAILFAST.
 

> from_json documentation is wrong about malformed JSONs output
> -
>
> Key: SPARK-34441
> URL: https://issues.apache.org/jira/browse/SPARK-34441
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Jean-Francis Roy
>Priority: Minor
>
> The documentation of the `from_json` function states that malformed json will 
> return a `null` value, which is not the case anymore after 
> https://issues.apache.org/jira/browse/SPARK-25243.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45094) isEmpty on union of RDDs sharing the same trait crash when the first RDD is empty

2023-09-06 Thread Jean-Francis Roy (Jira)
Jean-Francis Roy created SPARK-45094:


 Summary: isEmpty on union of RDDs sharing the same trait crash 
when the first RDD is empty
 Key: SPARK-45094
 URL: https://issues.apache.org/jira/browse/SPARK-45094
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.0, 3.2.1
Reporter: Jean-Francis Roy


Given two RDDs of different types, but sharing a common trait, one can obtain a 
union of the two RDDs by doing the following:
{code:java}
> import org.apache.spark.rdd.RDD

> trait Foo { val a: Int }
> case class Bar(a: Int) extends Foo
> case class Baz(a: Int, b: Int) extends Foo

> val bars = spark.sparkContext.parallelize(List(Bar(1), 
> Bar(2))).asInstanceOf[RDD[Foo]]
> val bazs = spark.sparkContext.parallelize(List(Baz(1, 42), Baz(2, 
> 42))).asInstanceOf[RDD[Foo]]

> val union = bars.union(bazs){code}
 

When doing so, `count()` and `isEmpty()` are behaving as expected:
{code:java}
> union.count()
4

> union.isEmpty()
false{code}
However, if the first RDD is empty, `count()` will behave as expected, but 
`isEmpty()` will throw a `java.lang.ArrayStoreException`:
{code:java}
> val bars = 
> spark.sparkContext.parallelize(List.empty[Bar]).asInstanceOf[RDD[Foo]]
> val union = bars.union(bazs)

> union.count()
2

> union.isEmpty()
BOOM{code}
Full stack trace:
{code:java}
ERROR Executor: Exception in task 4.0 in stage 8.0 (TID 134)
java.lang.ArrayStoreException: $line16.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$Baz
    at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:74)
    at scala.Array$.slowcopy(Array.scala:157)
    at scala.Array$.copy(Array.scala:183)
    at 
scala.collection.mutable.ResizableArray.copyToArray(ResizableArray.scala:80)
    at 
scala.collection.mutable.ResizableArray.copyToArray$(ResizableArray.scala:78)
    at scala.collection.mutable.ArrayBuffer.copyToArray(ArrayBuffer.scala:49)
    at scala.collection.TraversableOnce.copyToArray(TraversableOnce.scala:334)
    at scala.collection.TraversableOnce.copyToArray$(TraversableOnce.scala:333)
    at scala.collection.AbstractTraversable.copyToArray(Traversable.scala:108)
    at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:342)
    at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
    at scala.collection.AbstractTraversable.toArray(Traversable.scala:108)
    at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
    at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1431)
    at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1462)
    at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2303)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
    at org.apache.spark.scheduler.Task.run(Task.scala:139)
    at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
23/09/06 14:55:18 ERROR Executor: Exception in task 14.0 in stage 8.0 (TID 144)
java.lang.ArrayStoreException: $line16.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$Baz
    at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:74)
    at scala.Array$.slowcopy(Array.scala:157)
    at scala.Array$.copy(Array.scala:183)
    at 
scala.collection.mutable.ResizableArray.copyToArray(ResizableArray.scala:80)
    at 
scala.collection.mutable.ResizableArray.copyToArray$(ResizableArray.scala:78)
    at scala.collection.mutable.ArrayBuffer.copyToArray(ArrayBuffer.scala:49)
    at scala.collection.TraversableOnce.copyToArray(TraversableOnce.scala:334)
    at scala.collection.TraversableOnce.copyToArray$(TraversableOnce.scala:333)
    at scala.collection.AbstractTraversable.copyToArray(Traversable.scala:108)
    at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:342)
    at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
    at scala.collection.AbstractTraversable.toArray(Traversable.scala:108)
    at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
    at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1431)
    at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1462)
    at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2303)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
   

[jira] [Updated] (SPARK-45094) isEmpty on union of RDDs sharing the same trait crash when the first RDD is empty

2023-09-06 Thread Jean-Francis Roy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Francis Roy updated SPARK-45094:
-
Environment: Tested on Spark 3.2.1 and Spark 3.4.0, using Scala 2.12.17 and 
OpenJDK 11.0.20.1.

> isEmpty on union of RDDs sharing the same trait crash when the first RDD is 
> empty
> -
>
> Key: SPARK-45094
> URL: https://issues.apache.org/jira/browse/SPARK-45094
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1, 3.4.0
> Environment: Tested on Spark 3.2.1 and Spark 3.4.0, using Scala 
> 2.12.17 and OpenJDK 11.0.20.1.
>Reporter: Jean-Francis Roy
>Priority: Minor
>
> Given two RDDs of different types, but sharing a common trait, one can obtain 
> a union of the two RDDs by doing the following:
> {code:java}
> > import org.apache.spark.rdd.RDD
> > trait Foo { val a: Int }
> > case class Bar(a: Int) extends Foo
> > case class Baz(a: Int, b: Int) extends Foo
> > val bars = spark.sparkContext.parallelize(List(Bar(1), 
> > Bar(2))).asInstanceOf[RDD[Foo]]
> > val bazs = spark.sparkContext.parallelize(List(Baz(1, 42), Baz(2, 
> > 42))).asInstanceOf[RDD[Foo]]
> > val union = bars.union(bazs){code}
>  
> When doing so, `count()` and `isEmpty()` are behaving as expected:
> {code:java}
> > union.count()
> 4
> > union.isEmpty()
> false{code}
> However, if the first RDD is empty, `count()` will behave as expected, but 
> `isEmpty()` will throw a `java.lang.ArrayStoreException`:
> {code:java}
> > val bars = 
> > spark.sparkContext.parallelize(List.empty[Bar]).asInstanceOf[RDD[Foo]]
> > val union = bars.union(bazs)
> > union.count()
> 2
> > union.isEmpty()
> BOOM{code}
> Full stack trace:
> {code:java}
> ERROR Executor: Exception in task 4.0 in stage 8.0 (TID 134)
> java.lang.ArrayStoreException: 
> $line16.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$Baz
>     at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:74)
>     at scala.Array$.slowcopy(Array.scala:157)
>     at scala.Array$.copy(Array.scala:183)
>     at 
> scala.collection.mutable.ResizableArray.copyToArray(ResizableArray.scala:80)
>     at 
> scala.collection.mutable.ResizableArray.copyToArray$(ResizableArray.scala:78)
>     at scala.collection.mutable.ArrayBuffer.copyToArray(ArrayBuffer.scala:49)
>     at scala.collection.TraversableOnce.copyToArray(TraversableOnce.scala:334)
>     at 
> scala.collection.TraversableOnce.copyToArray$(TraversableOnce.scala:333)
>     at scala.collection.AbstractTraversable.copyToArray(Traversable.scala:108)
>     at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:342)
>     at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
>     at scala.collection.AbstractTraversable.toArray(Traversable.scala:108)
>     at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
>     at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
>     at scala.collection.AbstractIterator.toArray(Iterator.scala:1431)
>     at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1462)
>     at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2303)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>     at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>     at org.apache.spark.scheduler.Task.run(Task.scala:139)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>     at java.base/java.lang.Thread.run(Thread.java:829)
> 23/09/06 14:55:18 ERROR Executor: Exception in task 14.0 in stage 8.0 (TID 
> 144)
> java.lang.ArrayStoreException: 
> $line16.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$Baz
>     at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:74)
>     at scala.Array$.slowcopy(Array.scala:157)
>     at scala.Array$.copy(Array.scala:183)
>     at 
> scala.collection.mutable.ResizableArray.copyToArray(ResizableArray.scala:80)
>     at 
> scala.collection.mutable.ResizableArray.copyToArray$(ResizableArray.scala:78)
>     at scala.collection.mutable.ArrayBuffer.copyToArray(ArrayBuffer.scala:49)
>     at scala.collection.TraversableOnce.copyToArray(TraversableOnce.scala:334)
>     at 
> scala.collection.TraversableOnce.copyToArray$(TraversableOnce.scala:333)
>     at scala.collection.AbstractTraversable.copyToArray(Traversable.scala:108)
>     at scala.collection.TraversableOnce.toArray(Traversable