[ https://issues.apache.org/jira/browse/SPARK-28981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun resolved SPARK-28981. ----------------------------------- Resolution: Cannot Reproduce Since this is fixed at 2.4.4, this seems to be reported at a wrong affected version. The following is Apache Spark 2.4.4 result. I'll close this as `Cannot Reproduce` according to the current `Affected Versions`. In addition, I linked SPARK-26995 as a duplicate for future reference. {code} $ docker build -t spark:2.4.4 -f kubernetes/dockerfiles/spark/Dockerfile . $ docker run --rm -it spark:2.4.4 /opt/spark/bin/spark-shell ++ id -u + myuid=0 ++ id -g + mygid=0 + set +e ++ getent passwd 0 + uidentry=root:x:0:0:root:/root:/bin/ash + set -e + '[' -z root:x:0:0:root:/root:/bin/ash ']' + SPARK_K8S_CMD=/opt/spark/bin/spark-shell + case "$SPARK_K8S_CMD" in + echo 'Non-spark-on-k8s command provided, proceeding in pass-through mode...' Non-spark-on-k8s command provided, proceeding in pass-through mode... + exec /sbin/tini -s -- /opt/spark/bin/spark-shell 19/09/05 17:39:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://454a817f8cee:4040 Spark context available as 'sc' (master = local[*], app id = local-1567705163260). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.4 /_/ Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212) Type in expressions to have them evaluated. Type :help for more information. scala> spark.range(10).write.parquet("/tmp/p") 19/09/05 17:39:38 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 96.54% for 7 writers 19/09/05 17:39:38 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 84.47% for 8 writers 19/09/05 17:39:38 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 75.08% for 9 writers 19/09/05 17:39:38 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 67.58% for 10 writers 19/09/05 17:39:38 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 61.43% for 11 writers 19/09/05 17:39:38 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 67.58% for 10 writers 19/09/05 17:39:38 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 75.08% for 9 writers 19/09/05 17:39:38 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 84.47% for 8 writers 19/09/05 17:39:38 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 96.54% for 7 writers scala> spark.read.parquet("/tmp/p").count res1: Long = 10 {code} > Missing library for reading/writing Snappy-compressed files > ----------------------------------------------------------- > > Key: SPARK-28981 > URL: https://issues.apache.org/jira/browse/SPARK-28981 > Project: Spark > Issue Type: Bug > Components: Kubernetes > Affects Versions: 2.4.4 > Reporter: Paul Schweigert > Priority: Minor > > The current Dockerfile for Spark on Kubernetes is missing the > "ld-linux-x86-64.so.2" library needed to read / write Snappy-compressed > files. > > Sample error message when trying to read a parquet file compressed with > snappy: > > {code:java} > 19/09/02 05:33:19 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, > 172.30.189.77, executor 2): org.apache.spark.SparkException: Task failed > while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257) > > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) > > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) > > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.UnsatisfiedLinkError: > /tmp/snappy-1.1.7-04145e2f-cc82-4217-99b8-641cdd755a87-libsnappyjava.so: > Error loading shared library ld-linux-x86-64.so.2: No such file or directory > (needed by > /tmp/snappy-1.1.7-04145e2f-cc82-4217-99b8-641cdd755a87-libsnappyjava.so) > at java.lang.ClassLoader$NativeLibrary.load(Native Method) > at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941) > at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824) > at java.lang.Runtime.load0(Runtime.java:809) > at java.lang.System.load(System.java:1086) > at > org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:179) > at org.xerial.snappy.SnappyLoader.loadSnappyApi(SnappyLoader.java:154) > at org.xerial.snappy.Snappy.<clinit>(Snappy.java:47) > at > org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67) > > at > org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81) > > at > org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92) > > at > org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.compress(CodecFactory.java:165) > > at > org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:95) > > at > org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:147) > > at > org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:235) > > at > org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:122) > > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:172) > > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:114) > > at > org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165) > > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) > > at > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57) > > at > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74) > > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247) > > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) > > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248) > > ... 10 more > {code} > The relevant library is in the Alpine Linux "gcompat" package > ([https://pkgs.alpinelinux.org/package/edge/community/x86/gcompat]). Adding > this library to the Dockerfile enables the reading/writing of > Snappy-compressed files. > -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org