[GitHub] [spark] jonbelanger-ns edited a comment on issue #20826: [SPARK-2489][SQL] Support Parquet's optional fixed_len_byte_array

GitBox Fri, 17 Jan 2020 11:07:05 -0800

jonbelanger-ns edited a comment on issue #20826: [SPARK-2489][SQL] Support 
Parquet's optional fixed_len_byte_array
URL: https://github.com/apache/spark/pull/20826#issuecomment-575728511
 
 
   If it helps, I have a fairly complex parquet file with a few nested fields 
as FIXED_LEN_BYTE_ARRAY, so this bug is a show stopper for spark on this 
dataset.
   
   I tried to fix by cloning this repo with the PR 
(https://github.com/aws-awinstan/spark.git) to local machine and compiling.
   
   I did the same for the master repo for spark which worked fine on a with a 
few of the columns (to test without parsing the FIXED_LEN_BYTE_ARRAY columns).
   
   However, the aws-awinstan repo fails with on the same test columns:
   
   [Stage 0:>                                                          (0 + 1) 
/ 1]20/01/17 12:37:13 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
192.168.42.107, executor 0): java.io.StreamCorruptedException: invalid stream 
header: 0000000F
        at 
java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:866)
        at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358)
        at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.<init>(JavaSerializer.scala:63)
        at 
org.apache.spark.serializer.JavaDeserializationStream.<init>(JavaSerializer.scala:63)
        at 
org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:126)
        at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:113)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:313)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   
   I'm using in the following in my client environment, with the HDFS and Spark 
remote in VM and standalone with a single worker.
   
   $ pip freeze | grep spark
   pyspark==2.4.4
   spark==0.2.1
   
   I'm surprised this bug was allowed to languish for as long as it has, it's 
not possible for us to serialize the upstream data and need this feature or 
have to move on...
   
   Edit:  further troubleshooting showed that it was the toPandas() call that 
is failing


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] jonbelanger-ns edited a comment on issue #20826: [SPARK-2489][SQL] Support Parquet's optional fixed_len_byte_array

Reply via email to