[ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207568#comment-14207568
 ] 

Ryan Williams commented on SPARK-3630:
--------------------------------------

I'm seeing many Snappy {{FAILED_TO_UNCOMPRESS(5)}} and {{PARSING_ERROR(2)}} 
errors. I just built Spark yesterday off of 
[227488d|https://github.com/apache/spark/commit/227488d], so I expected that to 
have picked up some of the fixes detailed in this thread. I am running on a 
Yarn cluster whose 100 nodes have kernel 2.6.32 so in a few of these attempts I 
to used {{spark.file.transferTo=false}} and still saw these errors.

Here are some notes about some of my runs, along with the stdout I got:
* 1000 partitions, {{spark.file.transferTo=false}}: 
[stdout|https://www.dropbox.com/s/141keqpojucfbai/logs.1000?dl=0]. This was my 
latest run; it took a while to get to my reduceByKeyLocally stage, and 
immediately upon finishing the preceding stage it emitted ~190K 
{{FetchFailure}}s over ~200 attempts of the stage in about one minute, followed 
by some Snappy errors and the job shutting down.
* 2000 partitions, {{spark.file.transferTo=false}}: 
[stdout|https://www.dropbox.com/s/jr1dsldodq4rvbz/logs.2000?dl=0]. This one had 
~150 FetchFailures out of the gate, 
seemingly ran fine for ~8mins, then had a futures timeout, seemingly ran find 
for another ~17m, then got to my reduceByKeyLocally stage and died from Snappy 
errors.
* 2000 partitions, {{spark.file.transferTo=true}}: 
[stdout|https://www.dropbox.com/s/9n24ffcdq0j43ue/logs.2000.tt?dl=0]. Before 
running the above two, I was hoping that {{spark.file.transferTo=false}} was 
going to fix my problems, so I ran this to see whether >2000 partitions was the 
determining factor in the Snappy errors happening, as [~joshrosen] suggested in 
this thread. No such luck! ~15 FetchFailures right away, ran fine for 24mins, 
got to reduceByKeyLocally phase, Snappy-failed and died.
* these and other stdout logs can be found 
[here|https://www.dropbox.com/sh/pn0bik3tvy73wfi/AAByFlQVJ3QUOqiKYKXt31RGa?dl=0]

In all of these I was running on a dataset (~170GB) that should be easily 
handled by my cluster (5TB RAM total), and in fact I successfully ran this job 
against this dataset last night using a Spark 1.1 build. That job was dying of 
FetchFailures when I tried to run against a larger dataset (~300GB), and I 
thought maybe I needed shuffle sorting or external shuffle service, or other 
1.2.0 goodies, so I've been trying to run with 1.2.0 but can't get anything to 
finish.

This job reads a file in from hadoop, coalesces to the number of partitions 
I've asked for, and does a {{flatMap}}, a {reduceByKey}}, a map, and a 
{{reduceByKeyLocally}}. I am pretty confident that the {{Map}} I'm 
materializing onto the driver in the {{reduceByKeyLocally}} is a reasonable 
size; it's a {{Map[Long, Long]}} with about 40K entries, and I've actually 
successfully run this job on this data to materialize that exact map at 
different points this week, as I mentioned before. Something causes this job to 
die almost immediately upon starting the {{reduceByKeyLocally}} phase, however, 
usually just with Snappy errors, but with a preponderance of FetchFailures 
preceding them in my last attempt.

Let me know what other information I can provide that might be useful. Thanks!

> Identify cause of Kryo+Snappy PARSING_ERROR
> -------------------------------------------
>
>                 Key: SPARK-3630
>                 URL: https://issues.apache.org/jira/browse/SPARK-3630
>             Project: Spark
>          Issue Type: Task
>          Components: Spark Core
>    Affects Versions: 1.1.0, 1.2.0
>            Reporter: Andrew Ash
>            Assignee: Josh Rosen
>
> A recent GraphX commit caused non-deterministic exceptions in unit tests so 
> it was reverted (see SPARK-3400).
> Separately, [~aash] observed the same exception stacktrace in an 
> application-specific Kryo registrator:
> {noformat}
> com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
> uncompress the chunk: PARSING_ERROR(2)
> com.esotericsoftware.kryo.io.Input.fill(Input.java:142) 
> com.esotericsoftware.kryo.io.Input.require(Input.java:169) 
> com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) 
> com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127)
>  
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
>  
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> ...
> {noformat}
> This ticket is to identify the cause of the exception in the GraphX commit so 
> the faulty commit can be fixed and merged back into master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to