Hi Ankur,

We’ve built it from the git link you’ve sent, and we don’t get the exception 
anymore.
However, we’ve been facing strange indeterministic behavior from Graphx.

We compute connected components on a graph of ~900K edges. We ran the spark job 
several times on the same input graph and got back different components each 
time.
Furthermore, we construct the graph from an edge list, therefore there should 
not be “singleton” components. In the output we see that the vast majority 
(like 80%) of the components have only single vertex.

Does that have something to do with the bugfix below? Can you advise on how to 
solve this issue?

Thanks,
Alex

From: Ankur Dave [mailto:ankurd...@gmail.com]
Sent: Thursday, May 22, 2014 6:59 PM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: GraphX partition problem

The fix will be included in Spark 1.0, but if you just want to apply the fix to 
0.9.1, here's a hotfixed version of 0.9.1 that only includes PR #367: 
https://github.com/ankurdave/spark/tree/v0.9.1-handle-empty-partitions. You can 
clone and build this.

Ankur<http://www.ankurdave.com/>

On Thu, May 22, 2014 at 4:53 AM, Zhicharevich, Alex 
<azhicharev...@ebay.com<mailto:azhicharev...@ebay.com>> wrote:
Hi,

I’m running a simple connected components code using GraphX (version 0.9.1)

My input comes from a HDFS text file partitioned to 400 parts. When I run the 
code on a single part or a small number of files (like 20) the code runs fine. 
As soon as I’m trying to read more files (more than 30) I’m getting an error 
and the job fails.
From looking at the logs I see the following exception
                java.util.NoSuchElementException: End of stream
       at org.apache.spark.util.NextIterator.next(NextIterator.scala:83)
       at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:29)
       at 
org.apache.spark.graphx.impl.RoutingTable$$anonfun$1.apply(RoutingTable.scala:52)
       at 
org.apache.spark.graphx.impl.RoutingTable$$anonfun$1.apply(RoutingTable.scala:51)
       at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:456)

From searching the web, I see it’s a known issue with GraphX
Here : https://github.com/apache/spark/pull/367
And here : https://github.com/apache/spark/pull/497

Are there some stable releases that include this fix? Should I clone the git 
repo and build it myself? How would you advise me to deal with this issue

Thanks,
Alex




Reply via email to