PySpark Issue: "org.apache.spark.shuffle.FetchFailedException: Failed to connect to..."

craigiggy Tue, 15 Mar 2016 19:04:51 -0700

I am having trouble with my standalone Spark cluster and I can't seem to find
a solution anywhere. I hope that maybe someone can figure out what is going
wrong so this issue might be resolved and I can continue with my work.


I am currently attempting to use Python and the pyspark library to do
distributed computing. I have two virtual machines set up for this cluster,
one machine is being used as both the master and one of the slaves
(*spark-mastr-1* with ip address: *xx.xx.xx.248*) and the other machine is
being used as just a slave (*spark-wrkr-1* with ip address: *xx.xx.xx.247*).
Both of them have 8GB of memory, 2 virtual sockets with 2 cores per socket
(4 CPU cores per machine for a total of 8 cores in the cluster). Both of
them have passwordless SSHing set up to each other (the master has
passwordless SSHing set up for itself as well since it is also being used as
one of the slaves).

At first I thought that *247* was just unable to connect to *248*, but I ran
a simple test with the Spark shell in order to check that the slaves are
able to talk with the master and they seem to be able to talk to each other
just fine. However, when I attempt to run my pyspark application, I still
run into trouble with *247* connecting with *248*. Then I thought it was a
memory issue, so I allocated 6GB of memory to each machine to use in Spark,
but this did not resolve the issue. Finally, I tried to give the pyspark
application more time before it times out as well as more retry attempts,
but I still get the same error. The error code that stands out to me is:

*org.apache.spark.shuffle.FetchFailedException: Failed to connect to
spark-mastr-1:xxxxxx*


The following is the error that I receive on my most recent attempted run of
the application:

Traceback (most recent call last):
  File "/home/spark/enigma_analytics/rec_engine/submission.py", line 413, in
<module>
    main(sc,sw,sw_set)
  File "/home/spark/enigma_analytics/rec_engine/submission.py", line 391, in
main
   
run_engine(submission_type,inputSub,mdb_collection,mdb_collectionType,sw_set,sc,weighted=False)
  File "/home/spark/enigma_analytics/rec_engine/submission.py", line 332, in
run_engine
    similarities_recRDD,recommendations =
recommend(subRDD,mdb_collection,query_format,sw_set,sc)
  File "/home/spark/enigma_analytics/rec_engine/submission.py", line 204, in
recommend
    idfsCorpusWeightsBroadcast = core.idfsRDD(corpus,sc)
  File "/home/spark/enigma_analytics/rec_engine/core.py", line 38, in
idfsRDD
    idfsInputRDD = ta.inverseDocumentFrequency(corpusRDD)
  File "/home/spark/enigma_analytics/rec_engine/textAnalyzer.py", line 106,
in inverseDocumentFrequency
    N = corpus.map(lambda doc: doc[0]).distinct().count()
  File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/rdd.py",
line 1004, in count
  File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/rdd.py",
line 995, in sum
  File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/rdd.py",
line 869, in fold
  File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/rdd.py",
line 771, in collect
  File
"/usr/local/spark/current/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
line 813, in __call__
  File
"/usr/local/spark/current/python/lib/py4j-0.9-src.zip/py4j/protocol.py",
line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure:
ResultStage 5 (count at
/home/spark/enigma_analytics/rec_engine/textAnalyzer.py:106) has failed the
maximum allowable number of times: 4. Most recent failure reason:
*/org.apache.spark.shuffle.FetchFailedException: Failed to connect to
spark-mastr-1:44642/*
        at
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
        at
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
        at
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
        at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
        at
org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741)
        at
org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)
Caused by: java.io.IOException: Failed to connect to spark-mastr-1:44642
        at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
        at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
        at
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
        at
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
        at
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
        at
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.channels.UnresolvedAddressException
        at sun.nio.ch.Net.checkAddress(Net.java:101)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
        at
io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:209)
        at
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:207)
        at
io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1097)
        at
io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471)
        at
io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:456)
        at
io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
        at
io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471)
        at
io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:456)
        at
io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
        at
io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471)
        at
io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:456)
        at
io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:438)
        at
io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:908)
        at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:203)
        at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:166)
        at
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
        at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        ... 1 more

        at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
        at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
        at
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1258)
        at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
        at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
        at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
        at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
        at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
        at
org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)
        at 
org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)

And after this point, the application kills itself without completing the
rest of the application job. The following are configurations that I give to
the python spark application (named *submission.py*) before setting up the
SparkContext variable:

*submission.py:*

I will include my configurations for Spark below on each machine:

On the master machine (xx.xx.xx.248)

*spark/conf/slaves:*
*spark/conf/log4j.properties:*
*spark/conf/spark-defaults.conf:*
*spark/conf/spark-env.sh:*

*spark/logs/SparkOut.log:*

On the slave machine (xx.xx.xx.247)

*spark/conf/log4j.properties:*
*spark/conf/spark-env.sh:*


I'll also attach an image below of what my Spark WebUI looks like while the
application is running:

<http://apache-spark-user-list.1001560.n3.nabble.com/file/n26511/spark_cluster_overview.png>
 

Finally I will attach the output logs of the Spark connections for both
machines to this post (*SparkOut_248.log* and *SparkOut_247.log*)

SparkOut_248.log
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n26511/SparkOut_248.log>
  
SparkOut_247.log
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n26511/SparkOut_247.log>
  

Hopefully this will be all of the information needed in order for this issue
to be resolved. If not, let me know what additional information to include
in this post and I will do so. Thank you to anyone for taking the time to
read this and help me out, I'm at a loss of what to do at the moment and I
am getting very frustrated.

-Craig Ignatowski



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Issue-org-apache-spark-shuffle-FetchFailedException-Failed-to-connect-to-tp26511.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

PySpark Issue: "org.apache.spark.shuffle.FetchFailedException: Failed to connect to..."

Reply via email to