Re: PySpark Issue: "org.apache.spark.shuffle.FetchFailedException: Failed to connect to..."

2016-05-04 Thread HLee
I had the same problem.  One forum post elsewhere suggested that too much
network communication might be using up available ports.  I reduced the
partition size via repartition(int) and it solved the problem.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Issue-org-apache-spark-shuffle-FetchFailedException-Failed-to-connect-to-tp26511p26879.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: PySpark Issue: "org.apache.spark.shuffle.FetchFailedException: Failed to connect to..."

2016-03-20 Thread craigiggy
Also, this is the command I use to submit the Spark application:

**

where *recommendation_engine-0.1-py2.7.egg* is a Python egg of my own
library I've written for this application, and *'file'* and
*'/home/spark/enigma_analytics/tests/msg-epims0730_small.json'* are input
arguments for the application.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Issue-org-apache-spark-shuffle-FetchFailedException-Failed-to-connect-to-tp26511p26532.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: PySpark Issue: "org.apache.spark.shuffle.FetchFailedException: Failed to connect to..."

2016-03-19 Thread craigiggy
Slight update I suppose?
For some reason, sometimes it will connect and continue and the job will be
completed. But most of the time I still run into this error and the job is
killed and the application doesn't finish.

Still have no idea why this is happening. I could really use some help here.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Issue-org-apache-spark-shuffle-FetchFailedException-Failed-to-connect-to-tp26511p26531.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



PySpark Issue: "org.apache.spark.shuffle.FetchFailedException: Failed to connect to..."

2016-03-15 Thread craigiggy
I am having trouble with my standalone Spark cluster and I can't seem to find
a solution anywhere. I hope that maybe someone can figure out what is going
wrong so this issue might be resolved and I can continue with my work.

I am currently attempting to use Python and the pyspark library to do
distributed computing. I have two virtual machines set up for this cluster,
one machine is being used as both the master and one of the slaves
(*spark-mastr-1* with ip address: *xx.xx.xx.248*) and the other machine is
being used as just a slave (*spark-wrkr-1* with ip address: *xx.xx.xx.247*).
Both of them have 8GB of memory, 2 virtual sockets with 2 cores per socket
(4 CPU cores per machine for a total of 8 cores in the cluster). Both of
them have passwordless SSHing set up to each other (the master has
passwordless SSHing set up for itself as well since it is also being used as
one of the slaves).

At first I thought that *247* was just unable to connect to *248*, but I ran
a simple test with the Spark shell in order to check that the slaves are
able to talk with the master and they seem to be able to talk to each other
just fine. However, when I attempt to run my pyspark application, I still
run into trouble with *247* connecting with *248*. Then I thought it was a
memory issue, so I allocated 6GB of memory to each machine to use in Spark,
but this did not resolve the issue. Finally, I tried to give the pyspark
application more time before it times out as well as more retry attempts,
but I still get the same error. The error code that stands out to me is:

*org.apache.spark.shuffle.FetchFailedException: Failed to connect to
spark-mastr-1:xx*


The following is the error that I receive on my most recent attempted run of
the application:

Traceback (most recent call last):
  File "/home/spark/enigma_analytics/rec_engine/submission.py", line 413, in

main(sc,sw,sw_set)
  File "/home/spark/enigma_analytics/rec_engine/submission.py", line 391, in
main
   
run_engine(submission_type,inputSub,mdb_collection,mdb_collectionType,sw_set,sc,weighted=False)
  File "/home/spark/enigma_analytics/rec_engine/submission.py", line 332, in
run_engine
similarities_recRDD,recommendations =
recommend(subRDD,mdb_collection,query_format,sw_set,sc)
  File "/home/spark/enigma_analytics/rec_engine/submission.py", line 204, in
recommend
idfsCorpusWeightsBroadcast = core.idfsRDD(corpus,sc)
  File "/home/spark/enigma_analytics/rec_engine/core.py", line 38, in
idfsRDD
idfsInputRDD = ta.inverseDocumentFrequency(corpusRDD)
  File "/home/spark/enigma_analytics/rec_engine/textAnalyzer.py", line 106,
in inverseDocumentFrequency
N = corpus.map(lambda doc: doc[0]).distinct().count()
  File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/rdd.py",
line 1004, in count
  File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/rdd.py",
line 995, in sum
  File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/rdd.py",
line 869, in fold
  File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/rdd.py",
line 771, in collect
  File
"/usr/local/spark/current/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
line 813, in __call__
  File
"/usr/local/spark/current/python/lib/py4j-0.9-src.zip/py4j/protocol.py",
line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure:
ResultStage 5 (count at
/home/spark/enigma_analytics/rec_engine/textAnalyzer.py:106) has failed the
maximum allowable number of times: 4. Most recent failure reason:
*/org.apache.spark.shuffle.FetchFailedException: Failed to connect to
spark-mastr-1:44642/*
at
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
at
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
at
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
at
org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741)
at