Thanks Krishna, I use a small cluster and each compute node has 16GB of RAM and 8 2.66GHz CPU cores.
On Sat, Jun 21, 2014 at 3:16 PM, Krishna Sankar [via Apache Spark User List] <ml-node+s1001560n8077...@n3.nabble.com> wrote: > Hi, > > - I have seen similar behavior before. As far as I can tell, the root > cause is the out of memory error - verified this by monitoring the memory. > - I had a 30 GB file and was running on a single machine with 16GB. > So I knew it would fail. > - But instead of raising an exception, some part of the system > keeps on churning. > - My suggestion is to follow the memory settings for the JVM (try > bigger settings), make sure the settings are propagated to all the workers > and finally monitor the memory while the job is running. > - Another vector is to split the file, try with progressively > increasing size. > - I also see symptoms of failed connections. While I can't positively > say that it is a problem, check your topology & network connectivity. > - Out of curiosity, what kind of machines are you running ? Bare metal > ? EC2 ? How much memory ? 64 bit OS ? > - I assume these are big machines and so the resources themselves > might not be a problem. > > Cheers > <k/> > > > On Sat, Jun 21, 2014 at 12:55 PM, yxzhao <[hidden email] > <http://user/SendEmail.jtp?type=node&node=8077&i=0>> wrote: > >> I run the pagerank example processing a large data set, 5GB in size, >> using 48 >> machines. The job got stuck at the time point: 14/05/20 21:32:17, as the >> attached log shows. It was stuck there for more than 10 hours and then I >> killed it at last. But I did not find any information explaining why it >> was >> stuck. Any suggestions? Thanks. >> >> Spark_OK_48_pagerank.log >> < >> http://apache-spark-user-list.1001560.n3.nabble.com/file/n8075/Spark_OK_48_pagerank.log >> > >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Processing-Large-Data-Stuck-tp8075.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > > > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Processing-Large-Data-Stuck-tp8075p8077.html > To unsubscribe from Spark Processing Large Data Stuck, click here > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8075&code=eXh6aGFvQHVhbHIuZWR1fDgwNzV8LTY0Mjc0NDkzMQ==> > . > NAML > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Processing-Large-Data-Stuck-tp8075p8080.html Sent from the Apache Spark User List mailing list archive at Nabble.com.