I'll make a comment on the JIRA - thanks for reporting this, let's get to the bottom of it.
On Thu, Jun 19, 2014 at 11:19 AM, Surendranauth Hiraman <suren.hira...@velos.io> wrote: > I've created an issue for this but if anyone has any advice, please let me > know. > > Basically, on about 10 GBs of data, saveAsTextFile() to HDFS hangs on two > remaining tasks (out of 320). Those tasks seem to be waiting on data from > another task on another node. Eventually (about 2 hours later) they time out > with a connection reset by peer. > > All the data actually seems to be on HDFS as the expected part files. It > just seems like the remaining tasks have corrupted "metadata", so that they > do not realize that they are done. Just a guess though. > > https://issues.apache.org/jira/browse/SPARK-2202 > > -Suren > > > > > On Wed, Jun 18, 2014 at 8:35 PM, Surendranauth Hiraman > <suren.hira...@velos.io> wrote: >> >> Looks like eventually there was some type of reset or timeout and the >> tasks have been reassigned. I'm guessing they'll keep failing until max >> failure count. >> >> The machine it disconnected from was a remote machine, though I've seen >> such failures from connections to itself with other problems. The log lines >> from the remote machine are also below. >> >> Any thoughts or guesses would be appreciated! >> >> "HUNG" WORKER >> >> 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from >> connection to ConnectionManagerId(172.16.25.103,57626) >> >> java.io.IOException: Connection reset by peer >> >> at sun.nio.ch.FileDispatcher.read0(Native Method) >> >> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) >> >> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251) >> >> at sun.nio.ch.IOUtil.read(IOUtil.java:224) >> >> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254) >> >> at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496) >> >> at >> org.apache.spark.network.ConnectionManager$$anon$6.run(ConnectionManager.scala:175) >> >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) >> >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) >> >> at java.lang.Thread.run(Thread.java:679) >> >> 14/06/18 19:41:18 INFO network.ConnectionManager: Handling connection >> error on connection to ConnectionManagerId(172.16.25.103,57626) >> >> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing >> ReceivingConnection to ConnectionManagerId(172.16.25.103,57626) >> >> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing >> SendingConnection to ConnectionManagerId(172.16.25.103,57626) >> >> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing >> ReceivingConnection to ConnectionManagerId(172.16.25.103,57626) >> >> 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding >> SendingConnectionManagerId not found >> >> >> REMOTE WORKER >> >> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing >> ReceivingConnection to ConnectionManagerId(172.16.25.124,55610) >> >> 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding >> SendingConnectionManagerId not found >> >> >> >> >> On Wed, Jun 18, 2014 at 7:16 PM, Surendranauth Hiraman >> <suren.hira...@velos.io> wrote: >>> >>> I have a flow that ends with saveAsTextFile() to HDFS. >>> >>> It seems all the expected files per partition have been written out, >>> based on the number of part files and the file sizes. >>> >>> But the driver logs show 2 tasks still not completed and has no activity >>> and the worker logs show no activity for those two tasks for a while now. >>> >>> Has anyone run into this situation? It's happened to me a couple of times >>> now. >>> >>> Thanks. >>> >>> -- Suren >>> >>> SUREN HIRAMAN, VP TECHNOLOGY >>> Velos >>> Accelerating Machine Learning >>> >>> 440 NINTH AVENUE, 11TH FLOOR >>> NEW YORK, NY 10001 >>> O: (917) 525-2466 ext. 105 >>> F: 646.349.4063 >>> E: suren.hira...@velos.io >>> W: www.velos.io >>> >> >> >> >> -- >> >> SUREN HIRAMAN, VP TECHNOLOGY >> Velos >> Accelerating Machine Learning >> >> 440 NINTH AVENUE, 11TH FLOOR >> NEW YORK, NY 10001 >> O: (917) 525-2466 ext. 105 >> F: 646.349.4063 >> E: suren.hira...@velos.io >> W: www.velos.io >> > > > > -- > > SUREN HIRAMAN, VP TECHNOLOGY > Velos > Accelerating Machine Learning > > 440 NINTH AVENUE, 11TH FLOOR > NEW YORK, NY 10001 > O: (917) 525-2466 ext. 105 > F: 646.349.4063 > E: suren.hira...@velos.io > W: www.velos.io >