I've created an issue for this but if anyone has any advice, please let me know.
Basically, on about 10 GBs of data, saveAsTextFile() to HDFS hangs on two remaining tasks (out of 320). Those tasks seem to be waiting on data from another task on another node. Eventually (about 2 hours later) they time out with a connection reset by peer. All the data actually seems to be on HDFS as the expected part files. It just seems like the remaining tasks have corrupted "metadata", so that they do not realize that they are done. Just a guess though. https://issues.apache.org/jira/browse/SPARK-2202 -Suren On Wed, Jun 18, 2014 at 8:35 PM, Surendranauth Hiraman < suren.hira...@velos.io> wrote: > Looks like eventually there was some type of reset or timeout and the > tasks have been reassigned. I'm guessing they'll keep failing until max > failure count. > > The machine it disconnected from was a remote machine, though I've seen > such failures from connections to itself with other problems. The log lines > from the remote machine are also below. > > Any thoughts or guesses would be appreciated! > > *"HUNG" WORKER* > > 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from > connection to ConnectionManagerId(172.16.25.103,57626) > > java.io.IOException: Connection reset by peer > > at sun.nio.ch.FileDispatcher.read0(Native Method) > > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251) > > at sun.nio.ch.IOUtil.read(IOUtil.java:224) > > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254) > > at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496) > > at > org.apache.spark.network.ConnectionManager$$anon$6.run(ConnectionManager.scala:175) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > > at java.lang.Thread.run(Thread.java:679) > > 14/06/18 19:41:18 INFO network.ConnectionManager: Handling connection > error on connection to ConnectionManagerId(172.16.25.103,57626) > > 14/06/18 19:41:18 INFO network.ConnectionManager: Removing > ReceivingConnection to ConnectionManagerId(172.16.25.103,57626) > > 14/06/18 19:41:18 INFO network.ConnectionManager: Removing > SendingConnection to ConnectionManagerId(172.16.25.103,57626) > > 14/06/18 19:41:18 INFO network.ConnectionManager: Removing > ReceivingConnection to ConnectionManagerId(172.16.25.103,57626) > > 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding > SendingConnectionManagerId not found > > > *REMOTE WORKER* > > 14/06/18 19:41:18 INFO network.ConnectionManager: Removing > ReceivingConnection to ConnectionManagerId(172.16.25.124,55610) > > 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding > SendingConnectionManagerId not found > > > > On Wed, Jun 18, 2014 at 7:16 PM, Surendranauth Hiraman < > suren.hira...@velos.io> wrote: > >> I have a flow that ends with saveAsTextFile() to HDFS. >> >> It seems all the expected files per partition have been written out, >> based on the number of part files and the file sizes. >> >> But the driver logs show 2 tasks still not completed and has no activity >> and the worker logs show no activity for those two tasks for a while now. >> >> Has anyone run into this situation? It's happened to me a couple of times >> now. >> >> Thanks. >> >> -- Suren >> >> SUREN HIRAMAN, VP TECHNOLOGY >> Velos >> Accelerating Machine Learning >> >> 440 NINTH AVENUE, 11TH FLOOR >> NEW YORK, NY 10001 >> O: (917) 525-2466 ext. 105 >> F: 646.349.4063 >> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io >> W: www.velos.io >> >> > > > -- > > SUREN HIRAMAN, VP TECHNOLOGY > Velos > Accelerating Machine Learning > > 440 NINTH AVENUE, 11TH FLOOR > NEW YORK, NY 10001 > O: (917) 525-2466 ext. 105 > F: 646.349.4063 > E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io > W: www.velos.io > > -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io W: www.velos.io