I'll make a comment on the JIRA - thanks for reporting this, let's get
to the bottom of it.

On Thu, Jun 19, 2014 at 11:19 AM, Surendranauth Hiraman
<suren.hira...@velos.io> wrote:
> I've created an issue for this but if anyone has any advice, please let me
> know.
>
> Basically, on about 10 GBs of data, saveAsTextFile() to HDFS hangs on two
> remaining tasks (out of 320). Those tasks seem to be waiting on data from
> another task on another node. Eventually (about 2 hours later) they time out
> with a connection reset by peer.
>
> All the data actually seems to be on HDFS as the expected part files. It
> just seems like the remaining tasks have corrupted "metadata", so that they
> do not realize that they are done. Just a guess though.
>
> https://issues.apache.org/jira/browse/SPARK-2202
>
> -Suren
>
>
>
>
> On Wed, Jun 18, 2014 at 8:35 PM, Surendranauth Hiraman
> <suren.hira...@velos.io> wrote:
>>
>> Looks like eventually there was some type of reset or timeout and the
>> tasks have been reassigned. I'm guessing they'll keep failing until max
>> failure count.
>>
>> The machine it disconnected from was a remote machine, though I've seen
>> such failures from connections to itself with other problems. The log lines
>> from the remote machine are also below.
>>
>> Any thoughts or guesses would be appreciated!
>>
>> "HUNG" WORKER
>>
>> 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from
>> connection to ConnectionManagerId(172.16.25.103,57626)
>>
>> java.io.IOException: Connection reset by peer
>>
>> at sun.nio.ch.FileDispatcher.read0(Native Method)
>>
>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>
>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
>>
>> at sun.nio.ch.IOUtil.read(IOUtil.java:224)
>>
>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
>>
>> at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496)
>>
>> at
>> org.apache.spark.network.ConnectionManager$$anon$6.run(ConnectionManager.scala:175)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>
>> at java.lang.Thread.run(Thread.java:679)
>>
>> 14/06/18 19:41:18 INFO network.ConnectionManager: Handling connection
>> error on connection to ConnectionManagerId(172.16.25.103,57626)
>>
>> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing
>> ReceivingConnection to ConnectionManagerId(172.16.25.103,57626)
>>
>> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing
>> SendingConnection to ConnectionManagerId(172.16.25.103,57626)
>>
>> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing
>> ReceivingConnection to ConnectionManagerId(172.16.25.103,57626)
>>
>> 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding
>> SendingConnectionManagerId not found
>>
>>
>> REMOTE WORKER
>>
>> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing
>> ReceivingConnection to ConnectionManagerId(172.16.25.124,55610)
>>
>> 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding
>> SendingConnectionManagerId not found
>>
>>
>>
>>
>> On Wed, Jun 18, 2014 at 7:16 PM, Surendranauth Hiraman
>> <suren.hira...@velos.io> wrote:
>>>
>>> I have a flow that ends with saveAsTextFile() to HDFS.
>>>
>>> It seems all the expected files per partition have been written out,
>>> based on the number of part files and the file sizes.
>>>
>>> But the driver logs show 2 tasks still not completed and has no activity
>>> and the worker logs show no activity for those two tasks for a while now.
>>>
>>> Has anyone run into this situation? It's happened to me a couple of times
>>> now.
>>>
>>> Thanks.
>>>
>>> -- Suren
>>>
>>> SUREN HIRAMAN, VP TECHNOLOGY
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR
>>> NEW YORK, NY 10001
>>> O: (917) 525-2466 ext. 105
>>> F: 646.349.4063
>>> E: suren.hira...@velos.io
>>> W: www.velos.io
>>>
>>
>>
>>
>> --
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hira...@velos.io
>> W: www.velos.io
>>
>
>
>
> --
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hira...@velos.io
> W: www.velos.io
>

Reply via email to