Re: Missing shuffle files

2015-02-28 Thread Corey Nolet
... On Mon, Feb 23, 2015 at 9:54 PM, Corey Nolet cjno...@gmail.com wrote: I'm looking @ my yarn container logs for some of the executors which appear to be failing (with the missing shuffle files). I see exceptions that say client.TransportClientFactor: Found inactive connection to host

Re: Missing shuffle files

2015-02-24 Thread Anders Arpteg
before the job complete but it's looking better... On Mon, Feb 23, 2015 at 9:54 PM, Corey Nolet cjno...@gmail.com wrote: I'm looking @ my yarn container logs for some of the executors which appear to be failing (with the missing shuffle files). I see exceptions that say

Re: Missing shuffle files

2015-02-23 Thread Anders Arpteg
No, unfortunately we're not making use of dynamic allocation or the external shuffle service. Hoping that we could reconfigure our cluster to make use of it, but since it requires changes to the cluster itself (and not just the Spark app), it could take some time. Unsure if task 450 was acting as

Re: Missing shuffle files

2015-02-23 Thread Corey Nolet
I'm looking @ my yarn container logs for some of the executors which appear to be failing (with the missing shuffle files). I see exceptions that say client.TransportClientFactor: Found inactive connection to host/ip:port, closing it. Right after that I see shuffle.RetryingBlockFetcher: Exception

Re: Missing shuffle files

2015-02-23 Thread Anders Arpteg
... On Mon, Feb 23, 2015 at 9:54 PM, Corey Nolet cjno...@gmail.com wrote: I'm looking @ my yarn container logs for some of the executors which appear to be failing (with the missing shuffle files). I see exceptions that say client.TransportClientFactor: Found inactive connection to host/ip:port

Re: Missing shuffle files

2015-02-23 Thread Corey Nolet
Nolet cjno...@gmail.com wrote: I'm looking @ my yarn container logs for some of the executors which appear to be failing (with the missing shuffle files). I see exceptions that say client.TransportClientFactor: Found inactive connection to host/ip:port, closing it. Right after that I see

Re: Missing shuffle files

2015-02-23 Thread Corey Nolet
to be failing (with the missing shuffle files). I see exceptions that say client.TransportClientFactor: Found inactive connection to host/ip:port, closing it. Right after that I see shuffle.RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks. java.io.IOException: Failed

Re: Missing shuffle files

2015-02-22 Thread Sameer Farooqui
Do you guys have dynamic allocation turned on for YARN? Anders, was Task 450 in your job acting like a Reducer and fetching the Map spill output data from a different node? If a Reducer task can't read the remote data it needs, that could cause the stage to fail. Sometimes this forces the

Missing shuffle files

2015-02-21 Thread Anders Arpteg
For large jobs, the following error message is shown that seems to indicate that shuffle files for some reason are missing. It's a rather large job with many partitions. If the data size is reduced, the problem disappears. I'm running a build from Spark master post 1.2 (build at 2015-01-16) and

Re: Missing shuffle files

2015-02-21 Thread Corey Nolet
I'm experiencing the same issue. Upon closer inspection I'm noticing that executors are being lost as well. Thing is, I can't figure out how they are dying. I'm using MEMORY_AND_DISK_SER and i've got over 1.3TB of memory allocated for the application. I was thinking perhaps it was possible that a

Re: Missing shuffle files

2015-02-21 Thread Petar Zecevic
Could you try to turn on the external shuffle service? spark.shuffle.service.enable= true On 21.2.2015. 17:50, Corey Nolet wrote: I'm experiencing the same issue. Upon closer inspection I'm noticing that executors are being lost as well. Thing is, I can't figure out how they are dying. I'm