...
On Mon, Feb 23, 2015 at 9:54 PM, Corey Nolet cjno...@gmail.com wrote:
I'm looking @ my yarn container logs for some of the executors which
appear to be failing (with the missing shuffle files). I see exceptions
that say client.TransportClientFactor: Found inactive connection to
host
before the job complete but it's
looking better...
On Mon, Feb 23, 2015 at 9:54 PM, Corey Nolet cjno...@gmail.com wrote:
I'm looking @ my yarn container logs for some of the executors which
appear to be failing (with the missing shuffle files). I see exceptions
that say
No, unfortunately we're not making use of dynamic allocation or the
external shuffle service. Hoping that we could reconfigure our cluster to
make use of it, but since it requires changes to the cluster itself (and
not just the Spark app), it could take some time.
Unsure if task 450 was acting as
I'm looking @ my yarn container logs for some of the executors which appear
to be failing (with the missing shuffle files). I see exceptions that say
client.TransportClientFactor: Found inactive connection to host/ip:port,
closing it.
Right after that I see shuffle.RetryingBlockFetcher: Exception
...
On Mon, Feb 23, 2015 at 9:54 PM, Corey Nolet cjno...@gmail.com wrote:
I'm looking @ my yarn container logs for some of the executors which
appear to be failing (with the missing shuffle files). I see exceptions
that say client.TransportClientFactor: Found inactive connection to
host/ip:port
Nolet cjno...@gmail.com wrote:
I'm looking @ my yarn container logs for some of the executors which
appear to be failing (with the missing shuffle files). I see exceptions
that say client.TransportClientFactor: Found inactive connection to
host/ip:port, closing it.
Right after that I see
to be failing (with the missing shuffle files). I see exceptions
that say client.TransportClientFactor: Found inactive connection to
host/ip:port, closing it.
Right after that I see shuffle.RetryingBlockFetcher: Exception while
beginning fetch of 1 outstanding blocks. java.io.IOException: Failed
Do you guys have dynamic allocation turned on for YARN?
Anders, was Task 450 in your job acting like a Reducer and fetching the Map
spill output data from a different node?
If a Reducer task can't read the remote data it needs, that could cause the
stage to fail. Sometimes this forces the
For large jobs, the following error message is shown that seems to indicate
that shuffle files for some reason are missing. It's a rather large job
with many partitions. If the data size is reduced, the problem disappears.
I'm running a build from Spark master post 1.2 (build at 2015-01-16) and
I'm experiencing the same issue. Upon closer inspection I'm noticing that
executors are being lost as well. Thing is, I can't figure out how they are
dying. I'm using MEMORY_AND_DISK_SER and i've got over 1.3TB of memory
allocated for the application. I was thinking perhaps it was possible that
a
Could you try to turn on the external shuffle service?
spark.shuffle.service.enable= true
On 21.2.2015. 17:50, Corey Nolet wrote:
I'm experiencing the same issue. Upon closer inspection I'm noticing
that executors are being lost as well. Thing is, I can't figure out
how they are dying. I'm
11 matches
Mail list logo