Okay maybe these errors are more helpful -
WARN server.TransportChannelHandler: Exception in connection from
ip-10-0-0-138.ec2.internal/10.0.0.138:39723
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
at
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
15/12/07 21:25:42 ERROR client.TransportResponseHandler: Still have 5 requests
outstanding when connection from ip-10-0-0-138.ec2.internal/10.0.0.138:39723 is
closed
15/12/07 21:25:42 ERROR shuffle.OneForOneBlockFetcher: Failed while starting
block fetches
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
at
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
15/12/07 21:25:42 INFO shuffle.RetryingBlockFetcher: Retrying fetch (1/3) for
39 outstanding blocks after 5000 ms
15/12/07 21:25:42 ERROR shuffle.OneForOneBlockFetcher: Failed while starting
block fetches
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
at
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
That continues for a while.
There is also this error on the Stage status from the Spark History server:
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 1
at
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:460)
at
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:456)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at
org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:456)
at
org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:183)
at
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:47)
at
org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:71)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
I am not sure which occurred first.
On Dec 7, 2015, at 2:03 PM, Cramblit, Ross (Reuters News)
<[email protected]<mailto:[email protected]>>
wrote:
Here is the trace I get from the command line:
[Stage 4:================> (60 + 60) /
200]15/12/07 18:59:40 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
ApplicationMaster has disassociated: 10.0.0.138:33822
15/12/07 18:59:40 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
ApplicationMaster has disassociated: 10.0.0.138:33822
15/12/07 18:59:40 WARN ReliableDeliverySupervisor: Association with remote
system [akka.tcp://[email protected]:33822] has failed, address is now
gated for [5000] ms. Reason: [Disassociated]
15/12/07 18:59:41 WARN ReliableDeliverySupervisor: Association with remote
system [akka.tcp://[email protected]:54951] has failed,
address is now gated for [5000] ms. Reason: [Disassociated]
15/12/07 18:59:41 ERROR YarnScheduler: Lost executor 3 on
ip-10-0-0-138.ec2.internal: remote Rpc client disassociated
15/12/07 18:59:41 WARN TaskSetManager: Lost task 62.0 in stage 4.0 (TID 2003,
ip-10-0-0-138.ec2.internal): ExecutorLostFailure (executor 3 lost)
15/12/07 18:59:41 WARN TaskSetManager: Lost task 65.0 in stage 4.0 (TID 2006,
ip-10-0-0-138.ec2.internal): ExecutorLostFailure (executor 3 lost)
…
…
On Dec 7, 2015, at 1:33 PM, Cramblit, Ross (Reuters News)
<[email protected]<mailto:[email protected]>>
wrote:
I have looked through the logs and do not see any WARNING or ERRORs - the
executors just seem to stop logging.
I am running Spark 1.5.2 on YARN.
On Dec 7, 2015, at 1:20 PM, Ted Yu
<[email protected]<mailto:[email protected]>> wrote:
bq. complete a shuffle stage due to lost executors
Have you taken a look at the log for the lost executor(s) ?
Which release of Spark are you using ?
Cheers
On Mon, Dec 7, 2015 at 10:12 AM,
<[email protected]<mailto:[email protected]>>
wrote:
I have pyspark app loading a large-ish (100GB) dataframe from JSON files and it
turns out there are a number of duplicate JSON objects in the source data. I am
trying to find the best way to remove these duplicates before using the
dataframe.
With both df.dropDuplicates() and df.sqlContext.sql(‘’’SELECT DISTINCT *…’’’)
the application is not able to complete a shuffle stage due to lost executors.
Is there a more efficient way to remove these duplicate rows? If not, what
settings can I tweak to help this succeed? I have tried both increasing and
decreasing the number of default shuffle partitions (to 100 and 500,
respectively) but neither changes the behavior.
---------------------------------------------------------------------
To unsubscribe, e-mail:
[email protected]<mailto:[email protected]>
For additional commands, e-mail:
[email protected]<mailto:[email protected]>