Re: TaskCommitDenied (Driver denied task commit)
etty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289) >> at >> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) >> at >> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) >> at >> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) >> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) >> at >> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) >> ... 1 more >> >> ... >> >> >> Not sure if this reveals anything at all. >> >> >> On Thu, Jan 21, 2016 at 2:58 PM, Holden Karau <hol...@pigscanfly.ca> >> wrote: >> >>> My hunch is that the TaskCommitDenied is perhaps a red hearing and the >>> problem is groupByKey - but I've also just seen a lot of people be bitten >>> by it so that might not be issue. If you just do a count at the point of >>> the groupByKey does the pipeline succeed? >>> >>> On Thu, Jan 21, 2016 at 2:56 PM, Arun Luthra <arun.lut...@gmail.com> >>> wrote: >>> >>>> Usually the pipeline works, it just failed on this particular input >>>> data. The other data it has run on is of similar size. >>>> >>>> Speculation is enabled. >>>> >>>> I'm using Spark 1.5.0. >>>> >>>> Here is the config. Many of these may not be needed anymore, they are >>>> from trying to get things working in Spark 1.2 and 1.3. >>>> >>>> .set("spark.storage.memoryFraction","0.2") // default 0.6 >>>> .set("spark.shuffle.memoryFraction","0.2") // default 0.2 >>>> .set("spark.shuffle.manager","SORT") // preferred setting for >>>> optimized joins >>>> .set("spark.shuffle.consolidateFiles","true") // helpful for >>>> "too many files open" >>>> .set("spark.mesos.coarse", "true") // helpful for >>>> MapOutputTracker errors? >>>> .set("spark.akka.frameSize","300") // helpful when using >>>> consildateFiles=true >>>> .set("spark.shuffle.compress","false") // >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html >>>> .set("spark.file.transferTo","false") // >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html >>>> .set("spark.core.connection.ack.wait.timeout","600") // >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html >>>> .set("spark.speculation","true") >>>> .set("spark.worker.timeout","600") // >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-td3798.html >>>> .set("spark.akka.timeout","300") // >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-td3798.html >>>> .set("spark.storage.blockManagerSlaveTimeoutMs","12") >>>> .set("spark.driver.maxResultSize","2048") // in response to >>>> error: Total size of serialized results of 39901 tasks (1024.0 MB) is >>>> bigger than spark.driver.maxResultSize (1024.0 MB) >>>> .set("spark.serializer", >>>> "org.apache.spark.serializer.KryoSerializer") >>>> .set("spark.kryo.registrator","--.MyRegistrator") >>>> .set("spark.kryo.registrationRequired", "true") >>>> .set("spark.yarn.executor.memoryOverhead","600") >>>> >>>> On Thu, Jan 21, 2016 at 2:50 PM, Josh Rosen <joshro...@databricks.com> >>>> wrote: >>>> >>>>> Is speculation enabled? This TaskCommitDenied by driver error is >>>>> thrown by writers who lost the race to commit an output partition. I don't >>>>> think this had anything to do with key skew etc. Replacing the groupbykey >>>>> with a count will mask this exception because the coordination does not >>>>> get >>>>> triggered in non save/write operations. >>>>> >>>>> On Thu, Jan 21, 2016 at 2:46 PM Holden Karau <hol...@pigscanfly
Re: TaskCommitDenied (Driver denied task commit)
16/01/21 21:52:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/01/21 21:52:14 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. spark.yarn.driver.memoryOverhead is set but does not apply in client mode. 16/01/21 21:52:16 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 16/01/21 21:52:52 WARN MemoryStore: Not enough space to cache broadcast_4 in memory! (computed 60.2 MB so far) 16/01/21 21:52:52 WARN MemoryStore: Persisting block broadcast_4 to disk instead. [Stage 1:>(2260 + 7) / 2262]16/01/21 21:57:24 WARN TaskSetManager: Lost task 1440.1 in stage 1.0 (TID 4530, --): TaskCommitDenied (Driver denied task commit) for job: 1, partition: 1440, attempt: 4530 [Stage 1:>(2260 + 6) / 2262]16/01/21 21:57:27 WARN TaskSetManager: Lost task 1488.1 in stage 1.0 (TID 4531, --): TaskCommitDenied (Driver denied task commit) for job: 1, partition: 1488, attempt: 4531 [Stage 1:>(2261 + 4) / 2262]16/01/21 21:57:39 WARN TaskSetManager: Lost task 1982.1 in stage 1.0 (TID 4532, --): TaskCommitDenied (Driver denied task commit) for job: 1, partition: 1982, attempt: 4532 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2214.0 in stage 1.0 (TID 4482, --): TaskCommitDenied (Driver denied task commit) for job: 1, partition: 2214, attempt: 4482 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID 4436, --): TaskCommitDenied (Driver denied task commit) for job: 1, partition: 2168, attempt: 4436 I am running with: spark-submit --class "myclass" \ --num-executors 90 \ --driver-memory 1g \ --executor-memory 60g \ --executor-cores 8 \ --master yarn-client \ --conf "spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \ my.jar There are 2262 input files totaling just 98.6G. The DAG is basically textFile().map().filter().groupByKey().saveAsTextFile(). On Thu, Jan 21, 2016 at 2:14 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > Can you post more of your log? How big are the partitions? What is the > action you are performing? > > On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra <arun.lut...@gmail.com> > wrote: > >> Example warning: >> >> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID >> 4436, XXX): TaskCommitDenied (Driver denied task commit) for job: 1, >> partition: 2168, attempt: 4436 >> >> >> Is there a solution for this? Increase driver memory? I'm using just 1G >> driver memory but ideally I won't have to increase it. >> >> The RDD being processed has 2262 partitions. >> >> Arun >> > > > > -- > Cell : 425-233-8271 > Twitter: https://twitter.com/holdenkarau >
Re: TaskCommitDenied (Driver denied task commit)
Is speculation enabled? This TaskCommitDenied by driver error is thrown by writers who lost the race to commit an output partition. I don't think this had anything to do with key skew etc. Replacing the groupbykey with a count will mask this exception because the coordination does not get triggered in non save/write operations. On Thu, Jan 21, 2016 at 2:46 PM Holden Karau <hol...@pigscanfly.ca> wrote: > Before we dig too far into this, the thing which most quickly jumps out to > me is groupByKey which could be causing some problems - whats the > distribution of keys like? Try replacing the groupByKey with a count() and > see if the pipeline works up until that stage. Also 1G of driver memory is > also a bit small for something with 90 executors... > > On Thu, Jan 21, 2016 at 2:40 PM, Arun Luthra <arun.lut...@gmail.com> > wrote: > >> >> >> 16/01/21 21:52:11 WARN NativeCodeLoader: Unable to load native-hadoop >> library for your platform... using builtin-java classes where applicable >> >> 16/01/21 21:52:14 WARN MetricsSystem: Using default name DAGScheduler for >> source because spark.app.id is not set. >> >> spark.yarn.driver.memoryOverhead is set but does not apply in client mode. >> >> 16/01/21 21:52:16 WARN DomainSocketFactory: The short-circuit local reads >> feature cannot be used because libhadoop cannot be loaded. >> >> 16/01/21 21:52:52 WARN MemoryStore: Not enough space to cache broadcast_4 >> in memory! (computed 60.2 MB so far) >> >> 16/01/21 21:52:52 WARN MemoryStore: Persisting block broadcast_4 to disk >> instead. >> >> [Stage 1:================>(2260 + 7) >> / 2262]16/01/21 21:57:24 WARN TaskSetManager: Lost task 1440.1 in stage 1.0 >> (TID 4530, --): TaskCommitDenied (Driver denied task commit) for job: 1, >> partition: 1440, attempt: 4530 >> >> [Stage 1:========>(2260 + 6) >> / 2262]16/01/21 21:57:27 WARN TaskSetManager: Lost task 1488.1 in stage 1.0 >> (TID 4531, --): TaskCommitDenied (Driver denied task commit) for job: 1, >> partition: 1488, attempt: 4531 >> >> [Stage 1:>(2261 + 4) >> / 2262]16/01/21 21:57:39 WARN TaskSetManager: Lost task 1982.1 in stage 1.0 >> (TID 4532, --): TaskCommitDenied (Driver denied task commit) for job: 1, >> partition: 1982, attempt: 4532 >> >> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2214.0 in stage 1.0 (TID >> 4482, --): TaskCommitDenied (Driver denied task commit) for job: 1, >> partition: 2214, attempt: 4482 >> >> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID >> 4436, --): TaskCommitDenied (Driver denied task commit) for job: 1, >> partition: 2168, attempt: 4436 >> >> >> I am running with: >> >> spark-submit --class "myclass" \ >> >> --num-executors 90 \ >> >> --driver-memory 1g \ >> >> --executor-memory 60g \ >> >> --executor-cores 8 \ >> >> --master yarn-client \ >> >> --conf "spark.executor.extraJavaOptions=-verbose:gc >> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \ >> >> my.jar >> >> >> There are 2262 input files totaling just 98.6G. The DAG is basically >> textFile().map().filter().groupByKey().saveAsTextFile(). >> >> On Thu, Jan 21, 2016 at 2:14 PM, Holden Karau <hol...@pigscanfly.ca> >> wrote: >> >>> Can you post more of your log? How big are the partitions? What is the >>> action you are performing? >>> >>> On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra <arun.lut...@gmail.com> >>> wrote: >>> >>>> Example warning: >>>> >>>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 >>>> (TID 4436, XXX): TaskCommitDenied (Driver denied task commit) for job: >>>> 1, partition: 2168, attempt: 4436 >>>> >>>> >>>> Is there a solution for this? Increase driver memory? I'm using just 1G >>>> driver memory but ideally I won't have to increase it. >>>> >>>> The RDD being processed has 2262 partitions. >>>> >>>> Arun >>>> >>> >>> >>> >>> -- >>> Cell : 425-233-8271 >>> Twitter: https://twitter.com/holdenkarau >>> >> >> > > > -- > Cell : 425-233-8271 > Twitter: https://twitter.com/holdenkarau >
Re: TaskCommitDenied (Driver denied task commit)
My hunch is that the TaskCommitDenied is perhaps a red hearing and the problem is groupByKey - but I've also just seen a lot of people be bitten by it so that might not be issue. If you just do a count at the point of the groupByKey does the pipeline succeed? On Thu, Jan 21, 2016 at 2:56 PM, Arun Luthra <arun.lut...@gmail.com> wrote: > Usually the pipeline works, it just failed on this particular input data. > The other data it has run on is of similar size. > > Speculation is enabled. > > I'm using Spark 1.5.0. > > Here is the config. Many of these may not be needed anymore, they are from > trying to get things working in Spark 1.2 and 1.3. > > .set("spark.storage.memoryFraction","0.2") // default 0.6 > .set("spark.shuffle.memoryFraction","0.2") // default 0.2 > .set("spark.shuffle.manager","SORT") // preferred setting for > optimized joins > .set("spark.shuffle.consolidateFiles","true") // helpful for "too > many files open" > .set("spark.mesos.coarse", "true") // helpful for MapOutputTracker > errors? > .set("spark.akka.frameSize","300") // helpful when using > consildateFiles=true > .set("spark.shuffle.compress","false") // > http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html > .set("spark.file.transferTo","false") // > http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html > .set("spark.core.connection.ack.wait.timeout","600") // > http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html > .set("spark.speculation","true") > .set("spark.worker.timeout","600") // > http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-td3798.html > .set("spark.akka.timeout","300") // > http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-td3798.html > .set("spark.storage.blockManagerSlaveTimeoutMs","12") > .set("spark.driver.maxResultSize","2048") // in response to error: > Total size of serialized results of 39901 tasks (1024.0 MB) is bigger than > spark.driver.maxResultSize (1024.0 MB) > .set("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") > .set("spark.kryo.registrator","--.MyRegistrator") > .set("spark.kryo.registrationRequired", "true") > .set("spark.yarn.executor.memoryOverhead","600") > > On Thu, Jan 21, 2016 at 2:50 PM, Josh Rosen <joshro...@databricks.com> > wrote: > >> Is speculation enabled? This TaskCommitDenied by driver error is thrown >> by writers who lost the race to commit an output partition. I don't think >> this had anything to do with key skew etc. Replacing the groupbykey with a >> count will mask this exception because the coordination does not get >> triggered in non save/write operations. >> >> On Thu, Jan 21, 2016 at 2:46 PM Holden Karau <hol...@pigscanfly.ca> >> wrote: >> >>> Before we dig too far into this, the thing which most quickly jumps out >>> to me is groupByKey which could be causing some problems - whats the >>> distribution of keys like? Try replacing the groupByKey with a count() and >>> see if the pipeline works up until that stage. Also 1G of driver memory is >>> also a bit small for something with 90 executors... >>> >>> On Thu, Jan 21, 2016 at 2:40 PM, Arun Luthra <arun.lut...@gmail.com> >>> wrote: >>> >>>> >>>> >>>> 16/01/21 21:52:11 WARN NativeCodeLoader: Unable to load native-hadoop >>>> library for your platform... using builtin-java classes where applicable >>>> >>>> 16/01/21 21:52:14 WARN MetricsSystem: Using default name DAGScheduler >>>> for source because spark.app.id is not set. >>>> >>>> spark.yarn.driver.memoryOverhead is set but does not apply in client >>>> mode. >>>> >>>> 16/01/21 21:52:16 WARN DomainSocketFactory: The short-circuit local >>>> reads feature cannot be used because libhadoop cannot be loaded. >>>> >>>> 16/01/21 21:52:52 WARN MemoryStore: Not enough space to cache >>>> broadcast_4 in memory! (computed 60.2 MB so far) >>>> >>>> 16/01/21 21:52:52 WARN MemoryStore: Persisting block broadcast_4 to
TaskCommitDenied (Driver denied task commit)
Example warning: 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID 4436, XXX): TaskCommitDenied (Driver denied task commit) for job: 1, partition: 2168, attempt: 4436 Is there a solution for this? Increase driver memory? I'm using just 1G driver memory but ideally I won't have to increase it. The RDD being processed has 2262 partitions. Arun
Re: TaskCommitDenied (Driver denied task commit)
Usually the pipeline works, it just failed on this particular input data. The other data it has run on is of similar size. Speculation is enabled. I'm using Spark 1.5.0. Here is the config. Many of these may not be needed anymore, they are from trying to get things working in Spark 1.2 and 1.3. .set("spark.storage.memoryFraction","0.2") // default 0.6 .set("spark.shuffle.memoryFraction","0.2") // default 0.2 .set("spark.shuffle.manager","SORT") // preferred setting for optimized joins .set("spark.shuffle.consolidateFiles","true") // helpful for "too many files open" .set("spark.mesos.coarse", "true") // helpful for MapOutputTracker errors? .set("spark.akka.frameSize","300") // helpful when using consildateFiles=true .set("spark.shuffle.compress","false") // http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html .set("spark.file.transferTo","false") // http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html .set("spark.core.connection.ack.wait.timeout","600") // http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html .set("spark.speculation","true") .set("spark.worker.timeout","600") // http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-td3798.html .set("spark.akka.timeout","300") // http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-td3798.html .set("spark.storage.blockManagerSlaveTimeoutMs","12") .set("spark.driver.maxResultSize","2048") // in response to error: Total size of serialized results of 39901 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.kryo.registrator","--.MyRegistrator") .set("spark.kryo.registrationRequired", "true") .set("spark.yarn.executor.memoryOverhead","600") On Thu, Jan 21, 2016 at 2:50 PM, Josh Rosen <joshro...@databricks.com> wrote: > Is speculation enabled? This TaskCommitDenied by driver error is thrown by > writers who lost the race to commit an output partition. I don't think this > had anything to do with key skew etc. Replacing the groupbykey with a count > will mask this exception because the coordination does not get triggered in > non save/write operations. > > On Thu, Jan 21, 2016 at 2:46 PM Holden Karau <hol...@pigscanfly.ca> wrote: > >> Before we dig too far into this, the thing which most quickly jumps out >> to me is groupByKey which could be causing some problems - whats the >> distribution of keys like? Try replacing the groupByKey with a count() and >> see if the pipeline works up until that stage. Also 1G of driver memory is >> also a bit small for something with 90 executors... >> >> On Thu, Jan 21, 2016 at 2:40 PM, Arun Luthra <arun.lut...@gmail.com> >> wrote: >> >>> >>> >>> 16/01/21 21:52:11 WARN NativeCodeLoader: Unable to load native-hadoop >>> library for your platform... using builtin-java classes where applicable >>> >>> 16/01/21 21:52:14 WARN MetricsSystem: Using default name DAGScheduler >>> for source because spark.app.id is not set. >>> >>> spark.yarn.driver.memoryOverhead is set but does not apply in client >>> mode. >>> >>> 16/01/21 21:52:16 WARN DomainSocketFactory: The short-circuit local >>> reads feature cannot be used because libhadoop cannot be loaded. >>> >>> 16/01/21 21:52:52 WARN MemoryStore: Not enough space to cache >>> broadcast_4 in memory! (computed 60.2 MB so far) >>> >>> 16/01/21 21:52:52 WARN MemoryStore: Persisting block broadcast_4 to disk >>> instead. >>> >>> [Stage 1:>(2260 + 7) >>> / 2262]16/01/21 21:57:24 WARN TaskSetManager: Lost task 1440.1 in stage 1.0 >>> (TID 4530, --): TaskCommitDenied (Driver denied task commit) for job: 1, >>> partition: 1440, attempt: 4530 >>> >>> [Stage 1:>(2260 + 6) >>> / 2262]16/01/21 21:57:27 WARN TaskSetManager: Lost task 1488.1 in stage 1.0 >>> (TID 4531, --): TaskCommitDenied (Driver denied task commit) for job: 1, >>> partition: 1488, attempt: 4531 >>> >>>
Re: TaskCommitDenied (Driver denied task commit)
Can you post more of your log? How big are the partitions? What is the action you are performing? On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra <arun.lut...@gmail.com> wrote: > Example warning: > > 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID > 4436, XXX): TaskCommitDenied (Driver denied task commit) for job: 1, > partition: 2168, attempt: 4436 > > > Is there a solution for this? Increase driver memory? I'm using just 1G > driver memory but ideally I won't have to increase it. > > The RDD being processed has 2262 partitions. > > Arun > -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau
Re: TaskCommitDenied (Driver denied task commit)
Before we dig too far into this, the thing which most quickly jumps out to me is groupByKey which could be causing some problems - whats the distribution of keys like? Try replacing the groupByKey with a count() and see if the pipeline works up until that stage. Also 1G of driver memory is also a bit small for something with 90 executors... On Thu, Jan 21, 2016 at 2:40 PM, Arun Luthra <arun.lut...@gmail.com> wrote: > > > 16/01/21 21:52:11 WARN NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > > 16/01/21 21:52:14 WARN MetricsSystem: Using default name DAGScheduler for > source because spark.app.id is not set. > > spark.yarn.driver.memoryOverhead is set but does not apply in client mode. > > 16/01/21 21:52:16 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > > 16/01/21 21:52:52 WARN MemoryStore: Not enough space to cache broadcast_4 > in memory! (computed 60.2 MB so far) > > 16/01/21 21:52:52 WARN MemoryStore: Persisting block broadcast_4 to disk > instead. > > [Stage 1:>(2260 + 7) / > 2262]16/01/21 21:57:24 WARN TaskSetManager: Lost task 1440.1 in stage 1.0 > (TID 4530, --): TaskCommitDenied (Driver denied task commit) for job: 1, > partition: 1440, attempt: 4530 > > [Stage 1:>(2260 + 6) / > 2262]16/01/21 21:57:27 WARN TaskSetManager: Lost task 1488.1 in stage 1.0 > (TID 4531, --): TaskCommitDenied (Driver denied task commit) for job: 1, > partition: 1488, attempt: 4531 > > [Stage 1:>(2261 + 4) / > 2262]16/01/21 21:57:39 WARN TaskSetManager: Lost task 1982.1 in stage 1.0 > (TID 4532, --): TaskCommitDenied (Driver denied task commit) for job: 1, > partition: 1982, attempt: 4532 > > 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2214.0 in stage 1.0 (TID > 4482, --): TaskCommitDenied (Driver denied task commit) for job: 1, > partition: 2214, attempt: 4482 > > 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID > 4436, --): TaskCommitDenied (Driver denied task commit) for job: 1, > partition: 2168, attempt: 4436 > > > I am running with: > > spark-submit --class "myclass" \ > > --num-executors 90 \ > > --driver-memory 1g \ > > --executor-memory 60g \ > > --executor-cores 8 \ > > --master yarn-client \ > > --conf "spark.executor.extraJavaOptions=-verbose:gc > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \ > > my.jar > > > There are 2262 input files totaling just 98.6G. The DAG is basically > textFile().map().filter().groupByKey().saveAsTextFile(). > > On Thu, Jan 21, 2016 at 2:14 PM, Holden Karau <hol...@pigscanfly.ca> > wrote: > >> Can you post more of your log? How big are the partitions? What is the >> action you are performing? >> >> On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra <arun.lut...@gmail.com> >> wrote: >> >>> Example warning: >>> >>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 >>> (TID 4436, XXX): TaskCommitDenied (Driver denied task commit) for job: >>> 1, partition: 2168, attempt: 4436 >>> >>> >>> Is there a solution for this? Increase driver memory? I'm using just 1G >>> driver memory but ideally I won't have to increase it. >>> >>> The RDD being processed has 2262 partitions. >>> >>> Arun >>> >> >> >> >> -- >> Cell : 425-233-8271 >> Twitter: https://twitter.com/holdenkarau >> > > -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau
Re: TaskCommitDenied (Driver denied task commit)
> .set("spark.shuffle.consolidateFiles","true") // helpful for "too >> many files open" >> .set("spark.mesos.coarse", "true") // helpful for >> MapOutputTracker errors? >> .set("spark.akka.frameSize","300") // helpful when using >> consildateFiles=true >> .set("spark.shuffle.compress","false") // >> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html >> .set("spark.file.transferTo","false") // >> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html >> .set("spark.core.connection.ack.wait.timeout","600") // >> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html >> .set("spark.speculation","true") >> .set("spark.worker.timeout","600") // >> http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-td3798.html >> .set("spark.akka.timeout","300") // >> http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-td3798.html >> .set("spark.storage.blockManagerSlaveTimeoutMs","12") >> .set("spark.driver.maxResultSize","2048") // in response to >> error: Total size of serialized results of 39901 tasks (1024.0 MB) is >> bigger than spark.driver.maxResultSize (1024.0 MB) >> .set("spark.serializer", >> "org.apache.spark.serializer.KryoSerializer") >> .set("spark.kryo.registrator","--.MyRegistrator") >> .set("spark.kryo.registrationRequired", "true") >> .set("spark.yarn.executor.memoryOverhead","600") >> >> On Thu, Jan 21, 2016 at 2:50 PM, Josh Rosen <joshro...@databricks.com> >> wrote: >> >>> Is speculation enabled? This TaskCommitDenied by driver error is thrown >>> by writers who lost the race to commit an output partition. I don't think >>> this had anything to do with key skew etc. Replacing the groupbykey with a >>> count will mask this exception because the coordination does not get >>> triggered in non save/write operations. >>> >>> On Thu, Jan 21, 2016 at 2:46 PM Holden Karau <hol...@pigscanfly.ca> >>> wrote: >>> >>>> Before we dig too far into this, the thing which most quickly jumps out >>>> to me is groupByKey which could be causing some problems - whats the >>>> distribution of keys like? Try replacing the groupByKey with a count() and >>>> see if the pipeline works up until that stage. Also 1G of driver memory is >>>> also a bit small for something with 90 executors... >>>> >>>> On Thu, Jan 21, 2016 at 2:40 PM, Arun Luthra <arun.lut...@gmail.com> >>>> wrote: >>>> >>>>> >>>>> >>>>> 16/01/21 21:52:11 WARN NativeCodeLoader: Unable to load native-hadoop >>>>> library for your platform... using builtin-java classes where applicable >>>>> >>>>> 16/01/21 21:52:14 WARN MetricsSystem: Using default name DAGScheduler >>>>> for source because spark.app.id is not set. >>>>> >>>>> spark.yarn.driver.memoryOverhead is set but does not apply in client >>>>> mode. >>>>> >>>>> 16/01/21 21:52:16 WARN DomainSocketFactory: The short-circuit local >>>>> reads feature cannot be used because libhadoop cannot be loaded. >>>>> >>>>> 16/01/21 21:52:52 WARN MemoryStore: Not enough space to cache >>>>> broadcast_4 in memory! (computed 60.2 MB so far) >>>>> >>>>> 16/01/21 21:52:52 WARN MemoryStore: Persisting block broadcast_4 to >>>>> disk instead. >>>>> >>>>> [Stage 1:>(2260 + >>>>> 7) / 2262]16/01/21 21:57:24 WARN TaskSetManager: Lost task 1440.1 in stage >>>>> 1.0 (TID 4530, --): TaskCommitDenied (Driver denied task commit) for job: >>>>> 1, partition: 1440, attempt: 4530 >>>>> >>>>> [Stage 1:>(2260 + >>>>> 6) / 2262]16/01/21 21:57:27 WARN TaskSetManager: Lost task 1488.1 in stage >>>>> 1.0 (TID 4531, --): TaskCommitDenied (Driver denied task commit) for job: >>>>> 1, partition:
Re: TaskCommitDenied (Driver denied task commit)
> wrote: > >> My hunch is that the TaskCommitDenied is perhaps a red hearing and the >> problem is groupByKey - but I've also just seen a lot of people be bitten >> by it so that might not be issue. If you just do a count at the point of >> the groupByKey does the pipeline succeed? >> >> On Thu, Jan 21, 2016 at 2:56 PM, Arun Luthra <arun.lut...@gmail.com> >> wrote: >> >>> Usually the pipeline works, it just failed on this particular input >>> data. The other data it has run on is of similar size. >>> >>> Speculation is enabled. >>> >>> I'm using Spark 1.5.0. >>> >>> Here is the config. Many of these may not be needed anymore, they are >>> from trying to get things working in Spark 1.2 and 1.3. >>> >>> .set("spark.storage.memoryFraction","0.2") // default 0.6 >>> .set("spark.shuffle.memoryFraction","0.2") // default 0.2 >>> .set("spark.shuffle.manager","SORT") // preferred setting for >>> optimized joins >>> .set("spark.shuffle.consolidateFiles","true") // helpful for >>> "too many files open" >>> .set("spark.mesos.coarse", "true") // helpful for >>> MapOutputTracker errors? >>> .set("spark.akka.frameSize","300") // helpful when using >>> consildateFiles=true >>> .set("spark.shuffle.compress","false") // >>> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html >>> .set("spark.file.transferTo","false") // >>> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html >>> .set("spark.core.connection.ack.wait.timeout","600") // >>> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html >>> .set("spark.speculation","true") >>> .set("spark.worker.timeout","600") // >>> http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-td3798.html >>> .set("spark.akka.timeout","300") // >>> http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-td3798.html >>> .set("spark.storage.blockManagerSlaveTimeoutMs","12") >>> .set("spark.driver.maxResultSize","2048") // in response to >>> error: Total size of serialized results of 39901 tasks (1024.0 MB) is >>> bigger than spark.driver.maxResultSize (1024.0 MB) >>> .set("spark.serializer", >>> "org.apache.spark.serializer.KryoSerializer") >>> .set("spark.kryo.registrator","--.MyRegistrator") >>> .set("spark.kryo.registrationRequired", "true") >>> .set("spark.yarn.executor.memoryOverhead","600") >>> >>> On Thu, Jan 21, 2016 at 2:50 PM, Josh Rosen <joshro...@databricks.com> >>> wrote: >>> >>>> Is speculation enabled? This TaskCommitDenied by driver error is thrown >>>> by writers who lost the race to commit an output partition. I don't think >>>> this had anything to do with key skew etc. Replacing the groupbykey with a >>>> count will mask this exception because the coordination does not get >>>> triggered in non save/write operations. >>>> >>>> On Thu, Jan 21, 2016 at 2:46 PM Holden Karau <hol...@pigscanfly.ca> >>>> wrote: >>>> >>>>> Before we dig too far into this, the thing which most quickly jumps >>>>> out to me is groupByKey which could be causing some problems - whats the >>>>> distribution of keys like? Try replacing the groupByKey with a count() and >>>>> see if the pipeline works up until that stage. Also 1G of driver memory is >>>>> also a bit small for something with 90 executors... >>>>> >>>>> On Thu, Jan 21, 2016 at 2:40 PM, Arun Luthra <arun.lut...@gmail.com> >>>>> wrote: >>>>> >>>>>> >>>>>> >>>>>> 16/01/21 21:52:11 WARN NativeCodeLoader: Unable to load native-hadoop >>>>>> library for your platform... using builtin-java classes where applicable >>>>>> >>>>>> 16/01/21 21:52:14 WARN MetricsSystem: Using default name DAGScheduler >