Moving this back onto user@

Regarding GC, can you look in the web UI and see whether the "GC time"
metric dominates the amount of time spent on each task (or at least the
tasks that aren't completing)?

Also, have you tried bumping your spark.yarn.executor.memoryOverhead?  YARN
may be killing your executors for using too much off-heap space.  You can
see whether this is happening by looking in the Spark AM or YARN
NodeManager logs.

-Sandy

On Thu, Aug 20, 2015 at 7:39 AM, Umesh Kacha <umesh.ka...@gmail.com> wrote:

> Hi thanks much for the response. Yes I tried default settings too 0.2 it
> was also going into timeout if it is spending time in GC then why it is not
> throwing GC error I don't see any such error. Yarn logs are not helpful at
> all. What is tungsten how do I use it? Spark is doing great I believe my
> job runs successfully and 60% tasks completes only after first executor
> gets lost things are messing.
> On Aug 20, 2015 7:59 PM, "Sandy Ryza" <sandy.r...@cloudera.com> wrote:
>
>> What sounds most likely is that you're hitting heavy garbage collection.
>> Did you hit issues when the shuffle memory fraction was at its default of
>> 0.2?  A potential danger with setting the shuffle storage to 0.7 is that it
>> allows shuffle objects to get into the GC old generation, which triggers
>> more stop-the-world garbage collections.
>>
>> Have you tried enabling Tungsten / unsafe?
>>
>> Unfortunately, Spark is still not that great at dealing with
>> heavily-skewed shuffle data, because its reduce-side aggregation still
>> operates on Java objects instead of binary data.
>>
>> -Sandy
>>
>> On Thu, Aug 20, 2015 at 7:21 AM, Umesh Kacha <umesh.ka...@gmail.com>
>> wrote:
>>
>>> Hi Sandy thanks much for the response. I am using Spark 1.4.1 and I have
>>> set spark.shuffle.storage as 0.7 as my spark job involves 4 groupby queries
>>> executed using hiveContext.sql my data set is skewed so will be more
>>> shuffling I believe I don't know what's wrong spark job runs fine for
>>> almost an hour and when shuffle read shuffle write column in UI starts to
>>> show more than 10 gb executor starts to getting lost because of timeout and
>>> slowly other executor starts getting lost. Please guide.
>>> On Aug 20, 2015 7:38 PM, "Sandy Ryza" <sandy.r...@cloudera.com> wrote:
>>>
>>>> What version of Spark are you using?  Have you set any shuffle configs?
>>>>
>>>> On Wed, Aug 19, 2015 at 11:46 AM, unk1102 <umesh.ka...@gmail.com>
>>>> wrote:
>>>>
>>>>> I have one Spark job which seems to run fine but after one hour or so
>>>>> executor start getting lost because of time out something like the
>>>>> following
>>>>> error
>>>>>
>>>>> cluster.yarnScheduler : Removing an executor 14 650000 timeout exceeds
>>>>> 600000 seconds
>>>>>
>>>>> and because of above error couple of chained errors starts to come like
>>>>> FetchFailedException, Rpc client disassociated, Connection reset by
>>>>> peer,
>>>>> IOException etc
>>>>>
>>>>> Please see the following UI page I have noticed when shuffle read/write
>>>>> starts to increase more than 10 GB executors starts getting lost
>>>>> because of
>>>>> timeout. How do I clear this stacked memory of 10 GB in shuffle
>>>>> read/write
>>>>> section I dont cache anything why Spark is not clearing those memory.
>>>>> Please
>>>>> guide.
>>>>>
>>>>> IMG_20150819_231418358.jpg
>>>>> <
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n24345/IMG_20150819_231418358.jpg
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-avoid-executor-time-out-on-yarn-spark-while-dealing-with-large-shuffle-skewed-data-tp24345.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>
>>>>>
>>>>
>>

Reply via email to