Yes my input data is partitioned in a completely random manner, so each worker that produces shuffle data processes only a part of it. The way I understand it is that before each stage each workers needs to distribute correct partitions (based on hash key ranges?) to other workers. And this is where I would expect network input/output to spike. And after that the processing should occur, where I would expect CPU to spike. If you check the image I have attached earlier in this thread you can see that for example between 8.25 and 8.30 Disk, network and CPU are almost not utilized.
I would be very happy to receive some suggestions on how to debug this. Or where you think would be a good place to start looking? Kind regards, Domen On Fri, Mar 21, 2014 at 6:58 PM, Mayur Rustagi [via Apache Spark User List] <ml-node+s1001560n3006...@n3.nabble.com> wrote: > In your task details I dont see a large skew in tasks so the low cpu usage > period occurs between stages or during stage execution. > One issue possible is your data is 89GB Shuffle read, if the machine > producing the shuffle data is not the one processing it, data shuffling > across machines may be causing the delay. > Can you look at your network traffic during that period to see performance. > Regards > Mayur > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoidanalytics.com > @mayur_rustagi <https://twitter.com/mayur_rustagi> > > > > On Fri, Mar 21, 2014 at 8:33 AM, sparrow <[hidden > email]<http://user/SendEmail.jtp?type=node&node=3006&i=0> > > wrote: > >> Here is the stage overview: >> [image: Inline image 2] >> >> and here are the stage details for stage 0: >> [image: Inline image 1] >> Transformations from first stage to the second one are trivial, so that >> should not be the bottle neck (apart from keyBy().groupByKey() that causes >> the shuffle write/read). >> >> Kind regards, Domen >> >> >> >> On Thu, Mar 20, 2014 at 8:38 PM, Mayur Rustagi [via Apache Spark User >> List] <[hidden email] <http://user/SendEmail.jtp?type=node&node=2988&i=0> >> > wrote: >> >>> I would have preferred the stage window details & aggregate task >>> details(above the task list). >>> Basically if you run a job , it translates to multiple stages, each >>> stage translates to multiple tasks (each run on worker core). >>> So some breakup like >>> my job is taking 16 min >>> 3 stages , stage 1 : 5 min Stage 2: 10 min & stage 3:1 min >>> in Stage 2 give me task aggregate screenshot which talks about 50 >>> percentile, 75 percentile & 100 percentile. >>> Regards >>> Mayur >>> >>> Mayur Rustagi >>> >>> Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="<a >>> href="tel:%2B17602033257" value="+17602033257" target="_blank"> >>> +17602033257" target="_blank"><a >>> href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" >>> target="_blank">+1 (760) 203 3257 >>> http://www.sigmoidanalytics.com >>> @mayur_rustagi <https://twitter.com/mayur_rustagi> >>> >>> >>> >>> On Thu, Mar 20, 2014 at 9:55 AM, sparrow <[hidden >>> email]<http://user/SendEmail.jtp?type=node&node=2962&i=0> >>> > wrote: >>> >>>> >>>> This is what the web UI looks like: >>>> [image: Inline image 1] >>>> >>>> I also tail all the worker logs and theese are the last entries before >>>> the waiting begins: >>>> >>>> 14/03/20 13:29:10 INFO BlockFetcherIterator$BasicBlockFetcherIterator: >>>> maxBytesInFlight: 50331648, minRequest: 10066329 >>>> 14/03/20 13:29:10 INFO BlockFetcherIterator$BasicBlockFetcherIterator: >>>> Getting 29853 non-zero-bytes blocks out of 37714 blocks >>>> 14/03/20 13:29:10 INFO BlockFetcherIterator$BasicBlockFetcherIterator: >>>> Started 5 remote gets in 62 ms >>>> [PSYoungGen: 12464967K->3767331K(10552192K)] >>>> 36074093K->29053085K(44805696K), 0.6765460 secs] [Times: user=5.35 >>>> sys=0.02, real=0.67 secs] >>>> [PSYoungGen: 10779466K->3203826K(9806400K)] >>>> 35384386K->31562169K(44059904K), 0.6925730 secs] [Times: user=5.47 >>>> sys=0.00, real=0.70 secs] >>>> >>>> From the screenshot above you can see that task take ~ 6 minutes to >>>> complete. The amount of time it takes the tasks to complete seems to depend >>>> on the amount of input data. If s3 input string captures 2.5 times less >>>> data (less data to shuffle write and later read), same tasks take 1 >>>> minute. Any idea how to debug what the workers are doing? >>>> >>>> Domen >>>> >>>> On Wed, Mar 19, 2014 at 5:27 PM, Mayur Rustagi [via Apache Spark User >>>> List] <[hidden email]<http://user/SendEmail.jtp?type=node&node=2938&i=0> >>>> > wrote: >>>> >>>>> You could have some outlier task that is preventing the next set of >>>>> stages from launching. Can you check out stages state in the Spark WebUI, >>>>> is any task running or is everything halted. >>>>> Regards >>>>> Mayur >>>>> >>>>> Mayur Rustagi >>>>> Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="<a >>>>> href="tel:%2B17602033257" value="<a href="tel:%2B17602033257" >>>>> value="+17602033257" target="_blank">+17602033257" target="_blank"><a >>>>> href="tel:%2B17602033257" value="<a href="tel:%2B17602033257" >>>>> value="+17602033257" target="_blank">+17602033257" target="_blank"><a >>>>> href="tel:%2B17602033257" value="+17602033257" target="_blank"> >>>>> +17602033257" target="_blank"><a >>>>> href="tel:%2B1%20%28760%29%20203%203257" value="<a >>>>> href="tel:%2B17602033257" value="<a href="tel:%2B17602033257" >>>>> value="+17602033257" target="_blank">+17602033257" target="_blank"><a >>>>> href="tel:%2B17602033257" value="+17602033257" target="_blank"> >>>>> +17602033257" target="_blank"><a >>>>> href="tel:%2B1%20%28760%29%20203%203257" value="<a >>>>> href="tel:%2B17602033257" value="+17602033257" target="_blank"> >>>>> +17602033257" target="_blank"><a >>>>> href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" >>>>> target="_blank">+1 (760) 203 3257 >>>>> http://www.sigmoidanalytics.com >>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi> >>>>> >>>>> >>>>> >>>>> On Wed, Mar 19, 2014 at 5:40 AM, Domen Grabec <[hidden >>>>> email]<http://user/SendEmail.jtp?type=node&node=2882&i=0> >>>>> > wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I have a cluster with 16 nodes, each node has 69Gb ram (50GB goes to >>>>>> spark) and 8 cores running spark 0.8.1. I have a groupByKey operation >>>>>> that >>>>>> causes a wide RDD dependency so shuffle write and shuffle read are >>>>>> performed. >>>>>> >>>>>> For some reason all worker threads seem to sleep for about 3-4 >>>>>> minutes each time performing a shuffle read and completing a set of >>>>>> tasks. >>>>>> See graphs below how no resources are being utilized in specific time >>>>>> windows. >>>>>> >>>>>> Each time 3-4 minutes pass, a next set of tasks are being grabbed and >>>>>> processed, and then another waiting period happens. >>>>>> >>>>>> Each task has an input of 80Mb +- 5Mb data to shuffle read. >>>>>> >>>>>> [image: Inline image 1] >>>>>> >>>>>> Here <http://pastebin.com/UHWMdTRY> is a link to thread dump >>>>>> performed in the middle of the waiting period. Any idea what could cause >>>>>> the long waits? >>>>>> >>>>>> Kind regards, Domen >>>>>> >>>>> >>>>> >>>>> >>>>> ------------------------------ >>>>> If you reply to this email, your message will be added to the >>>>> discussion below: >>>>> >>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-worker-threads-waiting-tp2859p2882.html >>>>> To start a new topic under Apache Spark User List, email [hidden >>>>> email] <http://user/SendEmail.jtp?type=node&node=2938&i=1> >>>>> To unsubscribe from Apache Spark User List, click here. >>>>> NAML<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >>>>> >>>> >>>> >>>> ------------------------------ >>>> View this message in context: Re: Spark worker threads >>>> waiting<http://apache-spark-user-list.1001560.n3.nabble.com/Spark-worker-threads-waiting-tp2859p2938.html> >>>> Sent from the Apache Spark User List mailing list >>>> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com. >>>> >>> >>> >>> >>> ------------------------------ >>> If you reply to this email, your message will be added to the >>> discussion below: >>> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-worker-threads-waiting-tp2859p2962.html >>> To start a new topic under Apache Spark User List, email [hidden >>> email]<http://user/SendEmail.jtp?type=node&node=2988&i=1> >>> To unsubscribe from Apache Spark User List, click here. >>> NAML<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >>> >> >> >> ------------------------------ >> View this message in context: Re: Spark worker threads >> waiting<http://apache-spark-user-list.1001560.n3.nabble.com/Spark-worker-threads-waiting-tp2859p2988.html> >> Sent from the Apache Spark User List mailing list >> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com. >> > > > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-worker-threads-waiting-tp2859p3006.html > To start a new topic under Apache Spark User List, email > ml-node+s1001560n1...@n3.nabble.com > To unsubscribe from Apache Spark User List, click > here<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=ZG9tZW5AY2VsdHJhLmNvbXwxfC01NjUwMzk2ODU=> > . > NAML<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-worker-threads-waiting-tp2859p3090.html Sent from the Apache Spark User List mailing list archive at Nabble.com.