If it is really due to data skew, will the task hanging has much bigger Shuffle Write Size in this case? In this case, the shuffle write size for that task is 0, and the rest IO of this task is not much larger than the fast finished tasks, is that normal? I am also interested in this case, as from statistics on the UI, how it indicates the task could have skew data? Yong
Date: Mon, 13 Apr 2015 12:58:12 -0400 Subject: Re: Equi Join is taking for ever. 1 Task is Running while other 199 are complete From: jcove...@gmail.com To: deepuj...@gmail.com CC: user@spark.apache.org I can promise you that this is also a problem in the pig world :) not sure why it's not a problem for this data set, though... are you sure that the two are doing the exact same code? you should inspect your source data. Make a histogram for each and see what the data distribution looks like. If there is a value or bucket with a disproportionate set of values you know you have an issue 2015-04-13 12:50 GMT-04:00 ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com>: You mean there is a tuple in either RDD, that has itemID = 0 or null ? And what is catch all ? That implies is it a good idea to run a filter on each RDD first ? We do not do this using Pig on M/R. Is it required in Spark world ? On Mon, Apr 13, 2015 at 9:58 PM, Jonathan Coveney <jcove...@gmail.com> wrote: My guess would be data skew. Do you know if there is some item id that is a catch all? can it be null? item id 0? lots of data sets have this sort of value and it always kills joins 2015-04-13 11:32 GMT-04:00 ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com>: Code: val viEventsWithListings: RDD[(Long, (DetailInputRecord, VISummary, Long))] = lstgItem.join(viEvents).map { case (itemId, (listing, viDetail)) => val viSummary = new VISummary viSummary.leafCategoryId = listing.getLeafCategId().toInt viSummary.itemSiteId = listing.getItemSiteId().toInt viSummary.auctionTypeCode = listing.getAuctTypeCode().toInt viSummary.sellerCountryId = listing.getSlrCntryId().toInt viSummary.buyerSegment = "0" viSummary.isBin = (if (listing.getBinPriceLstgCurncy.doubleValue() > 0) 1 else 0) val sellerId = listing.getSlrId.toLong (sellerId, (viDetail, viSummary, itemId)) } Running Tasks:Tasks IndexIDAttemptStatus ▾Locality LevelExecutor ID / HostLaunch TimeDurationGC TimeShuffle Read Size / RecordsWrite TimeShuffle Write Size / RecordsShuffle Spill (Memory)Shuffle Spill (Disk)Errors 0 216 0 RUNNING PROCESS_LOCAL 181 / phxaishdc9dn0474.phx.ebay.com 2015/04/13 06:43:53 1.7 h 13 min 3.0 GB / 56964921 0.0 B / 0 21.2 GB 1902.6 MB 2 218 0 SUCCESS PROCESS_LOCAL 582 / phxaishdc9dn0235.phx.ebay.com 2015/04/13 06:43:53 15 min 31 s 2.2 GB / 1666851 0.1 s 3.0 MB / 2062 54.8 GB 1924.5 MB 1 217 0 SUCCESS PROCESS_LOCAL 202 / phxdpehdc9dn2683.stratus.phx.ebay.com 2015/04/13 06:43:53 19 min 1.3 min 2.2 GB / 1687086 75 ms 3.9 MB / 2692 33.7 GB 1960.4 MB 4 220 0 SUCCESS PROCESS_LOCAL 218 / phxaishdc9dn0855.phx.ebay.com 2015/04/13 06:43:53 15 min 56 s 2.2 GB / 1675654 40 ms 3.3 MB / 2260 26.2 GB 1928.4 MB Command:./bin/spark-submit -v --master yarn-cluster --driver-class-path /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/hdfs/hadoop-hdfs-2.4.1-EBAY-2.jar --jars /apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/hdfs/hadoop-hdfs-2.4.1-EBAY-2.jar,/home/dvasthimal/spark1.3/spark_reporting_dep_only-1.0-SNAPSHOT.jar --num-executors 3000 --driver-memory 12g --driver-java-options "-XX:MaxPermSize=6G" --executor-memory 12g --executor-cores 1 --queue hdmi-express --class com.ebay.ep.poc.spark.reporting.SparkApp /home/dvasthimal/spark1.3/spark_reporting-1.0-SNAPSHOT.jar startDate=2015-04-6 endDate=2015-04-7 input=/user/dvasthimal/epdatasets_small/exptsession subcommand=viewItem output=/user/dvasthimal/epdatasets/viewItem buffersize=128 maxbuffersize=1068 maxResultSize=2G What do i do ? I killed the job twice and its stuck again. Where is it stuck ? -- Deepak -- Deepak