Are you directly caching files from Hadoop or are you doing some transformation on them first? If you are doing a groupBy or some type of transformation, then you could be causing data skew that way.
On Sun, Aug 3, 2014 at 1:19 PM, iramaraju <iramar...@gmail.com> wrote: > I am running spark 1.0.0, Tachyon 0.5 and Hadoop 1.0.4. > > I am selecting a subset of a large dataset and trying to run queries on the > cached schema RDD. Strangely, in web UI, I see the following. > > 150 Partitions > > Block Name Storage Level Size in Memory ▴ Size on Disk > Executors > rdd_30_68 Memory Deserialized 1x Replicated 307.5 MB > 0.0 B > ip-172-31-45-100.ec2.internal:37796 > rdd_30_133 Memory Deserialized 1x Replicated 216.0 MB > 0.0 B > ip-172-31-45-101.ec2.internal:55947 > rdd_30_18 Memory Deserialized 1x Replicated 194.2 MB > 0.0 B > ip-172-31-42-159.ec2.internal:43543 > rdd_30_24 Memory Deserialized 1x Replicated 173.3 MB > 0.0 B > ip-172-31-45-101.ec2.internal:55947 > rdd_30_70 Memory Deserialized 1x Replicated 168.2 MB > 0.0 B > ip-172-31-18-220.ec2.internal:39847 > rdd_30_105 Memory Deserialized 1x Replicated 154.1 MB > 0.0 B > ip-172-31-45-102.ec2.internal:36700 > rdd_30_79 Memory Deserialized 1x Replicated 153.9 MB > 0.0 B > ip-172-31-45-99.ec2.internal:59538 > rdd_30_60 Memory Deserialized 1x Replicated 4.2 MB 0.0 B > ip-172-31-45-102.ec2.internal:36700 > rdd_30_99 Memory Deserialized 1x Replicated 112.0 B 0.0 B > ip-172-31-45-102.ec2.internal:36700 > rdd_30_90 Memory Deserialized 1x Replicated 112.0 B 0.0 B > ip-172-31-45-102.ec2.internal:36700 > rdd_30_9 Memory Deserialized 1x Replicated 112.0 B 0.0 B > ip-172-31-18-220.ec2.internal:39847 > rdd_30_89 Memory Deserialized 1x Replicated 112.0 B 0.0 B > ip-172-31-45-102.ec2.internal:36700 > > What is strange to me is the size in Memory is mostly 112Bytes except for 8 > of them. ( I have 9 data files in Hadoop, which are well distributed 64mb > blocks ). > > The tasks processing the rdd are getting stuck after finishing few initial > tasks. I am wondering, it is because, the spark has hit the large blocks > and > trying to process them on one worker per task. > > Any suggestions on how I can distribute them more evenly (Size of blocks) ? > And why my hadoop blocks are nicely even and spark cached RDD has such a > uneven distribution ? Any help is appreciated. > > Regards > Ram > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Cached-RDD-Block-Size-Uneven-Distribution-tp11286.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >