Are you directly caching files from Hadoop or are you doing some
transformation on them first? If you are doing a groupBy or some type of
transformation, then you could be causing data skew that way.


On Sun, Aug 3, 2014 at 1:19 PM, iramaraju <iramar...@gmail.com> wrote:

> I am running spark 1.0.0, Tachyon 0.5 and Hadoop 1.0.4.
>
> I am selecting a subset of a large dataset and trying to run queries on the
> cached schema RDD. Strangely, in web UI, I see the following.
>
> 150 Partitions
>
> Block Name      Storage Level   Size in Memory ▴        Size on Disk
>  Executors
> rdd_30_68       Memory Deserialized 1x Replicated       307.5 MB
>  0.0 B
> ip-172-31-45-100.ec2.internal:37796
> rdd_30_133      Memory Deserialized 1x Replicated       216.0 MB
>  0.0 B
> ip-172-31-45-101.ec2.internal:55947
> rdd_30_18       Memory Deserialized 1x Replicated       194.2 MB
>  0.0 B
> ip-172-31-42-159.ec2.internal:43543
> rdd_30_24       Memory Deserialized 1x Replicated       173.3 MB
>  0.0 B
> ip-172-31-45-101.ec2.internal:55947
> rdd_30_70       Memory Deserialized 1x Replicated       168.2 MB
>  0.0 B
> ip-172-31-18-220.ec2.internal:39847
> rdd_30_105      Memory Deserialized 1x Replicated       154.1 MB
>  0.0 B
> ip-172-31-45-102.ec2.internal:36700
> rdd_30_79       Memory Deserialized 1x Replicated       153.9 MB
>  0.0 B
> ip-172-31-45-99.ec2.internal:59538
> rdd_30_60       Memory Deserialized 1x Replicated       4.2 MB  0.0 B
> ip-172-31-45-102.ec2.internal:36700
> rdd_30_99       Memory Deserialized 1x Replicated       112.0 B 0.0 B
> ip-172-31-45-102.ec2.internal:36700
> rdd_30_90       Memory Deserialized 1x Replicated       112.0 B 0.0 B
> ip-172-31-45-102.ec2.internal:36700
> rdd_30_9        Memory Deserialized 1x Replicated       112.0 B 0.0 B
> ip-172-31-18-220.ec2.internal:39847
> rdd_30_89       Memory Deserialized 1x Replicated       112.0 B 0.0 B
> ip-172-31-45-102.ec2.internal:36700
>
> What is strange to me is the size in Memory is mostly 112Bytes except for 8
> of them. ( I have 9 data files in Hadoop, which are well distributed 64mb
> blocks ).
>
> The tasks processing the rdd are getting stuck after finishing few initial
> tasks. I am wondering, it is because, the spark has hit the large blocks
> and
> trying to process them on one worker per task.
>
> Any suggestions on how I can distribute them more evenly (Size of blocks) ?
> And why my hadoop blocks are nicely even and spark cached RDD has such a
> uneven distribution ? Any help is appreciated.
>
> Regards
> Ram
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Cached-RDD-Block-Size-Uneven-Distribution-tp11286.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to