I mentioned that as it scanned all files based on hdfs bytes read.. Table
is not compressed and hdfs bytes read matched the data size in the
partition.
I had bucketing enabled. But somehow when I joined with another table it
had long tail issue where most of the data went to single reducer.
Hi All,
I have the following skewed table addresses_1
select id, count(*) c from addresses_1 group by id order by c desc limit 10;
1426246531554806
198477395958492
102641838220181
138947865211331
156483436193429
96411677179771
210082076168033
800174765152421
In my understanding,
when you are saying scanning entire dataset it is looking at all your
partitions because your data has been partitioned by the date column.
A skewed table is a table where there will be different files created for
all your skewed keys in all the partitions.
So for your query
Thanks Nitin. I have only one partition in this table for testing. I
thought within the partition it will scan only certain files based on
skewed fields. However it is scanning the entire data within the
partition.
On Nov 14, 2013 9:38 AM, Nitin Pawar nitinpawar...@gmail.com wrote:
In my
how did u check its looking at all files inside the partition?
If you want more restriction on limit on filse to be accessed, you can
bucket them as well. That way you really dont have to worry about which
data is skewed and let the framework handle it.
On Thu, Nov 14, 2013 at 11:16 AM, Rajesh