I mentioned that as it scanned all files based on hdfs bytes read.. Table is not compressed and hdfs bytes read matched the data size in the partition.
I had bucketing enabled. But somehow when I joined with another table it had long tail issue where most of the data went to single reducer. Hence I thought of using skewed join.. On Nov 14, 2013 11:33 AM, "Nitin Pawar" <nitinpawar...@gmail.com> wrote: > how did u check its looking at all files inside the partition? > If you want more restriction on limit on filse to be accessed, you can > bucket them as well. That way you really dont have to worry about which > data is skewed and let the framework handle it. > > > > On Thu, Nov 14, 2013 at 11:16 AM, Rajesh Balamohan < > rajesh.balamo...@gmail.com> wrote: > >> Thanks Nitin. I have only one partition in this table for testing. I >> thought within the partition it will scan only certain files based on >> skewed fields. However it is scanning the entire data within the >> partition. >> On Nov 14, 2013 9:38 AM, "Nitin Pawar" <nitinpawar...@gmail.com> wrote: >> >>> In my understanding, >>> when you are saying scanning entire dataset it is looking at all your >>> partitions because your data has been partitioned by the date column. >>> >>> A skewed table is a table where there will be different files created >>> for all your skewed keys in all the partitions. >>> So for your query it will look at all partitions. >>> >>> The setting you have kept is only applicable to join queries as it >>> clearly says skewjoin. Non join queries it does not have an affect. >>> >>> >>> Thanks, >>> Nitin >>> >>> >>> >>> >>> On Thu, Nov 14, 2013 at 6:35 AM, Rajesh Balamohan < >>> rajesh.balamo...@gmail.com> wrote: >>> >>>> Hi All, >>>> >>>> I have the following skewed table "addresses_1" >>>> >>>> select id, count(*) c from addresses_1 group by id order by c desc >>>> limit 10; >>>> 142624653 1554806 >>>> 198477395 958492 >>>> 102641838 220181 >>>> 138947865 211331 >>>> 156483436 193429 >>>> 96411677 179771 >>>> 210082076 168033 >>>> 800174765 152421 >>>> 139116901 141207 >>>> 704352025 137263 >>>> >>>> I was able to create the following table with the skew information. >>>> And I was able to load the data into the table as well. >>>> >>>> CREATE TABLE skew_addresses_1( >>>> id bigint, >>>> address_id bigint, >>>> address_lines string, >>>> city string, >>>> state string, >>>> postal_code string, >>>> country string, >>>> latitude string, >>>> longitude string, >>>> ) PARTITIONED BY (dateTS string) SKEWED BY (id) ON (142624653, >>>> 198477395, 102641838, 138947865, 156483436, 96411677, 210082076, 800174765, >>>> 139116901, 704352025) >>>> stored as rcfile; >>>> >>>> select id,count(*) c from skew_addresses_1 where id=142624653 group by >>>> id order by c limit 10; >>>> >>>> *However, at the time of running select query, entire dataset is >>>> scanned. * I thought only the relevant dataset (with skew information >>>> will be scanned). Am I missing anything here? Any help will be >>>> appreciated. I am using Hive 10.x >>>> >>>> I have enabled hive.optimize.skewjoin.compiletime=true and I can see >>>> the skew information populated in SKEWED_COL_NAMES in metadata. But there >>>> is no information in SKEWED_COL_VALUE_LOC_MAP table. >>>> >>>> >>>> -- >>>> ~Rajesh.B >>>> >>> >>> >>> >>> -- >>> Nitin Pawar >>> >> > > > -- > Nitin Pawar >