Re: Hive skewed tables

2013-11-14 Thread Rajesh Balamohan
I mentioned that as it scanned all files based on hdfs bytes read.. Table
is not compressed and hdfs bytes read matched the data size in the
partition.

I had bucketing enabled.   But somehow when I joined with another table it
had long tail issue where most of the data went to single reducer.   Hence
I thought of using skewed join..
On Nov 14, 2013 11:33 AM, Nitin Pawar nitinpawar...@gmail.com wrote:

 how did u check its looking at all files inside the partition?
 If you want more restriction on limit on filse to be accessed, you can
 bucket them as well. That way you really dont have to worry about which
 data is skewed and let the framework handle it.



 On Thu, Nov 14, 2013 at 11:16 AM, Rajesh Balamohan 
 rajesh.balamo...@gmail.com wrote:

 Thanks Nitin.   I have only one partition in this table for testing.   I
 thought within the partition it will scan only certain files based on
 skewed fields.   However it is scanning the entire data within the
 partition.
  On Nov 14, 2013 9:38 AM, Nitin Pawar nitinpawar...@gmail.com wrote:

 In my understanding,
 when you are saying scanning entire dataset it is looking at all your
 partitions because your data has been partitioned by the date column.

 A skewed table is a table where there will be different files created
 for all your skewed keys in all the partitions.
 So for your query it will look at all partitions.

 The setting you have kept is only applicable to join queries as it
 clearly says skewjoin.  Non join queries it does not have an affect.


 Thanks,
 Nitin




 On Thu, Nov 14, 2013 at 6:35 AM, Rajesh Balamohan 
 rajesh.balamo...@gmail.com wrote:

 Hi All,

 I have the following skewed table addresses_1

 select id, count(*) c from addresses_1 group by id order by c desc
 limit 10;
 1426246531554806
 198477395958492
 102641838220181
 138947865211331
 156483436193429
 96411677179771
 210082076168033
 800174765152421
 139116901141207
 704352025137263

 I was able to create the following table with the skew information.
 And I was able to load the data into the table as well.

  CREATE  TABLE skew_addresses_1(
   id bigint,
   address_id bigint,
   address_lines string,
   city string,
   state string,
   postal_code string,
   country string,
   latitude string,
   longitude string,
   ) PARTITIONED BY (dateTS string) SKEWED BY (id) ON (142624653,
 198477395, 102641838, 138947865, 156483436, 96411677, 210082076, 800174765,
 139116901, 704352025)
stored as rcfile;

 select id,count(*) c from skew_addresses_1 where id=142624653 group by
 id order by c limit 10;

 *However, at the time of running select query,  entire dataset is
 scanned. * I thought only the relevant dataset (with skew information
 will be scanned).  Am I missing anything here?  Any help will be
 appreciated.  I am using Hive 10.x

 I have enabled hive.optimize.skewjoin.compiletime=true and I can see
 the skew information populated in SKEWED_COL_NAMES in metadata.  But there
 is no information in SKEWED_COL_VALUE_LOC_MAP table.


 --
 ~Rajesh.B




 --
 Nitin Pawar




 --
 Nitin Pawar



Hive skewed tables

2013-11-13 Thread Rajesh Balamohan
Hi All,

I have the following skewed table addresses_1

select id, count(*) c from addresses_1 group by id order by c desc limit 10;
1426246531554806
198477395958492
102641838220181
138947865211331
156483436193429
96411677179771
210082076168033
800174765152421
139116901141207
704352025137263

I was able to create the following table with the skew information.  And I
was able to load the data into the table as well.

 CREATE  TABLE skew_addresses_1(
  id bigint,
  address_id bigint,
  address_lines string,
  city string,
  state string,
  postal_code string,
  country string,
  latitude string,
  longitude string,
  ) PARTITIONED BY (dateTS string) SKEWED BY (id) ON (142624653, 198477395,
102641838, 138947865, 156483436, 96411677, 210082076, 800174765, 139116901,
704352025)
   stored as rcfile;

select id,count(*) c from skew_addresses_1 where id=142624653 group by id
order by c limit 10;

*However, at the time of running select query,  entire dataset is
scanned. *I thought only the relevant dataset (with skew information
will be
scanned).  Am I missing anything here?  Any help will be appreciated.  I am
using Hive 10.x

I have enabled hive.optimize.skewjoin.compiletime=true and I can see the
skew information populated in SKEWED_COL_NAMES in metadata.  But there is
no information in SKEWED_COL_VALUE_LOC_MAP table.


-- 
~Rajesh.B


Re: Hive skewed tables

2013-11-13 Thread Nitin Pawar
In my understanding,
when you are saying scanning entire dataset it is looking at all your
partitions because your data has been partitioned by the date column.

A skewed table is a table where there will be different files created for
all your skewed keys in all the partitions.
So for your query it will look at all partitions.

The setting you have kept is only applicable to join queries as it clearly
says skewjoin.  Non join queries it does not have an affect.


Thanks,
Nitin




On Thu, Nov 14, 2013 at 6:35 AM, Rajesh Balamohan 
rajesh.balamo...@gmail.com wrote:

 Hi All,

 I have the following skewed table addresses_1

 select id, count(*) c from addresses_1 group by id order by c desc limit
 10;
 1426246531554806
 198477395958492
 102641838220181
 138947865211331
 156483436193429
 96411677179771
 210082076168033
 800174765152421
 139116901141207
 704352025137263

 I was able to create the following table with the skew information.  And I
 was able to load the data into the table as well.

  CREATE  TABLE skew_addresses_1(
   id bigint,
   address_id bigint,
   address_lines string,
   city string,
   state string,
   postal_code string,
   country string,
   latitude string,
   longitude string,
   ) PARTITIONED BY (dateTS string) SKEWED BY (id) ON (142624653,
 198477395, 102641838, 138947865, 156483436, 96411677, 210082076, 800174765,
 139116901, 704352025)
stored as rcfile;

 select id,count(*) c from skew_addresses_1 where id=142624653 group by id
 order by c limit 10;

 *However, at the time of running select query,  entire dataset is
 scanned. * I thought only the relevant dataset (with skew information
 will be scanned).  Am I missing anything here?  Any help will be
 appreciated.  I am using Hive 10.x

 I have enabled hive.optimize.skewjoin.compiletime=true and I can see the
 skew information populated in SKEWED_COL_NAMES in metadata.  But there is
 no information in SKEWED_COL_VALUE_LOC_MAP table.


 --
 ~Rajesh.B




-- 
Nitin Pawar


Re: Hive skewed tables

2013-11-13 Thread Rajesh Balamohan
Thanks Nitin.   I have only one partition in this table for testing.   I
thought within the partition it will scan only certain files based on
skewed fields.   However it is scanning the entire data within the
partition.
On Nov 14, 2013 9:38 AM, Nitin Pawar nitinpawar...@gmail.com wrote:

 In my understanding,
 when you are saying scanning entire dataset it is looking at all your
 partitions because your data has been partitioned by the date column.

 A skewed table is a table where there will be different files created for
 all your skewed keys in all the partitions.
 So for your query it will look at all partitions.

 The setting you have kept is only applicable to join queries as it clearly
 says skewjoin.  Non join queries it does not have an affect.


 Thanks,
 Nitin




 On Thu, Nov 14, 2013 at 6:35 AM, Rajesh Balamohan 
 rajesh.balamo...@gmail.com wrote:

 Hi All,

 I have the following skewed table addresses_1

 select id, count(*) c from addresses_1 group by id order by c desc limit
 10;
 1426246531554806
 198477395958492
 102641838220181
 138947865211331
 156483436193429
 96411677179771
 210082076168033
 800174765152421
 139116901141207
 704352025137263

 I was able to create the following table with the skew information.  And
 I was able to load the data into the table as well.

  CREATE  TABLE skew_addresses_1(
   id bigint,
   address_id bigint,
   address_lines string,
   city string,
   state string,
   postal_code string,
   country string,
   latitude string,
   longitude string,
   ) PARTITIONED BY (dateTS string) SKEWED BY (id) ON (142624653,
 198477395, 102641838, 138947865, 156483436, 96411677, 210082076, 800174765,
 139116901, 704352025)
stored as rcfile;

 select id,count(*) c from skew_addresses_1 where id=142624653 group by id
 order by c limit 10;

 *However, at the time of running select query,  entire dataset is
 scanned. * I thought only the relevant dataset (with skew information
 will be scanned).  Am I missing anything here?  Any help will be
 appreciated.  I am using Hive 10.x

 I have enabled hive.optimize.skewjoin.compiletime=true and I can see the
 skew information populated in SKEWED_COL_NAMES in metadata.  But there is
 no information in SKEWED_COL_VALUE_LOC_MAP table.


 --
 ~Rajesh.B




 --
 Nitin Pawar



Re: Hive skewed tables

2013-11-13 Thread Nitin Pawar
how did u check its looking at all files inside the partition?
If you want more restriction on limit on filse to be accessed, you can
bucket them as well. That way you really dont have to worry about which
data is skewed and let the framework handle it.



On Thu, Nov 14, 2013 at 11:16 AM, Rajesh Balamohan 
rajesh.balamo...@gmail.com wrote:

 Thanks Nitin.   I have only one partition in this table for testing.   I
 thought within the partition it will scan only certain files based on
 skewed fields.   However it is scanning the entire data within the
 partition.
 On Nov 14, 2013 9:38 AM, Nitin Pawar nitinpawar...@gmail.com wrote:

 In my understanding,
 when you are saying scanning entire dataset it is looking at all your
 partitions because your data has been partitioned by the date column.

 A skewed table is a table where there will be different files created for
 all your skewed keys in all the partitions.
 So for your query it will look at all partitions.

 The setting you have kept is only applicable to join queries as it
 clearly says skewjoin.  Non join queries it does not have an affect.


 Thanks,
 Nitin




 On Thu, Nov 14, 2013 at 6:35 AM, Rajesh Balamohan 
 rajesh.balamo...@gmail.com wrote:

 Hi All,

 I have the following skewed table addresses_1

 select id, count(*) c from addresses_1 group by id order by c desc limit
 10;
 1426246531554806
 198477395958492
 102641838220181
 138947865211331
 156483436193429
 96411677179771
 210082076168033
 800174765152421
 139116901141207
 704352025137263

 I was able to create the following table with the skew information.  And
 I was able to load the data into the table as well.

  CREATE  TABLE skew_addresses_1(
   id bigint,
   address_id bigint,
   address_lines string,
   city string,
   state string,
   postal_code string,
   country string,
   latitude string,
   longitude string,
   ) PARTITIONED BY (dateTS string) SKEWED BY (id) ON (142624653,
 198477395, 102641838, 138947865, 156483436, 96411677, 210082076, 800174765,
 139116901, 704352025)
stored as rcfile;

 select id,count(*) c from skew_addresses_1 where id=142624653 group by
 id order by c limit 10;

 *However, at the time of running select query,  entire dataset is
 scanned. * I thought only the relevant dataset (with skew information
 will be scanned).  Am I missing anything here?  Any help will be
 appreciated.  I am using Hive 10.x

 I have enabled hive.optimize.skewjoin.compiletime=true and I can see the
 skew information populated in SKEWED_COL_NAMES in metadata.  But there is
 no information in SKEWED_COL_VALUE_LOC_MAP table.


 --
 ~Rajesh.B




 --
 Nitin Pawar




-- 
Nitin Pawar