RE: Partition Columns

2015-05-15 Thread Martin Benson
Hi Appan,

I think the answer is that the parser is not able to detect that partitions are 
useful in Query 2, because the where condition is on a derived field. i.e. Hive 
can tell that if you say where some_partition_field=”some partition value” then 
it only needs to scan that partition, but if you bury the partition columns in 
a derived field like in Query 2 it is unable to spot that and so does a full 
table scan. I think (but don’t know for sure) that this will be fairly typical 
of all SQL engines. Your best bet is to use direct conditions like in Query 1. 
In this case it may have been better for you to persist a field containing the 
whole date and partition on that instead, in order to make it simpler to pick 
up a date range along the lines of Query2.

Thanks,

Martin.

From: Appan Thirumaligai [mailto:appanhiv...@gmail.com]
Sent: 15 May 2015 01:18
To: user@hive.apache.org
Subject: Re: Partition Columns

Mungeol,

I did check the # of mappers and that did not change between the two queries 
but when I ran a count(*) query the total execution time reduced significantly 
for Query1 vs Query2. Also, the amount data the query reads does change when 
the where clause changes. I still can't explain why one is faster over the 
other.

Thanks,
Appan

On Thu, May 14, 2015 at 4:46 PM, Mungeol Heo 
mungeol@gmail.commailto:mungeol@gmail.com wrote:
Hi, Appan.

you can just simply check the amount of data your query reads from the
table. or the number of the mapper for running that query.
then, you can know whether it filtering or scanning all table.
Of course, it is a lazy approach. but, you can give a try.
I think query 1 should work fine. because I am using a lot of that
kind of queries and it works fine for me.

Thanks,
mungeol

On Fri, May 15, 2015 at 8:31 AM, Appan Thirumaligai
appanhiv...@gmail.commailto:appanhiv...@gmail.com wrote:
 I agree with you Viral. I see the same behavior as well. We are on Hive 0.13
 for the cluster where I'm testing this.

 On Thu, May 14, 2015 at 2:16 PM, Viral Bajaria 
 viral.baja...@gmail.commailto:viral.baja...@gmail.com
 wrote:

 Hi Appan,

 In my experience I have seen that Query 2 does not use partition pruning
 because it's not a straight up filtering and involves using functions (aka
 UDFs).

 What version of Hive are you using ?

 Thanks,
 Viral



 On Thu, May 14, 2015 at 1:48 PM, Appan Thirumaligai
 appanhiv...@gmail.commailto:appanhiv...@gmail.com wrote:

 Hi,

 I have a question on Hive Optimizer. I have a table with partition
 columns  eg.,Sales partitioned by year, month, day. Assume that I have two
 years worth of data on this table. I'm running two queries on this table.

 Query 1: Select * from Sales where year=2015 and month = 5 and day
 between 1 and 7

 Query 2: Select * from Sales where concat_ws('-',cast(year as
 string),lpad(cast(month as string),2,'0'),lpad(cast(day as string),2,'0'))
 between '2015-01-01' and '2015-01-07'

 When I ran Explain command on the above two queries I get a Filter
 operation for the 2nd Query and there is no Filter Operation for the first
 query.

 My question is: Do both queries use the partitions or is it used only in
 Query 1 and for Query 2 it will be a scan of all the data?

 Thanks for your help.

 Thanks,
 Appan







Registered in England and Wales at Players House, 300 Attercliffe Common, 
Sheffield, S9 2AG. Company number 05935923.

This email and its attachments are confidential and are intended solely for the 
use of the addressed recipient.
Any views or opinions expressed are those of the author and do not necessarily 
represent Jaywing. If you are not
the intended recipient, you must not forward or show this to anyone or take any 
action based upon it.
Please contact the sender if you received this in error.


Re: Partition Columns

2015-05-14 Thread Viral Bajaria
Hi Appan,

In my experience I have seen that Query 2 does not use partition pruning
because it's not a straight up filtering and involves using functions (aka
UDFs).

What version of Hive are you using ?

Thanks,
Viral



On Thu, May 14, 2015 at 1:48 PM, Appan Thirumaligai appanhiv...@gmail.com
wrote:

 Hi,

 I have a question on Hive Optimizer. I have a table with partition columns
  eg.,Sales partitioned by year, month, day. Assume that I have two years
 worth of data on this table. I'm running two queries on this table.

 Query 1: Select * from Sales where year=2015 and month = 5 and day between
 1 and 7

 Query 2: Select * from Sales where concat_ws('-',cast(year as
 string),lpad(cast(month as string),2,'0'),lpad(cast(day as string),2,'0'))
 between '2015-01-01' and '2015-01-07'

 When I ran Explain command on the above two queries I get a Filter
 operation for the 2nd Query and there is no Filter Operation for the first
 query.

 My question is: Do both queries use the partitions or is it used only in
 Query 1 and for Query 2 it will be a scan of all the data?

 Thanks for your help.

 Thanks,
 Appan



Re: Partition Columns

2015-05-14 Thread Mungeol Heo
Hi, Appan.

you can just simply check the amount of data your query reads from the
table. or the number of the mapper for running that query.
then, you can know whether it filtering or scanning all table.
Of course, it is a lazy approach. but, you can give a try.
I think query 1 should work fine. because I am using a lot of that
kind of queries and it works fine for me.

Thanks,
mungeol

On Fri, May 15, 2015 at 8:31 AM, Appan Thirumaligai
appanhiv...@gmail.com wrote:
 I agree with you Viral. I see the same behavior as well. We are on Hive 0.13
 for the cluster where I'm testing this.

 On Thu, May 14, 2015 at 2:16 PM, Viral Bajaria viral.baja...@gmail.com
 wrote:

 Hi Appan,

 In my experience I have seen that Query 2 does not use partition pruning
 because it's not a straight up filtering and involves using functions (aka
 UDFs).

 What version of Hive are you using ?

 Thanks,
 Viral



 On Thu, May 14, 2015 at 1:48 PM, Appan Thirumaligai
 appanhiv...@gmail.com wrote:

 Hi,

 I have a question on Hive Optimizer. I have a table with partition
 columns  eg.,Sales partitioned by year, month, day. Assume that I have two
 years worth of data on this table. I'm running two queries on this table.

 Query 1: Select * from Sales where year=2015 and month = 5 and day
 between 1 and 7

 Query 2: Select * from Sales where concat_ws('-',cast(year as
 string),lpad(cast(month as string),2,'0'),lpad(cast(day as string),2,'0'))
 between '2015-01-01' and '2015-01-07'

 When I ran Explain command on the above two queries I get a Filter
 operation for the 2nd Query and there is no Filter Operation for the first
 query.

 My question is: Do both queries use the partitions or is it used only in
 Query 1 and for Query 2 it will be a scan of all the data?

 Thanks for your help.

 Thanks,
 Appan





Re: Partition Columns

2015-05-14 Thread Appan Thirumaligai
Mungeol,

I did check the # of mappers and that did not change between the two
queries but when I ran a count(*) query the total execution time reduced
significantly for Query1 vs Query2. Also, the amount data the query reads
does change when the where clause changes. I still can't explain why one is
faster over the other.

Thanks,
Appan

On Thu, May 14, 2015 at 4:46 PM, Mungeol Heo mungeol@gmail.com wrote:

 Hi, Appan.

 you can just simply check the amount of data your query reads from the
 table. or the number of the mapper for running that query.
 then, you can know whether it filtering or scanning all table.
 Of course, it is a lazy approach. but, you can give a try.
 I think query 1 should work fine. because I am using a lot of that
 kind of queries and it works fine for me.

 Thanks,
 mungeol

 On Fri, May 15, 2015 at 8:31 AM, Appan Thirumaligai
 appanhiv...@gmail.com wrote:
  I agree with you Viral. I see the same behavior as well. We are on Hive
 0.13
  for the cluster where I'm testing this.
 
  On Thu, May 14, 2015 at 2:16 PM, Viral Bajaria viral.baja...@gmail.com
  wrote:
 
  Hi Appan,
 
  In my experience I have seen that Query 2 does not use partition pruning
  because it's not a straight up filtering and involves using functions
 (aka
  UDFs).
 
  What version of Hive are you using ?
 
  Thanks,
  Viral
 
 
 
  On Thu, May 14, 2015 at 1:48 PM, Appan Thirumaligai
  appanhiv...@gmail.com wrote:
 
  Hi,
 
  I have a question on Hive Optimizer. I have a table with partition
  columns  eg.,Sales partitioned by year, month, day. Assume that I have
 two
  years worth of data on this table. I'm running two queries on this
 table.
 
  Query 1: Select * from Sales where year=2015 and month = 5 and day
  between 1 and 7
 
  Query 2: Select * from Sales where concat_ws('-',cast(year as
  string),lpad(cast(month as string),2,'0'),lpad(cast(day as
 string),2,'0'))
  between '2015-01-01' and '2015-01-07'
 
  When I ran Explain command on the above two queries I get a Filter
  operation for the 2nd Query and there is no Filter Operation for the
 first
  query.
 
  My question is: Do both queries use the partitions or is it used only
 in
  Query 1 and for Query 2 it will be a scan of all the data?
 
  Thanks for your help.
 
  Thanks,
  Appan