RE: Partition Columns
Hi Appan, I think the answer is that the parser is not able to detect that partitions are useful in Query 2, because the where condition is on a derived field. i.e. Hive can tell that if you say where some_partition_field=”some partition value” then it only needs to scan that partition, but if you bury the partition columns in a derived field like in Query 2 it is unable to spot that and so does a full table scan. I think (but don’t know for sure) that this will be fairly typical of all SQL engines. Your best bet is to use direct conditions like in Query 1. In this case it may have been better for you to persist a field containing the whole date and partition on that instead, in order to make it simpler to pick up a date range along the lines of Query2. Thanks, Martin. From: Appan Thirumaligai [mailto:appanhiv...@gmail.com] Sent: 15 May 2015 01:18 To: user@hive.apache.org Subject: Re: Partition Columns Mungeol, I did check the # of mappers and that did not change between the two queries but when I ran a count(*) query the total execution time reduced significantly for Query1 vs Query2. Also, the amount data the query reads does change when the where clause changes. I still can't explain why one is faster over the other. Thanks, Appan On Thu, May 14, 2015 at 4:46 PM, Mungeol Heo mailto:mungeol@gmail.com>> wrote: Hi, Appan. you can just simply check the amount of data your query reads from the table. or the number of the mapper for running that query. then, you can know whether it filtering or scanning all table. Of course, it is a lazy approach. but, you can give a try. I think query 1 should work fine. because I am using a lot of that kind of queries and it works fine for me. Thanks, mungeol On Fri, May 15, 2015 at 8:31 AM, Appan Thirumaligai mailto:appanhiv...@gmail.com>> wrote: > I agree with you Viral. I see the same behavior as well. We are on Hive 0.13 > for the cluster where I'm testing this. > > On Thu, May 14, 2015 at 2:16 PM, Viral Bajaria > mailto:viral.baja...@gmail.com>> > wrote: >> >> Hi Appan, >> >> In my experience I have seen that Query 2 does not use partition pruning >> because it's not a straight up filtering and involves using functions (aka >> UDFs). >> >> What version of Hive are you using ? >> >> Thanks, >> Viral >> >> >> >> On Thu, May 14, 2015 at 1:48 PM, Appan Thirumaligai >> mailto:appanhiv...@gmail.com>> wrote: >>> >>> Hi, >>> >>> I have a question on Hive Optimizer. I have a table with partition >>> columns eg.,Sales partitioned by year, month, day. Assume that I have two >>> years worth of data on this table. I'm running two queries on this table. >>> >>> Query 1: Select * from Sales where year=2015 and month = 5 and day >>> between 1 and 7 >>> >>> Query 2: Select * from Sales where concat_ws('-',cast(year as >>> string),lpad(cast(month as string),2,'0'),lpad(cast(day as string),2,'0')) >>> between '2015-01-01' and '2015-01-07' >>> >>> When I ran Explain command on the above two queries I get a Filter >>> operation for the 2nd Query and there is no Filter Operation for the first >>> query. >>> >>> My question is: Do both queries use the partitions or is it used only in >>> Query 1 and for Query 2 it will be a scan of all the data? >>> >>> Thanks for your help. >>> >>> Thanks, >>> Appan >> >> > Registered in England and Wales at Players House, 300 Attercliffe Common, Sheffield, S9 2AG. Company number 05935923. This email and its attachments are confidential and are intended solely for the use of the addressed recipient. Any views or opinions expressed are those of the author and do not necessarily represent Jaywing. If you are not the intended recipient, you must not forward or show this to anyone or take any action based upon it. Please contact the sender if you received this in error.
Re: Partition Columns
Mungeol, I did check the # of mappers and that did not change between the two queries but when I ran a count(*) query the total execution time reduced significantly for Query1 vs Query2. Also, the amount data the query reads does change when the where clause changes. I still can't explain why one is faster over the other. Thanks, Appan On Thu, May 14, 2015 at 4:46 PM, Mungeol Heo wrote: > Hi, Appan. > > you can just simply check the amount of data your query reads from the > table. or the number of the mapper for running that query. > then, you can know whether it filtering or scanning all table. > Of course, it is a lazy approach. but, you can give a try. > I think query 1 should work fine. because I am using a lot of that > kind of queries and it works fine for me. > > Thanks, > mungeol > > On Fri, May 15, 2015 at 8:31 AM, Appan Thirumaligai > wrote: > > I agree with you Viral. I see the same behavior as well. We are on Hive > 0.13 > > for the cluster where I'm testing this. > > > > On Thu, May 14, 2015 at 2:16 PM, Viral Bajaria > > wrote: > >> > >> Hi Appan, > >> > >> In my experience I have seen that Query 2 does not use partition pruning > >> because it's not a straight up filtering and involves using functions > (aka > >> UDFs). > >> > >> What version of Hive are you using ? > >> > >> Thanks, > >> Viral > >> > >> > >> > >> On Thu, May 14, 2015 at 1:48 PM, Appan Thirumaligai > >> wrote: > >>> > >>> Hi, > >>> > >>> I have a question on Hive Optimizer. I have a table with partition > >>> columns eg.,Sales partitioned by year, month, day. Assume that I have > two > >>> years worth of data on this table. I'm running two queries on this > table. > >>> > >>> Query 1: Select * from Sales where year=2015 and month = 5 and day > >>> between 1 and 7 > >>> > >>> Query 2: Select * from Sales where concat_ws('-',cast(year as > >>> string),lpad(cast(month as string),2,'0'),lpad(cast(day as > string),2,'0')) > >>> between '2015-01-01' and '2015-01-07' > >>> > >>> When I ran Explain command on the above two queries I get a Filter > >>> operation for the 2nd Query and there is no Filter Operation for the > first > >>> query. > >>> > >>> My question is: Do both queries use the partitions or is it used only > in > >>> Query 1 and for Query 2 it will be a scan of all the data? > >>> > >>> Thanks for your help. > >>> > >>> Thanks, > >>> Appan > >> > >> > > >
Re: Partition Columns
Hi, Appan. you can just simply check the amount of data your query reads from the table. or the number of the mapper for running that query. then, you can know whether it filtering or scanning all table. Of course, it is a lazy approach. but, you can give a try. I think query 1 should work fine. because I am using a lot of that kind of queries and it works fine for me. Thanks, mungeol On Fri, May 15, 2015 at 8:31 AM, Appan Thirumaligai wrote: > I agree with you Viral. I see the same behavior as well. We are on Hive 0.13 > for the cluster where I'm testing this. > > On Thu, May 14, 2015 at 2:16 PM, Viral Bajaria > wrote: >> >> Hi Appan, >> >> In my experience I have seen that Query 2 does not use partition pruning >> because it's not a straight up filtering and involves using functions (aka >> UDFs). >> >> What version of Hive are you using ? >> >> Thanks, >> Viral >> >> >> >> On Thu, May 14, 2015 at 1:48 PM, Appan Thirumaligai >> wrote: >>> >>> Hi, >>> >>> I have a question on Hive Optimizer. I have a table with partition >>> columns eg.,Sales partitioned by year, month, day. Assume that I have two >>> years worth of data on this table. I'm running two queries on this table. >>> >>> Query 1: Select * from Sales where year=2015 and month = 5 and day >>> between 1 and 7 >>> >>> Query 2: Select * from Sales where concat_ws('-',cast(year as >>> string),lpad(cast(month as string),2,'0'),lpad(cast(day as string),2,'0')) >>> between '2015-01-01' and '2015-01-07' >>> >>> When I ran Explain command on the above two queries I get a Filter >>> operation for the 2nd Query and there is no Filter Operation for the first >>> query. >>> >>> My question is: Do both queries use the partitions or is it used only in >>> Query 1 and for Query 2 it will be a scan of all the data? >>> >>> Thanks for your help. >>> >>> Thanks, >>> Appan >> >> >
Re: Partition Columns
I agree with you Viral. I see the same behavior as well. We are on Hive 0.13 for the cluster where I'm testing this. On Thu, May 14, 2015 at 2:16 PM, Viral Bajaria wrote: > Hi Appan, > > In my experience I have seen that Query 2 does not use partition pruning > because it's not a straight up filtering and involves using functions (aka > UDFs). > > What version of Hive are you using ? > > Thanks, > Viral > > > > On Thu, May 14, 2015 at 1:48 PM, Appan Thirumaligai > wrote: > >> Hi, >> >> I have a question on Hive Optimizer. I have a table with partition >> columns eg.,Sales partitioned by year, month, day. Assume that I have two >> years worth of data on this table. I'm running two queries on this table. >> >> Query 1: Select * from Sales where year=2015 and month = 5 and day >> between 1 and 7 >> >> Query 2: Select * from Sales where concat_ws('-',cast(year as >> string),lpad(cast(month as string),2,'0'),lpad(cast(day as string),2,'0')) >> between '2015-01-01' and '2015-01-07' >> >> When I ran Explain command on the above two queries I get a Filter >> operation for the 2nd Query and there is no Filter Operation for the first >> query. >> >> My question is: Do both queries use the partitions or is it used only in >> Query 1 and for Query 2 it will be a scan of all the data? >> >> Thanks for your help. >> >> Thanks, >> Appan >> > >
Re: Partition Columns
Hi Appan, In my experience I have seen that Query 2 does not use partition pruning because it's not a straight up filtering and involves using functions (aka UDFs). What version of Hive are you using ? Thanks, Viral On Thu, May 14, 2015 at 1:48 PM, Appan Thirumaligai wrote: > Hi, > > I have a question on Hive Optimizer. I have a table with partition columns > eg.,Sales partitioned by year, month, day. Assume that I have two years > worth of data on this table. I'm running two queries on this table. > > Query 1: Select * from Sales where year=2015 and month = 5 and day between > 1 and 7 > > Query 2: Select * from Sales where concat_ws('-',cast(year as > string),lpad(cast(month as string),2,'0'),lpad(cast(day as string),2,'0')) > between '2015-01-01' and '2015-01-07' > > When I ran Explain command on the above two queries I get a Filter > operation for the 2nd Query and there is no Filter Operation for the first > query. > > My question is: Do both queries use the partitions or is it used only in > Query 1 and for Query 2 it will be a scan of all the data? > > Thanks for your help. > > Thanks, > Appan >