Re: Doubt on Hive Partitioning.

2016-08-02 Thread Jörn Franke
Partition pruning works also with older Hive version, but you have to put the 
filter in the join statement and not only in the where statement 

> On 02 Aug 2016, at 09:53, Furcy Pin  wrote:
> 
> I'm using Hive 1.1 on MR and dynamic partition pruning does not seem to work.
> 
> Since MR is deprecated in 2.0, I assume we should not expect any future perf 
> optimisation on this side.
> 
> It has been implemented for Hive on Spark, though.
> https://issues.apache.org/jira/browse/HIVE-9152
> 
> 
> 
> 
>> On Tue, Aug 2, 2016 at 3:45 AM, Qiuzhuang Lian  
>> wrote:
>> Is this partition pruning fixed in MR too except for TEZ in newer hive 
>> version?
>> 
>> Regards,
>> Q
>> 
>>> On Mon, Aug 1, 2016 at 8:48 PM, Jörn Franke  wrote:
>>> It happens in old hive version of the filter is only in the where clause 
>>> and NOT in the join clause. This should not happen in newer hive version. 
>>> You can check it by executing explain dependency query. 
>>> 
 On 01 Aug 2016, at 11:07, Abhishek Dubey  
 wrote:
 
 Hi All,
 
  
 
 I have a very big table t with billions of rows and it is partitioned on a 
 column p. Column p  has datatype text and values like ‘201601’, 
 ‘201602’…upto ‘201612’.
 
 And, I am running a query like : Select columns from t where p=’201604’.
 
  
 
 My question is : Can there be a scenario/condition/probability that my 
 query will do a complete table scan on t instead of only reading data for 
 specified partition key. If yes, please put some light on those scenario.
 
  
 
 I’m asking this because someone told me that there is a probability that 
 the query will ignore the partitioning and do a complete table scan to 
 fetch output.
 
  
 
 
 Thanks & Regards,
 Abhishek Dubey
> 


Re: Doubt on Hive Partitioning.

2016-08-02 Thread Jörn Franke
I do not think so, but never tested it.

> On 02 Aug 2016, at 03:45, Qiuzhuang Lian  wrote:
> 
> Is this partition pruning fixed in MR too except for TEZ in newer hive 
> version?
> 
> Regards,
> Q
> 
>> On Mon, Aug 1, 2016 at 8:48 PM, Jörn Franke  wrote:
>> It happens in old hive version of the filter is only in the where clause and 
>> NOT in the join clause. This should not happen in newer hive version. You 
>> can check it by executing explain dependency query. 
>> 
>>> On 01 Aug 2016, at 11:07, Abhishek Dubey  wrote:
>>> 
>>> Hi All,
>>> 
>>>  
>>> 
>>> I have a very big table t with billions of rows and it is partitioned on a 
>>> column p. Column p  has datatype text and values like ‘201601’, 
>>> ‘201602’…upto ‘201612’.
>>> 
>>> And, I am running a query like : Select columns from t where p=’201604’.
>>> 
>>>  
>>> 
>>> My question is : Can there be a scenario/condition/probability that my 
>>> query will do a complete table scan on t instead of only reading data for 
>>> specified partition key. If yes, please put some light on those scenario.
>>> 
>>>  
>>> 
>>> I’m asking this because someone told me that there is a probability that 
>>> the query will ignore the partitioning and do a complete table scan to 
>>> fetch output.
>>> 
>>>  
>>> 
>>> 
>>> Thanks & Regards,
>>> Abhishek Dubey
> 


Re: Doubt on Hive Partitioning.

2016-08-02 Thread Furcy Pin
I'm using Hive 1.1 on MR and dynamic partition pruning does not seem to
work.

Since MR is deprecated in 2.0, I assume we should not expect any future
perf optimisation on this side.

It has been implemented for Hive on Spark, though.
https://issues.apache.org/jira/browse/HIVE-9152




On Tue, Aug 2, 2016 at 3:45 AM, Qiuzhuang Lian 
wrote:

> Is this partition pruning fixed in MR too except for TEZ in newer hive
> version?
>
> Regards,
> Q
>
> On Mon, Aug 1, 2016 at 8:48 PM, Jörn Franke  wrote:
>
>> It happens in old hive version of the filter is only in the where clause
>> and NOT in the join clause. This should not happen in newer hive version.
>> You can check it by executing explain dependency query.
>>
>> On 01 Aug 2016, at 11:07, Abhishek Dubey > > wrote:
>>
>> Hi All,
>>
>>
>>
>> I have a very big table *t* with billions of rows and it is partitioned
>> on a column *p*. Column *p * has datatype text and values like ‘201601’,
>> ‘201602’…upto ‘201612’.
>>
>> And, I am running a query like : *Select columns from t where
>> p=’201604’.*
>>
>>
>>
>> My question is : Can there be a scenario/condition/probability that my
>> query will do a complete table scan on *t* instead of only reading data
>> for specified partition key. If yes, please put some light on those
>> scenario.
>>
>>
>>
>> I’m asking this because someone told me that there is a probability that
>> the query will ignore the partitioning and do a complete table scan to
>> fetch output.
>>
>>
>>
>> *Thanks & Regards,*
>> *Abhishek Dubey*
>>
>>
>>
>>
>


Re: Doubt on Hive Partitioning.

2016-08-01 Thread Qiuzhuang Lian
Is this partition pruning fixed in MR too except for TEZ in newer hive
version?

Regards,
Q

On Mon, Aug 1, 2016 at 8:48 PM, Jörn Franke  wrote:

> It happens in old hive version of the filter is only in the where clause
> and NOT in the join clause. This should not happen in newer hive version.
> You can check it by executing explain dependency query.
>
> On 01 Aug 2016, at 11:07, Abhishek Dubey  > wrote:
>
> Hi All,
>
>
>
> I have a very big table *t* with billions of rows and it is partitioned
> on a column *p*. Column *p * has datatype text and values like ‘201601’,
> ‘201602’…upto ‘201612’.
>
> And, I am running a query like : *Select columns from t where p=’201604’.*
>
>
>
> My question is : Can there be a scenario/condition/probability that my
> query will do a complete table scan on *t* instead of only reading data
> for specified partition key. If yes, please put some light on those
> scenario.
>
>
>
> I’m asking this because someone told me that there is a probability that
> the query will ignore the partitioning and do a complete table scan to
> fetch output.
>
>
>
> *Thanks & Regards,*
> *Abhishek Dubey*
>
>
>
>


Re: Doubt on Hive Partitioning.

2016-08-01 Thread Gopal Vijayaraghavan


> WHERE p IN (SELECT p FROM t2)


> here we could argue that Hive could optimize this by computing the sub
>query first, 
> and then do the partition pruning, but sadly I don't think this
>optimisation has been implemented yet

It is implemented already -


In Hive-1.x, the optimization doesn't kick in when the partition column
has a UDF wrapped around it.

In Hive-2.0, it does apply even if the partition column is wrapped with a
UDF.

"explain rewrite  where p IN (Select p from t2);"

will show the rewrite which enables DPP.

> An example of non-deterministic function are rand() and unix_timestamp()
>because it is evaluated differently at each row

Yes, that is exactly right. Another case was TO_DATE() which in Hive-1.x
returned Strings and prevented the removal of partitions.

Cheers,
Gopal





Re: Doubt on Hive Partitioning.

2016-08-01 Thread Jörn Franke
It happens in old hive version of the filter is only in the where clause and 
NOT in the join clause. This should not happen in newer hive version. You can 
check it by executing explain dependency query. 

> On 01 Aug 2016, at 11:07, Abhishek Dubey  wrote:
> 
> Hi All,
>  
> I have a very big table t with billions of rows and it is partitioned on a 
> column p. Column p  has datatype text and values like ‘201601’, ‘201602’…upto 
> ‘201612’.
> And, I am running a query like : Select columns from t where p=’201604’.
>  
> My question is : Can there be a scenario/condition/probability that my query 
> will do a complete table scan on t instead of only reading data for specified 
> partition key. If yes, please put some light on those scenario.
>  
> I’m asking this because someone told me that there is a probability that the 
> query will ignore the partitioning and do a complete table scan to fetch 
> output.
>  
> Thanks & Regards,
> Abhishek Dubey
>  


Re: Doubt on Hive Partitioning.

2016-08-01 Thread Furcy Pin
Hi Abhishek,

Yes, it can happen.

The only such scenarios I can think of are when you use a WHERE clause with
a non-constant clause.
As far as I know, partition only work on constant clauses, because it has
to evaluate them *before* starting the query in order to prune the
partitions.

For instance:

WHERE p = otherColumn
> here the predicate will depend on the row being read, thus all rows must
be read.
> if otherColumn is a partition, I don't think it work either

WHERE p IN (SELECT p FROM t2)
> here we could argue that Hive could optimize this by computing the sub
query first,
> and then do the partition pruning, but sadly I don't think this
optimisation has been implemented yet


WHERE f(p) = 'constant'
or
WHERE p = f('constant')

where f is a non-deterministic or non-stateful UDF.
An example of non-deterministic function are rand() and
unix_timestamp() because
it is evaluated differently at each row

So if you want today's partition, you should use instead current_date(),
which is deterministic,  since it takes the time of compilation of the
query.
It is only available since Hive 1.2.0 though.

You can know if a Hive UDF is deterministic and stateful by looking at the
class annotation UDFType in it's source code.
If you plan on writing your own UDF, don't forget to specifiy this
annotation as well.

hope this helps,

Furcy




On Mon, Aug 1, 2016 at 11:07 AM, Abhishek Dubey 
wrote:

> Hi All,
>
>
>
> I have a very big table *t* with billions of rows and it is partitioned
> on a column *p*. Column *p * has datatype text and values like ‘201601’,
> ‘201602’…upto ‘201612’.
>
> And, I am running a query like : *Select columns from t where p=’201604’.*
>
>
>
> My question is : Can there be a scenario/condition/probability that my
> query will do a complete table scan on *t* instead of only reading data
> for specified partition key. If yes, please put some light on those
> scenario.
>
>
>
> I’m asking this because someone told me that there is a probability that
> the query will ignore the partitioning and do a complete table scan to
> fetch output.
>
>
>
> *Thanks & Regards,*
> *Abhishek Dubey*
>
>
>


Doubt on Hive Partitioning.

2016-08-01 Thread Abhishek Dubey
Hi All,

I have a very big table t with billions of rows and it is partitioned on a 
column p. Column p  has datatype text and values like '201601', '201602'...upto 
'201612'.
And, I am running a query like : Select columns from t where p='201604'.

My question is : Can there be a scenario/condition/probability that my query 
will do a complete table scan on t instead of only reading data for specified 
partition key. If yes, please put some light on those scenario.

I'm asking this because someone told me that there is a probability that the 
query will ignore the partitioning and do a complete table scan to fetch output.

Thanks & Regards,
Abhishek Dubey