Any hooks to invoke the custom database's statistics for aggregate hive queries

2017-09-12 Thread Amey Barve
Hi All,

We have developed a custom storgeHandler implementing *HiveStorageHandler*.
We also have Api's/statistics for totalCount, max, min etc. for the data
stored in our database.

See below example queries:
1. select count(*) from my_table;
2. select max(id_column) from my_table;

So for above queries instead of full table scan, storageHandler should be
able to invoke our totalCount, max, min etc. methods

So are there any hooks to invoke these statistics api's for aggregate hive
queries which can do simple look up of these statistics?

Thanks,
Amey


[jira] [Created] (HIVE-13717) Push Predicate information for 'like' and 'between' operators to custom storage handler.

2016-05-09 Thread Amey Barve (JIRA)
Amey Barve created HIVE-13717:
-

 Summary: Push Predicate information for 'like' and 'between' 
operators to custom storage handler.
 Key: HIVE-13717
 URL: https://issues.apache.org/jira/browse/HIVE-13717
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
 Environment: TableScanDesc
Reporter: Amey Barve


Custom storage-handler/s are able to get predicate information from hive for 
'=', '!=', '<', '<=', '>' and '>=' operators.
But cannot get predicates information from hive for 'like', 'between' operators.

Eg:
final String expression = conf.get(TableScanDesc.FILTER_EXPR_CONF_STR);

here expression remains null for like and between operators but not null for 
above operators.
custom storage handlers need information for predicate push down for these 
operators.

Is there some other mechanism to get predicates for 'like' operator?

I tested with hive version 1.2 and 0.14.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Predicates for 'like' and 'between' operators to custom storage handler.

2016-05-05 Thread Amey Barve
Hi,

Do you have the equivalent of that operation in pure SQL.
---> This is my hive query: *select count(*) from u_data where unixtime
like '%888904884%'*
 Query evaluates and results are correct. But point is that hive does not
give like operator during predicate push down to custom storage handler.
I am mapping with *hive's UDFLike* class.

Also have you tried Spark query tool with Hive table.
---> *No*.

I gather you are doing this through Java?
---> *YES*.

Has anybody tried mapping operators other than  '=', '!=', '<', '<=', '>'
and '>=' ?

Regards,
Amey

On Thu, May 5, 2016 at 5:44 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi,
>
> Do you have the equivalent of that operation in pure SQL. Also have you
> tried Spark query tool with Hive table.
>
> I gather you are doing this through Java?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 5 May 2016 at 13:10, Amey Barve <ameybarv...@gmail.com> wrote:
>
>> Thanks Mich,
>>
>> It does work, but with operators other than '=', '!=', '<', '<=', '>' and
>> '>=' , my code with custom storage handler gets null expression.
>>
>> *final String expression = conf.get(TableScanDesc.FILTER_EXPR_CONF_STR);
>> // expression is null for hive query that has like operator*
>>
>> Why does above API return *null* for hive query that has like operator?
>> I need to pass like operator for predicate push down to my custom storage
>> handler.
>>
>> Regards,
>> Amey
>>
>> On Thu, May 5, 2016 at 5:30 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> On a normal query using sql in* Hive 2* LIKE predicate works fine. Case
>>> in point in a 1 billion rows table with the column random_string of
>>> varchar(50) I have one row that satisfies the followinh@
>>>
>>>
>>> +---+--+--+---+-+-++--+
>>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>>> | dummy.random_string | dummy.small_vc  |
>>> dummy.padding  |
>>>
>>> +---+--+--+---+-+-++--+
>>> | 1 | 0| 0| 63|
>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
>>> xx |
>>> | 2 | 0| 1| 926   |
>>> UEDJsfIgoYqwreSuuvjIcPZarpxMdCthpDCsgPlJfvIiylLiBS  |  2  |
>>> xx |
>>>
>>> Now let us try to select that row with LIKE predicate:
>>>
>>> 0: jdbc:hive2://rhes564:10010/default> select count(1) from dummy where
>>> random_string like 'rMLTDXxxqXOZnqYRJ%';
>>>
>>> INFO  :
>>> Query Hive on Spark job[0] stages:
>>> INFO  : 0
>>> INFO  : 1
>>> INFO  :
>>> Status: Running (Hive on Spark job[0])
>>>
>>> INFO  : Completed executing
>>> command(queryId=hduser_20160505125700_cbc415b6-91bb-4ed6-95e4-d177e12988f6);
>>> Time taken: 153.544 seconds
>>> INFO  : OK
>>> +-+--+
>>> | c0  |
>>> +-+--+
>>> | 1   |
>>> +-+--+
>>> 1 row selected (153.959 seconds)
>>>
>>> So it does work
>>>
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 5 May 2016 at 11:53, Amey Barve <ameybarv...@gmail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I have implemented custom storage-handler and able to get predicates
>>>> from hive for '=', '!=', '<', '<=', '>' and '>=' operators.
>>>> But I cannot get predicates from hive for 'like', 'between' operators.
>>>>
>>>> Here's my code:
>>>>
>>>> *final String expression =
>>>> conf.get(TableScanDesc.FILTER_EXPR_CONF_STR);*
>>>>
>>>> here expression remains null for like and between operators but not
>>>> null for above operators.
>>>>
>>>> Does hive not give predicates for 'like' and 'between' operators to
>>>> custom storage handler ?
>>>> *Is there some other mechanism to get predicates for 'like' operator*?
>>>>
>>>> I tested with hive version 1.2 and 0.14.
>>>>
>>>> Thanks and Regards,
>>>> Amey
>>>>
>>>
>>>
>>
>


Re: Predicates for 'like' and 'between' operators to custom storage handler.

2016-05-05 Thread Amey Barve
Thanks Mich,

It does work, but with operators other than '=', '!=', '<', '<=', '>' and
'>=' , my code with custom storage handler gets null expression.

*final String expression = conf.get(TableScanDesc.FILTER_EXPR_CONF_STR); //
expression is null for hive query that has like operator*

Why does above API return *null* for hive query that has like operator? I
need to pass like operator for predicate push down to my custom storage
handler.

Regards,
Amey

On Thu, May 5, 2016 at 5:30 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> On a normal query using sql in* Hive 2* LIKE predicate works fine. Case
> in point in a 1 billion rows table with the column random_string of
> varchar(50) I have one row that satisfies the followinh@
>
>
> +---+--+--+---+-+-++--+
> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
> | dummy.random_string | dummy.small_vc  |
> dummy.padding  |
>
> +---+--+--+---+-+-++--+
> | 1 | 0| 0| 63|
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
> xx |
> | 2 | 0| 1| 926   |
> UEDJsfIgoYqwreSuuvjIcPZarpxMdCthpDCsgPlJfvIiylLiBS  |  2  |
> xx |
>
> Now let us try to select that row with LIKE predicate:
>
> 0: jdbc:hive2://rhes564:10010/default> select count(1) from dummy where
> random_string like 'rMLTDXxxqXOZnqYRJ%';
>
> INFO  :
> Query Hive on Spark job[0] stages:
> INFO  : 0
> INFO  : 1
> INFO  :
> Status: Running (Hive on Spark job[0])
>
> INFO  : Completed executing
> command(queryId=hduser_20160505125700_cbc415b6-91bb-4ed6-95e4-d177e12988f6);
> Time taken: 153.544 seconds
> INFO  : OK
> +-+--+
> | c0  |
> +-+--+
> | 1   |
> +-+--+
> 1 row selected (153.959 seconds)
>
> So it does work
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 5 May 2016 at 11:53, Amey Barve <ameybarv...@gmail.com> wrote:
>
>> Hi All,
>>
>> I have implemented custom storage-handler and able to get predicates from
>> hive for '=', '!=', '<', '<=', '>' and '>=' operators.
>> But I cannot get predicates from hive for 'like', 'between' operators.
>>
>> Here's my code:
>>
>> *final String expression = conf.get(TableScanDesc.FILTER_EXPR_CONF_STR);*
>>
>> here expression remains null for like and between operators but not null
>> for above operators.
>>
>> Does hive not give predicates for 'like' and 'between' operators to
>> custom storage handler ?
>> *Is there some other mechanism to get predicates for 'like' operator*?
>>
>> I tested with hive version 1.2 and 0.14.
>>
>> Thanks and Regards,
>> Amey
>>
>
>


Predicates for 'like' and 'between' operators to custom storage handler.

2016-05-05 Thread Amey Barve
Hi All,

I have implemented custom storage-handler and able to get predicates from
hive for '=', '!=', '<', '<=', '>' and '>=' operators.
But I cannot get predicates from hive for 'like', 'between' operators.

Here's my code:

*final String expression = conf.get(TableScanDesc.FILTER_EXPR_CONF_STR);*

here expression remains null for like and between operators but not null
for above operators.

Does hive not give predicates for 'like' and 'between' operators to custom
storage handler ?
*Is there some other mechanism to get predicates for 'like' operator*?

I tested with hive version 1.2 and 0.14.

Thanks and Regards,
Amey


Re: select count(*) from table;

2016-03-31 Thread Amey Barve
Hi All,

Can custom storage handlers get information for queries like count, max,
min etc. from hive directly so that for each of such queries RecordReader
need not fetch all the records?

Regards,
Amey

On Tue, Mar 22, 2016 at 1:46 PM, Amey Barve <ameybarv...@gmail.com> wrote:

> Thanks Nitin, Mich,
>
> if its just plain vanilla text file format, it needs to run a job to get
> the count so the longest of all
> --> Hive must be translating some operator like fetch (for count) into a
> map-reduce job and getting the result?
> Can a custom storage handler get information about the operator/s for
> count(*) and then use it to retrieve the results.
>
> I want to know whether custom storage handler can get information about
> operators that hive constructs for queries like count, max, min etc. so
> that storage handler can map these to internal storage functions?
>
> Regards,
> Amey
>
> On Tue, Mar 22, 2016 at 1:32 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> ORC file has the following stats levels for storage indexes
>>
>>
>>1. ORC File itself
>>2. Multiple stripes (chunks) within the ORC file
>>3. Multiple row groups (row batches) within each stripe
>>
>> Assuming that the underlying table has stats updated, count will be
>> stored for each column
>>
>> So when we do something like below:
>>
>> select count(1) from orctest
>>
>> you can see stats collected if you do
>>
>> show create table orctest;
>>
>>  TBLPROPERTIES (  |
>> |   'COLUMN_STATS_ACCURATE'='true',|
>> |   'numFiles'='31',   |
>> |   *'numRows'='25'*,|
>>
>>
>> File statistics, Stripe statistics and row group statistics are kept. So
>> ORC table will rely on those if needed
>>
>>
>> HTH
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 22 March 2016 at 07:14, Amey Barve <ameybarv...@gmail.com> wrote:
>>
>>> select count(*) from table;
>>>
>>> How does hive evaluate count(*) on a table?
>>>
>>> Does it return count by actually querying table, or directly return
>>> count by consulting some statistics locally.
>>>
>>> For Hive's Text format it takes few seconds while Hive's Orc format
>>> takes fraction of seconds.
>>>
>>> Regards,
>>> Amey
>>>
>>
>>
>


Re: select count(*) from table;

2016-03-22 Thread Amey Barve
Thanks Nitin, Mich,

if its just plain vanilla text file format, it needs to run a job to get
the count so the longest of all
--> Hive must be translating some operator like fetch (for count) into a
map-reduce job and getting the result?
Can a custom storage handler get information about the operator/s for
count(*) and then use it to retrieve the results.

I want to know whether custom storage handler can get information about
operators that hive constructs for queries like count, max, min etc. so
that storage handler can map these to internal storage functions?

Regards,
Amey

On Tue, Mar 22, 2016 at 1:32 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> ORC file has the following stats levels for storage indexes
>
>
>1. ORC File itself
>2. Multiple stripes (chunks) within the ORC file
>3. Multiple row groups (row batches) within each stripe
>
> Assuming that the underlying table has stats updated, count will be stored
> for each column
>
> So when we do something like below:
>
> select count(1) from orctest
>
> you can see stats collected if you do
>
> show create table orctest;
>
>  TBLPROPERTIES (  |
> |   'COLUMN_STATS_ACCURATE'='true',|
> |   'numFiles'='31',   |
> |   *'numRows'='25'*,|
>
>
> File statistics, Stripe statistics and row group statistics are kept. So
> ORC table will rely on those if needed
>
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 March 2016 at 07:14, Amey Barve <ameybarv...@gmail.com> wrote:
>
>> select count(*) from table;
>>
>> How does hive evaluate count(*) on a table?
>>
>> Does it return count by actually querying table, or directly return count
>> by consulting some statistics locally.
>>
>> For Hive's Text format it takes few seconds while Hive's Orc format takes
>> fraction of seconds.
>>
>> Regards,
>> Amey
>>
>
>


select count(*) from table;

2016-03-22 Thread Amey Barve
select count(*) from table;

How does hive evaluate count(*) on a table?

Does it return count by actually querying table, or directly return count
by consulting some statistics locally.

For Hive's Text format it takes few seconds while Hive's Orc format takes
fraction of seconds.

Regards,
Amey


predicate push down with custom StorageHandler

2016-02-26 Thread Amey Barve
Hi All,

I have implemented HiveStoragePredicateHandler for my StorageHandler.

I am able to push down predicates for following query and this works fine.
Example: select o_custkey,c_custkey from customer,orders where c_mktsegment
= 'BUILDING';
(*column-ref = c_mktsegment comparison-op = op constant-value = 'BUILDING'*)

I get no results when I execute following query
Example: select o_custkey,c_custkey from customer,orders where
 c_mktsegment = 'BUILDING' and c_custkey = o_custkey limit 10;
in this case (*column-ref = c_mktsegment comparison-op = op constant-value
= 'BUILDING'*) is evaluated by my storageHandler.
While (*column-ref = c_custkey comparison-op = op constant-value =
o_custkey*) is evaluated by hive.

Does such kind of query work with non native tables with predicate push
down or am I missing anything?
Does any custom StorageHandler like HBaseStorageHandler,
AccumuloStorageHandler, CassandraStorageHandler etc. support such kind of
predicate push down.

Thanks and Regards,
Amey


unique-id for the mapper task with tez execution engine

2015-12-22 Thread Amey Barve
Hi All,

Thanks in advance!

I am running hive queries with mr engine and I wanted to get unique-id from
the mapper task so I used following property from configuration
conf.get("mapreduce.task.id");

Now I want to run same hive queries with tez engine and I want to know what
should be my unique-id. Is there any property from configuration or other
that can give me unique-id from the mapper task?

I am using conf.get("mapreduce.task.partition") property, Is that guranteed
to give unique - id for tez?

Regards,
Amey