Any hooks to invoke the custom database's statistics for aggregate hive queries
Hi All, We have developed a custom storgeHandler implementing *HiveStorageHandler*. We also have Api's/statistics for totalCount, max, min etc. for the data stored in our database. See below example queries: 1. select count(*) from my_table; 2. select max(id_column) from my_table; So for above queries instead of full table scan, storageHandler should be able to invoke our totalCount, max, min etc. methods So are there any hooks to invoke these statistics api's for aggregate hive queries which can do simple look up of these statistics? Thanks, Amey
[jira] [Created] (HIVE-13717) Push Predicate information for 'like' and 'between' operators to custom storage handler.
Amey Barve created HIVE-13717: - Summary: Push Predicate information for 'like' and 'between' operators to custom storage handler. Key: HIVE-13717 URL: https://issues.apache.org/jira/browse/HIVE-13717 Project: Hive Issue Type: Bug Components: Query Processor Environment: TableScanDesc Reporter: Amey Barve Custom storage-handler/s are able to get predicate information from hive for '=', '!=', '<', '<=', '>' and '>=' operators. But cannot get predicates information from hive for 'like', 'between' operators. Eg: final String expression = conf.get(TableScanDesc.FILTER_EXPR_CONF_STR); here expression remains null for like and between operators but not null for above operators. custom storage handlers need information for predicate push down for these operators. Is there some other mechanism to get predicates for 'like' operator? I tested with hive version 1.2 and 0.14. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Predicates for 'like' and 'between' operators to custom storage handler.
Hi, Do you have the equivalent of that operation in pure SQL. ---> This is my hive query: *select count(*) from u_data where unixtime like '%888904884%'* Query evaluates and results are correct. But point is that hive does not give like operator during predicate push down to custom storage handler. I am mapping with *hive's UDFLike* class. Also have you tried Spark query tool with Hive table. ---> *No*. I gather you are doing this through Java? ---> *YES*. Has anybody tried mapping operators other than '=', '!=', '<', '<=', '>' and '>=' ? Regards, Amey On Thu, May 5, 2016 at 5:44 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi, > > Do you have the equivalent of that operation in pure SQL. Also have you > tried Spark query tool with Hive table. > > I gather you are doing this through Java? > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 5 May 2016 at 13:10, Amey Barve <ameybarv...@gmail.com> wrote: > >> Thanks Mich, >> >> It does work, but with operators other than '=', '!=', '<', '<=', '>' and >> '>=' , my code with custom storage handler gets null expression. >> >> *final String expression = conf.get(TableScanDesc.FILTER_EXPR_CONF_STR); >> // expression is null for hive query that has like operator* >> >> Why does above API return *null* for hive query that has like operator? >> I need to pass like operator for predicate push down to my custom storage >> handler. >> >> Regards, >> Amey >> >> On Thu, May 5, 2016 at 5:30 PM, Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> On a normal query using sql in* Hive 2* LIKE predicate works fine. Case >>> in point in a 1 billion rows table with the column random_string of >>> varchar(50) I have one row that satisfies the followinh@ >>> >>> >>> +---+--+--+---+-+-++--+ >>> | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised >>> | dummy.random_string | dummy.small_vc | >>> dummy.padding | >>> >>> +---+--+--+---+-+-++--+ >>> | 1 | 0| 0| 63| >>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | >>> xx | >>> | 2 | 0| 1| 926 | >>> UEDJsfIgoYqwreSuuvjIcPZarpxMdCthpDCsgPlJfvIiylLiBS | 2 | >>> xx | >>> >>> Now let us try to select that row with LIKE predicate: >>> >>> 0: jdbc:hive2://rhes564:10010/default> select count(1) from dummy where >>> random_string like 'rMLTDXxxqXOZnqYRJ%'; >>> >>> INFO : >>> Query Hive on Spark job[0] stages: >>> INFO : 0 >>> INFO : 1 >>> INFO : >>> Status: Running (Hive on Spark job[0]) >>> >>> INFO : Completed executing >>> command(queryId=hduser_20160505125700_cbc415b6-91bb-4ed6-95e4-d177e12988f6); >>> Time taken: 153.544 seconds >>> INFO : OK >>> +-+--+ >>> | c0 | >>> +-+--+ >>> | 1 | >>> +-+--+ >>> 1 row selected (153.959 seconds) >>> >>> So it does work >>> >>> HTH >>> >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> >>> On 5 May 2016 at 11:53, Amey Barve <ameybarv...@gmail.com> wrote: >>> >>>> Hi All, >>>> >>>> I have implemented custom storage-handler and able to get predicates >>>> from hive for '=', '!=', '<', '<=', '>' and '>=' operators. >>>> But I cannot get predicates from hive for 'like', 'between' operators. >>>> >>>> Here's my code: >>>> >>>> *final String expression = >>>> conf.get(TableScanDesc.FILTER_EXPR_CONF_STR);* >>>> >>>> here expression remains null for like and between operators but not >>>> null for above operators. >>>> >>>> Does hive not give predicates for 'like' and 'between' operators to >>>> custom storage handler ? >>>> *Is there some other mechanism to get predicates for 'like' operator*? >>>> >>>> I tested with hive version 1.2 and 0.14. >>>> >>>> Thanks and Regards, >>>> Amey >>>> >>> >>> >> >
Re: Predicates for 'like' and 'between' operators to custom storage handler.
Thanks Mich, It does work, but with operators other than '=', '!=', '<', '<=', '>' and '>=' , my code with custom storage handler gets null expression. *final String expression = conf.get(TableScanDesc.FILTER_EXPR_CONF_STR); // expression is null for hive query that has like operator* Why does above API return *null* for hive query that has like operator? I need to pass like operator for predicate push down to my custom storage handler. Regards, Amey On Thu, May 5, 2016 at 5:30 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > On a normal query using sql in* Hive 2* LIKE predicate works fine. Case > in point in a 1 billion rows table with the column random_string of > varchar(50) I have one row that satisfies the followinh@ > > > +---+--+--+---+-+-++--+ > | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised > | dummy.random_string | dummy.small_vc | > dummy.padding | > > +---+--+--+---+-+-++--+ > | 1 | 0| 0| 63| > rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | > xx | > | 2 | 0| 1| 926 | > UEDJsfIgoYqwreSuuvjIcPZarpxMdCthpDCsgPlJfvIiylLiBS | 2 | > xx | > > Now let us try to select that row with LIKE predicate: > > 0: jdbc:hive2://rhes564:10010/default> select count(1) from dummy where > random_string like 'rMLTDXxxqXOZnqYRJ%'; > > INFO : > Query Hive on Spark job[0] stages: > INFO : 0 > INFO : 1 > INFO : > Status: Running (Hive on Spark job[0]) > > INFO : Completed executing > command(queryId=hduser_20160505125700_cbc415b6-91bb-4ed6-95e4-d177e12988f6); > Time taken: 153.544 seconds > INFO : OK > +-+--+ > | c0 | > +-+--+ > | 1 | > +-+--+ > 1 row selected (153.959 seconds) > > So it does work > > HTH > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 5 May 2016 at 11:53, Amey Barve <ameybarv...@gmail.com> wrote: > >> Hi All, >> >> I have implemented custom storage-handler and able to get predicates from >> hive for '=', '!=', '<', '<=', '>' and '>=' operators. >> But I cannot get predicates from hive for 'like', 'between' operators. >> >> Here's my code: >> >> *final String expression = conf.get(TableScanDesc.FILTER_EXPR_CONF_STR);* >> >> here expression remains null for like and between operators but not null >> for above operators. >> >> Does hive not give predicates for 'like' and 'between' operators to >> custom storage handler ? >> *Is there some other mechanism to get predicates for 'like' operator*? >> >> I tested with hive version 1.2 and 0.14. >> >> Thanks and Regards, >> Amey >> > >
Predicates for 'like' and 'between' operators to custom storage handler.
Hi All, I have implemented custom storage-handler and able to get predicates from hive for '=', '!=', '<', '<=', '>' and '>=' operators. But I cannot get predicates from hive for 'like', 'between' operators. Here's my code: *final String expression = conf.get(TableScanDesc.FILTER_EXPR_CONF_STR);* here expression remains null for like and between operators but not null for above operators. Does hive not give predicates for 'like' and 'between' operators to custom storage handler ? *Is there some other mechanism to get predicates for 'like' operator*? I tested with hive version 1.2 and 0.14. Thanks and Regards, Amey
Re: select count(*) from table;
Hi All, Can custom storage handlers get information for queries like count, max, min etc. from hive directly so that for each of such queries RecordReader need not fetch all the records? Regards, Amey On Tue, Mar 22, 2016 at 1:46 PM, Amey Barve <ameybarv...@gmail.com> wrote: > Thanks Nitin, Mich, > > if its just plain vanilla text file format, it needs to run a job to get > the count so the longest of all > --> Hive must be translating some operator like fetch (for count) into a > map-reduce job and getting the result? > Can a custom storage handler get information about the operator/s for > count(*) and then use it to retrieve the results. > > I want to know whether custom storage handler can get information about > operators that hive constructs for queries like count, max, min etc. so > that storage handler can map these to internal storage functions? > > Regards, > Amey > > On Tue, Mar 22, 2016 at 1:32 PM, Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> ORC file has the following stats levels for storage indexes >> >> >>1. ORC File itself >>2. Multiple stripes (chunks) within the ORC file >>3. Multiple row groups (row batches) within each stripe >> >> Assuming that the underlying table has stats updated, count will be >> stored for each column >> >> So when we do something like below: >> >> select count(1) from orctest >> >> you can see stats collected if you do >> >> show create table orctest; >> >> TBLPROPERTIES ( | >> | 'COLUMN_STATS_ACCURATE'='true',| >> | 'numFiles'='31', | >> | *'numRows'='25'*,| >> >> >> File statistics, Stripe statistics and row group statistics are kept. So >> ORC table will rely on those if needed >> >> >> HTH >> >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 22 March 2016 at 07:14, Amey Barve <ameybarv...@gmail.com> wrote: >> >>> select count(*) from table; >>> >>> How does hive evaluate count(*) on a table? >>> >>> Does it return count by actually querying table, or directly return >>> count by consulting some statistics locally. >>> >>> For Hive's Text format it takes few seconds while Hive's Orc format >>> takes fraction of seconds. >>> >>> Regards, >>> Amey >>> >> >> >
Re: select count(*) from table;
Thanks Nitin, Mich, if its just plain vanilla text file format, it needs to run a job to get the count so the longest of all --> Hive must be translating some operator like fetch (for count) into a map-reduce job and getting the result? Can a custom storage handler get information about the operator/s for count(*) and then use it to retrieve the results. I want to know whether custom storage handler can get information about operators that hive constructs for queries like count, max, min etc. so that storage handler can map these to internal storage functions? Regards, Amey On Tue, Mar 22, 2016 at 1:32 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > ORC file has the following stats levels for storage indexes > > >1. ORC File itself >2. Multiple stripes (chunks) within the ORC file >3. Multiple row groups (row batches) within each stripe > > Assuming that the underlying table has stats updated, count will be stored > for each column > > So when we do something like below: > > select count(1) from orctest > > you can see stats collected if you do > > show create table orctest; > > TBLPROPERTIES ( | > | 'COLUMN_STATS_ACCURATE'='true',| > | 'numFiles'='31', | > | *'numRows'='25'*,| > > > File statistics, Stripe statistics and row group statistics are kept. So > ORC table will rely on those if needed > > > HTH > > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 22 March 2016 at 07:14, Amey Barve <ameybarv...@gmail.com> wrote: > >> select count(*) from table; >> >> How does hive evaluate count(*) on a table? >> >> Does it return count by actually querying table, or directly return count >> by consulting some statistics locally. >> >> For Hive's Text format it takes few seconds while Hive's Orc format takes >> fraction of seconds. >> >> Regards, >> Amey >> > >
select count(*) from table;
select count(*) from table; How does hive evaluate count(*) on a table? Does it return count by actually querying table, or directly return count by consulting some statistics locally. For Hive's Text format it takes few seconds while Hive's Orc format takes fraction of seconds. Regards, Amey
predicate push down with custom StorageHandler
Hi All, I have implemented HiveStoragePredicateHandler for my StorageHandler. I am able to push down predicates for following query and this works fine. Example: select o_custkey,c_custkey from customer,orders where c_mktsegment = 'BUILDING'; (*column-ref = c_mktsegment comparison-op = op constant-value = 'BUILDING'*) I get no results when I execute following query Example: select o_custkey,c_custkey from customer,orders where c_mktsegment = 'BUILDING' and c_custkey = o_custkey limit 10; in this case (*column-ref = c_mktsegment comparison-op = op constant-value = 'BUILDING'*) is evaluated by my storageHandler. While (*column-ref = c_custkey comparison-op = op constant-value = o_custkey*) is evaluated by hive. Does such kind of query work with non native tables with predicate push down or am I missing anything? Does any custom StorageHandler like HBaseStorageHandler, AccumuloStorageHandler, CassandraStorageHandler etc. support such kind of predicate push down. Thanks and Regards, Amey
unique-id for the mapper task with tez execution engine
Hi All, Thanks in advance! I am running hive queries with mr engine and I wanted to get unique-id from the mapper task so I used following property from configuration conf.get("mapreduce.task.id"); Now I want to run same hive queries with tez engine and I want to know what should be my unique-id. Is there any property from configuration or other that can give me unique-id from the mapper task? I am using conf.get("mapreduce.task.partition") property, Is that guranteed to give unique - id for tez? Regards, Amey