Not able to get selective cols in orc using mapreduce

2016-02-18 Thread Akansha Jain
Hi,
I am reading ORC format using map reduce program. I am using
ORCNewInputFormat as an input format. I want to select few fields out of
50+ column table while reading. Though i am setting
hive.io.file.readcolumn.ids and hive.io.file.read.all.columns, it is still
selecting all columns in mapper. What I am missing.

Thanks,
Akansha jain


-- 
-Akansha


RE: Hive Bucketing

2016-01-25 Thread Akansha Jain
Thanks for detailed explanation. Even without bucket pruning, expectation
from bucketing is performance improvement. I am joining two tables which
are bucketed on same no of buckets and column and comparing the performance
with join of two unbucketed tables on bucket led column. I am using mapr
dist and map join conversion is enabled by default. Performance is same in
both the cases. I increased data size from 120 gb to 400 gb to see clear
performance difference. But still results are same. Have set
hive.optimize.bucketmapjoin=true.

Any clue why that would be happening.

Thanks,
AJ
On Jan 22, 2016 4:31 PM, "Mich Talebzadeh"  wrote:

> Hi,
>
>
>
> In general my understanding is that it will be possible to use bucket
> pruning much like partition pruning (elimination) soon
>
>
>
> Bucketing in Hive refers to hash partitioning where a hashing function is
> applied. Likewise an RDBMS like Oracle, Hive will apply a linear hashing
> algorithm to prevent data from clustering within specific partitions.
> Hashing is very effective if the column selected for bucketing has very
> high selectivity like an ID column where selectivity (select
> count(distinct(column))/count(column) ) = 1.  In this case, the created
> partitions/ files will be as evenly sized as possible. In a nutshell
> bucketing is a method to get data evenly distributed over many
> partitions/files.  One should define the number of buckets by a power of
> two -- 2^n,  like 2, 4, 8, 16 etc to achieve best results. Again bucketing
> will help concurrency in Hive. It may even allow a partition wise join i.e.
> a join between two tables that are bucketed on the same column with the
> same number of buckets (anyone has tried this?)
>
>
>
> One more things. When one defines the number of buckets at table creation
> level in Hive, the number of partitions/files will be fixed. In contrast,
> with partitioning you do not have this limitation.
>
> .
>
> Have you considered creating these tables as ORC tables?
>
>
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Akansha Jain [mailto:akansha.15au...@gmail.com]
> *Sent:* 22 January 2016 23:20
> *To:* user@hive.apache.org
> *Subject:* RE: Hive Bucketing
>
>
>
> Thanks for response. I am using 0.13 mapr version. Could you tell more
> about bucket pruning.
>
> On Jan 22, 2016 3:09 PM, "Mich Talebzadeh"  wrote:
>
> Ok we are talking about bucket pruning here
>
>
>
> What version of Hive are using?
>
>
>
> Bucket pruning I believe is available from version 2.0
>
>
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines 

RE: Hive Bucketing

2016-01-22 Thread Akansha Jain
Thanks for response. I am using 0.13 mapr version. Could you tell more
about bucket pruning.
On Jan 22, 2016 3:09 PM, "Mich Talebzadeh"  wrote:

> Ok we are talking about bucket pruning here
>
>
>
> What version of Hive are using?
>
>
>
> Bucket pruning I believe is available from version 2.0
>
>
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Akansha Jain [mailto:akansha.15au...@gmail.com]
> *Sent:* 22 January 2016 21:55
> *To:* user@hive.apache.org
> *Subject:* Hive Bucketing
>
>
>
> Hi All,
> I have enabled bucketing in table. I created 256 buckets on user id. Now
> when I am querying (select count(*) from table where userid =172839393)
> that table, map reduce should only use single partitioned file as input to
> mappers. But its considering all files as input to mapper and I don't see
> any performance benefit when I run same query in unbucketed table.
>
> Do I have to set any property before running queries on bucketed tables. I
> tried join query also, but no performance improvement. In fact, I think
> it's taking few more seconds as compared to unbucketed table.
>
> Thanks,
> AJ
>


Hive Bucketing

2016-01-22 Thread Akansha Jain
Hi All,
I have enabled bucketing in table. I created 256 buckets on user id. Now
when I am querying (select count(*) from table where userid =172839393)
that table, map reduce should only use single partitioned file as input to
mappers. But its considering all files as input to mapper and I don't see
any performance benefit when I run same query in unbucketed table.

Do I have to set any property before running queries on bucketed tables. I
tried join query also, but no performance improvement. In fact, I think
it's taking few more seconds as compared to unbucketed table.

Thanks,
AJ


Adding JAR in Hive classpath

2015-09-04 Thread Akansha Jain
Hi All,

I am facing an issue with the Hive classpath. I have written a UDAF which
is using common maths 3.3 version. So, while creating temporary function I
first add common maths 3.3 and then UDAF jar and create temporary function.

There is another version of common math 3.1 present under HADOOP_HOME/lib
directory.

Now the problem is even after adding common maths 3.3 in Hive classpath (by
ADD JAR ..) , Hive is picking common maths 3.1 version from HADOOP_HOME/lib
folder. How do I remove 3.1 version from classpath.

I tried using DELETE JAR ... but it doesnt work.

Is there any way, I can force Hive to pick my version and not the one with
Hadoop lib. Any help is appreciated.

Thanks

AJ​


Fwd: Future date getting converted to epoch date with windowing function

2014-03-25 Thread Akansha Jain
Hi,
I am trying to use hive windowing functions for a business use case. Hive
version is Apache Hive 0.11.
I have a table with a column end_date where value is 2999-12-31. While
using hive windowing function with this value, Hive is converting it to
1970s date.

*Query used is :*

SELECT account_id,
   device_id,
   status,
   LEAD (status) OVER (PARTITION BY device_id ORDER BY start_date DESC)
prev_status,
   start_date,
   end_date
from my_table;

*Sample data : *

  account_id  device_id  status  primary_min  start_date  end_date





 9 111 2 111 2012-08-29 00:00:00  2013-08-14 00:00:00  9
111 5 111 2013-08-15 00:00:00  2013-08-15 00:00:00  9 111 4
111 2013-08-16 00:00:00  2013-11-30 00:00:00  9 111 4
111 2013-12-01
00:00:00  2013-12-01 00:00:00  9 111 4 111 2013-12-02
00:00:00 2014-01-15 00:00:00
9 111 4 111 2014-01-16 00:00:00  2999-12-31 00:00:00

*Output : *

  account_id device_id status prev_status start_date end_date  9 111
2 NULL 2012-08-29 00:00:00 2013-08-14 00:00:00  9 111 5 2 2013-08-15
00:00:00 2013-08-15 00:00:00  9 111 4 5 2013-08-16 00:00:00 2013-11-30
00:00:00  9 111 4 4 2013-12-01 00:00:00 2013-12-01 00:00:00  9
111 4 4 2013-12-02 00:00:00 2014-01-15 00:00:00  9 111 4 4
2014-01-16
00:00:00 1979-03-26 23:28:00

Here, date 2999-12-31 got converted to 1979-03-26. I have tried converting
date type to String but not help.

Please let me know if anyone has faced same issue and resolved it.

Thanks in advance,

Akansha