Re: Is it worth storing in ORC for one time read. And can be replace hive with HBase

venkatesh b Thu, 06 Aug 2015 06:33:15 -0700

I'm really sorry, by mistake I posted in spark mailing list.

Jorn Frankie Thanks for your reply.
I have many joins, many complex queries and all are table scans. So I think
HBase do not work for me.


On Thursday, August 6, 2015, Jörn Franke <jornfra...@gmail.com> wrote:

> Additionally it is of key importance to use the right data types for the
> columns. Use int for ids,  int or decimal or float or double etc for
> numeric values etc. - A bad data model using varchars and string where not
> appropriate is a significant bottle neck.
> Furthermore include partition columns in join statements (not where)
> otherwise you do a full table scan ignoring partitions
>
> Le jeu. 6 août 2015 à 15:07, Jörn Franke <jornfra...@gmail.com
> <javascript:_e(%7B%7D,'cvml','jornfra...@gmail.com');>> a écrit :
>
>> Yes you should use orc it is much faster and more compact. Additionally
>> you can apply compression (snappy) to increase performance. Your data
>> processing pipeline seems to be not.very optimized. You should use the
>> newest hive version enabling storage indexes and bloom filters on
>> appropriate columns. Ideally you should insert the data sorted
>> appropriately. Partitioning and setting the execution engine to tez is also
>> beneficial.
>>
>> Hbase with phoenix should currently only be used if you do few joins, not
>> very complex queries and not many full table scans.
>>
>> Le jeu. 6 août 2015 à 14:54, venkatesh b <venkateshmailingl...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','venkateshmailingl...@gmail.com');>> a
>> écrit :
>>
>>> Hi, here I got two things to know.
>>> FIRST:
>>> In our project we use hive.
>>> We daily get new data. We need to process this new data only once. And
>>> send this processed data to RDBMS. Here in processing we majorly use many
>>> complex queries with joins with where condition and grouping functions.
>>> There are many intermediate tables generated around 50 while
>>> processing. Till now we use text format as storage. We came across ORC file
>>> format. I would like to know that since it is one Time querying the table
>>> is it worth of storing as ORC format.
>>>
>>> SECOND:
>>> I came to know about HBase, which is faster.
>>> Can I replace hive with HBase for processing of data daily faster.
>>> Currently it is taking 15hrs daily with hive.
>>>
>>>
>>> Please inform me if any other information is needed.
>>>
>>> Thanks & regards
>>> Venkatesh
>>>
>>

Re: Is it worth storing in ORC for one time read. And can be replace hive with HBase

Reply via email to