I'm really sorry, by mistake I posted in spark mailing list. Jorn Frankie Thanks for your reply. I have many joins, many complex queries and all are table scans. So I think HBase do not work for me.
On Thursday, August 6, 2015, Jörn Franke <jornfra...@gmail.com> wrote: > Additionally it is of key importance to use the right data types for the > columns. Use int for ids, int or decimal or float or double etc for > numeric values etc. - A bad data model using varchars and string where not > appropriate is a significant bottle neck. > Furthermore include partition columns in join statements (not where) > otherwise you do a full table scan ignoring partitions > > Le jeu. 6 août 2015 à 15:07, Jörn Franke <jornfra...@gmail.com > <javascript:_e(%7B%7D,'cvml','jornfra...@gmail.com');>> a écrit : > >> Yes you should use orc it is much faster and more compact. Additionally >> you can apply compression (snappy) to increase performance. Your data >> processing pipeline seems to be not.very optimized. You should use the >> newest hive version enabling storage indexes and bloom filters on >> appropriate columns. Ideally you should insert the data sorted >> appropriately. Partitioning and setting the execution engine to tez is also >> beneficial. >> >> Hbase with phoenix should currently only be used if you do few joins, not >> very complex queries and not many full table scans. >> >> Le jeu. 6 août 2015 à 14:54, venkatesh b <venkateshmailingl...@gmail.com >> <javascript:_e(%7B%7D,'cvml','venkateshmailingl...@gmail.com');>> a >> écrit : >> >>> Hi, here I got two things to know. >>> FIRST: >>> In our project we use hive. >>> We daily get new data. We need to process this new data only once. And >>> send this processed data to RDBMS. Here in processing we majorly use many >>> complex queries with joins with where condition and grouping functions. >>> There are many intermediate tables generated around 50 while >>> processing. Till now we use text format as storage. We came across ORC file >>> format. I would like to know that since it is one Time querying the table >>> is it worth of storing as ORC format. >>> >>> SECOND: >>> I came to know about HBase, which is faster. >>> Can I replace hive with HBase for processing of data daily faster. >>> Currently it is taking 15hrs daily with hive. >>> >>> >>> Please inform me if any other information is needed. >>> >>> Thanks & regards >>> Venkatesh >>> >>