Re: Hive footprint

Mich Talebzadeh Wed, 20 Apr 2016 08:10:29 -0700

A caveat here.

An OLTP database much like Oracle or SAP ASE will use indexes for point
queries in other words when the search is via index scan. In that case the
search will be very fast because typically few blocks will be needed using
Index scan and using RowID pointer to the underlying data blocks to get the
records from the disk.


When an OLAP type read is required there is a lesser need for index as the
optimiser does a serial scan and the work would be pretty efficient. As a
rule of sum (if I am correct), if Oracle CBO decides that the result set
will be more than 4% of the underlying rows it will favour a table scan

The issue with Hive are two fold (excluding storage index in ORC tables)

1) Hive does not take advantage of indexes (index in a conventional sense)
at the moment. Yes you can even create bitwise indexes on FACT tables in
Hive but they are not used by the Optimiser yet.
0: jdbc:hive2://rhes564:10010/default> show index on sales;
INFO  : OK
+-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+
|       idx_name        |       tab_name        |       col_names
|               idx_tab_name               |       idx_type        |
comment  |
+-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+
| sales_cust_bix        | sales                 | cust_id               |
oraclehadoop__sales_sales_cust_bix__     | bitmap                |
|
| sales_channel_bix     | sales                 | channel_id            |
oraclehadoop__sales_sales_channel_bix__  | bitmap                |
|
| sales_prod_bix        | sales                 | prod_id               |
oraclehadoop__sales_sales_prod_bix__     | bitmap                |
|
| sales_promo_bix       | sales                 | promo_id              |
oraclehadoop__sales_sales_promo_bix__    | bitmap                |
|
| sales_time_bix        | sales                 | time_id               |
oraclehadoop__sales_sales_time_bix__     | bitmap                |
|
+-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+

2) The blocks in Hive table are not stored sequentially. Actually the issue
with this is that HDFS lacks the ability to co-locate blocks. So really
table scan in the sense of conventional RDBMS does not exist. However I
believe there are plans to start making indexes available in Hive for COB
which in that case indexes will speed up the queries. Alan Gates may have
more on this.

HTH,




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 20 April 2016 at 13:07, Sabarish Sasidharan <sabarish....@gmail.com>
wrote:

> HBase is very good for direct key based lookups. And when you want to do
> scans for a range of keys (data is sorted by keys)
>
> Whereas Hive is not good for seeks (needle in haystack problem). You can
> optimize with ORCs, stripes, sorting etc. But still it is a needle in a
> haystack problem.
>
> Apache Kylin takes a different approach. It maintains the cubes in HBase
> but routes adhoc queries to Hive. So that's one way to see them as
> complementary technologies solving problems relevant to their space in an
> efficient manner.
>
> Regards
> Sab
>
> On Wed, Apr 20, 2016 at 11:20 AM, Amey Barve <ameybarv...@gmail.com>
> wrote:
>
>> Thanks Peyman,
>>
>> Is running and evaluating TPCH queries with HBaseStorageHandler vs
>> Hive's Text format comparable?
>> What is the standard set of queries generally used for performance
>> comparison, What queries did you use above?
>>
>> Regards,
>> Amey
>>
>>
>>
>> On Tue, Apr 19, 2016 at 7:28 PM, Peyman Mohajerian <mohaj...@gmail.com>
>> wrote:
>>
>>> Hi Amey,
>>>
>>> It is about seek vs scan. HBase is great in case a rowkey or a range of
>>> rowkeys is part of the where clause, then you do a seek and ORC/Parquest
>>> reading off HDFS would not do better in absence of an index. However for
>>> Data Warehouse that is generally not what you do, you mostly do scan, e.g.
>>> doing aggregation you aren't looking for a particular record(s). In this
>>> case the IO throughput dominates (generally), because you have to read lots
>>> of data, then reading large blocks of data and using headers info
>>> (predicate push-down) in ORC or Parquet will be faster compared to reading
>>> lots of HFiles in HBase. Of course compaction in HBase can turn the files
>>> to larger chunks but still 'typically' it will be slower.
>>> I should super emphasized that making statements about what is faster or
>>> not is very dangerous, there could be many exceptions depending on the type
>>> of query and other factors. When I did this test I was using map/reduce and
>>> with newer engines queries will be faster. Also caching in HBase is
>>> critical, if all you data is cached and you got lots of memory and system
>>> isn't busy handling compaction and lots of new write then your read
>>> performance in all cases will improve. Always do your own POC and use your
>>> own data to test.
>>>
>>> Thanks,
>>> Peyman
>>>
>>>
>>>
>>> On Tue, Apr 19, 2016 at 2:26 AM, Amey Barve <ameybarv...@gmail.com>
>>> wrote:
>>>
>>>> Hi Peyman,
>>>>
>>>> You say: "you can use Hive storage handler to read data from HBase the
>>>> performance would be lower than reading from HDFS directly for analytic."
>>>> Why is it so? Is it slow as compared to ORC, Parquet, and even Text
>>>> file format?
>>>>
>>>> Regards,
>>>> Amey
>>>>
>>>> On Tue, Apr 19, 2016 at 4:32 AM, Peyman Mohajerian <mohaj...@gmail.com>
>>>> wrote:
>>>>
>>>>> HBase can handle high read/write throughput, e.g. IOT use cases. It is
>>>>> not an analytic engine even though you can use Hive storage handler to 
>>>>> read
>>>>> data from HBase the performance would be lower than reading from HDFS
>>>>> directly for analytic.  But HBase has index, rowkey and you can add
>>>>> secondary index, usually with Elasticsearch or other means. You can also
>>>>> run Phoenix over HBase to do analytic but again only if you data
>>>>> collection/use case mandates HBase, e.g. small amount of data from 
>>>>> millions
>>>>> of devices. It is common to copy data from HBase to HDFS (even though 
>>>>> HBase
>>>>> is sitting on top of HDFS), as ORC/Parquet for very large analytic. But
>>>>> again you do have the choice of using Phoenix or Hive to run analytic over
>>>>> HBase if you don't want to pay for the cost of data copying.
>>>>> HBase can only be part of a DW solution in a limited way, e.g. as
>>>>> index to data in HDFS, partition discovery, etc. Pretty soon it will be 
>>>>> the
>>>>> metadata for Hive (optional instead of RDMS). HBase can  sits on the edge
>>>>> of DW for collect fast landing data.
>>>>> I don't see any compete between Hive and HBase, they work together and
>>>>> I don't see modern DW having a monolithic engine, Tez+Spark+MPP+...
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Apr 18, 2016 at 3:51 PM, Marcin Tustin <mtus...@handybook.com>
>>>>> wrote:
>>>>>
>>>>>> We use a hive with ORC setup now. Queries may take thousands of
>>>>>> seconds with joins, and potentially tens of seconds with selects on very
>>>>>> large tables.
>>>>>>
>>>>>> My understanding is that the goal of hbase is to provide much lower
>>>>>> latency for queries. Obviously, this comes at the cost of not being able 
>>>>>> to
>>>>>> perform joins. I don't actually use hbase, so I hesitate to say more 
>>>>>> about
>>>>>> it.
>>>>>>
>>>>>> On Mon, Apr 18, 2016 at 6:48 PM, Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks Marcin.
>>>>>>>
>>>>>>> What is the definition of low latency here? Are you referring to the
>>>>>>> performance of SQL against HBase tables compared to Hive. As I 
>>>>>>> understand
>>>>>>> HBase is a columnar database. Would it be possible to use Hive against 
>>>>>>> ORC
>>>>>>> to achieve the same?
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn * 
>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 18 April 2016 at 23:43, Marcin Tustin <mtus...@handybook.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> HBase has a different use case - it's for low-latency querying of
>>>>>>>> big tables. If you combined it with Hive, you might have something 
>>>>>>>> nice for
>>>>>>>> certain queries, but I wouldn't think of them as direct competitors.
>>>>>>>>
>>>>>>>> On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh <
>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I notice that Impala is rarely mentioned these days.  I may be
>>>>>>>>> missing something. However, I gather it is coming to end now as I 
>>>>>>>>> don't
>>>>>>>>> recall many use cases for it (or customers asking for it). In 
>>>>>>>>> contrast,
>>>>>>>>> Hive has hold its ground with the new addition of Spark and Tez as
>>>>>>>>> execution engines, support for ACID and ORC and new stuff in Hive 2. 
>>>>>>>>> In
>>>>>>>>> addition provided a good choice for its metastore it scales well.
>>>>>>>>>
>>>>>>>>> If Hive had the ability (organic) to have local variable and
>>>>>>>>> stored procedure support then it would be top notch Data Warehouse. 
>>>>>>>>> Given
>>>>>>>>> its metastore, I don't see any technical reason why it cannot support 
>>>>>>>>> these
>>>>>>>>> constructs.
>>>>>>>>>
>>>>>>>>> I was recently asked to comment on migration from commercial DWs
>>>>>>>>> to Big Data (primarily for TCO reason) and really could not recall any
>>>>>>>>> better candidate than Hive. Is HBase a viable alternative? Obviously
>>>>>>>>> whatever one decides there is still HDFS, a good engine for Hive 
>>>>>>>>> (sounds
>>>>>>>>> like many prefer TEZ although I am a Spark fan) and the ubiquitous
>>>>>>>>> YARN.
>>>>>>>>>
>>>>>>>>> Let me know your thoughts.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> LinkedIn * 
>>>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Want to work at Handy? Check out our culture deck and open roles
>>>>>>>> <http://www.handy.com/careers>
>>>>>>>> Latest news <http://www.handy.com/press> at Handy
>>>>>>>> Handy just raised $50m
>>>>>>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>>>>>>>>  led
>>>>>>>> by Fidelity
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Want to work at Handy? Check out our culture deck and open roles
>>>>>> <http://www.handy.com/careers>
>>>>>> Latest news <http://www.handy.com/press> at Handy
>>>>>> Handy just raised $50m
>>>>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>>>>>>  led
>>>>>> by Fidelity
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Hive footprint

Reply via email to