Re: Hive footprint

Amey Barve Tue, 19 Apr 2016 22:51:58 -0700

Thanks Peyman,

Is running and evaluating TPCH queries with HBaseStorageHandler vs Hive's
Text format comparable?
What is the standard set of queries generally used for performance
comparison, What queries did you use above?


Regards,
Amey



On Tue, Apr 19, 2016 at 7:28 PM, Peyman Mohajerian <mohaj...@gmail.com>
wrote:

> Hi Amey,
>
> It is about seek vs scan. HBase is great in case a rowkey or a range of
> rowkeys is part of the where clause, then you do a seek and ORC/Parquest
> reading off HDFS would not do better in absence of an index. However for
> Data Warehouse that is generally not what you do, you mostly do scan, e.g.
> doing aggregation you aren't looking for a particular record(s). In this
> case the IO throughput dominates (generally), because you have to read lots
> of data, then reading large blocks of data and using headers info
> (predicate push-down) in ORC or Parquet will be faster compared to reading
> lots of HFiles in HBase. Of course compaction in HBase can turn the files
> to larger chunks but still 'typically' it will be slower.
> I should super emphasized that making statements about what is faster or
> not is very dangerous, there could be many exceptions depending on the type
> of query and other factors. When I did this test I was using map/reduce and
> with newer engines queries will be faster. Also caching in HBase is
> critical, if all you data is cached and you got lots of memory and system
> isn't busy handling compaction and lots of new write then your read
> performance in all cases will improve. Always do your own POC and use your
> own data to test.
>
> Thanks,
> Peyman
>
>
>
> On Tue, Apr 19, 2016 at 2:26 AM, Amey Barve <ameybarv...@gmail.com> wrote:
>
>> Hi Peyman,
>>
>> You say: "you can use Hive storage handler to read data from HBase the
>> performance would be lower than reading from HDFS directly for analytic."
>> Why is it so? Is it slow as compared to ORC, Parquet, and even Text file
>> format?
>>
>> Regards,
>> Amey
>>
>> On Tue, Apr 19, 2016 at 4:32 AM, Peyman Mohajerian <mohaj...@gmail.com>
>> wrote:
>>
>>> HBase can handle high read/write throughput, e.g. IOT use cases. It is
>>> not an analytic engine even though you can use Hive storage handler to read
>>> data from HBase the performance would be lower than reading from HDFS
>>> directly for analytic.  But HBase has index, rowkey and you can add
>>> secondary index, usually with Elasticsearch or other means. You can also
>>> run Phoenix over HBase to do analytic but again only if you data
>>> collection/use case mandates HBase, e.g. small amount of data from millions
>>> of devices. It is common to copy data from HBase to HDFS (even though HBase
>>> is sitting on top of HDFS), as ORC/Parquet for very large analytic. But
>>> again you do have the choice of using Phoenix or Hive to run analytic over
>>> HBase if you don't want to pay for the cost of data copying.
>>> HBase can only be part of a DW solution in a limited way, e.g. as index
>>> to data in HDFS, partition discovery, etc. Pretty soon it will be the
>>> metadata for Hive (optional instead of RDMS). HBase can  sits on the edge
>>> of DW for collect fast landing data.
>>> I don't see any compete between Hive and HBase, they work together and I
>>> don't see modern DW having a monolithic engine, Tez+Spark+MPP+...
>>>
>>>
>>>
>>> On Mon, Apr 18, 2016 at 3:51 PM, Marcin Tustin <mtus...@handybook.com>
>>> wrote:
>>>
>>>> We use a hive with ORC setup now. Queries may take thousands of seconds
>>>> with joins, and potentially tens of seconds with selects on very large
>>>> tables.
>>>>
>>>> My understanding is that the goal of hbase is to provide much lower
>>>> latency for queries. Obviously, this comes at the cost of not being able to
>>>> perform joins. I don't actually use hbase, so I hesitate to say more about
>>>> it.
>>>>
>>>> On Mon, Apr 18, 2016 at 6:48 PM, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Thanks Marcin.
>>>>>
>>>>> What is the definition of low latency here? Are you referring to the
>>>>> performance of SQL against HBase tables compared to Hive. As I understand
>>>>> HBase is a columnar database. Would it be possible to use Hive against ORC
>>>>> to achieve the same?
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 18 April 2016 at 23:43, Marcin Tustin <mtus...@handybook.com>
>>>>> wrote:
>>>>>
>>>>>> HBase has a different use case - it's for low-latency querying of big
>>>>>> tables. If you combined it with Hive, you might have something nice for
>>>>>> certain queries, but I wouldn't think of them as direct competitors.
>>>>>>
>>>>>> On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I notice that Impala is rarely mentioned these days.  I may be
>>>>>>> missing something. However, I gather it is coming to end now as I don't
>>>>>>> recall many use cases for it (or customers asking for it). In contrast,
>>>>>>> Hive has hold its ground with the new addition of Spark and Tez as
>>>>>>> execution engines, support for ACID and ORC and new stuff in Hive 2. In
>>>>>>> addition provided a good choice for its metastore it scales well.
>>>>>>>
>>>>>>> If Hive had the ability (organic) to have local variable and stored
>>>>>>> procedure support then it would be top notch Data Warehouse. Given its
>>>>>>> metastore, I don't see any technical reason why it cannot support these
>>>>>>> constructs.
>>>>>>>
>>>>>>> I was recently asked to comment on migration from commercial DWs to
>>>>>>> Big Data (primarily for TCO reason) and really could not recall any 
>>>>>>> better
>>>>>>> candidate than Hive. Is HBase a viable alternative? Obviously whatever 
>>>>>>> one
>>>>>>> decides there is still HDFS, a good engine for Hive (sounds like many
>>>>>>> prefer TEZ although I am a Spark fan) and the ubiquitous YARN.
>>>>>>>
>>>>>>> Let me know your thoughts.
>>>>>>>
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn * 
>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Want to work at Handy? Check out our culture deck and open roles
>>>>>> <http://www.handy.com/careers>
>>>>>> Latest news <http://www.handy.com/press> at Handy
>>>>>> Handy just raised $50m
>>>>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>>>>>>  led
>>>>>> by Fidelity
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> Want to work at Handy? Check out our culture deck and open roles
>>>> <http://www.handy.com/careers>
>>>> Latest news <http://www.handy.com/press> at Handy
>>>> Handy just raised $50m
>>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>>>>  led
>>>> by Fidelity
>>>>
>>>>
>>>
>>
>

Re: Hive footprint

Reply via email to