Re: Hive footprint

2016-04-25 Thread Mich Talebzadeh
Hi Naveen, Thank you for your detailed explanation. Please allow me to explain my points if I may I think a viable solution for big data stack will encompass (again this is my view) Spark with Hive, HDFS and Yarn as winning combinations. Hadoop encompasses HDFS and it is almost impossibl

Re: Hive footprint

2016-04-25 Thread Naveen Gangam
Hi Mich, I am a developer at Cloudera and contribute to Apache Hive. Hive and MPP query engine projects like Impala have settled into their respective positions so there is less confusion between these projects. For example, across Cloudera's customer base the majority of customers use Impala to

Re: Hive footprint

2016-04-21 Thread Mich Talebzadeh
This simply does not work but we need to make Hive use external indexes. This is a must Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://t

Re: Hive footprint

2016-04-20 Thread Mich Talebzadeh
Hi, If I may, I would also like to see where the Hive optimizer shows that it is used with explain ... or other means. It will be interesting. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Hive footprint

2016-04-20 Thread Marcin Tustin
Could you expand on this? This sounds like something that would be great to know, and probably fold into the wiki. On Wed, Apr 20, 2016 at 11:57 AM, Jörn Franke wrote: > Hive has working indexes. However many people overlook that a block is > usually much larger than in a relational database and

Re: Hive footprint

2016-04-20 Thread Jörn Franke
Hive has working indexes. However many people overlook that a block is usually much larger than in a relational database and thus do not use them right. > On 19 Apr 2016, at 09:31, Mich Talebzadeh wrote: > > The issue is that Hive has indexes (not index store) but they don't work so > there we

Re: Hive footprint

2016-04-20 Thread Jörn Franke
Depends really what you want to do. Hive is more for queries involving a lot of data, whereby hbase+Phoenix is more for oltp scenarios or sensor ingestion. I think the reason is that hive has been the entry point for many engines and formats. Additionally there is a lot of tuning capabilities fr

Re: Hive footprint

2016-04-20 Thread Mich Talebzadeh
A caveat here. An OLTP database much like Oracle or SAP ASE will use indexes for point queries in other words when the search is via index scan. In that case the search will be very fast because typically few blocks will be needed using Index scan and using RowID pointer to the underlying data blo

Re: Hive footprint

2016-04-20 Thread Sabarish Sasidharan
HBase is very good for direct key based lookups. And when you want to do scans for a range of keys (data is sorted by keys) Whereas Hive is not good for seeks (needle in haystack problem). You can optimize with ORCs, stripes, sorting etc. But still it is a needle in a haystack problem. Apache Kyl

Re: Hive footprint

2016-04-19 Thread Amey Barve
Thanks Peyman, Is running and evaluating TPCH queries with HBaseStorageHandler vs Hive's Text format comparable? What is the standard set of queries generally used for performance comparison, What queries did you use above? Regards, Amey On Tue, Apr 19, 2016 at 7:28 PM, Peyman Mohajerian wrot

Re: Hive footprint

2016-04-19 Thread Peyman Mohajerian
Hi Amey, It is about seek vs scan. HBase is great in case a rowkey or a range of rowkeys is part of the where clause, then you do a seek and ORC/Parquest reading off HDFS would not do better in absence of an index. However for Data Warehouse that is generally not what you do, you mostly do scan, e

Re: Hive footprint

2016-04-19 Thread Amey Barve
Hi Peyman, You say: "you can use Hive storage handler to read data from HBase the performance would be lower than reading from HDFS directly for analytic." Why is it so? Is it slow as compared to ORC, Parquet, and even Text file format? Regards, Amey On Tue, Apr 19, 2016 at 4:32 AM, Peyman Mohaj

Re: Hive footprint

2016-04-19 Thread Mich Talebzadeh
BTW what is the situation with Impala? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 19 April 2016 a

Re: Hive footprint

2016-04-19 Thread Mich Talebzadeh
The issue is that Hive has indexes (not index store) but they don't work so there we go. May be in later releases we can make use of these indexes for faster queries. Hive allows even bitmap indexes on Fact table but they are never used by COB. show indexes on sales; +---+

Re: Hive footprint

2016-04-18 Thread Alan Gates
> On Apr 18, 2016, at 15:34, Mich Talebzadeh wrote: > > Hi, > > > If Hive had the ability (organic) to have local variable and stored procedure > support then it would be top notch Data Warehouse. Given its metastore, I > don't see any technical reason why it cannot support these constructs.

Re: Hive footprint

2016-04-18 Thread Peyman Mohajerian
HBase can handle high read/write throughput, e.g. IOT use cases. It is not an analytic engine even though you can use Hive storage handler to read data from HBase the performance would be lower than reading from HDFS directly for analytic. But HBase has index, rowkey and you can add secondary inde

Re: Hive footprint

2016-04-18 Thread Marcin Tustin
We use a hive with ORC setup now. Queries may take thousands of seconds with joins, and potentially tens of seconds with selects on very large tables. My understanding is that the goal of hbase is to provide much lower latency for queries. Obviously, this comes at the cost of not being able to per

Re: Hive footprint

2016-04-18 Thread Mich Talebzadeh
Thanks Marcin. What is the definition of low latency here? Are you referring to the performance of SQL against HBase tables compared to Hive. As I understand HBase is a columnar database. Would it be possible to use Hive against ORC to achieve the same? Dr Mich Talebzadeh LinkedIn * https://w

Re: Hive footprint

2016-04-18 Thread Marcin Tustin
HBase has a different use case - it's for low-latency querying of big tables. If you combined it with Hive, you might have something nice for certain queries, but I wouldn't think of them as direct competitors. On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh wrote: > Hi, > > I notice that Impal

Hive footprint

2016-04-18 Thread Mich Talebzadeh
Hi, I notice that Impala is rarely mentioned these days. I may be missing something. However, I gather it is coming to end now as I don't recall many use cases for it (or customers asking for it). In contrast, Hive has hold its ground with the new addition of Spark and Tez as execution engines, s