Hi Michael, "If there is predicate pushdown, then you will be faster, assuming that the query triggers an implied range scan" ---> Does this bring results faster than plain hive querying over ORC / Text file formats
In other words Is querying over plain hive (ORC or Text) *always* faster than through HiveStorageHandler? Regards, Amey On 9 June 2017 at 15:08, Michael Segel <msegel_had...@hotmail.com> wrote: > The pro’s is that you have the ability to update a table without having to > worry about duplication of the row. Tez is doing some form of compaction > for you that already exists in HBase. > > The cons: > > 1) Its slower. Reads from HBase have more overhead with them than just > reading a file. Read Lars George’s book on what takes place when you do a > read. > > 2) HBase is not a relational store. (You have to think about what that > implies) > > 3) You need to query against your row key for best performance, otherwise > it will always be a complete table scan. > > HBase was designed to give you fast access for direct get() and limited > range scans. Otherwise you have to perform full table scans. This means > that unless you’re able to do a range scan, your full table scan will be > slower than if you did this on a flat file set. Again the reason why you > would want to use HBase if your data set is mutable. > > You also have to trigger a range scan when you write your hive query and > you have make sure that you’re querying off your row key. > > HBase was designed as a <key,value> store. Plain and simple. If you don’t > use the key, you have to do a full table scan. So even though you are > partitioning on row key, you never use your partitions. However in Hive or > Spark, you can create an alternative partition pattern. (e.g your key is > the transaction_id, yet you partition on month/year portion of the > transaction_date) > > You can speed things up a little by using an inverted table as a secondary > index. However this assumes that you want to use joins. If you have a > single base table with no joins then you can limit your range scans based > on making sure you are querying against the row key. Note: This will mean > that you have limited querying capabilities. > > And yes, I’ve done this before but can’t share it with you. > > HTH > > P.S. > I haven’t tried Hive queries where you have what would be the equivalent > of a get() . > > In earlier versions of hive, the issue would be “SELECT * FROM foo where > rowkey=BAR” would still do a full table scan because of the lack of > predicate pushdown. > This may have been fixed in later releases of hive. That would be your > test case. If there is predicate pushdown, then you will be faster, > assuming that the query triggers an implied range scan. > This would be a simple thing. However keep in mind that you’re going to > generate a map/reduce job (unless using a query engine like Tez) where you > wouldn’t if you just wrote your code in Java. > > > > > > On Jun 7, 2017, at 5:13 AM, Ramasubramanian Narayanan < > ramasubramanian.naraya...@gmail.com> wrote: > > > > Hi, > > > > Can you please let us know Pro and Cons of using HBase table as an > external table in HIVE. > > > > Will there be any performance degrade when using Hive over HBase instead > of using direct HIVE table. > > > > The table that I am planning to use in HBase will be master table like > account, customer. Wanting to achieve Slowly Changing Dimension. Please > through some lights on that too if you have done any such implementations. > > > > Thanks and Regards, > > Rams > >