[
https://issues.apache.org/jira/browse/HIVE-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073958#comment-14073958
]
Carter Shanklin commented on HIVE-6584:
---------------------------------------
I tested the .12 version of this patch on a 20 node cluster to see what sort of
performance gains might be expected.
I did a YCSB load of 180m rows and ran a few simple SQL queries in Hive while
simultaneously running a YCSB 32-thread workload.
TLDR the snapshot approach provides a nice performance boost of about 2.5x
across different types of queries. The more fields I queried the better the
performance was.
|Query|Run|Workload|Snapshot Time (s)|Direct Time (s)|Time X Factor|
|count(*)|1|a|191.019|488.915|2.56x|
|count(*)|2|a|200.641|480.837|2.40x|
|Aggregate 1 field|1|a|214.452|499.304|2.33x|
|Aggregate 1 field|2|a|217.744|500.07|2.30x|
|Aggregate 9 fields|1|a|281.514|802.799|2.85x|
|Aggregate 9 fields|2|a|272.358|785.816|2.89x|
|Aggregate 1 with GBY|1|a|248.874|558.143|2.24x|
|Aggregate 1 with GBY|2|a|269.658|533.562|1.98x|
|count(*)|1|b|194.739|482.261|2.48x|
|count(*)|2|b|195.178|481.437|2.47x|
|Aggregate 1 field|1|b|220.325|498.956|2.26x|
|Aggregate 1 field|2|b|227.117|489.27|2.15x|
|Aggregate 9 fields|1|b|276.939|817.118|2.95x|
|Aggregate 9 fields|2|b|290.288|876.753|3.02x|
|Aggregate 1 with GBY|1|b|244.025|563.884|2.31x|
|Aggregate 1 with GBY|2|b|225.431|570.723|2.53x|
|count(*)|1|c|194.568|502.79|2.58x|
|count(*)|2|c|205.418|508.319|2.47x|
|Aggregate 1 field|1|c|209.709|531.39|2.53x|
|Aggregate 1 field|2|c|217.551|526.878|2.42x|
|Aggregate 9 fields|1|c|267.93|756.476|2.82x|
|Aggregate 9 fields|2|c|273.107|723.459|2.65x|
|Aggregate 1 with GBY|1|c|240.991|526.053|2.18x|
|Aggregate 1 with GBY|2|c|258.06|527.845|2.05x|
For those not familiar with YCSB it uses a table with 9 fields, each filled
with random junk 100 characters long. It defines workloads A-F, of which I've
used A-C.
The main point to note is the more of the fields my query fetches, the better
it works in snapshot mode.
The other thing I measured was throughput as reported by the YCSB tool. For the
most part, when running the query over a snapshot the throughput was much
better.
|Workload|Tput Snapshot|Tput Direct|Throughput Improvement (Snapshot)|
|a|83443.11623|56267.34148|48.30%|
|b|45709.15011|31224.30376|46.39%|
|c|46634.58415|43224.86383|7.89%|
The throughput when using the snapshot seems to be close to the throughput when
not scanning data, but I didn't run the baseline tests long enough to get
anything conclusive here.
In any event this looks like a good patch, especially considering its small
size.
The numbers quoted here are for reference only, YMMV, etc.
> Add HiveHBaseTableSnapshotInputFormat
> -------------------------------------
>
> Key: HIVE-6584
> URL: https://issues.apache.org/jira/browse/HIVE-6584
> Project: Hive
> Issue Type: Improvement
> Components: HBase Handler
> Reporter: Nick Dimiduk
> Assignee: Nick Dimiduk
> Fix For: 0.14.0
>
> Attachments: HIVE-6584.0.patch, HIVE-6584.1.patch,
> HIVE-6584.10.patch, HIVE-6584.11.patch, HIVE-6584.12.patch,
> HIVE-6584.2.patch, HIVE-6584.3.patch, HIVE-6584.4.patch, HIVE-6584.5.patch,
> HIVE-6584.6.patch, HIVE-6584.7.patch, HIVE-6584.8.patch, HIVE-6584.9.patch
>
>
> HBASE-8369 provided mapreduce support for reading from HBase table snapsopts.
> This allows a MR job to consume a stable, read-only view of an HBase table
> directly off of HDFS. Bypassing the online region server API provides a nice
> performance boost for the full scan. HBASE-10642 is backporting that feature
> to 0.94/0.96 and also adding a {{mapred}} implementation. Once that's
> available, we should add an input format. A follow-on patch could work out
> how to integrate this functionality into the StorageHandler, similar to how
> HIVE-6473 integrates the HFileOutputFormat into existing table definitions.
--
This message was sent by Atlassian JIRA
(v6.2#6252)