I would guess that Hive would always be capable of out-matching what
HBase/Phoenix can do for this type of workload (bulk-transformation).
That said, I'm not ready to tell you that you can't get the
Phoenix-Spark integration better performing. See the other thread where
you provide more details..
It's important to remember that Phoenix is designed to shine when you
have workloads which require updates to a single row/column. The
underlying I/O system is much different in HBase compared to Hive in
order to server the random update use-case.
On 3/7/18 4:08 AM, Stepan Migunov wrote:
Some more details... We have done some simple tests to compare read/write
possibility spark+hive and spark+phoenix. And now we have the following results:
Copy table (with no any transformations) (about 800 million rec):
Hive (TEZ) - 752 sec
Spark:
From Hive to Hive: 2463 sec
From Phoenix to Hive - 13310 sec
From Hive to Phoenix - > 30240 sec
We use Spark 2.2.1; hbase 1.1.2, Phonix 4.13, Hive 2.1.1
So it seems that Spark + Phoenix led great performance degradation. Any
thoughts?
On 2018/03/04 11:08:56, Stepan Migunov <stepan.migu...@firstlinesoftware.com>
wrote:
In our software we need to combine fast interactive access to the data with
quite complex data processing. I know that Phoenix intended for fast access,
but hoped that also I could be able to use Phoenix as a source for complex
processing with the Spark. Unfortunately, Phoenix + Spark shows very poor
performance. E.g., querying big (about billion records) table with distinct
takes about 2 hours. At the same time this task with Hive source takes a few
minutes. Is it expected? Does it mean that Phoenix is absolutely not suitable
for batch processing with spark and I should duplicate data to Hive and
process it with Hive?