Re: Phoenix as a source for Spark processing

Josh Elser Thu, 08 Mar 2018 15:06:48 -0800

I would guess that Hive would always be capable of out-matching whatHBase/Phoenix can do for this type of workload (bulk-transformation).That said, I'm not ready to tell you that you can't get thePhoenix-Spark integration better performing. See the other thread whereyou provide more details..

It's important to remember that Phoenix is designed to shine when youhave workloads which require updates to a single row/column. Theunderlying I/O system is much different in HBase compared to Hive inorder to server the random update use-case.


On 3/7/18 4:08 AM, Stepan Migunov wrote:

Some more details... We have done some simple tests to compare read/write 
possibility spark+hive and spark+phoenix. And now we have the following results:

Copy table (with no any transformations) (about 800 million rec):
Hive (TEZ) - 752 sec

Spark:
 From Hive to Hive: 2463 sec
 From Phoenix to Hive - 13310 sec
 From Hive to Phoenix - > 30240 sec

We use Spark 2.2.1; hbase 1.1.2, Phonix 4.13, Hive 2.1.1

So it seems that Spark + Phoenix led great performance degradation. Any 
thoughts?

On 2018/03/04 11:08:56, Stepan Migunov <[email protected]> 
wrote:

In our software we need to combine fast interactive access to the data with 
quite complex data processing. I know that Phoenix intended for fast access, 
but hoped that also I could be able to use Phoenix as a source for complex 
processing with the Spark.  Unfortunately, Phoenix + Spark shows very poor 
performance. E.g., querying big (about billion records) table with distinct 
takes about 2 hours. At the same time this task with Hive source takes a few 
minutes. Is it expected? Does it mean that Phoenix is absolutely not suitable 
for batch processing with spark and I should  duplicate data to Hive and 
process it with Hive?

Re: Phoenix as a source for Spark processing

Reply via email to