Re: Phoenix as a source for Spark processing

Josh Elser Mon, 05 Mar 2018 10:14:27 -0800

Hi Stepan,

Can you better ballpark the Phoenix-Spark performance you've seen (e.g.how much hardware do you have, how many spark executors did you use, howmany region servers)? Also, what versions of software are you using?

I don't think there are any firm guidelines on how you can solve thisproblem, but you've found the tools available for you.


* You can try Phoenix+Spark to run over the Phoenix tables in place
* You can use Phoenix+Hive to offload the data into Hive for queries

If Phoenix-Spark wasn't fast enough, I'd imagine using the Phoenix-Hiveintegration to query the data would be similarly not fast enough.

It's possible that the bottleneck is something we could fix in theintegration, or fix configuration of Spark and/or Phoenix. We'd need youto help quantify this better :)


On 3/4/18 6:08 AM, Stepan Migunov wrote:

In our software we need to combine fast interactive access to the data with 
quite complex data processing. I know that Phoenix intended for fast access, 
but hoped that also I could be able to use Phoenix as a source for complex 
processing with the Spark.  Unfortunately, Phoenix + Spark shows very poor 
performance. E.g., querying big (about billion records) table with distinct 
takes about 2 hours. At the same time this task with Hive source takes a few 
minutes. Is it expected? Does it mean that Phoenix is absolutely not suitable 
for batch processing with spark and I should  duplicate data to Hive and 
process it with Hive?

Re: Phoenix as a source for Spark processing

Reply via email to