You're mostly at the mercy of HBase and Phoenix to ensure that your data is
evenly distributed in the underlying regions. You could look at
pre-splitting or salting [1] your tables, as well as adjusting the
guidepost parameters [2] if you need finer tuned control.
If you end up with more idle Spar
Thanks for the details.
I tested out and saw that the no.of partitions equals to the no.of parallel
scans run upon DataFrame load in phoenix 4.10.
Also, how can we ensure that the read gets evenly distributed as tasks
across the no.of executors set for the job? I'm running phoenixTableAsDataFrame
Hi,
Phoenix is able to parallelize queries based on the underlying HBase region
splits, as well as its own internal guideposts based on statistics
collection [1]
The phoenix-spark connector exposes those splits to Spark for the RDD /
DataFrame parallelism. In order to test this out, you can try r
Also, I'm using phoenixTableAsDataFrame API to read from a pre-split
phoenix table. How can we ensure read is parallelized across all executors?
Would salting/pre-splitting tables help in providing parallelism?
Appreciate any inputs.
Thanks
Kanagha
On Wed, Aug 16, 2017 at 10:16 PM, kanagha wro
Hi Josh,
Per your previous post, it is mentioned "The phoenix-spark parallelism is
based on the splits provided by the Phoenix query planner, and has no
requirements on specifying partition columns or upper/lower bounds."
Does it depend upon the region splits on the input table for parallelism?
C
SparkContext("local", "phoenix-test")*
>
> *val sqlContext = new SQLContext(sc)*
>
>
>
> *// Load the columns 'ID' and 'COL1' from TABLE1 as a DataFrame*
>
> *val df = sqlContext.phoenixTableAsDataFrame(*
>
> * "TABLE1&q
"TABLE1", Array("ID", "COL1"), conf = configuration
)
df.show
From: Josh Mahonin [mailto:jmaho...@gmail.com]
Sent: 2016年6月9日 9:44
To: user@phoenix.apache.org
Subject: Re: phoenix spark options not supporint query in dbtable
Hi Xindian,
The phoenix-spark integration
Hi Xindian,
The phoenix-spark integration is based on the Phoenix MapReduce layer,
which doesn't support aggregate functions. However, as you mentioned, both
filtering and pruning predicates are pushed down to Phoenix. With an RDD or
DataFrame loaded, all of Spark's various aggregation methods are