You're mostly at the mercy of HBase and Phoenix to ensure that your data is
evenly distributed in the underlying regions. You could look at
pre-splitting or salting [1] your tables, as well as adjusting the
guidepost parameters [2] if you need finer tuned control.
If you end up with more idle
I'm running EMR 5.8.0 with these applications installed:
Pig 0.16.0, Phoenix 4.11.0, HBase 1.3.1
Here is my pig script (try.pig):
REGISTER /usr/lib/phoenix/phoenix-4.11.0-HBase-1.3-client.jar;
A = load '/steve/a.txt' as (TXT:chararray);
store A into 'hbase://A_TABLE' using
Thanks for the details.
I tested out and saw that the no.of partitions equals to the no.of parallel
scans run upon DataFrame load in phoenix 4.10.
Also, how can we ensure that the read gets evenly distributed as tasks
across the no.of executors set for the job? I'm running phoenixTableAsDataFrame
Hi Luqman,
I just responded to another query on the list about phoenix-spark that may
help shed some light. In addition, the preferred locations the
phoenix-spark connector exposes are determined in the general
PhoenixInputFormat MapReduce code [1]
I'm not very familiar with PrestoDB, but if
Hi,
Phoenix is able to parallelize queries based on the underlying HBase region
splits, as well as its own internal guideposts based on statistics
collection [1]
The phoenix-spark connector exposes those splits to Spark for the RDD /
DataFrame parallelism. In order to test this out, you can try
Hi,
We are evaluating the possibility of writing a custom connector for Phoenix
to access tables in stored in HBase. However, we need some help.
The connector for Presto should be able to read from HBase cluster using
parallel collections. For that the connector has a "ConnectorSplitManager"
Also, I'm using phoenixTableAsDataFrame API to read from a pre-split
phoenix table. How can we ensure read is parallelized across all executors?
Would salting/pre-splitting tables help in providing parallelism?
Appreciate any inputs.
Thanks
Kanagha
On Wed, Aug 16, 2017 at 10:16 PM, kanagha