Hi,

Phoenix is able to parallelize queries based on the underlying HBase region
splits, as well as its own internal guideposts based on statistics
collection [1]

The phoenix-spark connector exposes those splits to Spark for the RDD /
DataFrame parallelism. In order to test this out, you can try run an
EXPLAIN SELECT... query [2] to mimic the DataFrame load to see how many
parallel scans will be run, and then compare those to the RDD / DataFrame
partition count (some_rdd.partitions.size). In Phoenix 4.10 and above [3],
they will be the same. In versions below that, the partition count will
equal the number of regions for that table.

Josh

[1] https://phoenix.apache.org/update_statistics.html
[2] https://phoenix.apache.org/tuning_guide.html
[3] https://issues.apache.org/jira/browse/PHOENIX-3600


On Thu, Aug 17, 2017 at 3:07 AM, Kanagha <er.kana...@gmail.com> wrote:

> Also, I'm using phoenixTableAsDataFrame API to read from a pre-split
> phoenix table. How can we ensure read is parallelized across all executors?
> Would salting/pre-splitting tables help in providing parallelism?
> Appreciate any inputs.
>
> Thanks
>
>
> Kanagha
>
> On Wed, Aug 16, 2017 at 10:16 PM, kanagha <er.kana...@gmail.com> wrote:
>
>> Hi Josh,
>>
>> Per your previous post, it is mentioned "The phoenix-spark parallelism is
>> based on the splits provided by the Phoenix query planner, and has no
>> requirements on specifying partition columns or upper/lower bounds."
>>
>> Does it depend upon the region splits on the input table for parallelism?
>> Could you please provide more details?
>>
>>
>> Thanks
>>
>>
>>
>> --
>> View this message in context: http://apache-phoenix-user-lis
>> t.1124778.n5.nabble.com/phoenix-spark-options-not-supporint-
>> query-in-dbtable-tp1915p3810.html
>> Sent from the Apache Phoenix User List mailing list archive at Nabble.com.
>>
>
>

Reply via email to