You're mostly at the mercy of HBase and Phoenix to ensure that your data is
evenly distributed in the underlying regions. You could look at
pre-splitting or salting [1] your tables, as well as adjusting the
guidepost parameters [2] if you need finer tuned control.

If you end up with more idle Spark workers than RDD partitions, a pattern
I've seen is to simply repartition() the RDD / DataFrame after it's loaded
to a higher level of parallelism. You pay some overhead cost to
redistribute the data between executors, but you may make it up by having
more workers processing the data.

Josh

[1] https://phoenix.apache.org/salted.html
[2] https://phoenix.apache.org/tuning_guide.html

On Thu, Aug 17, 2017 at 2:36 PM, Kanagha <er.kana...@gmail.com> wrote:

> Thanks for the details.
>
> I tested out and saw that the no.of partitions equals to the no.of
> parallel scans run upon DataFrame load in phoenix 4.10.
> Also, how can we ensure that the read gets evenly distributed as tasks
> across the no.of executors set for the job? I'm running
> phoenixTableAsDataFrame API on a table with 4-way parallel scans and with
> 4 executors set for the job. Thanks for the inputs.
>
>
> Kanagha
>
> On Thu, Aug 17, 2017 at 7:17 AM, Josh Mahonin <jmaho...@gmail.com> wrote:
>
>> Hi,
>>
>> Phoenix is able to parallelize queries based on the underlying HBase
>> region splits, as well as its own internal guideposts based on statistics
>> collection [1]
>>
>> The phoenix-spark connector exposes those splits to Spark for the RDD /
>> DataFrame parallelism. In order to test this out, you can try run an
>> EXPLAIN SELECT... query [2] to mimic the DataFrame load to see how many
>> parallel scans will be run, and then compare those to the RDD / DataFrame
>> partition count (some_rdd.partitions.size). In Phoenix 4.10 and above [3],
>> they will be the same. In versions below that, the partition count will
>> equal the number of regions for that table.
>>
>> Josh
>>
>> [1] https://phoenix.apache.org/update_statistics.html
>> [2] https://phoenix.apache.org/tuning_guide.html
>> [3] https://issues.apache.org/jira/browse/PHOENIX-3600
>>
>>
>> On Thu, Aug 17, 2017 at 3:07 AM, Kanagha <er.kana...@gmail.com> wrote:
>>
>>> Also, I'm using phoenixTableAsDataFrame API to read from a pre-split
>>> phoenix table. How can we ensure read is parallelized across all executors?
>>> Would salting/pre-splitting tables help in providing parallelism?
>>> Appreciate any inputs.
>>>
>>> Thanks
>>>
>>>
>>> Kanagha
>>>
>>> On Wed, Aug 16, 2017 at 10:16 PM, kanagha <er.kana...@gmail.com> wrote:
>>>
>>>> Hi Josh,
>>>>
>>>> Per your previous post, it is mentioned "The phoenix-spark parallelism
>>>> is
>>>> based on the splits provided by the Phoenix query planner, and has no
>>>> requirements on specifying partition columns or upper/lower bounds."
>>>>
>>>> Does it depend upon the region splits on the input table for
>>>> parallelism?
>>>> Could you please provide more details?
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context: http://apache-phoenix-user-lis
>>>> t.1124778.n5.nabble.com/phoenix-spark-options-not-supporint-
>>>> query-in-dbtable-tp1915p3810.html
>>>> Sent from the Apache Phoenix User List mailing list archive at
>>>> Nabble.com.
>>>>
>>>
>>>
>>
>

Reply via email to