You're mostly at the mercy of HBase and Phoenix to ensure that your data is evenly distributed in the underlying regions. You could look at pre-splitting or salting [1] your tables, as well as adjusting the guidepost parameters [2] if you need finer tuned control.
If you end up with more idle Spark workers than RDD partitions, a pattern I've seen is to simply repartition() the RDD / DataFrame after it's loaded to a higher level of parallelism. You pay some overhead cost to redistribute the data between executors, but you may make it up by having more workers processing the data. Josh [1] https://phoenix.apache.org/salted.html [2] https://phoenix.apache.org/tuning_guide.html On Thu, Aug 17, 2017 at 2:36 PM, Kanagha <er.kana...@gmail.com> wrote: > Thanks for the details. > > I tested out and saw that the no.of partitions equals to the no.of > parallel scans run upon DataFrame load in phoenix 4.10. > Also, how can we ensure that the read gets evenly distributed as tasks > across the no.of executors set for the job? I'm running > phoenixTableAsDataFrame API on a table with 4-way parallel scans and with > 4 executors set for the job. Thanks for the inputs. > > > Kanagha > > On Thu, Aug 17, 2017 at 7:17 AM, Josh Mahonin <jmaho...@gmail.com> wrote: > >> Hi, >> >> Phoenix is able to parallelize queries based on the underlying HBase >> region splits, as well as its own internal guideposts based on statistics >> collection [1] >> >> The phoenix-spark connector exposes those splits to Spark for the RDD / >> DataFrame parallelism. In order to test this out, you can try run an >> EXPLAIN SELECT... query [2] to mimic the DataFrame load to see how many >> parallel scans will be run, and then compare those to the RDD / DataFrame >> partition count (some_rdd.partitions.size). In Phoenix 4.10 and above [3], >> they will be the same. In versions below that, the partition count will >> equal the number of regions for that table. >> >> Josh >> >> [1] https://phoenix.apache.org/update_statistics.html >> [2] https://phoenix.apache.org/tuning_guide.html >> [3] https://issues.apache.org/jira/browse/PHOENIX-3600 >> >> >> On Thu, Aug 17, 2017 at 3:07 AM, Kanagha <er.kana...@gmail.com> wrote: >> >>> Also, I'm using phoenixTableAsDataFrame API to read from a pre-split >>> phoenix table. How can we ensure read is parallelized across all executors? >>> Would salting/pre-splitting tables help in providing parallelism? >>> Appreciate any inputs. >>> >>> Thanks >>> >>> >>> Kanagha >>> >>> On Wed, Aug 16, 2017 at 10:16 PM, kanagha <er.kana...@gmail.com> wrote: >>> >>>> Hi Josh, >>>> >>>> Per your previous post, it is mentioned "The phoenix-spark parallelism >>>> is >>>> based on the splits provided by the Phoenix query planner, and has no >>>> requirements on specifying partition columns or upper/lower bounds." >>>> >>>> Does it depend upon the region splits on the input table for >>>> parallelism? >>>> Could you please provide more details? >>>> >>>> >>>> Thanks >>>> >>>> >>>> >>>> -- >>>> View this message in context: http://apache-phoenix-user-lis >>>> t.1124778.n5.nabble.com/phoenix-spark-options-not-supporint- >>>> query-in-dbtable-tp1915p3810.html >>>> Sent from the Apache Phoenix User List mailing list archive at >>>> Nabble.com. >>>> >>> >>> >> >