Hi, The phoenix-spark integration inherits the underlying splits provided by Phoenix, which is a function of the HBase regions, salting and other aspects determined by the Phoenix Query Planner.
Re: #1, as I understand the Spark JDBC connector, it evenly segments the range, although it will only work on a numeric column, not a compound row key. Re: #2, again, as I understand Spark JDBC, I don't believe that's an option, or perhaps it will default to only providing 1 partition, i.e, one very large query. Re: data-locality, the underlying Phoenix Hadoop Input Format isn't yet node-aware. There are some data locality advantages gained by co-locating the Spark executors to the RegionServers, but it could be improved. It's worth filing a JIRA enhancement ticket for that. Best, Josh On Mon, Sep 19, 2016 at 12:48 PM, Long, Xindian <xindian.l...@sensus.com> wrote: > How are Dataframes/Datasets/RDD partitioned by default when using spark? > assuming the Dataframe/Datasets/RDD is the result of a query like that: > > > > select col1, col2, col3 from table3 where col3 > xxx > > > > I noticed that for HBase, a partitioner partitions the rowkeys based on > region splits, can Phoenix do this as well? > > > > I also read that if I use spark with the Phoenix jdbc interface “it’s > only able to parallelize queries by partioning on a numeric column. It also > requires a known lower bound, upper bound and partition count in order to > create split queries.” > > > > Question 1, If I specify an option like this, is the partitioning based > on segmenting the range evenly, i.e. each partition gets a rowkey in ranges > like: upperlimit-lowerlmit)/partitionCount ? > > > > Question 2, if I do not specify any range, or the row key is not a numeric > column, how is the result partitioned using jdbc? > > > > > > If I use the spark-phoenix plug in, it is mentioned that it is able to > leverage > the underlying splits provided by Phoenix? > > Are there any example scenarios of that? e.g. can it partition the > resulted Dataframe based on regions in the underling HBase table, so that > spark can take advantage the locality of the data? > > > > Thanks > > > > Xindian >