Re: How are Dataframes partitioned by default when using spark?

Josh Mahonin Mon, 26 Sep 2016 13:01:44 -0700

Hi,

The phoenix-spark integration inherits the underlying splits provided by
Phoenix, which is a function of the HBase regions, salting and other
aspects determined by the Phoenix Query Planner.

Re: #1, as I understand the Spark JDBC connector, it evenly segments the
range, although it will only work on a numeric column, not a compound row
key.

Re: #2, again, as I understand Spark JDBC, I don't believe that's an
option, or perhaps it will default to only providing 1 partition, i.e, one
very large query.

Re: data-locality, the underlying Phoenix Hadoop Input Format isn't yet
node-aware. There are some data locality advantages gained by co-locating
the Spark executors to the RegionServers, but it could be improved. It's
worth filing a JIRA enhancement ticket for that.

Best,

Josh

On Mon, Sep 19, 2016 at 12:48 PM, Long, Xindian <xindian.l...@sensus.com>
wrote:

> How are Dataframes/Datasets/RDD  partitioned by default when using spark?
> assuming the Dataframe/Datasets/RDD  is the result of a query like that:
>
>
>
> select col1, col2, col3 from table3 where col3 > xxx
>
>
>
> I noticed that for HBase, a partitioner partitions the rowkeys based on
> region splits,  can Phoenix do this as well?
>
>
>
> I also read that if I use spark with the Phoenix jdbc interface “it’s
> only able to parallelize queries by partioning on a numeric column. It also
> requires a known lower bound, upper bound and partition count in order to
> create split queries.”
>
>
>
> Question 1,  If I specify an option like this, is the partitioning based
> on segmenting the range evenly, i.e. each partition gets a rowkey in ranges
> like: upperlimit-lowerlmit)/partitionCount ?
>
>
>
> Question 2, if I do not specify any range, or the row key is not a numeric
> column, how is the result partitioned using jdbc?
>
>
>
>
>
> If I use the spark-phoenix  plug in, it is mentioned that it is able to 
> leverage
> the underlying splits provided by Phoenix?
>
> Are there any example scenarios  of that? e.g. can it partition the
> resulted Dataframe based on regions in the underling HBase table, so that
> spark can take advantage the locality of the data?
>
>
>
> Thanks
>
>
>
> Xindian
>

Re: How are Dataframes partitioned by default when using spark?

Reply via email to