[jira] [Commented] (PHOENIX-3601) PhoenixRDD doesn't expose the preferred node locations to Spark

Josh Mahonin (JIRA) Wed, 18 Jan 2017 08:26:59 -0800

    [ 
https://issues.apache.org/jira/browse/PHOENIX-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828357#comment-15828357
 ]


Josh Mahonin commented on PHOENIX-3601:
---------------------------------------

Trivial patch, most of the functionality comes from PHOENIX-3600. Unfortunately 
since PhoenixRDD extends 'RDD' and not 'NewHadoopRDD' we don't get some of the 
niceties for free. There was a good reason for this that's now lost to me...

TL;DR: If used in conjunction with PHOENIX-3600, I observed Spark data load 
times decrease by 30-40%.

 Longer version:

Using [~elserj]'s take on the very cool 
https://github.com/joshelser/phoenix-performance toolset, I generated about 
114M rows of TPC-DS data on a 5 RegionServer setup. I used a load-factor of 5, 
which created a 256-way split table we'll refer to as SALES. I also created a 
new table, pre-salted with 5 buckets we'll call SALES2 and UPSERT SELECTed the 
data over. Both tables had major compaction and UPDATE STATISTICS run on them 
as well.

Using HDP 2.5 (Phoenix 4.7, Spark 1.6), I invoked spark-shell with 5 executors 
and 2 cores each. Each executor was co-located with one Region Server. I then 
created a Phoenix RDD for each table, and then ran a Spark {{rdd.count}} 
operation on them. This effectively loads the entire table into Spark, and then 
Spark counts the rows. I ran this for each table, using the default case, then 
just the location changes, then the location changes plus the split.by.stats 
changes and recorded the run-times 4 times each. I also closed out the 
spark-shell and ensured any Spark-cached files were removed, although I didn't 
account for caching on the HBase or OS side.


||SALES (256 regions, 261 stats splits)||t1||t2||t3||t4||
|control|120s|116s|111s|125s|
|location|96s|106s|94s|100s|
|location+stats|82s|74s|82s|82s|  
||SALES2 (10 regions, 50 stats splits) ||t1||t2||t3||t4||
|control|102s|83s|92s|96s|
|location|94s|78s|90s|81s|
|location+stats|62s|70s|79s|58s|

I have more screencaps of the Spark executors that report on the various task 
jobs, but in short, what we see is the individual task times are much more 
evenly distributed (i.e. fewer outliers), and the overall task time is also 
decreased due to less network overhead.

If anyone's using phoenix-spark and is able to test it out, that would be 
great. Also cc [~maghamraviki...@gmail.com] [~ndimiduk] [~sergey.soldatov] 
[~elserj] [~jamestaylor]

> PhoenixRDD doesn't expose the preferred node locations to Spark
> ---------------------------------------------------------------
>
>                 Key: PHOENIX-3601
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3601
>             Project: Phoenix
>          Issue Type: Improvement
>    Affects Versions: 4.8.0
>            Reporter: Josh Mahonin
>            Assignee: Josh Mahonin
>         Attachments: PHOENIX-3601.patch
>
>
> Follow-up to PHOENIX-3600, in order to let Spark know the preferred node 
> locations to assign partitions to, we need to update PhoenixRDD to retrieve 
> the underlying node location information from the splits.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-3601) PhoenixRDD doesn't expose the preferred node locations to Spark

Reply via email to