[ https://issues.apache.org/jira/browse/PHOENIX-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828357#comment-15828357 ]
Josh Mahonin commented on PHOENIX-3601: --------------------------------------- Trivial patch, most of the functionality comes from PHOENIX-3600. Unfortunately since PhoenixRDD extends 'RDD' and not 'NewHadoopRDD' we don't get some of the niceties for free. There was a good reason for this that's now lost to me... TL;DR: If used in conjunction with PHOENIX-3600, I observed Spark data load times decrease by 30-40%. Longer version: Using [~elserj]'s take on the very cool https://github.com/joshelser/phoenix-performance toolset, I generated about 114M rows of TPC-DS data on a 5 RegionServer setup. I used a load-factor of 5, which created a 256-way split table we'll refer to as SALES. I also created a new table, pre-salted with 5 buckets we'll call SALES2 and UPSERT SELECTed the data over. Both tables had major compaction and UPDATE STATISTICS run on them as well. Using HDP 2.5 (Phoenix 4.7, Spark 1.6), I invoked spark-shell with 5 executors and 2 cores each. Each executor was co-located with one Region Server. I then created a Phoenix RDD for each table, and then ran a Spark {{rdd.count}} operation on them. This effectively loads the entire table into Spark, and then Spark counts the rows. I ran this for each table, using the default case, then just the location changes, then the location changes plus the split.by.stats changes and recorded the run-times 4 times each. I also closed out the spark-shell and ensured any Spark-cached files were removed, although I didn't account for caching on the HBase or OS side. ||SALES (256 regions, 261 stats splits)||t1||t2||t3||t4|| |control|120s|116s|111s|125s| |location|96s|106s|94s|100s| |location+stats|82s|74s|82s|82s| ||SALES2 (10 regions, 50 stats splits) ||t1||t2||t3||t4|| |control|102s|83s|92s|96s| |location|94s|78s|90s|81s| |location+stats|62s|70s|79s|58s| I have more screencaps of the Spark executors that report on the various task jobs, but in short, what we see is the individual task times are much more evenly distributed (i.e. fewer outliers), and the overall task time is also decreased due to less network overhead. If anyone's using phoenix-spark and is able to test it out, that would be great. Also cc [~maghamraviki...@gmail.com] [~ndimiduk] [~sergey.soldatov] [~elserj] [~jamestaylor] > PhoenixRDD doesn't expose the preferred node locations to Spark > --------------------------------------------------------------- > > Key: PHOENIX-3601 > URL: https://issues.apache.org/jira/browse/PHOENIX-3601 > Project: Phoenix > Issue Type: Improvement > Affects Versions: 4.8.0 > Reporter: Josh Mahonin > Assignee: Josh Mahonin > Attachments: PHOENIX-3601.patch > > > Follow-up to PHOENIX-3600, in order to let Spark know the preferred node > locations to assign partitions to, we need to update PhoenixRDD to retrieve > the underlying node location information from the splits. -- This message was sent by Atlassian JIRA (v6.3.4#6332)