[ 
https://issues.apache.org/jira/browse/SPARK-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-1061:
------------------------------------
    Component/s: Spark Core

> allow Hadoop RDDs to be read w/ a partitioner
> ---------------------------------------------
>
>                 Key: SPARK-1061
>                 URL: https://issues.apache.org/jira/browse/SPARK-1061
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: Imran Rashid
>            Assignee: Imran Rashid
>
> Using partitioners to get narrow dependencies can save tons of time on a 
> shuffle.  However, after saving an RDD to hdfs, and then reloading it, all 
> partitioner information is lost.  This means that you can never get a narrow 
> dependency when loading data from hadoop.
> I think we could get around this by:
> 1) having a modified version of hadoop rdd that kept track of original part 
> file (or maybe just prevent splits altogether ...)
> 2) add a "assumePartition(partitioner:Partitioner, verify: Boolean)" function 
> to RDD.  It would create a new RDD, which had the exact same data but just 
> pretended that the RDD had the given partitioner applied to it.  And if 
> verify=true, it could add a mapPartitionsWithIndex to check that each record 
> was in the right partition.
> http://apache-spark-user-list.1001560.n3.nabble.com/setting-partitioners-with-hadoop-rdds-td976.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to