[ https://issues.apache.org/jira/browse/SPARK-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726903#comment-14726903 ]
Apache Spark commented on SPARK-9821: ------------------------------------- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/8569 > pyspark reduceByKey should allow a custom partitioner > ----------------------------------------------------- > > Key: SPARK-9821 > URL: https://issues.apache.org/jira/browse/SPARK-9821 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 1.3.0 > Reporter: Diana Carroll > Priority: Minor > > In Scala, I can supply a custom partitioner to reduceByKey (and other > aggregation/repartitioning methods like aggregateByKey and combinedByKey), > but as far as I can tell from the Pyspark API, there's no way to do the same > in Python. > Here's an example of my code in Scala: > {code}weblogs.map(s => (getFileType(s), 1)).reduceByKey(new > FileTypePartitioner(),_+_){code} > But I can't figure out how to do the same in Python. The closest I can get > is to call repartition before reduceByKey like so: > {code}weblogs.map(lambda s: (getFileType(s), > 1)).partitionBy(3,hash_filetype).reduceByKey(lambda v1,v2: > v1+v2).collect(){code} > But that defeats the purpose, because I'm shuffling twice instead of once, so > my performance is worse instead of better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org