[ https://issues.apache.org/jira/browse/SPARK-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331975#comment-15331975 ]
Bryan Cutler commented on SPARK-15861: -------------------------------------- If you change your function to this {noformat} def to_np(data): return [np.array(list(data))] {noformat} I think you would get what you expect, but this is probably not a good way to go about it. I feel like you should be aggregating your lists into numpy arrays instead, but someone else might know better. > pyspark mapPartitions with none generator functions / functors > -------------------------------------------------------------- > > Key: SPARK-15861 > URL: https://issues.apache.org/jira/browse/SPARK-15861 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.6.1 > Reporter: Greg Bowyer > Priority: Minor > > Hi all, it appears that the method `rdd.mapPartitions` does odd things if it > is fed a normal subroutine. > For instance, lets say we have the following > {code} > rows = range(25) > rows = [rows[i:i+5] for i in range(0, len(rows), 5)] > rdd = sc.parallelize(rows, 2) > def to_np(data): > return np.array(list(data)) > rdd.mapPartitions(to_np).collect() > ... > [array([0, 1, 2, 3, 4]), > array([5, 6, 7, 8, 9]), > array([10, 11, 12, 13, 14]), > array([15, 16, 17, 18, 19]), > array([20, 21, 22, 23, 24])] > rdd.mapPartitions(to_np, preservePartitioning=True).collect() > ... > [array([0, 1, 2, 3, 4]), > array([5, 6, 7, 8, 9]), > array([10, 11, 12, 13, 14]), > array([15, 16, 17, 18, 19]), > array([20, 21, 22, 23, 24])] > {code} > This basically makes the provided function that did return act like the end > user called {code}rdd.map{code} > I think that maybe a check should be put in to call > {code}inspect.isgeneratorfunction{code} > ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org