[ https://issues.apache.org/jira/browse/SPARK-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328245#comment-15328245 ]
Bryan Cutler commented on SPARK-15861: -------------------------------------- [~gbow...@fastmail.co.uk] {{mapPartitions}} expects a function the takes an iterator as input then outputs an iterable sequence, and your function in the example is actually providing this. I think what is going on here is your function will map the iterator to a numpy array, that internally will be something like {noformat}array([[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]){noformat} for the first partition, then {{collect}} will iterate over that sequence and return each element, which will also be a numpy array, so you get {noformat}array([0, 1, 2, 3, 4]), array([5, 6, 7, 8, 9])) {noformat} for the first 2 elements and so on.. I believe this is working as it is supposed to, and in general, {{mapPartitions}} will not usually give the same result as {{map}} - it will fail if the function does not return a valid sequence. The documentation could perhaps be a little clearer in that regard. > pyspark mapPartitions with none generator functions / functors > -------------------------------------------------------------- > > Key: SPARK-15861 > URL: https://issues.apache.org/jira/browse/SPARK-15861 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.6.1 > Reporter: Greg Bowyer > Priority: Minor > > Hi all, it appears that the method `rdd.mapPartitions` does odd things if it > is fed a normal subroutine. > For instance, lets say we have the following > {code} > rows = range(25) > rows = [rows[i:i+5] for i in range(0, len(rows), 5)] > rdd = sc.parallelize(rows, 2) > def to_np(data): > return np.array(list(data)) > rdd.mapPartitions(to_np).collect() > ... > [array([0, 1, 2, 3, 4]), > array([5, 6, 7, 8, 9]), > array([10, 11, 12, 13, 14]), > array([15, 16, 17, 18, 19]), > array([20, 21, 22, 23, 24])] > rdd.mapPartitions(to_np, preservePartitioning=True).collect() > ... > [array([0, 1, 2, 3, 4]), > array([5, 6, 7, 8, 9]), > array([10, 11, 12, 13, 14]), > array([15, 16, 17, 18, 19]), > array([20, 21, 22, 23, 24])] > {code} > This basically makes the provided function that did return act like the end > user called {code}rdd.map{code} > I think that maybe a check should be put in to call > {code}inspect.isgeneratorfunction{code} > ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org