[ 
https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734047#comment-14734047
 ] 

Ryan Schmitt commented on SPARK-3369:
-------------------------------------

You know what type of change is guaranteed not to break existing code? Javadoc 
changes. Why has the {{FlatMapFunction}} interface (and other affected types 
and methods) not been documented as defective? At a minimum, it needs to be 
pointed out that the returned {{Iterable}} will only be traversed once, and 
(more importantly) will not be eagerly computed all at once. As you already 
pointed out, this is more or less the opposite of what {{Iterable}} means. 

Also, I don't see any comment in the thread that purports to explain why this 
interface was originally written in this way. Everyone seems to agree that it's 
an error, but what kind of error is it? Was this some sort of misfeature 
intended to allow collections to be returned directly? Without feedback from 
the original authors, all we can really do is speculate. 

> Java mapPartitions Iterator->Iterable is inconsistent with Scala's 
> Iterator->Iterator
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-3369
>                 URL: https://issues.apache.org/jira/browse/SPARK-3369
>             Project: Spark
>          Issue Type: Improvement
>          Components: Java API
>    Affects Versions: 1.0.2, 1.2.1
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>              Labels: breaking_change
>         Attachments: FlatMapIterator.patch
>
>
> {{mapPartitions}} in the Scala RDD API takes a function that transforms an 
> {{Iterator}} to an {{Iterator}}: 
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
> In the Java RDD API, the equivalent is a FlatMapFunction, which operates on 
> an {{Iterator}} but is requires to return an {{Iterable}}, which is a 
> stronger condition and appears inconsistent. It's a problematic inconsistent 
> though because this seems to require copying all of the input into memory in 
> order to create an object that can be iterated many times, since the input 
> does not afford this itself.
> Similarity for other {{mapPartitions*}} methods and other 
> {{*FlatMapFunctions}}s in Java.
> (Is there a reason for this difference that I'm overlooking?)
> If I'm right that this was inadvertent inconsistency, then the big issue here 
> is that of course this is part of a public API. Workarounds I can think of:
> Promise that Spark will only call {{iterator()}} once, so implementors can 
> use a hacky {{IteratorIterable}} that returns the same {{Iterator}}.
> Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the 
> desired signature, and deprecate existing ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to