[ 
https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735323#comment-14735323
 ] 

Nicholas Chammas commented on SPARK-3369:
-----------------------------------------

Sean said:

{quote}
I don't think there's a "why" – just hasn't been done by someone who wants to 
do it. I think it's fine to document this. It would more constructive if you 
opened a PR to this effect.
{quote}

I was about to comment to this effect.

There is a known problem here that we cannot fix until Spark 2.0 due to API 
compatibility guarantees. The only thing that can be done now is to perhaps add 
some documentation explaining this issue.

Ryan said:

{quote}
You know what type of change is guaranteed not to break existing code? Javadoc 
changes. Why has the FlatMapFunction interface (and other affected types and 
methods) not been documented as defective?
{quote}

The answer is simply that no-one has stepped up to do it yet. In open source 
projects, people generally work on what interests them. The person best in a 
position to fix an issue like this is one to whom this issue matters, and who 
is willing to take the initiative.

> Java mapPartitions Iterator->Iterable is inconsistent with Scala's 
> Iterator->Iterator
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-3369
>                 URL: https://issues.apache.org/jira/browse/SPARK-3369
>             Project: Spark
>          Issue Type: Improvement
>          Components: Java API
>    Affects Versions: 1.0.2, 1.2.1
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>              Labels: breaking_change
>         Attachments: FlatMapIterator.patch
>
>
> {{mapPartitions}} in the Scala RDD API takes a function that transforms an 
> {{Iterator}} to an {{Iterator}}: 
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
> In the Java RDD API, the equivalent is a FlatMapFunction, which operates on 
> an {{Iterator}} but is requires to return an {{Iterable}}, which is a 
> stronger condition and appears inconsistent. It's a problematic inconsistent 
> though because this seems to require copying all of the input into memory in 
> order to create an object that can be iterated many times, since the input 
> does not afford this itself.
> Similarity for other {{mapPartitions*}} methods and other 
> {{*FlatMapFunctions}}s in Java.
> (Is there a reason for this difference that I'm overlooking?)
> If I'm right that this was inadvertent inconsistency, then the big issue here 
> is that of course this is part of a public API. Workarounds I can think of:
> Promise that Spark will only call {{iterator()}} once, so implementors can 
> use a hacky {{IteratorIterable}} that returns the same {{Iterator}}.
> Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the 
> desired signature, and deprecate existing ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to