I think the right choice here depends on how the object is used. For
developer and internal APIs, I think standardizing on Java collections
makes the most sense.

For user-facing APIs, it is awkward to return Java collections to Scala
code -- I think that's the motivation for Tom's comment. For user APIs, I
think most methods should return Scala collections, and I don't have a
strong opinion about whether the conversion (or lack thereof) is done in a
separate object (#1) or in parallel methods (#3).

Both #1 and #3 seem like about the same amount of work and have the same
likelihood that a developer will leave out a Java method version. One thing
we could do here is use Java collections internally and make the Scala API
a thin wrapper around Java -- like how Python works. Then adding a method
to the Scala API would require adding it to the Java API and we would keep
the two more in sync. It would also help avoid Scala collections leaking
into internals.

On Mon, Apr 27, 2020 at 8:49 AM Hyukjin Kwon <gurwls...@gmail.com> wrote:

> Let's stick to the less maintenance efforts then rather than we leave it
> undecided and delay with leaving this inconsistency.
>
> I dont think we can have some very meaningful data about this soon given
> that we don't hear much complaints about this in general so far.
>
> The point of this thread is to make a call rather then defer to the future.
>
> On Mon, 27 Apr 2020, 23:15 Wenchen Fan, <cloud0...@gmail.com> wrote:
>
>> IIUC We are moving away from having 2 classes for Java and Scala, like
>> JavaRDD and RDD. It's much simpler to maintain and use with a single class.
>>
>> I don't have a strong preference over option 3 or 4. We may need to
>> collect more data points from actual users.
>>
>> On Mon, Apr 27, 2020 at 9:50 PM Hyukjin Kwon <gurwls...@gmail.com> wrote:
>>
>>> Scala users are arguably more prevailing compared to Java users, yes.
>>> Using the Java instances in Scala side is legitimate, and they are
>>> already being used in multiple please. I don't believe Scala
>>> users find this not Scala friendly as it's legitimate and already being
>>> used. I personally find it's more trouble some to let Java
>>> users to search which APIs to call. Yes, I understand the pros and cons
>>> - we should also find the balance considering the actual usage.
>>>
>>> One more argument from me is, though, I think one of the goals in Spark
>>> APIs is the unified API set up to my knowledge
>>>  e.g., JavaRDD <> RDD vs DataFrame.
>>> If either way is not particularly preferred over the other, I would just
>>> choose the one to have the unified API set.
>>>
>>>
>>>
>>> 2020년 4월 27일 (월) 오후 10:37, Tom Graves <tgraves...@yahoo.com>님이 작성:
>>>
>>>> I agree a general guidance is good so we keep consistent in the apis. I
>>>> don't necessarily agree that 4 is the best solution though.  I agree its
>>>> nice to have one api, but it is less friendly for the scala side.
>>>> Searching for the equivalent Java api shouldn't be hard as it should be
>>>> very close in the name and if we make it a general rule users should
>>>> understand it.   I guess one good question is what API do most of our users
>>>> use between Java and Scala and what is the ratio?  I don't know the answer
>>>> to that. I've seen more using Scala over Java.  If the majority use Scala
>>>> then I think the API should be more friendly to that.
>>>>
>>>> Tom
>>>>
>>>> On Monday, April 27, 2020, 04:04:28 AM CDT, Hyukjin Kwon <
>>>> gurwls...@gmail.com> wrote:
>>>>
>>>>
>>>> Hi all,
>>>>
>>>> I would like to discuss Java specific APIs and which design we will
>>>> choose.
>>>> This has been discussed in multiple places so far, for example, at
>>>> https://github.com/apache/spark/pull/28085#discussion_r407334754
>>>>
>>>>
>>>> *The problem:*
>>>>
>>>> In short, I would like us to have clear guidance on how we support Java
>>>> specific APIs when
>>>> it requires to return a Java instance. The problem is simple:
>>>>
>>>> def requests: Map[String, ExecutorResourceRequest] = ...
>>>> def requestsJMap: java.util.Map[String, ExecutorResourceRequest] = ...
>>>>
>>>> vs
>>>>
>>>> def requests: java.util.Map[String, ExecutorResourceRequest] = ...
>>>>
>>>>
>>>> *Current codebase:*
>>>>
>>>> My understanding so far was that the latter is preferred and more
>>>> consistent and prevailing in the
>>>> existing codebase, for example, see StateOperatorProgress and
>>>> StreamingQueryProgress in Structured Streaming.
>>>> However, I realised that we also have other approaches in the current
>>>> codebase. There look
>>>> four approaches to deal with Java specifics in general:
>>>>
>>>>    1. Java specific classes such as JavaRDD and JavaSparkContext.
>>>>    2. Java specific methods with the same name that overload its
>>>>    parameters, see functions.scala.
>>>>    3. Java specific methods with a different name that needs to return
>>>>    a different type such as TaskContext.resourcesJMap vs
>>>>    TaskContext.resources.
>>>>    4. One method that returns a Java instance for both Scala and Java
>>>>    sides. see StateOperatorProgress and StreamingQueryProgress.
>>>>
>>>>
>>>> *Analysis on the current codebase:*
>>>>
>>>> I agree with 2. approach because the corresponding cases give you a
>>>> consistent API usage across
>>>> other language APIs in general. Approach 1. is from the old world when
>>>> we didn't have unified APIs.
>>>> This might be the worst approach.
>>>>
>>>> 3. and 4. are controversial.
>>>>
>>>> For 3., if you have to use Java APIs, then, you should search if there
>>>> is a variant of that API
>>>> every time specifically for Java APIs. But yes, it gives you Java/Scala
>>>> friendly instances.
>>>>
>>>> For 4., having one API that returns a Java instance makes you able to
>>>> use it in both Scala and Java APIs
>>>> sides although it makes you call asScala in Scala side specifically.
>>>> But you don’t
>>>> have to search if there’s a variant of this API and it gives you a
>>>> consistent API usage across languages.
>>>>
>>>> Also, note that calling Java in Scala is legitimate but the opposite
>>>> case is not, up to my best knowledge.
>>>> In addition, you should have a method that returns a Java instance for
>>>> PySpark or SparkR to support.
>>>>
>>>>
>>>> *Proposal:*
>>>>
>>>> I would like to have a general guidance on this that the Spark dev
>>>> agrees upon: Do 4. approach. If not possible, do 3. Avoid 1 almost at all
>>>> cost.
>>>>
>>>> Note that this isn't a hard requirement but *a general guidance*;
>>>> therefore, the decision might be up to
>>>> the specific context. For example, when there are some strong arguments
>>>> to have a separate Java specific API, that’s fine.
>>>> Of course, we won’t change the existing methods given Micheal’s rubric
>>>> added before. I am talking about new
>>>> methods in unreleased branches.
>>>>
>>>> Any concern or opinion on this?
>>>>
>>>

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to