Spark has targeted to have a unified API set rather than having separate
Java classes to reduce the maintenance cost,
e.g.) JavaRDD <> RDD vs DataFrame. These JavaXXX are more about the legacy.

I think it's best to stick to the approach 4. in general cases.
Other options might have to be considered based upon a specific context.
For example, if we *must* to add a bunch of Java-specifics
into a specific class for an inevitable reason somewhere, I would consider
to have a Java-specific class.



2020년 4월 28일 (화) 오후 4:38, ZHANG Wei <wezh...@outlook.com>님이 작성:

> Be frankly, I also love the pure Java type in Java API and Scala type in
> Scala API. :-)
>
> If we don't treat Java as a "FRIEND" of Scala, just as Python, maybe we
> can adopt the status of option 1, the specific Java classes. (But I don't
> like the `Java` prefix, which is redundant when I'm coding Java app,
> such as JavaRDD, why not distinct it by package namespace...) The specific
> Java API can also leverage some native Java language features with new
> versions.
>
> And just since the friendly relationship between Scala and Java, the Java
> user can call Scala API with `.asScala` or `.asJava`'s help if Java API
> is not ready. Then switch to Java API when it's well cooked.
>
> The cons is more efforts to maintain.
>
> My 2 cents.
>
> --
> Cheers,
> -z
>
> On Tue, 28 Apr 2020 12:07:36 +0900
> Hyukjin Kwon <gurwls...@gmail.com> wrote:
>
> > The problem is that calling Scala instances in Java side is discouraged
> in
> > general up to my best knowledge.
> > A Java user won't likely know asJava in Scala but a Scala user will
> likely
> > know both asScala and asJava.
> >
> >
> > 2020년 4월 28일 (화) 오전 11:35, ZHANG Wei <wezh...@outlook.com>님이 작성:
> >
> > > How about making a small change on option 4:
> > >   Keep Scala API returning Scala type instance with providing a
> > >   `asJava` method to return a Java type instance.
> > >
> > > Scala 2.13 has provided CollectionConverter [1][2][3], in the following
> > > Spark dependences upgrade, which can be supported by nature. For
> > > current Scala 2.12 version, we can wrap `ImplicitConversionsToJava`[4]
> > > as what Scala 2.13 does and add implicit conversions.
> > >
> > > Just my 2 cents.
> > >
> > > --
> > > Cheers,
> > > -z
> > >
> > > [1]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.scala-lang.org%2Foverviews%2Fcollections-2.13%2Fconversions-between-java-and-scala-collections.html&amp;data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637236400701707166&amp;sdata=1qauveOMB1lKHSkRco7v8tBpcJXab8IeGlcoYNMCZ%2BU%3D&amp;reserved=0
> > > [2]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2Fjavaapi%2FCollectionConverters%24.html&amp;data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637236400701707166&amp;sdata=%2B9TrlfiGSWDnsaT8DMPrSn1CqGIxtgfNLcPFRJ%2F%2FANQ%3D&amp;reserved=0
> > > [3]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2FCollectionConverters%24.html&amp;data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637236400701707166&amp;sdata=EjocqFcoIho43wU3yvOEO9Vtvn2jTHliV88W%2BSOed9k%3D&amp;reserved=0
> > > [4]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.12.11%2Fscala%2Fcollection%2Fconvert%2FImplicitConversionsToJava%24.html&amp;data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637236400701707166&amp;sdata=BpMYD30%2B2tXeaoIj0nNhlho8XUZOEYvT%2FzH%2FJ4WEK98%3D&amp;reserved=0
> > >
> > >
> > > On Tue, 28 Apr 2020 08:52:57 +0900
> > > Hyukjin Kwon <gurwls...@gmail.com> wrote:
> > >
> > > > I would like to make sure I am open for other options that can be
> > > > considered situationally and based on the context.
> > > > It's okay, and I don't target to restrict this here. For example,
> DSv2, I
> > > > understand it's written in Java because Java
> > > > interfaces arguably brings better performance. That's why vectorized
> > > > readers are written in Java too.
> > > >
> > > > Maybe the "general" wasn't explicit in my previous email. Adding
> APIs to
> > > > return a Java instance is still
> > > > rather rare in general given my few years monitoring.
> > > > The problem I would more like to deal with is more about when we
> need to
> > > > add one or a couple of user-facing
> > > > Java-specific APIs to return Java instances, which is relatively more
> > > > frequent compared to when we need a bunch
> > > > of Java specific APIs.
> > > >
> > > > In this case, I think it should be guided to use 4. approach. There
> are
> > > > pros and cons between 3. and 4., of course.
> > > > But it looks to me 4. approach is closer to what Spark has targeted
> so
> > > far.
> > > >
> > > >
> > > >
> > > > 2020년 4월 28일 (화) 오전 8:34, Hyukjin Kwon <gurwls...@gmail.com>님이 작성:
> > > >
> > > > > > One thing we could do here is use Java collections internally and
> > > make
> > > > > the Scala API a thin wrapper around Java -- like how Python works.
> > > > > > Then adding a method to the Scala API would require adding it to
> the
> > > > > Java API and we would keep the two more in sync.
> > > > >
> > > > > I think it can be an appropriate idea for when we have to deal with
> > > this
> > > > > case a lot but I don't think there are so many
> > > > > user-facing APIs to return a Java collections, it's rather rare.
> Also,
> > > the
> > > > > Java users are relatively less than Scala users.
> > > > > This case is slightly different from Python in a way that there
> are so
> > > > > many differences to deal with in PySpark case.
> > > > >
> > > > > Also, in case of `Seq`, actually we can just use `Array` instead
> for
> > > both
> > > > > Scala and Java side simply. I don't find such cases notably
> awkward.
> > > > > This problematic cases might be specific to few Java collections or
> > > > > instances, and I would like to avoid an overkill here.
> > > > >
> > > > > Of course, if there is a place to consider other options, let's
> do. I
> > > > > don't like to say this is the only required option.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > 2020년 4월 28일 (화) 오전 1:18, Ryan Blue <rb...@netflix.com.invalid>님이
> 작성:
> > > > >
> > > > >> I think the right choice here depends on how the object is used.
> For
> > > > >> developer and internal APIs, I think standardizing on Java
> collections
> > > > >> makes the most sense.
> > > > >>
> > > > >> For user-facing APIs, it is awkward to return Java collections to
> > > Scala
> > > > >> code -- I think that's the motivation for Tom's comment. For user
> > > APIs, I
> > > > >> think most methods should return Scala collections, and I don't
> have a
> > > > >> strong opinion about whether the conversion (or lack thereof) is
> done
> > > in a
> > > > >> separate object (#1) or in parallel methods (#3).
> > > > >>
> > > > >> Both #1 and #3 seem like about the same amount of work and have
> the
> > > same
> > > > >> likelihood that a developer will leave out a Java method version.
> One
> > > thing
> > > > >> we could do here is use Java collections internally and make the
> > > Scala API
> > > > >> a thin wrapper around Java -- like how Python works. Then adding a
> > > method
> > > > >> to the Scala API would require adding it to the Java API and we
> would
> > > keep
> > > > >> the two more in sync. It would also help avoid Scala collections
> > > leaking
> > > > >> into internals.
> > > > >>
> > > > >> On Mon, Apr 27, 2020 at 8:49 AM Hyukjin Kwon <gurwls...@gmail.com
> >
> > > wrote:
> > > > >>
> > > > >>> Let's stick to the less maintenance efforts then rather than we
> > > leave it
> > > > >>> undecided and delay with leaving this inconsistency.
> > > > >>>
> > > > >>> I dont think we can have some very meaningful data about this
> soon
> > > given
> > > > >>> that we don't hear much complaints about this in general so far.
> > > > >>>
> > > > >>> The point of this thread is to make a call rather then defer to
> the
> > > > >>> future.
> > > > >>>
> > > > >>> On Mon, 27 Apr 2020, 23:15 Wenchen Fan, <cloud0...@gmail.com>
> wrote:
> > > > >>>
> > > > >>>> IIUC We are moving away from having 2 classes for Java and
> Scala,
> > > like
> > > > >>>> JavaRDD and RDD. It's much simpler to maintain and use with a
> > > single class.
> > > > >>>>
> > > > >>>> I don't have a strong preference over option 3 or 4. We may
> need to
> > > > >>>> collect more data points from actual users.
> > > > >>>>
> > > > >>>> On Mon, Apr 27, 2020 at 9:50 PM Hyukjin Kwon <
> gurwls...@gmail.com>
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>>> Scala users are arguably more prevailing compared to Java
> users,
> > > yes.
> > > > >>>>> Using the Java instances in Scala side is legitimate, and they
> are
> > > > >>>>> already being used in multiple please. I don't believe Scala
> > > > >>>>> users find this not Scala friendly as it's legitimate and
> already
> > > > >>>>> being used. I personally find it's more trouble some to let
> Java
> > > > >>>>> users to search which APIs to call. Yes, I understand the pros
> and
> > > > >>>>> cons - we should also find the balance considering the actual
> > > usage.
> > > > >>>>>
> > > > >>>>> One more argument from me is, though, I think one of the goals
> in
> > > > >>>>> Spark APIs is the unified API set up to my knowledge
> > > > >>>>>  e.g., JavaRDD <> RDD vs DataFrame.
> > > > >>>>> If either way is not particularly preferred over the other, I
> would
> > > > >>>>> just choose the one to have the unified API set.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> 2020년 4월 27일 (월) 오후 10:37, Tom Graves <tgraves...@yahoo.com>님이
> 작성:
> > > > >>>>>
> > > > >>>>>> I agree a general guidance is good so we keep consistent in
> the
> > > apis.
> > > > >>>>>> I don't necessarily agree that 4 is the best solution
> though.  I
> > > agree its
> > > > >>>>>> nice to have one api, but it is less friendly for the scala
> side.
> > > > >>>>>> Searching for the equivalent Java api shouldn't be hard as it
> > > should be
> > > > >>>>>> very close in the name and if we make it a general rule users
> > > should
> > > > >>>>>> understand it.   I guess one good question is what API do
> most of
> > > our users
> > > > >>>>>> use between Java and Scala and what is the ratio?  I don't
> know
> > > the answer
> > > > >>>>>> to that. I've seen more using Scala over Java.  If the
> majority
> > > use Scala
> > > > >>>>>> then I think the API should be more friendly to that.
> > > > >>>>>>
> > > > >>>>>> Tom
> > > > >>>>>>
> > > > >>>>>> On Monday, April 27, 2020, 04:04:28 AM CDT, Hyukjin Kwon <
> > > > >>>>>> gurwls...@gmail.com> wrote:
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> Hi all,
> > > > >>>>>>
> > > > >>>>>> I would like to discuss Java specific APIs and which design we
> > > will
> > > > >>>>>> choose.
> > > > >>>>>> This has been discussed in multiple places so far, for
> example, at
> > > > >>>>>>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F28085%23discussion_r407334754&amp;data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637236400701707166&amp;sdata=6A82CT7n4FwG6f1Hx3%2FqmetQVSGWlrcE7BHDx0LLwTo%3D&amp;reserved=0
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> *The problem:*
> > > > >>>>>>
> > > > >>>>>> In short, I would like us to have clear guidance on how we
> support
> > > > >>>>>> Java specific APIs when
> > > > >>>>>> it requires to return a Java instance. The problem is simple:
> > > > >>>>>>
> > > > >>>>>> def requests: Map[String, ExecutorResourceRequest] = ...
> > > > >>>>>> def requestsJMap: java.util.Map[String,
> ExecutorResourceRequest]
> > > = ...
> > > > >>>>>>
> > > > >>>>>> vs
> > > > >>>>>>
> > > > >>>>>> def requests: java.util.Map[String, ExecutorResourceRequest]
> = ...
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> *Current codebase:*
> > > > >>>>>>
> > > > >>>>>> My understanding so far was that the latter is preferred and
> more
> > > > >>>>>> consistent and prevailing in the
> > > > >>>>>> existing codebase, for example, see StateOperatorProgress and
> > > > >>>>>> StreamingQueryProgress in Structured Streaming.
> > > > >>>>>> However, I realised that we also have other approaches in the
> > > current
> > > > >>>>>> codebase. There look
> > > > >>>>>> four approaches to deal with Java specifics in general:
> > > > >>>>>>
> > > > >>>>>>    1. Java specific classes such as JavaRDD and
> JavaSparkContext.
> > > > >>>>>>    2. Java specific methods with the same name that overload
> its
> > > > >>>>>>    parameters, see functions.scala.
> > > > >>>>>>    3. Java specific methods with a different name that needs
> to
> > > > >>>>>>    return a different type such as TaskContext.resourcesJMap
> vs
> > > > >>>>>>    TaskContext.resources.
> > > > >>>>>>    4. One method that returns a Java instance for both Scala
> and
> > > > >>>>>>    Java sides. see StateOperatorProgress and
> > > StreamingQueryProgress.
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> *Analysis on the current codebase:*
> > > > >>>>>>
> > > > >>>>>> I agree with 2. approach because the corresponding cases give
> you
> > > a
> > > > >>>>>> consistent API usage across
> > > > >>>>>> other language APIs in general. Approach 1. is from the old
> world
> > > > >>>>>> when we didn't have unified APIs.
> > > > >>>>>> This might be the worst approach.
> > > > >>>>>>
> > > > >>>>>> 3. and 4. are controversial.
> > > > >>>>>>
> > > > >>>>>> For 3., if you have to use Java APIs, then, you should search
> if
> > > > >>>>>> there is a variant of that API
> > > > >>>>>> every time specifically for Java APIs. But yes, it gives you
> > > > >>>>>> Java/Scala friendly instances.
> > > > >>>>>>
> > > > >>>>>> For 4., having one API that returns a Java instance makes you
> > > able to
> > > > >>>>>> use it in both Scala and Java APIs
> > > > >>>>>> sides although it makes you call asScala in Scala side
> > > specifically.
> > > > >>>>>> But you don’t
> > > > >>>>>> have to search if there’s a variant of this API and it gives
> you a
> > > > >>>>>> consistent API usage across languages.
> > > > >>>>>>
> > > > >>>>>> Also, note that calling Java in Scala is legitimate but the
> > > opposite
> > > > >>>>>> case is not, up to my best knowledge.
> > > > >>>>>> In addition, you should have a method that returns a Java
> instance
> > > > >>>>>> for PySpark or SparkR to support.
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> *Proposal:*
> > > > >>>>>>
> > > > >>>>>> I would like to have a general guidance on this that the
> Spark dev
> > > > >>>>>> agrees upon: Do 4. approach. If not possible, do 3. Avoid 1
> > > almost at all
> > > > >>>>>> cost.
> > > > >>>>>>
> > > > >>>>>> Note that this isn't a hard requirement but *a general
> guidance*;
> > > > >>>>>> therefore, the decision might be up to
> > > > >>>>>> the specific context. For example, when there are some strong
> > > > >>>>>> arguments to have a separate Java specific API, that’s fine.
> > > > >>>>>> Of course, we won’t change the existing methods given
> Micheal’s
> > > > >>>>>> rubric added before. I am talking about new
> > > > >>>>>> methods in unreleased branches.
> > > > >>>>>>
> > > > >>>>>> Any concern or opinion on this?
> > > > >>>>>>
> > > > >>>>>
> > > > >>
> > > > >> --
> > > > >> Ryan Blue
> > > > >> Software Engineer
> > > > >> Netflix
> > > > >>
> > > > >
> > >
>

Reply via email to