The con is much more than just more effort to maintain a parallel API. It
puts the burden for all libraries and library developers to maintain a
parallel API as well. That’s one of the primary reasons we moved away from
this RDD vs JavaRDD approach in the old RDD API.


On Tue, Apr 28, 2020 at 12:38 AM ZHANG Wei <wezh...@outlook.com> wrote:

> Be frankly, I also love the pure Java type in Java API and Scala type in
> Scala API. :-)
>
> If we don't treat Java as a "FRIEND" of Scala, just as Python, maybe we
> can adopt the status of option 1, the specific Java classes. (But I don't
> like the `Java` prefix, which is redundant when I'm coding Java app,
> such as JavaRDD, why not distinct it by package namespace...) The specific
> Java API can also leverage some native Java language features with new
> versions.
>
> And just since the friendly relationship between Scala and Java, the Java
> user can call Scala API with `.asScala` or `.asJava`'s help if Java API
> is not ready. Then switch to Java API when it's well cooked.
>
> The cons is more efforts to maintain.
>
> My 2 cents.
>
> --
> Cheers,
> -z
>
> On Tue, 28 Apr 2020 12:07:36 +0900
> Hyukjin Kwon <gurwls...@gmail.com> wrote:
>
> > The problem is that calling Scala instances in Java side is discouraged
> in
> > general up to my best knowledge.
> > A Java user won't likely know asJava in Scala but a Scala user will
> likely
> > know both asScala and asJava.
> >
> >
> > 2020년 4월 28일 (화) 오전 11:35, ZHANG Wei <wezh...@outlook.com>님이 작성:
> >
> > > How about making a small change on option 4:
> > >   Keep Scala API returning Scala type instance with providing a
> > >   `asJava` method to return a Java type instance.
> > >
> > > Scala 2.13 has provided CollectionConverter [1][2][3], in the following
> > > Spark dependences upgrade, which can be supported by nature. For
> > > current Scala 2.12 version, we can wrap `ImplicitConversionsToJava`[4]
> > > as what Scala 2.13 does and add implicit conversions.
> > >
> > > Just my 2 cents.
> > >
> > > --
> > > Cheers,
> > > -z
> > >
> > > [1]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.scala-lang.org%2Foverviews%2Fcollections-2.13%2Fconversions-between-java-and-scala-collections.html&amp;data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637236400701707166&amp;sdata=1qauveOMB1lKHSkRco7v8tBpcJXab8IeGlcoYNMCZ%2BU%3D&amp;reserved=0
> > > [2]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2Fjavaapi%2FCollectionConverters%24.html&amp;data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637236400701707166&amp;sdata=%2B9TrlfiGSWDnsaT8DMPrSn1CqGIxtgfNLcPFRJ%2F%2FANQ%3D&amp;reserved=0
> > > [3]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2FCollectionConverters%24.html&amp;data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637236400701707166&amp;sdata=EjocqFcoIho43wU3yvOEO9Vtvn2jTHliV88W%2BSOed9k%3D&amp;reserved=0
> > > [4]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.12.11%2Fscala%2Fcollection%2Fconvert%2FImplicitConversionsToJava%24.html&amp;data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637236400701707166&amp;sdata=BpMYD30%2B2tXeaoIj0nNhlho8XUZOEYvT%2FzH%2FJ4WEK98%3D&amp;reserved=0
> > >
> > >
> > > On Tue, 28 Apr 2020 08:52:57 +0900
> > > Hyukjin Kwon <gurwls...@gmail.com> wrote:
> > >
> > > > I would like to make sure I am open for other options that can be
> > > > considered situationally and based on the context.
> > > > It's okay, and I don't target to restrict this here. For example,
> DSv2, I
> > > > understand it's written in Java because Java
> > > > interfaces arguably brings better performance. That's why vectorized
> > > > readers are written in Java too.
> > > >
> > > > Maybe the "general" wasn't explicit in my previous email. Adding
> APIs to
> > > > return a Java instance is still
> > > > rather rare in general given my few years monitoring.
> > > > The problem I would more like to deal with is more about when we
> need to
> > > > add one or a couple of user-facing
> > > > Java-specific APIs to return Java instances, which is relatively more
> > > > frequent compared to when we need a bunch
> > > > of Java specific APIs.
> > > >
> > > > In this case, I think it should be guided to use 4. approach. There
> are
> > > > pros and cons between 3. and 4., of course.
> > > > But it looks to me 4. approach is closer to what Spark has targeted
> so
> > > far.
> > > >
> > > >
> > > >
> > > > 2020년 4월 28일 (화) 오전 8:34, Hyukjin Kwon <gurwls...@gmail.com>님이 작성:
> > > >
> > > > > > One thing we could do here is use Java collections internally and
> > > make
> > > > > the Scala API a thin wrapper around Java -- like how Python works.
> > > > > > Then adding a method to the Scala API would require adding it to
> the
> > > > > Java API and we would keep the two more in sync.
> > > > >
> > > > > I think it can be an appropriate idea for when we have to deal with
> > > this
> > > > > case a lot but I don't think there are so many
> > > > > user-facing APIs to return a Java collections, it's rather rare.
> Also,
> > > the
> > > > > Java users are relatively less than Scala users.
> > > > > This case is slightly different from Python in a way that there
> are so
> > > > > many differences to deal with in PySpark case.
> > > > >
> > > > > Also, in case of `Seq`, actually we can just use `Array` instead
> for
> > > both
> > > > > Scala and Java side simply. I don't find such cases notably
> awkward.
> > > > > This problematic cases might be specific to few Java collections or
> > > > > instances, and I would like to avoid an overkill here.
> > > > >
> > > > > Of course, if there is a place to consider other options, let's
> do. I
> > > > > don't like to say this is the only required option.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > 2020년 4월 28일 (화) 오전 1:18, Ryan Blue <rb...@netflix.com.invalid>님이
> 작성:
> > > > >
> > > > >> I think the right choice here depends on how the object is used.
> For
> > > > >> developer and internal APIs, I think standardizing on Java
> collections
> > > > >> makes the most sense.
> > > > >>
> > > > >> For user-facing APIs, it is awkward to return Java collections to
> > > Scala
> > > > >> code -- I think that's the motivation for Tom's comment. For user
> > > APIs, I
> > > > >> think most methods should return Scala collections, and I don't
> have a
> > > > >> strong opinion about whether the conversion (or lack thereof) is
> done
> > > in a
> > > > >> separate object (#1) or in parallel methods (#3).
> > > > >>
> > > > >> Both #1 and #3 seem like about the same amount of work and have
> the
> > > same
> > > > >> likelihood that a developer will leave out a Java method version.
> One
> > > thing
> > > > >> we could do here is use Java collections internally and make the
> > > Scala API
> > > > >> a thin wrapper around Java -- like how Python works. Then adding a
> > > method
> > > > >> to the Scala API would require adding it to the Java API and we
> would
> > > keep
> > > > >> the two more in sync. It would also help avoid Scala collections
> > > leaking
> > > > >> into internals.
> > > > >>
> > > > >> On Mon, Apr 27, 2020 at 8:49 AM Hyukjin Kwon <gurwls...@gmail.com
> >
> > > wrote:
> > > > >>
> > > > >>> Let's stick to the less maintenance efforts then rather than we
> > > leave it
> > > > >>> undecided and delay with leaving this inconsistency.
> > > > >>>
> > > > >>> I dont think we can have some very meaningful data about this
> soon
> > > given
> > > > >>> that we don't hear much complaints about this in general so far.
> > > > >>>
> > > > >>> The point of this thread is to make a call rather then defer to
> the
> > > > >>> future.
> > > > >>>
> > > > >>> On Mon, 27 Apr 2020, 23:15 Wenchen Fan, <cloud0...@gmail.com>
> wrote:
> > > > >>>
> > > > >>>> IIUC We are moving away from having 2 classes for Java and
> Scala,
> > > like
> > > > >>>> JavaRDD and RDD. It's much simpler to maintain and use with a
> > > single class.
> > > > >>>>
> > > > >>>> I don't have a strong preference over option 3 or 4. We may
> need to
> > > > >>>> collect more data points from actual users.
> > > > >>>>
> > > > >>>> On Mon, Apr 27, 2020 at 9:50 PM Hyukjin Kwon <
> gurwls...@gmail.com>
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>>> Scala users are arguably more prevailing compared to Java
> users,
> > > yes.
> > > > >>>>> Using the Java instances in Scala side is legitimate, and they
> are
> > > > >>>>> already being used in multiple please. I don't believe Scala
> > > > >>>>> users find this not Scala friendly as it's legitimate and
> already
> > > > >>>>> being used. I personally find it's more trouble some to let
> Java
> > > > >>>>> users to search which APIs to call. Yes, I understand the pros
> and
> > > > >>>>> cons - we should also find the balance considering the actual
> > > usage.
> > > > >>>>>
> > > > >>>>> One more argument from me is, though, I think one of the goals
> in
> > > > >>>>> Spark APIs is the unified API set up to my knowledge
> > > > >>>>>  e.g., JavaRDD <> RDD vs DataFrame.
> > > > >>>>> If either way is not particularly preferred over the other, I
> would
> > > > >>>>> just choose the one to have the unified API set.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> 2020년 4월 27일 (월) 오후 10:37, Tom Graves <tgraves...@yahoo.com>님이
> 작성:
> > > > >>>>>
> > > > >>>>>> I agree a general guidance is good so we keep consistent in
> the
> > > apis.
> > > > >>>>>> I don't necessarily agree that 4 is the best solution
> though.  I
> > > agree its
> > > > >>>>>> nice to have one api, but it is less friendly for the scala
> side.
> > > > >>>>>> Searching for the equivalent Java api shouldn't be hard as it
> > > should be
> > > > >>>>>> very close in the name and if we make it a general rule users
> > > should
> > > > >>>>>> understand it.   I guess one good question is what API do
> most of
> > > our users
> > > > >>>>>> use between Java and Scala and what is the ratio?  I don't
> know
> > > the answer
> > > > >>>>>> to that. I've seen more using Scala over Java.  If the
> majority
> > > use Scala
> > > > >>>>>> then I think the API should be more friendly to that.
> > > > >>>>>>
> > > > >>>>>> Tom
> > > > >>>>>>
> > > > >>>>>> On Monday, April 27, 2020, 04:04:28 AM CDT, Hyukjin Kwon <
> > > > >>>>>> gurwls...@gmail.com> wrote:
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> Hi all,
> > > > >>>>>>
> > > > >>>>>> I would like to discuss Java specific APIs and which design we
> > > will
> > > > >>>>>> choose.
> > > > >>>>>> This has been discussed in multiple places so far, for
> example, at
> > > > >>>>>>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F28085%23discussion_r407334754&amp;data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637236400701707166&amp;sdata=6A82CT7n4FwG6f1Hx3%2FqmetQVSGWlrcE7BHDx0LLwTo%3D&amp;reserved=0
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> *The problem:*
> > > > >>>>>>
> > > > >>>>>> In short, I would like us to have clear guidance on how we
> support
> > > > >>>>>> Java specific APIs when
> > > > >>>>>> it requires to return a Java instance. The problem is simple:
> > > > >>>>>>
> > > > >>>>>> def requests: Map[String, ExecutorResourceRequest] = ...
> > > > >>>>>> def requestsJMap: java.util.Map[String,
> ExecutorResourceRequest]
> > > = ...
> > > > >>>>>>
> > > > >>>>>> vs
> > > > >>>>>>
> > > > >>>>>> def requests: java.util.Map[String, ExecutorResourceRequest]
> = ...
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> *Current codebase:*
> > > > >>>>>>
> > > > >>>>>> My understanding so far was that the latter is preferred and
> more
> > > > >>>>>> consistent and prevailing in the
> > > > >>>>>> existing codebase, for example, see StateOperatorProgress and
> > > > >>>>>> StreamingQueryProgress in Structured Streaming.
> > > > >>>>>> However, I realised that we also have other approaches in the
> > > current
> > > > >>>>>> codebase. There look
> > > > >>>>>> four approaches to deal with Java specifics in general:
> > > > >>>>>>
> > > > >>>>>>    1. Java specific classes such as JavaRDD and
> JavaSparkContext.
> > > > >>>>>>    2. Java specific methods with the same name that overload
> its
> > > > >>>>>>    parameters, see functions.scala.
> > > > >>>>>>    3. Java specific methods with a different name that needs
> to
> > > > >>>>>>    return a different type such as TaskContext.resourcesJMap
> vs
> > > > >>>>>>    TaskContext.resources.
> > > > >>>>>>    4. One method that returns a Java instance for both Scala
> and
> > > > >>>>>>    Java sides. see StateOperatorProgress and
> > > StreamingQueryProgress.
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> *Analysis on the current codebase:*
> > > > >>>>>>
> > > > >>>>>> I agree with 2. approach because the corresponding cases give
> you
> > > a
> > > > >>>>>> consistent API usage across
> > > > >>>>>> other language APIs in general. Approach 1. is from the old
> world
> > > > >>>>>> when we didn't have unified APIs.
> > > > >>>>>> This might be the worst approach.
> > > > >>>>>>
> > > > >>>>>> 3. and 4. are controversial.
> > > > >>>>>>
> > > > >>>>>> For 3., if you have to use Java APIs, then, you should search
> if
> > > > >>>>>> there is a variant of that API
> > > > >>>>>> every time specifically for Java APIs. But yes, it gives you
> > > > >>>>>> Java/Scala friendly instances.
> > > > >>>>>>
> > > > >>>>>> For 4., having one API that returns a Java instance makes you
> > > able to
> > > > >>>>>> use it in both Scala and Java APIs
> > > > >>>>>> sides although it makes you call asScala in Scala side
> > > specifically.
> > > > >>>>>> But you don’t
> > > > >>>>>> have to search if there’s a variant of this API and it gives
> you a
> > > > >>>>>> consistent API usage across languages.
> > > > >>>>>>
> > > > >>>>>> Also, note that calling Java in Scala is legitimate but the
> > > opposite
> > > > >>>>>> case is not, up to my best knowledge.
> > > > >>>>>> In addition, you should have a method that returns a Java
> instance
> > > > >>>>>> for PySpark or SparkR to support.
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> *Proposal:*
> > > > >>>>>>
> > > > >>>>>> I would like to have a general guidance on this that the
> Spark dev
> > > > >>>>>> agrees upon: Do 4. approach. If not possible, do 3. Avoid 1
> > > almost at all
> > > > >>>>>> cost.
> > > > >>>>>>
> > > > >>>>>> Note that this isn't a hard requirement but *a general
> guidance*;
> > > > >>>>>> therefore, the decision might be up to
> > > > >>>>>> the specific context. For example, when there are some strong
> > > > >>>>>> arguments to have a separate Java specific API, that’s fine.
> > > > >>>>>> Of course, we won’t change the existing methods given
> Micheal’s
> > > > >>>>>> rubric added before. I am talking about new
> > > > >>>>>> methods in unreleased branches.
> > > > >>>>>>
> > > > >>>>>> Any concern or opinion on this?
> > > > >>>>>>
> > > > >>>>>
> > > > >>
> > > > >> --
> > > > >> Ryan Blue
> > > > >> Software Engineer
> > > > >> Netflix
> > > > >>
> > > > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to