I think I am not seeing explicit objection here but rather see people tend to agree with the proposal in general. I would like to step forward rather than leaving it as a deadlock - the worst choice here is to postpone and abandon this discussion with this inconsistency.
I don't currently target to document this as the cases are rather rare, and we haven't really documented JavaRDD <> RDD vs DataFrame case as well. Let's keep monitoring and see if this discussion thread clarifies things enough in such cases I mentioned. Let me know if you guys think differently. 2020년 4월 28일 (화) 오후 5:03, Hyukjin Kwon <gurwls...@gmail.com>님이 작성: > Spark has targeted to have a unified API set rather than having separate > Java classes to reduce the maintenance cost, > e.g.) JavaRDD <> RDD vs DataFrame. These JavaXXX are more about the legacy. > > I think it's best to stick to the approach 4. in general cases. > Other options might have to be considered based upon a specific context. > For example, if we *must* to add a bunch of Java-specifics > into a specific class for an inevitable reason somewhere, I would consider > to have a Java-specific class. > > > > 2020년 4월 28일 (화) 오후 4:38, ZHANG Wei <wezh...@outlook.com>님이 작성: > >> Be frankly, I also love the pure Java type in Java API and Scala type in >> Scala API. :-) >> >> If we don't treat Java as a "FRIEND" of Scala, just as Python, maybe we >> can adopt the status of option 1, the specific Java classes. (But I don't >> like the `Java` prefix, which is redundant when I'm coding Java app, >> such as JavaRDD, why not distinct it by package namespace...) The specific >> Java API can also leverage some native Java language features with new >> versions. >> >> And just since the friendly relationship between Scala and Java, the Java >> user can call Scala API with `.asScala` or `.asJava`'s help if Java API >> is not ready. Then switch to Java API when it's well cooked. >> >> The cons is more efforts to maintain. >> >> My 2 cents. >> >> -- >> Cheers, >> -z >> >> On Tue, 28 Apr 2020 12:07:36 +0900 >> Hyukjin Kwon <gurwls...@gmail.com> wrote: >> >> > The problem is that calling Scala instances in Java side is discouraged >> in >> > general up to my best knowledge. >> > A Java user won't likely know asJava in Scala but a Scala user will >> likely >> > know both asScala and asJava. >> > >> > >> > 2020년 4월 28일 (화) 오전 11:35, ZHANG Wei <wezh...@outlook.com>님이 작성: >> > >> > > How about making a small change on option 4: >> > > Keep Scala API returning Scala type instance with providing a >> > > `asJava` method to return a Java type instance. >> > > >> > > Scala 2.13 has provided CollectionConverter [1][2][3], in the >> following >> > > Spark dependences upgrade, which can be supported by nature. For >> > > current Scala 2.12 version, we can wrap `ImplicitConversionsToJava`[4] >> > > as what Scala 2.13 does and add implicit conversions. >> > > >> > > Just my 2 cents. >> > > >> > > -- >> > > Cheers, >> > > -z >> > > >> > > [1] >> > > >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.scala-lang.org%2Foverviews%2Fcollections-2.13%2Fconversions-between-java-and-scala-collections.html&data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637236400701707166&sdata=1qauveOMB1lKHSkRco7v8tBpcJXab8IeGlcoYNMCZ%2BU%3D&reserved=0 >> > > [2] >> > > >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2Fjavaapi%2FCollectionConverters%24.html&data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637236400701707166&sdata=%2B9TrlfiGSWDnsaT8DMPrSn1CqGIxtgfNLcPFRJ%2F%2FANQ%3D&reserved=0 >> > > [3] >> > > >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2FCollectionConverters%24.html&data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637236400701707166&sdata=EjocqFcoIho43wU3yvOEO9Vtvn2jTHliV88W%2BSOed9k%3D&reserved=0 >> > > [4] >> > > >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.12.11%2Fscala%2Fcollection%2Fconvert%2FImplicitConversionsToJava%24.html&data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637236400701707166&sdata=BpMYD30%2B2tXeaoIj0nNhlho8XUZOEYvT%2FzH%2FJ4WEK98%3D&reserved=0 >> > > >> > > >> > > On Tue, 28 Apr 2020 08:52:57 +0900 >> > > Hyukjin Kwon <gurwls...@gmail.com> wrote: >> > > >> > > > I would like to make sure I am open for other options that can be >> > > > considered situationally and based on the context. >> > > > It's okay, and I don't target to restrict this here. For example, >> DSv2, I >> > > > understand it's written in Java because Java >> > > > interfaces arguably brings better performance. That's why vectorized >> > > > readers are written in Java too. >> > > > >> > > > Maybe the "general" wasn't explicit in my previous email. Adding >> APIs to >> > > > return a Java instance is still >> > > > rather rare in general given my few years monitoring. >> > > > The problem I would more like to deal with is more about when we >> need to >> > > > add one or a couple of user-facing >> > > > Java-specific APIs to return Java instances, which is relatively >> more >> > > > frequent compared to when we need a bunch >> > > > of Java specific APIs. >> > > > >> > > > In this case, I think it should be guided to use 4. approach. There >> are >> > > > pros and cons between 3. and 4., of course. >> > > > But it looks to me 4. approach is closer to what Spark has targeted >> so >> > > far. >> > > > >> > > > >> > > > >> > > > 2020년 4월 28일 (화) 오전 8:34, Hyukjin Kwon <gurwls...@gmail.com>님이 작성: >> > > > >> > > > > > One thing we could do here is use Java collections internally >> and >> > > make >> > > > > the Scala API a thin wrapper around Java -- like how Python works. >> > > > > > Then adding a method to the Scala API would require adding it >> to the >> > > > > Java API and we would keep the two more in sync. >> > > > > >> > > > > I think it can be an appropriate idea for when we have to deal >> with >> > > this >> > > > > case a lot but I don't think there are so many >> > > > > user-facing APIs to return a Java collections, it's rather rare. >> Also, >> > > the >> > > > > Java users are relatively less than Scala users. >> > > > > This case is slightly different from Python in a way that there >> are so >> > > > > many differences to deal with in PySpark case. >> > > > > >> > > > > Also, in case of `Seq`, actually we can just use `Array` instead >> for >> > > both >> > > > > Scala and Java side simply. I don't find such cases notably >> awkward. >> > > > > This problematic cases might be specific to few Java collections >> or >> > > > > instances, and I would like to avoid an overkill here. >> > > > > >> > > > > Of course, if there is a place to consider other options, let's >> do. I >> > > > > don't like to say this is the only required option. >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > 2020년 4월 28일 (화) 오전 1:18, Ryan Blue <rb...@netflix.com.invalid>님이 >> 작성: >> > > > > >> > > > >> I think the right choice here depends on how the object is used. >> For >> > > > >> developer and internal APIs, I think standardizing on Java >> collections >> > > > >> makes the most sense. >> > > > >> >> > > > >> For user-facing APIs, it is awkward to return Java collections to >> > > Scala >> > > > >> code -- I think that's the motivation for Tom's comment. For user >> > > APIs, I >> > > > >> think most methods should return Scala collections, and I don't >> have a >> > > > >> strong opinion about whether the conversion (or lack thereof) is >> done >> > > in a >> > > > >> separate object (#1) or in parallel methods (#3). >> > > > >> >> > > > >> Both #1 and #3 seem like about the same amount of work and have >> the >> > > same >> > > > >> likelihood that a developer will leave out a Java method >> version. One >> > > thing >> > > > >> we could do here is use Java collections internally and make the >> > > Scala API >> > > > >> a thin wrapper around Java -- like how Python works. Then adding >> a >> > > method >> > > > >> to the Scala API would require adding it to the Java API and we >> would >> > > keep >> > > > >> the two more in sync. It would also help avoid Scala collections >> > > leaking >> > > > >> into internals. >> > > > >> >> > > > >> On Mon, Apr 27, 2020 at 8:49 AM Hyukjin Kwon < >> gurwls...@gmail.com> >> > > wrote: >> > > > >> >> > > > >>> Let's stick to the less maintenance efforts then rather than we >> > > leave it >> > > > >>> undecided and delay with leaving this inconsistency. >> > > > >>> >> > > > >>> I dont think we can have some very meaningful data about this >> soon >> > > given >> > > > >>> that we don't hear much complaints about this in general so far. >> > > > >>> >> > > > >>> The point of this thread is to make a call rather then defer to >> the >> > > > >>> future. >> > > > >>> >> > > > >>> On Mon, 27 Apr 2020, 23:15 Wenchen Fan, <cloud0...@gmail.com> >> wrote: >> > > > >>> >> > > > >>>> IIUC We are moving away from having 2 classes for Java and >> Scala, >> > > like >> > > > >>>> JavaRDD and RDD. It's much simpler to maintain and use with a >> > > single class. >> > > > >>>> >> > > > >>>> I don't have a strong preference over option 3 or 4. We may >> need to >> > > > >>>> collect more data points from actual users. >> > > > >>>> >> > > > >>>> On Mon, Apr 27, 2020 at 9:50 PM Hyukjin Kwon < >> gurwls...@gmail.com> >> > > > >>>> wrote: >> > > > >>>> >> > > > >>>>> Scala users are arguably more prevailing compared to Java >> users, >> > > yes. >> > > > >>>>> Using the Java instances in Scala side is legitimate, and >> they are >> > > > >>>>> already being used in multiple please. I don't believe Scala >> > > > >>>>> users find this not Scala friendly as it's legitimate and >> already >> > > > >>>>> being used. I personally find it's more trouble some to let >> Java >> > > > >>>>> users to search which APIs to call. Yes, I understand the >> pros and >> > > > >>>>> cons - we should also find the balance considering the actual >> > > usage. >> > > > >>>>> >> > > > >>>>> One more argument from me is, though, I think one of the >> goals in >> > > > >>>>> Spark APIs is the unified API set up to my knowledge >> > > > >>>>> e.g., JavaRDD <> RDD vs DataFrame. >> > > > >>>>> If either way is not particularly preferred over the other, I >> would >> > > > >>>>> just choose the one to have the unified API set. >> > > > >>>>> >> > > > >>>>> >> > > > >>>>> >> > > > >>>>> 2020년 4월 27일 (월) 오후 10:37, Tom Graves <tgraves...@yahoo.com>님이 >> 작성: >> > > > >>>>> >> > > > >>>>>> I agree a general guidance is good so we keep consistent in >> the >> > > apis. >> > > > >>>>>> I don't necessarily agree that 4 is the best solution >> though. I >> > > agree its >> > > > >>>>>> nice to have one api, but it is less friendly for the scala >> side. >> > > > >>>>>> Searching for the equivalent Java api shouldn't be hard as it >> > > should be >> > > > >>>>>> very close in the name and if we make it a general rule users >> > > should >> > > > >>>>>> understand it. I guess one good question is what API do >> most of >> > > our users >> > > > >>>>>> use between Java and Scala and what is the ratio? I don't >> know >> > > the answer >> > > > >>>>>> to that. I've seen more using Scala over Java. If the >> majority >> > > use Scala >> > > > >>>>>> then I think the API should be more friendly to that. >> > > > >>>>>> >> > > > >>>>>> Tom >> > > > >>>>>> >> > > > >>>>>> On Monday, April 27, 2020, 04:04:28 AM CDT, Hyukjin Kwon < >> > > > >>>>>> gurwls...@gmail.com> wrote: >> > > > >>>>>> >> > > > >>>>>> >> > > > >>>>>> Hi all, >> > > > >>>>>> >> > > > >>>>>> I would like to discuss Java specific APIs and which design >> we >> > > will >> > > > >>>>>> choose. >> > > > >>>>>> This has been discussed in multiple places so far, for >> example, at >> > > > >>>>>> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F28085%23discussion_r407334754&data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637236400701707166&sdata=6A82CT7n4FwG6f1Hx3%2FqmetQVSGWlrcE7BHDx0LLwTo%3D&reserved=0 >> > > > >>>>>> >> > > > >>>>>> >> > > > >>>>>> *The problem:* >> > > > >>>>>> >> > > > >>>>>> In short, I would like us to have clear guidance on how we >> support >> > > > >>>>>> Java specific APIs when >> > > > >>>>>> it requires to return a Java instance. The problem is simple: >> > > > >>>>>> >> > > > >>>>>> def requests: Map[String, ExecutorResourceRequest] = ... >> > > > >>>>>> def requestsJMap: java.util.Map[String, >> ExecutorResourceRequest] >> > > = ... >> > > > >>>>>> >> > > > >>>>>> vs >> > > > >>>>>> >> > > > >>>>>> def requests: java.util.Map[String, ExecutorResourceRequest] >> = ... >> > > > >>>>>> >> > > > >>>>>> >> > > > >>>>>> *Current codebase:* >> > > > >>>>>> >> > > > >>>>>> My understanding so far was that the latter is preferred and >> more >> > > > >>>>>> consistent and prevailing in the >> > > > >>>>>> existing codebase, for example, see StateOperatorProgress and >> > > > >>>>>> StreamingQueryProgress in Structured Streaming. >> > > > >>>>>> However, I realised that we also have other approaches in the >> > > current >> > > > >>>>>> codebase. There look >> > > > >>>>>> four approaches to deal with Java specifics in general: >> > > > >>>>>> >> > > > >>>>>> 1. Java specific classes such as JavaRDD and >> JavaSparkContext. >> > > > >>>>>> 2. Java specific methods with the same name that overload >> its >> > > > >>>>>> parameters, see functions.scala. >> > > > >>>>>> 3. Java specific methods with a different name that needs >> to >> > > > >>>>>> return a different type such as TaskContext.resourcesJMap >> vs >> > > > >>>>>> TaskContext.resources. >> > > > >>>>>> 4. One method that returns a Java instance for both Scala >> and >> > > > >>>>>> Java sides. see StateOperatorProgress and >> > > StreamingQueryProgress. >> > > > >>>>>> >> > > > >>>>>> >> > > > >>>>>> *Analysis on the current codebase:* >> > > > >>>>>> >> > > > >>>>>> I agree with 2. approach because the corresponding cases >> give you >> > > a >> > > > >>>>>> consistent API usage across >> > > > >>>>>> other language APIs in general. Approach 1. is from the old >> world >> > > > >>>>>> when we didn't have unified APIs. >> > > > >>>>>> This might be the worst approach. >> > > > >>>>>> >> > > > >>>>>> 3. and 4. are controversial. >> > > > >>>>>> >> > > > >>>>>> For 3., if you have to use Java APIs, then, you should >> search if >> > > > >>>>>> there is a variant of that API >> > > > >>>>>> every time specifically for Java APIs. But yes, it gives you >> > > > >>>>>> Java/Scala friendly instances. >> > > > >>>>>> >> > > > >>>>>> For 4., having one API that returns a Java instance makes you >> > > able to >> > > > >>>>>> use it in both Scala and Java APIs >> > > > >>>>>> sides although it makes you call asScala in Scala side >> > > specifically. >> > > > >>>>>> But you don’t >> > > > >>>>>> have to search if there’s a variant of this API and it gives >> you a >> > > > >>>>>> consistent API usage across languages. >> > > > >>>>>> >> > > > >>>>>> Also, note that calling Java in Scala is legitimate but the >> > > opposite >> > > > >>>>>> case is not, up to my best knowledge. >> > > > >>>>>> In addition, you should have a method that returns a Java >> instance >> > > > >>>>>> for PySpark or SparkR to support. >> > > > >>>>>> >> > > > >>>>>> >> > > > >>>>>> *Proposal:* >> > > > >>>>>> >> > > > >>>>>> I would like to have a general guidance on this that the >> Spark dev >> > > > >>>>>> agrees upon: Do 4. approach. If not possible, do 3. Avoid 1 >> > > almost at all >> > > > >>>>>> cost. >> > > > >>>>>> >> > > > >>>>>> Note that this isn't a hard requirement but *a general >> guidance*; >> > > > >>>>>> therefore, the decision might be up to >> > > > >>>>>> the specific context. For example, when there are some strong >> > > > >>>>>> arguments to have a separate Java specific API, that’s fine. >> > > > >>>>>> Of course, we won’t change the existing methods given >> Micheal’s >> > > > >>>>>> rubric added before. I am talking about new >> > > > >>>>>> methods in unreleased branches. >> > > > >>>>>> >> > > > >>>>>> Any concern or opinion on this? >> > > > >>>>>> >> > > > >>>>> >> > > > >> >> > > > >> -- >> > > > >> Ryan Blue >> > > > >> Software Engineer >> > > > >> Netflix >> > > > >> >> > > > > >> > > >> >