Re: Apache Spark 2.4.8 (and EOL of 2.4)
Thank you, Liang-Chi! Next Monday sounds good. To All. Please ping Liang-Chi if you have a missed backport. Bests, Dongjoon. On Thu, Mar 4, 2021 at 7:00 PM Xiao Li wrote: > Thank you, Liang-Chi! > > Xiao > > On Thu, Mar 4, 2021 at 6:25 PM Hyukjin Kwon wrote: > >> Thanks @Liang-Chi Hsieh for driving this. >> >> 2021년 3월 5일 (금) 오전 5:21, Liang-Chi Hsieh 님이 작성: >> >>> >>> Thanks all for the input. >>> >>> If there is no objection, I am going to cut the branch next Monday. >>> >>> Thanks. >>> Liang-Chi >>> >>> >>> Takeshi Yamamuro wrote >>> > +1 for releasing 2.4.8 and thanks, Liang-chi, for volunteering. >>> > Btw, anyone roughly know how many v2.4 users still are based on some >>> stats >>> > (e.g., # of v2.4.7 downloads from the official repos)? >>> > Most users have started using v3.x? >>> > >>> > On Thu, Mar 4, 2021 at 8:34 AM Hyukjin Kwon < >>> >>> > gurwls223@ >>> >>> > > wrote: >>> > >>> >> Yeah, I would prefer to have a 2.4.8 release as an EOL too. I don't >>> mind >>> >> having 2.4.9 as EOL too if that's preferred from more people. >>> >> >>> > Takeshi Yamamuro >>> >>> >>> >>> >>> >>> -- >>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ >>> >>> - >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>> > > -- > >
Re: Apache Spark 2.4.8 (and EOL of 2.4)
Thank you, Liang-Chi! Xiao On Thu, Mar 4, 2021 at 6:25 PM Hyukjin Kwon wrote: > Thanks @Liang-Chi Hsieh for driving this. > > 2021년 3월 5일 (금) 오전 5:21, Liang-Chi Hsieh 님이 작성: > >> >> Thanks all for the input. >> >> If there is no objection, I am going to cut the branch next Monday. >> >> Thanks. >> Liang-Chi >> >> >> Takeshi Yamamuro wrote >> > +1 for releasing 2.4.8 and thanks, Liang-chi, for volunteering. >> > Btw, anyone roughly know how many v2.4 users still are based on some >> stats >> > (e.g., # of v2.4.7 downloads from the official repos)? >> > Most users have started using v3.x? >> > >> > On Thu, Mar 4, 2021 at 8:34 AM Hyukjin Kwon < >> >> > gurwls223@ >> >> > > wrote: >> > >> >> Yeah, I would prefer to have a 2.4.8 release as an EOL too. I don't >> mind >> >> having 2.4.9 as EOL too if that's preferred from more people. >> >> >> > Takeshi Yamamuro >> >> >> >> >> >> -- >> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> --
Re: Apache Spark 2.4.8 (and EOL of 2.4)
Thanks @Liang-Chi Hsieh for driving this. 2021년 3월 5일 (금) 오전 5:21, Liang-Chi Hsieh 님이 작성: > > Thanks all for the input. > > If there is no objection, I am going to cut the branch next Monday. > > Thanks. > Liang-Chi > > > Takeshi Yamamuro wrote > > +1 for releasing 2.4.8 and thanks, Liang-chi, for volunteering. > > Btw, anyone roughly know how many v2.4 users still are based on some > stats > > (e.g., # of v2.4.7 downloads from the official repos)? > > Most users have started using v3.x? > > > > On Thu, Mar 4, 2021 at 8:34 AM Hyukjin Kwon < > > > gurwls223@ > > > > wrote: > > > >> Yeah, I would prefer to have a 2.4.8 release as an EOL too. I don't mind > >> having 2.4.9 as EOL too if that's preferred from more people. > >> > > Takeshi Yamamuro > > > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
Re: Apache Spark 2.4.8 (and EOL of 2.4)
Thanks all for the input. If there is no objection, I am going to cut the branch next Monday. Thanks. Liang-Chi Takeshi Yamamuro wrote > +1 for releasing 2.4.8 and thanks, Liang-chi, for volunteering. > Btw, anyone roughly know how many v2.4 users still are based on some stats > (e.g., # of v2.4.7 downloads from the official repos)? > Most users have started using v3.x? > > On Thu, Mar 4, 2021 at 8:34 AM Hyukjin Kwon < > gurwls223@ > > wrote: > >> Yeah, I would prefer to have a 2.4.8 release as an EOL too. I don't mind >> having 2.4.9 as EOL too if that's preferred from more people. >> > Takeshi Yamamuro -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [DISCUSS] SPIP: FunctionCatalog
Yeah, in short this is a great compromise approach and I do like to see this proposal move forward to next step. This discussion is valuable. Chao Sun wrote > +1 on Dongjoon's proposal. Great to see this is getting moved forward and > thanks everyone for the insightful discussion! > > > > On Thu, Mar 4, 2021 at 8:58 AM Ryan Blue < > rblue@ > > wrote: > >> Okay, great. I'll update the SPIP doc and call a vote in the next day or >> two. -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [DISCUSS] SPIP: FunctionCatalog
+1 on Dongjoon's proposal. Great to see this is getting moved forward and thanks everyone for the insightful discussion! On Thu, Mar 4, 2021 at 8:58 AM Ryan Blue wrote: > Okay, great. I'll update the SPIP doc and call a vote in the next day or > two. > > On Thu, Mar 4, 2021 at 8:26 AM Erik Krogen wrote: > >> +1 on Dongjoon's proposal. This is a very nice compromise between the >> reflective/magic-method approach and the InternalRow approach, providing >> a lot of flexibility for our users, and allowing for the more complicated >> reflection-based approach to evolve at its own pace, since you can always >> fall back to InternalRow for situations which aren't yet supported by >> reflection. >> >> We can even consider having Spark code detect that you haven't overridden >> the default produceResult (IIRC this is discoverable via reflection), >> and raise an error at query analysis time instead of at runtime when it >> can't find a reflective method or an overridden produceResult. >> >> I'm very pleased we have found a compromise that everyone seems happy >> with! Big thanks to everyone who participated. >> >> On Wed, Mar 3, 2021 at 8:34 PM John Zhuge wrote: >> >>> +1 Good plan to move forward. >>> >>> Thank you all for the constructive and comprehensive discussions in this >>> thread! Decisions on this important feature will have ramifications for >>> years to come. >>> >>> On Wed, Mar 3, 2021 at 7:42 PM Wenchen Fan wrote: >>> +1 to this proposal. If people don't like the ScalarFunction0,1, ... variants and prefer the "magical methods", then we can have a single ScalarFunction interface which has the row-parameter API (with a default implementation to fail) and documents to describe the "magical methods" (which can be done later). I'll start the PR review this week to check the naming, doc, etc. Thanks all for the discussion here and let's move forward! On Thu, Mar 4, 2021 at 9:53 AM Ryan Blue wrote: > Good point, Dongjoon. I think we can probably come to some compromise > here: > >- Remove SupportsInvoke since it isn’t really needed. We should >always try to find the right method to invoke in the codegen path. >- Add a default implementation of produceResult so that >implementations don’t have to use it. If they don’t implement it and a >magic function can’t be found, then it will throw >UnsupportedOperationException > > This is assuming that we can agree not to introduce all of the > ScalarFunction interface variations, which would have limited utility > because of type erasure. > > Does that sound like a good plan to everyone? If so, I’ll update the > SPIP doc so we can move forward. > > On Wed, Mar 3, 2021 at 4:36 PM Dongjoon Hyun > wrote: > >> Hi, All. >> >> We shared many opinions in different perspectives. >> However, we didn't reach a consensus even on a partial merge by >> excluding something >> (on the PR by me, on this mailing thread by Wenchen). >> >> For the following claims, we have another alternative to mitigate it. >> >> > I don't like it because it promotes the row-parameter API and >> forces users to implement it, even if the users want to only use the >> individual-parameters API. >> >> Why don't we merge the AS-IS PR by adding something instead of >> excluding something? >> >> - R produceResult(InternalRow input); >> + default R produceResult(InternalRow input) throws Exception { >> + throw new UnsupportedOperationException(); >> + } >> >> By providing the default implementation, it will not *forcing users >> to implement it* technically. >> And, we can provide a document about our expected usage properly. >> What do you think? >> >> Bests, >> Dongjoon. >> >> >> >> On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue wrote: >> >>> Yes, GenericInternalRow is safe if when type mismatches, with the >>> cost of using Object[], and primitive types need to do boxing >>> >>> The question is not whether to use the magic functions, which would >>> not need boxing. The question here is whether to use multiple >>> ScalarFunction interfaces. Those interfaces would require boxing or >>> using Object[] so there isn’t a benefit. >>> >>> If we do want to reuse one UDF for different types, using “magical >>> methods” solves the problem >>> >>> Yes, that’s correct. We agree that magic methods are a good option >>> for this. >>> >>> Again, the question we need to decide is whether to use InternalRow >>> or interfaces like ScalarFunction2 for non-codegen. The option to >>> use multiple interfaces is limited by type erasure because you can only >>> have one set of type parameters. If you wanted to support both >>> ScalarF
Re: minikube and kubernetes cluster versions for integration testing
fwiw, upgrading minikube and the associated VM drivers is potentially a PITA. your PR will absolutely be tested before merging. :) On Thu, Mar 4, 2021 at 10:13 AM attilapiros wrote: > Thanks Shane! > > I can do the documentation task and the Minikube version check can be > incorporated into my PR. > When my PR is finalized (probably next week) I will create a jira for you > and you can set up the test systems and you can even test my PR before > merging it. Is this possible / fine for you? > > > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: minikube and kubernetes cluster versions for integration testing
Thanks Shane! I can do the documentation task and the Minikube version check can be incorporated into my PR. When my PR is finalized (probably next week) I will create a jira for you and you can set up the test systems and you can even test my PR before merging it. Is this possible / fine for you? -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [DISCUSS] SPIP: FunctionCatalog
Okay, great. I'll update the SPIP doc and call a vote in the next day or two. On Thu, Mar 4, 2021 at 8:26 AM Erik Krogen wrote: > +1 on Dongjoon's proposal. This is a very nice compromise between the > reflective/magic-method approach and the InternalRow approach, providing > a lot of flexibility for our users, and allowing for the more complicated > reflection-based approach to evolve at its own pace, since you can always > fall back to InternalRow for situations which aren't yet supported by > reflection. > > We can even consider having Spark code detect that you haven't overridden > the default produceResult (IIRC this is discoverable via reflection), and > raise an error at query analysis time instead of at runtime when it can't > find a reflective method or an overridden produceResult. > > I'm very pleased we have found a compromise that everyone seems happy > with! Big thanks to everyone who participated. > > On Wed, Mar 3, 2021 at 8:34 PM John Zhuge wrote: > >> +1 Good plan to move forward. >> >> Thank you all for the constructive and comprehensive discussions in this >> thread! Decisions on this important feature will have ramifications for >> years to come. >> >> On Wed, Mar 3, 2021 at 7:42 PM Wenchen Fan wrote: >> >>> +1 to this proposal. If people don't like the ScalarFunction0,1, ... >>> variants and prefer the "magical methods", then we can have a single >>> ScalarFunction interface which has the row-parameter API (with a >>> default implementation to fail) and documents to describe the "magical >>> methods" (which can be done later). >>> >>> I'll start the PR review this week to check the naming, doc, etc. >>> >>> Thanks all for the discussion here and let's move forward! >>> >>> On Thu, Mar 4, 2021 at 9:53 AM Ryan Blue wrote: >>> Good point, Dongjoon. I think we can probably come to some compromise here: - Remove SupportsInvoke since it isn’t really needed. We should always try to find the right method to invoke in the codegen path. - Add a default implementation of produceResult so that implementations don’t have to use it. If they don’t implement it and a magic function can’t be found, then it will throw UnsupportedOperationException This is assuming that we can agree not to introduce all of the ScalarFunction interface variations, which would have limited utility because of type erasure. Does that sound like a good plan to everyone? If so, I’ll update the SPIP doc so we can move forward. On Wed, Mar 3, 2021 at 4:36 PM Dongjoon Hyun wrote: > Hi, All. > > We shared many opinions in different perspectives. > However, we didn't reach a consensus even on a partial merge by > excluding something > (on the PR by me, on this mailing thread by Wenchen). > > For the following claims, we have another alternative to mitigate it. > > > I don't like it because it promotes the row-parameter API and > forces users to implement it, even if the users want to only use the > individual-parameters API. > > Why don't we merge the AS-IS PR by adding something instead of > excluding something? > > - R produceResult(InternalRow input); > + default R produceResult(InternalRow input) throws Exception { > + throw new UnsupportedOperationException(); > + } > > By providing the default implementation, it will not *forcing users to > implement it* technically. > And, we can provide a document about our expected usage properly. > What do you think? > > Bests, > Dongjoon. > > > > On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue wrote: > >> Yes, GenericInternalRow is safe if when type mismatches, with the >> cost of using Object[], and primitive types need to do boxing >> >> The question is not whether to use the magic functions, which would >> not need boxing. The question here is whether to use multiple >> ScalarFunction interfaces. Those interfaces would require boxing or >> using Object[] so there isn’t a benefit. >> >> If we do want to reuse one UDF for different types, using “magical >> methods” solves the problem >> >> Yes, that’s correct. We agree that magic methods are a good option >> for this. >> >> Again, the question we need to decide is whether to use InternalRow >> or interfaces like ScalarFunction2 for non-codegen. The option to >> use multiple interfaces is limited by type erasure because you can only >> have one set of type parameters. If you wanted to support both >> ScalarFunction2> Integer> and ScalarFunction2 you’d have to fall back to >> ScalarFunction2> Object> and cast. >> >> The point is that type erasure will commonly lead either to many >> different implementation classes (one for each type combination) or will >> lead to
Re: [DISCUSS] SPIP: FunctionCatalog
+1 on Dongjoon's proposal. This is a very nice compromise between the reflective/magic-method approach and the InternalRow approach, providing a lot of flexibility for our users, and allowing for the more complicated reflection-based approach to evolve at its own pace, since you can always fall back to InternalRow for situations which aren't yet supported by reflection. We can even consider having Spark code detect that you haven't overridden the default produceResult (IIRC this is discoverable via reflection), and raise an error at query analysis time instead of at runtime when it can't find a reflective method or an overridden produceResult. I'm very pleased we have found a compromise that everyone seems happy with! Big thanks to everyone who participated. On Wed, Mar 3, 2021 at 8:34 PM John Zhuge wrote: > +1 Good plan to move forward. > > Thank you all for the constructive and comprehensive discussions in this > thread! Decisions on this important feature will have ramifications for > years to come. > > On Wed, Mar 3, 2021 at 7:42 PM Wenchen Fan wrote: > >> +1 to this proposal. If people don't like the ScalarFunction0,1, ... >> variants and prefer the "magical methods", then we can have a single >> ScalarFunction interface which has the row-parameter API (with a default >> implementation to fail) and documents to describe the "magical methods" >> (which can be done later). >> >> I'll start the PR review this week to check the naming, doc, etc. >> >> Thanks all for the discussion here and let's move forward! >> >> On Thu, Mar 4, 2021 at 9:53 AM Ryan Blue wrote: >> >>> Good point, Dongjoon. I think we can probably come to some compromise >>> here: >>> >>>- Remove SupportsInvoke since it isn’t really needed. We should >>>always try to find the right method to invoke in the codegen path. >>>- Add a default implementation of produceResult so that >>>implementations don’t have to use it. If they don’t implement it and a >>>magic function can’t be found, then it will throw >>>UnsupportedOperationException >>> >>> This is assuming that we can agree not to introduce all of the >>> ScalarFunction interface variations, which would have limited utility >>> because of type erasure. >>> >>> Does that sound like a good plan to everyone? If so, I’ll update the >>> SPIP doc so we can move forward. >>> >>> On Wed, Mar 3, 2021 at 4:36 PM Dongjoon Hyun >>> wrote: >>> Hi, All. We shared many opinions in different perspectives. However, we didn't reach a consensus even on a partial merge by excluding something (on the PR by me, on this mailing thread by Wenchen). For the following claims, we have another alternative to mitigate it. > I don't like it because it promotes the row-parameter API and forces users to implement it, even if the users want to only use the individual-parameters API. Why don't we merge the AS-IS PR by adding something instead of excluding something? - R produceResult(InternalRow input); + default R produceResult(InternalRow input) throws Exception { + throw new UnsupportedOperationException(); + } By providing the default implementation, it will not *forcing users to implement it* technically. And, we can provide a document about our expected usage properly. What do you think? Bests, Dongjoon. On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue wrote: > Yes, GenericInternalRow is safe if when type mismatches, with the cost > of using Object[], and primitive types need to do boxing > > The question is not whether to use the magic functions, which would > not need boxing. The question here is whether to use multiple > ScalarFunction interfaces. Those interfaces would require boxing or > using Object[] so there isn’t a benefit. > > If we do want to reuse one UDF for different types, using “magical > methods” solves the problem > > Yes, that’s correct. We agree that magic methods are a good option for > this. > > Again, the question we need to decide is whether to use InternalRow > or interfaces like ScalarFunction2 for non-codegen. The option to use > multiple interfaces is limited by type erasure because you can only have > one set of type parameters. If you wanted to support both > ScalarFunction2 Integer> and ScalarFunction2 you’d have to fall back to > ScalarFunction2 Object> and cast. > > The point is that type erasure will commonly lead either to many > different implementation classes (one for each type combination) or will > lead to parameterizing by Object, which defeats the purpose. > > The alternative adds safety because correct types are produced by > calls like getLong(0). Yes, this depends on the implementation making > the correct calls, but it is better than using
using accumulators in (MicroBatch) InputPartitionReader
I tried to create a data source, however our use case is bit hard as we do only know the available offsets within the tasks, not on the driver. I therefore planned to use accumulators in the InputPartitionReader but they seem not to work. Example accumulation is done here https://github.com/kortemik/spark-source/blob/master/src/main/java/com/teragrep/pth06/ArchiveMicroBatchInputPartitionReader.java#L118 I get on the task logs that the System.out.println() are called, so it can not be that the flow itself is broken, but the accumulators seem to work only while on the driver as on the logs at the https://github.com/kortemik/spark-source/tree/master Is it intentional that the accumulators do not work within the data source? One might ask why all this so I give brief explanation. We use gzipped files as the storage blobs and it's unknown prior to execution how many records they contain. Of course this can be mitigated by decompressing the files on the driver and then sending the offsets through to executors but it's a double effort. The aim however was to decompress them only once by doing a forward-lookup into the data and use accumulator to inform the driver that there is stuff available for the next batch as well or that the file is done and driver needs to pull the next one to keep executors busy. Any advices are welcome. Kind regards, -Mikko Kortelainen - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org