Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-04 Thread Dongjoon Hyun
Thank you, Liang-Chi! Next Monday sounds good.

To All. Please ping Liang-Chi if you have a missed backport.

Bests,
Dongjoon.



On Thu, Mar 4, 2021 at 7:00 PM Xiao Li  wrote:

> Thank you, Liang-Chi!
>
> Xiao
>
> On Thu, Mar 4, 2021 at 6:25 PM Hyukjin Kwon  wrote:
>
>> Thanks @Liang-Chi Hsieh  for driving this.
>>
>> 2021년 3월 5일 (금) 오전 5:21, Liang-Chi Hsieh 님이 작성:
>>
>>>
>>> Thanks all for the input.
>>>
>>> If there is no objection, I am going to cut the branch next Monday.
>>>
>>> Thanks.
>>> Liang-Chi
>>>
>>>
>>> Takeshi Yamamuro wrote
>>> > +1 for releasing 2.4.8 and thanks, Liang-chi, for volunteering.
>>> > Btw, anyone roughly know how many v2.4 users still are based on some
>>> stats
>>> > (e.g., # of v2.4.7 downloads from the official repos)?
>>> > Most users have started using v3.x?
>>> >
>>> > On Thu, Mar 4, 2021 at 8:34 AM Hyukjin Kwon <
>>>
>>> > gurwls223@
>>>
>>> > > wrote:
>>> >
>>> >> Yeah, I would prefer to have a 2.4.8 release as an EOL too. I don't
>>> mind
>>> >> having 2.4.9 as EOL too if that's preferred from more people.
>>> >>
>>> > Takeshi Yamamuro
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
> --
>
>


Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-04 Thread Xiao Li
Thank you, Liang-Chi!

Xiao

On Thu, Mar 4, 2021 at 6:25 PM Hyukjin Kwon  wrote:

> Thanks @Liang-Chi Hsieh  for driving this.
>
> 2021년 3월 5일 (금) 오전 5:21, Liang-Chi Hsieh 님이 작성:
>
>>
>> Thanks all for the input.
>>
>> If there is no objection, I am going to cut the branch next Monday.
>>
>> Thanks.
>> Liang-Chi
>>
>>
>> Takeshi Yamamuro wrote
>> > +1 for releasing 2.4.8 and thanks, Liang-chi, for volunteering.
>> > Btw, anyone roughly know how many v2.4 users still are based on some
>> stats
>> > (e.g., # of v2.4.7 downloads from the official repos)?
>> > Most users have started using v3.x?
>> >
>> > On Thu, Mar 4, 2021 at 8:34 AM Hyukjin Kwon <
>>
>> > gurwls223@
>>
>> > > wrote:
>> >
>> >> Yeah, I would prefer to have a 2.4.8 release as an EOL too. I don't
>> mind
>> >> having 2.4.9 as EOL too if that's preferred from more people.
>> >>
>> > Takeshi Yamamuro
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

--


Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-04 Thread Hyukjin Kwon
Thanks @Liang-Chi Hsieh  for driving this.

2021년 3월 5일 (금) 오전 5:21, Liang-Chi Hsieh 님이 작성:

>
> Thanks all for the input.
>
> If there is no objection, I am going to cut the branch next Monday.
>
> Thanks.
> Liang-Chi
>
>
> Takeshi Yamamuro wrote
> > +1 for releasing 2.4.8 and thanks, Liang-chi, for volunteering.
> > Btw, anyone roughly know how many v2.4 users still are based on some
> stats
> > (e.g., # of v2.4.7 downloads from the official repos)?
> > Most users have started using v3.x?
> >
> > On Thu, Mar 4, 2021 at 8:34 AM Hyukjin Kwon <
>
> > gurwls223@
>
> > > wrote:
> >
> >> Yeah, I would prefer to have a 2.4.8 release as an EOL too. I don't mind
> >> having 2.4.9 as EOL too if that's preferred from more people.
> >>
> > Takeshi Yamamuro
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-04 Thread Liang-Chi Hsieh


Thanks all for the input.

If there is no objection, I am going to cut the branch next Monday.

Thanks.
Liang-Chi


Takeshi Yamamuro wrote
> +1 for releasing 2.4.8 and thanks, Liang-chi, for volunteering.
> Btw, anyone roughly know how many v2.4 users still are based on some stats
> (e.g., # of v2.4.7 downloads from the official repos)?
> Most users have started using v3.x?
> 
> On Thu, Mar 4, 2021 at 8:34 AM Hyukjin Kwon <

> gurwls223@

> > wrote:
> 
>> Yeah, I would prefer to have a 2.4.8 release as an EOL too. I don't mind
>> having 2.4.9 as EOL too if that's preferred from more people.
>>
> Takeshi Yamamuro





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-04 Thread Liang-Chi Hsieh


Yeah, in short this is a great compromise approach and I do like to see this
proposal move forward to next step. This discussion is valuable.


Chao Sun wrote
> +1 on Dongjoon's proposal. Great to see this is getting moved forward and
> thanks everyone for the insightful discussion!
> 
> 
> 
> On Thu, Mar 4, 2021 at 8:58 AM Ryan Blue <

> rblue@

> > wrote:
> 
>> Okay, great. I'll update the SPIP doc and call a vote in the next day or
>> two.





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-04 Thread Chao Sun
+1 on Dongjoon's proposal. Great to see this is getting moved forward and
thanks everyone for the insightful discussion!



On Thu, Mar 4, 2021 at 8:58 AM Ryan Blue  wrote:

> Okay, great. I'll update the SPIP doc and call a vote in the next day or
> two.
>
> On Thu, Mar 4, 2021 at 8:26 AM Erik Krogen  wrote:
>
>> +1 on Dongjoon's proposal. This is a very nice compromise between the
>> reflective/magic-method approach and the InternalRow approach, providing
>> a lot of flexibility for our users, and allowing for the more complicated
>> reflection-based approach to evolve at its own pace, since you can always
>> fall back to InternalRow for situations which aren't yet supported by
>> reflection.
>>
>> We can even consider having Spark code detect that you haven't overridden
>> the default produceResult (IIRC this is discoverable via reflection),
>> and raise an error at query analysis time instead of at runtime when it
>> can't find a reflective method or an overridden produceResult.
>>
>> I'm very pleased we have found a compromise that everyone seems happy
>> with! Big thanks to everyone who participated.
>>
>> On Wed, Mar 3, 2021 at 8:34 PM John Zhuge  wrote:
>>
>>> +1 Good plan to move forward.
>>>
>>> Thank you all for the constructive and comprehensive discussions in this
>>> thread! Decisions on this important feature will have ramifications for
>>> years to come.
>>>
>>> On Wed, Mar 3, 2021 at 7:42 PM Wenchen Fan  wrote:
>>>
 +1 to this proposal. If people don't like the ScalarFunction0,1, ...
 variants and prefer the "magical methods", then we can have a single
 ScalarFunction interface which has the row-parameter API (with a
 default implementation to fail) and documents to describe the "magical
 methods" (which can be done later).

 I'll start the PR review this week to check the naming, doc, etc.

 Thanks all for the discussion here and let's move forward!

 On Thu, Mar 4, 2021 at 9:53 AM Ryan Blue  wrote:

> Good point, Dongjoon. I think we can probably come to some compromise
> here:
>
>- Remove SupportsInvoke since it isn’t really needed. We should
>always try to find the right method to invoke in the codegen path.
>- Add a default implementation of produceResult so that
>implementations don’t have to use it. If they don’t implement it and a
>magic function can’t be found, then it will throw
>UnsupportedOperationException
>
> This is assuming that we can agree not to introduce all of the
> ScalarFunction interface variations, which would have limited utility
> because of type erasure.
>
> Does that sound like a good plan to everyone? If so, I’ll update the
> SPIP doc so we can move forward.
>
> On Wed, Mar 3, 2021 at 4:36 PM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> We shared many opinions in different perspectives.
>> However, we didn't reach a consensus even on a partial merge by
>> excluding something
>> (on the PR by me, on this mailing thread by Wenchen).
>>
>> For the following claims, we have another alternative to mitigate it.
>>
>> > I don't like it because it promotes the row-parameter API and
>> forces users to implement it, even if the users want to only use the
>> individual-parameters API.
>>
>> Why don't we merge the AS-IS PR by adding something instead of
>> excluding something?
>>
>> - R produceResult(InternalRow input);
>> + default R produceResult(InternalRow input) throws Exception {
>> +   throw new UnsupportedOperationException();
>> + }
>>
>> By providing the default implementation, it will not *forcing users
>> to implement it* technically.
>> And, we can provide a document about our expected usage properly.
>> What do you think?
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>> On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue  wrote:
>>
>>> Yes, GenericInternalRow is safe if when type mismatches, with the
>>> cost of using Object[], and primitive types need to do boxing
>>>
>>> The question is not whether to use the magic functions, which would
>>> not need boxing. The question here is whether to use multiple
>>> ScalarFunction interfaces. Those interfaces would require boxing or
>>> using Object[] so there isn’t a benefit.
>>>
>>> If we do want to reuse one UDF for different types, using “magical
>>> methods” solves the problem
>>>
>>> Yes, that’s correct. We agree that magic methods are a good option
>>> for this.
>>>
>>> Again, the question we need to decide is whether to use InternalRow
>>> or interfaces like ScalarFunction2 for non-codegen. The option to
>>> use multiple interfaces is limited by type erasure because you can only
>>> have one set of type parameters. If you wanted to support both 
>>> ScalarF

Re: minikube and kubernetes cluster versions for integration testing

2021-03-04 Thread shane knapp ☠
fwiw, upgrading minikube and the associated VM drivers is potentially a
PITA.

your PR will absolutely be tested before merging.  :)

On Thu, Mar 4, 2021 at 10:13 AM attilapiros 
wrote:

> Thanks Shane!
>
> I can do the documentation task and the Minikube version check can be
> incorporated into my PR.
> When my PR is finalized (probably next week) I will create a jira for you
> and you can set up the test systems and you can even test my PR before
> merging it. Is this possible / fine for you?
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: minikube and kubernetes cluster versions for integration testing

2021-03-04 Thread attilapiros
Thanks Shane!

I can do the documentation task and the Minikube version check can be
incorporated into my PR. 
When my PR is finalized (probably next week) I will create a jira for you
and you can set up the test systems and you can even test my PR before
merging it. Is this possible / fine for you?
 
 



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-04 Thread Ryan Blue
Okay, great. I'll update the SPIP doc and call a vote in the next day or
two.

On Thu, Mar 4, 2021 at 8:26 AM Erik Krogen  wrote:

> +1 on Dongjoon's proposal. This is a very nice compromise between the
> reflective/magic-method approach and the InternalRow approach, providing
> a lot of flexibility for our users, and allowing for the more complicated
> reflection-based approach to evolve at its own pace, since you can always
> fall back to InternalRow for situations which aren't yet supported by
> reflection.
>
> We can even consider having Spark code detect that you haven't overridden
> the default produceResult (IIRC this is discoverable via reflection), and
> raise an error at query analysis time instead of at runtime when it can't
> find a reflective method or an overridden produceResult.
>
> I'm very pleased we have found a compromise that everyone seems happy
> with! Big thanks to everyone who participated.
>
> On Wed, Mar 3, 2021 at 8:34 PM John Zhuge  wrote:
>
>> +1 Good plan to move forward.
>>
>> Thank you all for the constructive and comprehensive discussions in this
>> thread! Decisions on this important feature will have ramifications for
>> years to come.
>>
>> On Wed, Mar 3, 2021 at 7:42 PM Wenchen Fan  wrote:
>>
>>> +1 to this proposal. If people don't like the ScalarFunction0,1, ...
>>> variants and prefer the "magical methods", then we can have a single
>>> ScalarFunction interface which has the row-parameter API (with a
>>> default implementation to fail) and documents to describe the "magical
>>> methods" (which can be done later).
>>>
>>> I'll start the PR review this week to check the naming, doc, etc.
>>>
>>> Thanks all for the discussion here and let's move forward!
>>>
>>> On Thu, Mar 4, 2021 at 9:53 AM Ryan Blue  wrote:
>>>
 Good point, Dongjoon. I think we can probably come to some compromise
 here:

- Remove SupportsInvoke since it isn’t really needed. We should
always try to find the right method to invoke in the codegen path.
- Add a default implementation of produceResult so that
implementations don’t have to use it. If they don’t implement it and a
magic function can’t be found, then it will throw
UnsupportedOperationException

 This is assuming that we can agree not to introduce all of the
 ScalarFunction interface variations, which would have limited utility
 because of type erasure.

 Does that sound like a good plan to everyone? If so, I’ll update the
 SPIP doc so we can move forward.

 On Wed, Mar 3, 2021 at 4:36 PM Dongjoon Hyun 
 wrote:

> Hi, All.
>
> We shared many opinions in different perspectives.
> However, we didn't reach a consensus even on a partial merge by
> excluding something
> (on the PR by me, on this mailing thread by Wenchen).
>
> For the following claims, we have another alternative to mitigate it.
>
> > I don't like it because it promotes the row-parameter API and
> forces users to implement it, even if the users want to only use the
> individual-parameters API.
>
> Why don't we merge the AS-IS PR by adding something instead of
> excluding something?
>
> - R produceResult(InternalRow input);
> + default R produceResult(InternalRow input) throws Exception {
> +   throw new UnsupportedOperationException();
> + }
>
> By providing the default implementation, it will not *forcing users to
> implement it* technically.
> And, we can provide a document about our expected usage properly.
> What do you think?
>
> Bests,
> Dongjoon.
>
>
>
> On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue  wrote:
>
>> Yes, GenericInternalRow is safe if when type mismatches, with the
>> cost of using Object[], and primitive types need to do boxing
>>
>> The question is not whether to use the magic functions, which would
>> not need boxing. The question here is whether to use multiple
>> ScalarFunction interfaces. Those interfaces would require boxing or
>> using Object[] so there isn’t a benefit.
>>
>> If we do want to reuse one UDF for different types, using “magical
>> methods” solves the problem
>>
>> Yes, that’s correct. We agree that magic methods are a good option
>> for this.
>>
>> Again, the question we need to decide is whether to use InternalRow
>> or interfaces like ScalarFunction2 for non-codegen. The option to
>> use multiple interfaces is limited by type erasure because you can only
>> have one set of type parameters. If you wanted to support both 
>> ScalarFunction2> Integer> and ScalarFunction2 you’d have to fall back to 
>> ScalarFunction2> Object> and cast.
>>
>> The point is that type erasure will commonly lead either to many
>> different implementation classes (one for each type combination) or will
>> lead to 

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-04 Thread Erik Krogen
+1 on Dongjoon's proposal. This is a very nice compromise between the
reflective/magic-method approach and the InternalRow approach, providing a
lot of flexibility for our users, and allowing for the more complicated
reflection-based approach to evolve at its own pace, since you can always
fall back to InternalRow for situations which aren't yet supported by
reflection.

We can even consider having Spark code detect that you haven't overridden
the default produceResult (IIRC this is discoverable via reflection), and
raise an error at query analysis time instead of at runtime when it can't
find a reflective method or an overridden produceResult.

I'm very pleased we have found a compromise that everyone seems happy with!
Big thanks to everyone who participated.

On Wed, Mar 3, 2021 at 8:34 PM John Zhuge  wrote:

> +1 Good plan to move forward.
>
> Thank you all for the constructive and comprehensive discussions in this
> thread! Decisions on this important feature will have ramifications for
> years to come.
>
> On Wed, Mar 3, 2021 at 7:42 PM Wenchen Fan  wrote:
>
>> +1 to this proposal. If people don't like the ScalarFunction0,1, ...
>> variants and prefer the "magical methods", then we can have a single
>> ScalarFunction interface which has the row-parameter API (with a default
>> implementation to fail) and documents to describe the "magical methods"
>> (which can be done later).
>>
>> I'll start the PR review this week to check the naming, doc, etc.
>>
>> Thanks all for the discussion here and let's move forward!
>>
>> On Thu, Mar 4, 2021 at 9:53 AM Ryan Blue  wrote:
>>
>>> Good point, Dongjoon. I think we can probably come to some compromise
>>> here:
>>>
>>>- Remove SupportsInvoke since it isn’t really needed. We should
>>>always try to find the right method to invoke in the codegen path.
>>>- Add a default implementation of produceResult so that
>>>implementations don’t have to use it. If they don’t implement it and a
>>>magic function can’t be found, then it will throw
>>>UnsupportedOperationException
>>>
>>> This is assuming that we can agree not to introduce all of the
>>> ScalarFunction interface variations, which would have limited utility
>>> because of type erasure.
>>>
>>> Does that sound like a good plan to everyone? If so, I’ll update the
>>> SPIP doc so we can move forward.
>>>
>>> On Wed, Mar 3, 2021 at 4:36 PM Dongjoon Hyun 
>>> wrote:
>>>
 Hi, All.

 We shared many opinions in different perspectives.
 However, we didn't reach a consensus even on a partial merge by
 excluding something
 (on the PR by me, on this mailing thread by Wenchen).

 For the following claims, we have another alternative to mitigate it.

 > I don't like it because it promotes the row-parameter API and
 forces users to implement it, even if the users want to only use the
 individual-parameters API.

 Why don't we merge the AS-IS PR by adding something instead of
 excluding something?

 - R produceResult(InternalRow input);
 + default R produceResult(InternalRow input) throws Exception {
 +   throw new UnsupportedOperationException();
 + }

 By providing the default implementation, it will not *forcing users to
 implement it* technically.
 And, we can provide a document about our expected usage properly.
 What do you think?

 Bests,
 Dongjoon.



 On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue  wrote:

> Yes, GenericInternalRow is safe if when type mismatches, with the cost
> of using Object[], and primitive types need to do boxing
>
> The question is not whether to use the magic functions, which would
> not need boxing. The question here is whether to use multiple
> ScalarFunction interfaces. Those interfaces would require boxing or
> using Object[] so there isn’t a benefit.
>
> If we do want to reuse one UDF for different types, using “magical
> methods” solves the problem
>
> Yes, that’s correct. We agree that magic methods are a good option for
> this.
>
> Again, the question we need to decide is whether to use InternalRow
> or interfaces like ScalarFunction2 for non-codegen. The option to use
> multiple interfaces is limited by type erasure because you can only have
> one set of type parameters. If you wanted to support both 
> ScalarFunction2 Integer> and ScalarFunction2 you’d have to fall back to 
> ScalarFunction2 Object> and cast.
>
> The point is that type erasure will commonly lead either to many
> different implementation classes (one for each type combination) or will
> lead to parameterizing by Object, which defeats the purpose.
>
> The alternative adds safety because correct types are produced by
> calls like getLong(0). Yes, this depends on the implementation making
> the correct calls, but it is better than using 

using accumulators in (MicroBatch) InputPartitionReader

2021-03-04 Thread kordex
I tried to create a data source, however our use case is bit hard as
we do only know the available offsets within the tasks, not on the
driver. I therefore planned to use accumulators in the
InputPartitionReader but they seem not to work.

Example accumulation is done here
https://github.com/kortemik/spark-source/blob/master/src/main/java/com/teragrep/pth06/ArchiveMicroBatchInputPartitionReader.java#L118

I get on the task logs that the System.out.println() are called, so it
can not be that the flow itself is broken, but the accumulators seem
to work only while on the driver as on the logs at the
https://github.com/kortemik/spark-source/tree/master

Is it intentional that the accumulators do not work within the data source?

One might ask why all this so I give brief explanation. We use gzipped
files as the storage blobs and it's unknown prior to execution how
many records they contain. Of course this can be mitigated by
decompressing the files on the driver and then sending the offsets
through to executors but it's a double effort. The aim however was to
decompress them only once by doing a forward-lookup into the data and
use accumulator to inform the driver that there is stuff available for
the next batch as well or that the file is done and driver needs to
pull the next one to keep executors busy.

Any advices are welcome.

Kind regards,
-Mikko Kortelainen

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org