Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-18 Thread Wenchen Fan
If people have such a big concern about reflection, we can follow the current
Spark Java UDF

and Transport
,
and create ScalarFuncion0[R], ScalarFuncion1[T1, R], etc. to avoid
reflection. But we may need to investigate how to avoid boxing with this
API design.

To put a detailed proposal: let's have ScalarFuncion0, ScalarFuncion1, ...,
ScalarFuncion9 and VarargsScalarFunction. At execution time, if Spark sees
ScalarFuncion0-9, pass the input columns to the UDF directly, one column
one parameter. So string type input is UTF8String, array type input is
ArrayData. If Spark sees VarargsScalarFunction, wrap the input columns with
InternalRow and pass it to the UDF.

In general, if VarargsScalarFunction is implemented, the UDF should not
implement ScalarFuncion0-9. We can also define a priority order to allow
this. I don't have a strong preference here.

What do you think?

On Fri, Feb 19, 2021 at 1:24 PM Walaa Eldin Moustafa 
wrote:

> I agree with Ryan on the questions around the expressivity of the Invoke
> method. It is not clear to me how the Invoke method can be used to declare
> UDFs with type-parameterized parameters. For example: a UDF to get the Nth
> element of an array (regardless of the Array element type) or a UDF to
> merge two Arrays (of generic types) to a Map.
>
> Also, to address Wenchen's InternalRow question, can we create a number of
> Function classes, each corresponding to a number of input parameter length
> (e.g., ScalarFunction1, ScalarFunction2, etc)?
>
> Thanks,
> Walaa.
>
>
> On Thu, Feb 18, 2021 at 6:07 PM Ryan Blue 
> wrote:
>
>> I agree with you that it is better in many cases to directly call a
>> method. But it it not better in all cases, which is why I don’t think it is
>> the right general-purpose choice.
>>
>> First, if codegen isn’t used for some reason, the reflection overhead is
>> really significant. That gets much better when you have an interface to
>> call. That’s one reason I’d use this pattern:
>>
>> class DoubleAdd implements ScalarFunction, SupportsInvoke {
>>   Double produceResult(InternalRow row) {
>> return produceResult(row.getDouble(0), row.getDouble(1));
>>   }
>>
>>   double produceResult(double left, double right) {
>> return left + right;
>>   }
>> }
>>
>> There’s little overhead to adding the InternalRow variation, but we
>> could call it in eval to avoid the reflect overhead. To the point about
>> UDF developers, I think this is a reasonable cost.
>>
>> Second, I think usability is better and helps avoid runtime issues.
>> Here’s an example:
>>
>> class StrLen implements ScalarFunction, SupportsInvoke {
>>   Integer produceResult(InternalRow row) {
>> return produceResult(row.getString(0));
>>   }
>>
>>   Integer produceResult(String str) {
>> return str.length();
>>   }
>> }
>>
>> See the bug? I forgot to use UTF8String instead of String. With the
>> InternalRow method, I get a compiler warning because getString produces
>> UTF8String that can’t be passed to produceResult(String). If I decided
>> to implement length separately, then we could still run the InternalRow
>> version and log a warning. The code would be slightly slower, but wouldn’t
>> fail.
>>
>> There are similar situations with varargs where it’s better to call
>> methods that produce concrete types than to cast from Object to some
>> expected type.
>>
>> I think that using invoke is a great extension to the proposal, but I
>> don’t think that it should be the only way to call functions. By all means,
>> let’s work on it in parallel and use it where possible. But I think we do
>> need the fallback of using InternalRow and that it isn’t a usability
>> problem to include it.
>>
>> Oh, and one last thought is that we already have users that call
>> Dataset.map and use InternalRow. This would allow converting that code
>> directly to a UDF.
>>
>> I think we’re closer to agreeing here than it actually looks. Hopefully
>> you’ll agree that having the InternalRow method isn’t a big usability
>> problem.
>>
>> On Wed, Feb 17, 2021 at 11:51 PM Wenchen Fan  wrote:
>>
>>> I don't see any objections to the rest of the proposal (loading
>>> functions from the catalog, function binding stuff, etc.) and I assume
>>> everyone is OK with it. We can commit that part first.
>>>
>>> Currently, the discussion focuses on the `ScalarFunction` API, where I
>>> think it's better to directly take the input columns as the UDF parameter,
>>> instead of wrapping the input columns with InternalRow and taking the
>>> InternalRow as the UDF parameter. It's not only for better performance,
>>> but also for ease of use. For example, it's easier for the UDF developer to
>>> write `input1 + input2` than `inputRow.getLong(0) + inputRow.getLong(1)`,
>>> as they don't need to specify the type and

Re: [DISCUSS] assignee practice on committers+ (possible issue on preemption)

2021-02-18 Thread Mridul Muralidharan
  I agree, Assignee has been used primarily to give recognition to the
contributor who ended up submitting the patch which got merged.
Typically jira's remain unassigned - even if it were to be assigned, it
conveys no meaning or ownership or ongoing work : IMO it is equivalent to
an unassigned jira.
There could ofcourse be discussions on dev@ or comments in the jira or
design docs or WIP PR"s - but if a better approach comes along, or previous
work stalls - contributors can (and please, must !) contribute to the jira.
Ofcourse, if there is active work going on as a PR under review or
SPIP/proposal being discussed - taking part in that process would help imo.

Regards,
Mridul


On Thu, Feb 18, 2021 at 11:00 PM Sean Owen  wrote:

> I don't believe Assignee has ever been used for anything except to give a
> bit of informal credit to the person who drove most of the work on the
> issue, when it's resolved.
> If that's the question - does Assignee mean only that person can work on
> the issue? then no, it has never meant that.
>
> You say you have an example, one that was resolved. Is this a single case
> or systemic? I don't think I recall seeing problems of this form.
>
> We _have_ had multiple incompatible PRs for a JIRA before, occasionally.
> We have also definitely had people file huge umbrella JIRAs, parts of
> which _nobody_ ever completes, but, for lack of any interest from the filer
> or anyone else.
>
> I think it's fair to give a person a reasonable shot at producing a
> solution if they propose a problem or feature.
> We have had instances where a new contributor files a relatively simple
> issue, and finds another contributor opened the obvious PR before they had
> a chance (maybe they needed a day to get the PR together). That seemed a
> bit discourteous.
>
>  If you need a solution as well, and one isn't forthcoming, just open a PR
> and propose your own? I don't hear that anyone told you not to, but I also
> don't know what this is about. You can always propose a PR as an
> alternative to compare with, to facilitate collaboration. Nothing wrong
> with that.
>
> On Thu, Feb 18, 2021 at 10:45 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> (Actually the real world case was fixed somehow and I wouldn't like to
>> point out a fixed one. I just would like to make sure what I think is
>> correct and is considered as "consensus".)
>>
>> Just consider the case as simple - someone files two different JIRA
>> issues for new features and assigns to him/herself altogether, without
>> sharing anything about the ongoing efforts someone has made. (So you have
>> no idea even someone just files two different JIRA issues without "any"
>> progress and has them in a backlog.) The new features are not new and are
>> likely something others could work in parallel.
>>
>> That said, committers can explicitly represent "I'm working on this so
>> please refrain from making redundant efforts." via assigning the issue,
>> which is actually similar to the comment "I'm working on this".
>> Unfortunately, this works only when the feature is something one who filed
>> a JIRA issue works uniquely. Occasional opposite cases aren't always a
>> notion of ignoring the signal of "I'm working on this". There're also
>> coincidences two different individuals/teams are working on exactly the
>> same at the same time.
>>
>> My concern is that "assignment" might be considered pretty much stronger
>> than just commenting "I'm working on this" - it's like "Regardless of your
>> current progress, I started working on this so don't consider your effort
>> to be proposable. You should have filed the JIRA issue before I file one."
>> Is it possible for contributors to do the same? I guess not.
>>
>> The other problem is the multiple assignments in parallel. I wouldn't
>> like to guess someone over-uses the power of assignments, but technically
>> it's simply possible that someone can file JIRA issues on his/her backlog
>> which can be done in a couple of months or so with assigning to
>> him/herself, which effectively blocks others from working or proposing the
>> same. I consider this as preemptive which sounds bad and even unfair.
>>
>> On Fri, Feb 19, 2021 at 12:14 AM Sean Owen  wrote:
>>
>>> I think it's OK to raise particular instances. It's hard for me to
>>> evaluate further in the abstract.
>>>
>>> I don't think we use Assignee much at all, except to kinda give credit
>>> when something is done. No piece of code or work can be solely owned by one
>>> person; this is just ASF policy.
>>>
>>> I think we've seen the occasional opposite case too: someone starts
>>> working on an issue, and then someone else also starts working on it with a
>>> competing fix or change.
>>>
>>> These are ultimately issues of communication. If an issue is pretty
>>> stalled, and you have a proposal, nothing wrong with just going ahead with
>>> a proposal. There may be no disagreement. It might result in the
>>> other person joining your PR. As

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-18 Thread Walaa Eldin Moustafa
I agree with Ryan on the questions around the expressivity of the Invoke
method. It is not clear to me how the Invoke method can be used to declare
UDFs with type-parameterized parameters. For example: a UDF to get the Nth
element of an array (regardless of the Array element type) or a UDF to
merge two Arrays (of generic types) to a Map.

Also, to address Wenchen's InternalRow question, can we create a number of
Function classes, each corresponding to a number of input parameter length
(e.g., ScalarFunction1, ScalarFunction2, etc)?

Thanks,
Walaa.


On Thu, Feb 18, 2021 at 6:07 PM Ryan Blue  wrote:

> I agree with you that it is better in many cases to directly call a
> method. But it it not better in all cases, which is why I don’t think it is
> the right general-purpose choice.
>
> First, if codegen isn’t used for some reason, the reflection overhead is
> really significant. That gets much better when you have an interface to
> call. That’s one reason I’d use this pattern:
>
> class DoubleAdd implements ScalarFunction, SupportsInvoke {
>   Double produceResult(InternalRow row) {
> return produceResult(row.getDouble(0), row.getDouble(1));
>   }
>
>   double produceResult(double left, double right) {
> return left + right;
>   }
> }
>
> There’s little overhead to adding the InternalRow variation, but we could
> call it in eval to avoid the reflect overhead. To the point about UDF
> developers, I think this is a reasonable cost.
>
> Second, I think usability is better and helps avoid runtime issues. Here’s
> an example:
>
> class StrLen implements ScalarFunction, SupportsInvoke {
>   Integer produceResult(InternalRow row) {
> return produceResult(row.getString(0));
>   }
>
>   Integer produceResult(String str) {
> return str.length();
>   }
> }
>
> See the bug? I forgot to use UTF8String instead of String. With the
> InternalRow method, I get a compiler warning because getString produces
> UTF8String that can’t be passed to produceResult(String). If I decided to
> implement length separately, then we could still run the InternalRow
> version and log a warning. The code would be slightly slower, but wouldn’t
> fail.
>
> There are similar situations with varargs where it’s better to call
> methods that produce concrete types than to cast from Object to some
> expected type.
>
> I think that using invoke is a great extension to the proposal, but I
> don’t think that it should be the only way to call functions. By all means,
> let’s work on it in parallel and use it where possible. But I think we do
> need the fallback of using InternalRow and that it isn’t a usability
> problem to include it.
>
> Oh, and one last thought is that we already have users that call
> Dataset.map and use InternalRow. This would allow converting that code
> directly to a UDF.
>
> I think we’re closer to agreeing here than it actually looks. Hopefully
> you’ll agree that having the InternalRow method isn’t a big usability
> problem.
>
> On Wed, Feb 17, 2021 at 11:51 PM Wenchen Fan  wrote:
>
>> I don't see any objections to the rest of the proposal (loading functions
>> from the catalog, function binding stuff, etc.) and I assume everyone is OK
>> with it. We can commit that part first.
>>
>> Currently, the discussion focuses on the `ScalarFunction` API, where I
>> think it's better to directly take the input columns as the UDF parameter,
>> instead of wrapping the input columns with InternalRow and taking the
>> InternalRow as the UDF parameter. It's not only for better performance,
>> but also for ease of use. For example, it's easier for the UDF developer to
>> write `input1 + input2` than `inputRow.getLong(0) + inputRow.getLong(1)`,
>> as they don't need to specify the type and index by themselves (
>> getLong(0)) which is error-prone.
>>
>> It does push more work to the Spark side, but I think it's worth it if
>> implementing UDF gets easier. I don't think the work is very challenging,
>> as we can leverage the infra we built for the expression encoder.
>>
>> I think it's also important to look at the UDF API from the user's
>> perspective (UDF developers). How do you like the UDF API without
>> considering how Spark can support it? Do you prefer the
>> individual-parameters version or the row-parameter version?
>>
>> To move forward, how about we implement the function loading and binding
>> first? Then we can have PRs for both the individual-parameters (I can take
>> it) and row-parameter approaches, if we still can't reach a consensus at
>> that time and need to see all the details.
>>
>> On Thu, Feb 18, 2021 at 4:48 AM Ryan Blue 
>> wrote:
>>
>>> Thanks, Hyukjin. I think that's a fair summary. And I agree with the
>>> idea that we should focus on what Spark will do by default.
>>>
>>> I think we should focus on the proposal, for two reasons: first, there
>>> is a straightforward path to incorporate Wenchen's suggestion via
>>> `SupportsInvoke`, and second, the proposal is more complete: it defines a
>>> solutio

Re: [DISCUSS] assignee practice on committers+ (possible issue on preemption)

2021-02-18 Thread Sean Owen
I don't believe Assignee has ever been used for anything except to give a
bit of informal credit to the person who drove most of the work on the
issue, when it's resolved.
If that's the question - does Assignee mean only that person can work on
the issue? then no, it has never meant that.

You say you have an example, one that was resolved. Is this a single case
or systemic? I don't think I recall seeing problems of this form.

We _have_ had multiple incompatible PRs for a JIRA before, occasionally.
We have also definitely had people file huge umbrella JIRAs, parts of which
_nobody_ ever completes, but, for lack of any interest from the filer or
anyone else.

I think it's fair to give a person a reasonable shot at producing a
solution if they propose a problem or feature.
We have had instances where a new contributor files a relatively simple
issue, and finds another contributor opened the obvious PR before they had
a chance (maybe they needed a day to get the PR together). That seemed a
bit discourteous.

 If you need a solution as well, and one isn't forthcoming, just open a PR
and propose your own? I don't hear that anyone told you not to, but I also
don't know what this is about. You can always propose a PR as an
alternative to compare with, to facilitate collaboration. Nothing wrong
with that.

On Thu, Feb 18, 2021 at 10:45 PM Jungtaek Lim 
wrote:

> (Actually the real world case was fixed somehow and I wouldn't like to
> point out a fixed one. I just would like to make sure what I think is
> correct and is considered as "consensus".)
>
> Just consider the case as simple - someone files two different JIRA issues
> for new features and assigns to him/herself altogether, without sharing
> anything about the ongoing efforts someone has made. (So you have no idea
> even someone just files two different JIRA issues without "any" progress
> and has them in a backlog.) The new features are not new and are likely
> something others could work in parallel.
>
> That said, committers can explicitly represent "I'm working on this so
> please refrain from making redundant efforts." via assigning the issue,
> which is actually similar to the comment "I'm working on this".
> Unfortunately, this works only when the feature is something one who filed
> a JIRA issue works uniquely. Occasional opposite cases aren't always a
> notion of ignoring the signal of "I'm working on this". There're also
> coincidences two different individuals/teams are working on exactly the
> same at the same time.
>
> My concern is that "assignment" might be considered pretty much stronger
> than just commenting "I'm working on this" - it's like "Regardless of your
> current progress, I started working on this so don't consider your effort
> to be proposable. You should have filed the JIRA issue before I file one."
> Is it possible for contributors to do the same? I guess not.
>
> The other problem is the multiple assignments in parallel. I wouldn't like
> to guess someone over-uses the power of assignments, but technically it's
> simply possible that someone can file JIRA issues on his/her backlog which
> can be done in a couple of months or so with assigning to him/herself,
> which effectively blocks others from working or proposing the same. I
> consider this as preemptive which sounds bad and even unfair.
>
> On Fri, Feb 19, 2021 at 12:14 AM Sean Owen  wrote:
>
>> I think it's OK to raise particular instances. It's hard for me to
>> evaluate further in the abstract.
>>
>> I don't think we use Assignee much at all, except to kinda give credit
>> when something is done. No piece of code or work can be solely owned by one
>> person; this is just ASF policy.
>>
>> I think we've seen the occasional opposite case too: someone starts
>> working on an issue, and then someone else also starts working on it with a
>> competing fix or change.
>>
>> These are ultimately issues of communication. If an issue is pretty
>> stalled, and you have a proposal, nothing wrong with just going ahead with
>> a proposal. There may be no disagreement. It might result in the
>> other person joining your PR. As I say, not sure if there's a deeper issue
>> than that if even this hasn't been tried?
>>
>> On Mon, Feb 15, 2021 at 8:35 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Thanks for the input, Hyukjin!
>>>
>>> I have been keeping my own policy among all discussions I have raised -
>>> I would provide the hypothetical example closer to the actual one and avoid
>>> pointing out directly. The main purpose of the discussion is to ensure our
>>> policy / consensus makes sense, no more. I can provide a more detailed
>>> explanation if someone feels the explanation wasn't sufficient to
>>> understand.
>>>
>>> Probably this discussion could play as a "reminder" to every committers
>>> if similar discussion was raised before and it succeeded to build
>>> consensus. If there's some point we don't build consensus yet, it'd be a
>>> good time to discuss fu

Re: Please use Jekyll via "bundle exec" from now on

2021-02-18 Thread Jungtaek Lim
Nice fix. Thanks!

On Thu, Feb 18, 2021 at 7:13 PM Hyukjin Kwon  wrote:

> Thanks Attlila for fixing and sharing this.
>
> 2021년 2월 18일 (목) 오후 6:17, Attila Zsolt Piros 님이
> 작성:
>
>> Hello everybody,
>>
>> To pin the exact same version of Jekyll across all the contributors, Ruby
>> Bundler is introduced.
>> This way the differences in the generated documentation, which were
>> caused by using different Jekyll versions, are avoided.
>>
>> Regarding its usage this simply means an extra prefix "*bundle exec*"
>> must be added to call Jekyll, so:
>> - instead of "jekyll build" please use "*bundle exec jekyll build*"
>> - instead of "jekyll serve --watch" please use "*bundle exec jekyll
>> serve --watch*"
>>
>> If you are using an earlier Ruby version please install Bundler by "*gem
>> install bundler*" (as of Ruby 2.6 Bundler is part of core Ruby).
>>
>> This applies to both "apache/spark" and "apache/spark-website"
>> repositories where all the documentations are updated accordingly.
>>
>> For details please check the PRs introducing this:
>> - https://github.com/apache/spark/pull/31559
>> - https://github.com/apache/spark-website/pull/303
>>
>> Sincerely,
>> Attila Piros
>>
>


Re: [DISCUSS] assignee practice on committers+ (possible issue on preemption)

2021-02-18 Thread Jungtaek Lim
(Actually the real world case was fixed somehow and I wouldn't like to
point out a fixed one. I just would like to make sure what I think is
correct and is considered as "consensus".)

Just consider the case as simple - someone files two different JIRA issues
for new features and assigns to him/herself altogether, without sharing
anything about the ongoing efforts someone has made. (So you have no idea
even someone just files two different JIRA issues without "any" progress
and has them in a backlog.) The new features are not new and are likely
something others could work in parallel.

That said, committers can explicitly represent "I'm working on this so
please refrain from making redundant efforts." via assigning the issue,
which is actually similar to the comment "I'm working on this".
Unfortunately, this works only when the feature is something one who filed
a JIRA issue works uniquely. Occasional opposite cases aren't always a
notion of ignoring the signal of "I'm working on this". There're also
coincidences two different individuals/teams are working on exactly the
same at the same time.

My concern is that "assignment" might be considered pretty much stronger
than just commenting "I'm working on this" - it's like "Regardless of your
current progress, I started working on this so don't consider your effort
to be proposable. You should have filed the JIRA issue before I file one."
Is it possible for contributors to do the same? I guess not.

The other problem is the multiple assignments in parallel. I wouldn't like
to guess someone over-uses the power of assignments, but technically it's
simply possible that someone can file JIRA issues on his/her backlog which
can be done in a couple of months or so with assigning to him/herself,
which effectively blocks others from working or proposing the same. I
consider this as preemptive which sounds bad and even unfair.

On Fri, Feb 19, 2021 at 12:14 AM Sean Owen  wrote:

> I think it's OK to raise particular instances. It's hard for me to
> evaluate further in the abstract.
>
> I don't think we use Assignee much at all, except to kinda give credit
> when something is done. No piece of code or work can be solely owned by one
> person; this is just ASF policy.
>
> I think we've seen the occasional opposite case too: someone starts
> working on an issue, and then someone else also starts working on it with a
> competing fix or change.
>
> These are ultimately issues of communication. If an issue is pretty
> stalled, and you have a proposal, nothing wrong with just going ahead with
> a proposal. There may be no disagreement. It might result in the
> other person joining your PR. As I say, not sure if there's a deeper issue
> than that if even this hasn't been tried?
>
> On Mon, Feb 15, 2021 at 8:35 PM Jungtaek Lim 
> wrote:
>
>> Thanks for the input, Hyukjin!
>>
>> I have been keeping my own policy among all discussions I have raised - I
>> would provide the hypothetical example closer to the actual one and avoid
>> pointing out directly. The main purpose of the discussion is to ensure our
>> policy / consensus makes sense, no more. I can provide a more detailed
>> explanation if someone feels the explanation wasn't sufficient to
>> understand.
>>
>> Probably this discussion could play as a "reminder" to every committers
>> if similar discussion was raised before and it succeeded to build
>> consensus. If there's some point we don't build consensus yet, it'd be a
>> good time to discuss further. I don't know what exactly was the discussion
>> and the result so what is new here, but I guess this might be a duplicated
>> one as you say similar issue.
>>
>>
>>
>> On Tue, Feb 16, 2021 at 11:09 AM Hyukjin Kwon 
>> wrote:
>>
>>> I remember I raised a similar issue a long time ago in the dev mailing
>>> list. I agree that setting no assignee makes sense in most of the cases,
>>> and also think we share similar thoughts about the assignee on
>>> umbrella JIRAs, followup tasks, the case when it's clear with a design doc,
>>> etc.
>>> It makes me think that the actual issue by setting an assignee happens
>>> rarely, and it is an issue to several specific cases that would need a look
>>> case-by-case.
>>> Were there specific cases that made you concerned?
>>>
>>>
>>> 2021년 2월 15일 (월) 오전 8:58, Jungtaek Lim 님이
>>> 작성:
>>>
 Hi devs,

 I'd like to raise a discussion and hear voices on the "assignee"
 practice on committers which may lead issues on preemption.

 I feel this is the one of major unfairnesses between contributors and
 committers if used improperly, especially when someone assigns themselves
 with multiple JIRA issues.

 Let's say there're features A and B, which may take a month for each
 (or require design doc) - both are individual major features, not
 subtasks or some sort of "follow-up".

 Technically, committers can file two JIRA issues and assign both of
 issues, "without actually doing no progress", 

(send this email to unsubscribe)

2021-02-18 Thread Harley Cody



Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-18 Thread Ryan Blue
I agree with you that it is better in many cases to directly call a method.
But it it not better in all cases, which is why I don’t think it is the
right general-purpose choice.

First, if codegen isn’t used for some reason, the reflection overhead is
really significant. That gets much better when you have an interface to
call. That’s one reason I’d use this pattern:

class DoubleAdd implements ScalarFunction, SupportsInvoke {
  Double produceResult(InternalRow row) {
return produceResult(row.getDouble(0), row.getDouble(1));
  }

  double produceResult(double left, double right) {
return left + right;
  }
}

There’s little overhead to adding the InternalRow variation, but we could
call it in eval to avoid the reflect overhead. To the point about UDF
developers, I think this is a reasonable cost.

Second, I think usability is better and helps avoid runtime issues. Here’s
an example:

class StrLen implements ScalarFunction, SupportsInvoke {
  Integer produceResult(InternalRow row) {
return produceResult(row.getString(0));
  }

  Integer produceResult(String str) {
return str.length();
  }
}

See the bug? I forgot to use UTF8String instead of String. With the
InternalRow method, I get a compiler warning because getString produces
UTF8String that can’t be passed to produceResult(String). If I decided to
implement length separately, then we could still run the InternalRow
version and log a warning. The code would be slightly slower, but wouldn’t
fail.

There are similar situations with varargs where it’s better to call methods
that produce concrete types than to cast from Object to some expected type.

I think that using invoke is a great extension to the proposal, but I don’t
think that it should be the only way to call functions. By all means, let’s
work on it in parallel and use it where possible. But I think we do need
the fallback of using InternalRow and that it isn’t a usability problem to
include it.

Oh, and one last thought is that we already have users that call
Dataset.map and use InternalRow. This would allow converting that code
directly to a UDF.

I think we’re closer to agreeing here than it actually looks. Hopefully
you’ll agree that having the InternalRow method isn’t a big usability
problem.

On Wed, Feb 17, 2021 at 11:51 PM Wenchen Fan  wrote:

> I don't see any objections to the rest of the proposal (loading functions
> from the catalog, function binding stuff, etc.) and I assume everyone is OK
> with it. We can commit that part first.
>
> Currently, the discussion focuses on the `ScalarFunction` API, where I
> think it's better to directly take the input columns as the UDF parameter,
> instead of wrapping the input columns with InternalRow and taking the
> InternalRow as the UDF parameter. It's not only for better performance,
> but also for ease of use. For example, it's easier for the UDF developer to
> write `input1 + input2` than `inputRow.getLong(0) + inputRow.getLong(1)`,
> as they don't need to specify the type and index by themselves (getLong(0))
> which is error-prone.
>
> It does push more work to the Spark side, but I think it's worth it if
> implementing UDF gets easier. I don't think the work is very challenging,
> as we can leverage the infra we built for the expression encoder.
>
> I think it's also important to look at the UDF API from the user's
> perspective (UDF developers). How do you like the UDF API without
> considering how Spark can support it? Do you prefer the
> individual-parameters version or the row-parameter version?
>
> To move forward, how about we implement the function loading and binding
> first? Then we can have PRs for both the individual-parameters (I can take
> it) and row-parameter approaches, if we still can't reach a consensus at
> that time and need to see all the details.
>
> On Thu, Feb 18, 2021 at 4:48 AM Ryan Blue 
> wrote:
>
>> Thanks, Hyukjin. I think that's a fair summary. And I agree with the idea
>> that we should focus on what Spark will do by default.
>>
>> I think we should focus on the proposal, for two reasons: first, there is
>> a straightforward path to incorporate Wenchen's suggestion via
>> `SupportsInvoke`, and second, the proposal is more complete: it defines a
>> solution for many concerns like loading a function and finding out what
>> types to use -- not just how to call code -- and supports more use cases
>> like varargs functions. I think we can continue to discuss the rest of the
>> proposal and be confident that we can support an invoke code path where it
>> makes sense.
>>
>> Does everyone agree? If not, I think we would need to solve a lot of the
>> challenges that I initially brought up with the invoke idea. It seems like
>> a good way to call a function, but needs a real proposal behind it if we
>> don't use it via `SupportsInvoke` in the current proposal.
>>
>> On Tue, Feb 16, 2021 at 11:07 PM Hyukjin Kwon 
>> wrote:
>>
>>> Just to make sure we don’t move past, I think we haven’t decided yet:
>>>
>>>

Re: Bug?

2021-02-18 Thread Tyson
I am not sure if the problem persists in 3.x.

On Thu, Feb 18, 2021 at 12:14 PM Dongjoon Hyun 
wrote:

> Thank you for sharing, Tyson.
>
> Spark 2.4.4 looks too old to me. Do you think it will occur at 3.x?
>
> Bests,
> Dongjoon.
>
>
> On Thu, Feb 18, 2021 at 11:07 AM Tyson  wrote:
>
>> We observed an interesting stack trace that I'd like to share with you.
>> The logging level is WARN, but it appears to be causing task failures.
>> Please let me know if anyone has any insights. It appears to be a integer
>> overflow issue from looking at the code in Spark 2.4.4
>>
>> WARN TaskSetManager [task-result-getter-0]: Lost task 3175.0 in stage
>> 518.0 (TID 186951, executor 150): java.lang.NegativeArraySizeException
>>
>>  at 
>> org.apache.spark.sql.catalyst.expressions.UnsafeRow.getBinary(UnsafeRow.java:438)
>>  at 
>> org.apache.spark.sql.catalyst.expressions.UnsafeRow.getDecimal(UnsafeRow.java:414)
>>  at 
>> org.apache.spark.sql.catalyst.expressions.JoinedRow.getDecimal(JoinedRow.scala:95)
>>  at 
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_4$(Unknown
>>  Source)
>>  at 
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>>  Source)
>>  at 
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>>  Source)
>>  at 
>> org.apache.spark.sql.execution.joins.HashJoin$$anonfun$join$1.apply(HashJoin.scala:218)
>>  at 
>> org.apache.spark.sql.execution.joins.HashJoin$$anonfun$join$1.apply(HashJoin.scala:216)
>>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>>  at 
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown
>>  Source)
>>  at 
>> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>>  at 
>> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>>  at 
>> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
>>  at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>>  at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
>>  at org.apache.spark.scheduler.Task.run(Task.scala:123)
>>  at 
>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>  at java.lang.Thread.run(Thread.java:748)
>>
>>


Re: Bug?

2021-02-18 Thread Dongjoon Hyun
Thank you for sharing, Tyson.

Spark 2.4.4 looks too old to me. Do you think it will occur at 3.x?

Bests,
Dongjoon.


On Thu, Feb 18, 2021 at 11:07 AM Tyson  wrote:

> We observed an interesting stack trace that I'd like to share with you.
> The logging level is WARN, but it appears to be causing task failures.
> Please let me know if anyone has any insights. It appears to be a integer
> overflow issue from looking at the code in Spark 2.4.4
>
> WARN TaskSetManager [task-result-getter-0]: Lost task 3175.0 in stage
> 518.0 (TID 186951, executor 150): java.lang.NegativeArraySizeException
>
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.getBinary(UnsafeRow.java:438)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.getDecimal(UnsafeRow.java:414)
>   at 
> org.apache.spark.sql.catalyst.expressions.JoinedRow.getDecimal(JoinedRow.scala:95)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_4$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anonfun$join$1.apply(HashJoin.scala:218)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anonfun$join$1.apply(HashJoin.scala:216)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>
>


Bug?

2021-02-18 Thread Tyson
We observed an interesting stack trace that I'd like to share with you. The
logging level is WARN, but it appears to be causing task failures. Please
let me know if anyone has any insights. It appears to be a integer overflow
issue from looking at the code in Spark 2.4.4

WARN TaskSetManager [task-result-getter-0]: Lost task 3175.0 in stage 518.0
(TID 186951, executor 150): java.lang.NegativeArraySizeException

at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.getBinary(UnsafeRow.java:438)
at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.getDecimal(UnsafeRow.java:414)
at 
org.apache.spark.sql.catalyst.expressions.JoinedRow.getDecimal(JoinedRow.scala:95)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_4$(Unknown
Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
Source)
at 
org.apache.spark.sql.execution.joins.HashJoin$$anonfun$join$1.apply(HashJoin.scala:218)
at 
org.apache.spark.sql.execution.joins.HashJoin$$anonfun$join$1.apply(HashJoin.scala:216)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown
Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)


Re: Auto-closing PRs or How to get reviewers' attention

2021-02-18 Thread Reynold Xin
Enrico - do feel free to reopen the PRs or email people directly, unless you 
are told otherwise.

On Thu, Feb 18, 2021 at 9:09 AM, Nicholas Chammas < nicholas.cham...@gmail.com 
> wrote:

> 
> On Thu, Feb 18, 2021 at 10:34 AM Sean Owen < srowen@ gmail. com (
> sro...@gmail.com ) > wrote:
> 
> 
>> There is no way to force people to review or commit something of course.
>> And keep in mind we get a lot of, shall we say, unuseful pull requests.
>> There is occasionally some blowback to closing someone's PR, so the path
>> of least resistance is often the timeout / 'soft close'. That is, it takes
>> a lot more time to satisfactorily debate down the majority of PRs that
>> probably shouldn't get merged, and there just isn't that much bandwidth.
>> That said of course it's bad if lots of good PRs are getting lost in the
>> shuffle and I am sure there are some.
>> 
>> 
>> One other aspect is that a committer is taking some degree of
>> responsibility for merging a change, so the ask is more than just a few
>> minutes of eyeballing. If it breaks something the merger pretty much owns
>> resolving it, and, the whole project owns any consequence of the change
>> for the future.
>> 
> 
> 
> 
> +1
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: Auto-closing PRs or How to get reviewers' attention

2021-02-18 Thread Nicholas Chammas
On Thu, Feb 18, 2021 at 10:34 AM Sean Owen  wrote:

> There is no way to force people to review or commit something of course.
> And keep in mind we get a lot of, shall we say, unuseful pull requests.
> There is occasionally some blowback to closing someone's PR, so the path of
> least resistance is often the timeout / 'soft close'. That is, it takes a
> lot more time to satisfactorily debate down the majority of PRs that
> probably shouldn't get merged, and there just isn't that much bandwidth.
> That said of course it's bad if lots of good PRs are getting lost in the
> shuffle and I am sure there are some.
>
> One other aspect is that a committer is taking some degree of
> responsibility for merging a change, so the ask is more than just a few
> minutes of eyeballing. If it breaks something the merger pretty much owns
> resolving it, and, the whole project owns any consequence of the change for
> the future.
>

+1


Re: Auto-closing PRs or How to get reviewers' attention

2021-02-18 Thread Nicholas Chammas
On Thu, Feb 18, 2021 at 9:58 AM Enrico Minack 
wrote:

> *What is the approved way to ...*
>
> *... prevent it from being auto-closed?* Committing and commenting to the
> PR does not prevent it from being closed the next day.
>
Committing and commenting should prevent the PR from being closed. It may
be that commenting after the stale message has been posted does not work
(which would likely be a bug in the action
 or in our config
),
but there are PRs that have been open for months with consistent activity
that do not get closed.

So at the very least, proactively committing or commenting every month will
keep the PR open. However, that's not the real problem, right? The real
problem is getting committer attention.

*...** re-open it? *The comment says "If you'd like to revive this PR,
> please reopen it ...", but there is no re-open button anywhere on the PR!
>
> I don't know if there is a repo setting here that allows non-committers to
reopen their own closed PRs. At the very worst, you can always open a new
PR from the same branch, though we should update the stale message text if
contributors cannot in fact reopen their own PRs.

> What is the expected contributor's response to a PR that does not get
> feedback? Giving up?
>
I've baby-sat several PRs that took months to get in. Here's an example
 off the top of my head (5-6
months to be merged in). I'm sure everyone on here, including most
committers themselves, have had this experience. It's common. The expected
response is to be persistent, to try to find a committer or shepherd for
your PR, and to proactively make your PR easier to review.

> Are there processes in place to increase the probability PRs do not get
> forgotten, auto-closed and lost?
>
There are things you can do as a contributor to increase the likelihood
your PR will get reviewed, but I wouldn't call them "processes". This is an
open source project built on corporate sponsorship for some stuff and
volunteer energy for everything else. There is no guarantee or formal
obligation for anyone to do the work of reviewing PRs. That's just the
nature of open source work.

The things that you can do are:

   - Make your PR as small and focused as possible.
   - Make sure the build is passing and that you've followed the
   contributing guidelines.
   - Find the people who most recently worked on the area you're touching
   and ask them for help.
   - Address reviewers' requests and concerns.
   - Try to get some committer buy-in for your idea before spending time
   developing it.
   - Ask for input on the dev list for your PR.

Basically, most of the advice boils down to "make it easy for reviewers''.
Even then, though, sometimes things won't work out
 (5-6 months and closed without
merging). It's just the nature of contributing to a large project like
Spark where there is a lot going on.


Re: Auto-closing PRs or How to get reviewers' attention

2021-02-18 Thread Sean Owen
Holden is absolutely correct - pinging relevant individuals is probably
your best bet. I skim the 40-50 PRs that have activity each day and look
into a few that look like I would know something about by the title, but,
easy to miss something I could weigh in on.

There is no way to force people to review or commit something of course.
And keep in mind we get a lot of, shall we say, unuseful pull requests.
There is occasionally some blowback to closing someone's PR, so the path of
least resistance is often the timeout / 'soft close'. That is, it takes a
lot more time to satisfactorily debate down the majority of PRs that
probably shouldn't get merged, and there just isn't that much bandwidth.
That said of course it's bad if lots of good PRs are getting lost in the
shuffle and I am sure there are some.

One other aspect is that a committer is taking some degree of
responsibility for merging a change, so the ask is more than just a few
minutes of eyeballing. If it breaks something the merger pretty much owns
resolving it, and, the whole project owns any consequence of the change for
the future.

I think it might just be committers that can reopen at this point, not sure
if that changed. But you'll probably need someone's attention anyway to
make progress.

Without knowing specific PRs, I can't say whether there was a good reason,
bad reason, or no particular reason it wasn't engaged. I think it's OK to
float a PR or two you really believe should have gotten attention to dev@,
but yeah in the end you need to find the person who has most touched that
code really.

The general advice from https://spark.apache.org/contributing.html is still
valuable. Clear fixes are easier to say 'yes' to than big refactorings.
Improving docs, tests, existing features is better than adding big new
things, etc.


On Thu, Feb 18, 2021 at 8:58 AM Enrico Minack 
wrote:

> Hi Spark Developers,
>
> I have a fundamental question on the process of contributing to Apache
> Spark from outside the circle of committers.
>
> I have gone through a number of pull requests and I always found it hard
> to get feedback, especially from committers. I understand there is a very
> high competition for getting attention of those few committers. Given
> Spark's code base is so huge, only very few people will feel comfortable
> approving code changes for a specific code section. Still, the motivation
> of those that want to contribute suffers from this.
>
> In particular I am getting annoyed by the auto-closing PR feature on
> GitHub. I understand the usefulness of this feature for such a frequent
> project, but I personally am impacted by the weaknesses of this approach. I
> hope, this can be improved.
>
> The feature first warns in advance that it is "... closing this PR because
> it hasn't been updated in a while". This comment looks a bit silly in
> situations where the contributor is waiting for committers' feedback.
>
> *What is the approved way to ...*
>
> *... prevent it from being auto-closed?* Committing and commenting to the
> PR does not prevent it from being closed the next day.
> *...** re-open it? *The comment says "If you'd like to revive this PR,
> please reopen it ...", but there is no re-open button anywhere on the PR!
>
> *... remove the Stale tag?* The comment says "...  ask a committer to
> remove the Stale tag!". Where can I find a list of committers and their
> contact details? What is the best way to contact them? E-Mail? Mentioning
> them in a PR comment?
>
> *... find the right committer to review a PR?* The contributors page
> states "ping likely reviewers", but it does not state how to identify
> likely reviewers. Do you recommend git-blaming the relevant code section?
> What if those committers are not available any more? Whom to ask next?
>
> *... contact committers to get their attention?* Cc'ing them in PR
> comments? Sending E-Mails? Doesn't that contribute to their cognitive load?
>
> What is the expected contributor's response to a PR that does not get
> feedback? Giving up?
>
> Are there processes in place to increase the probability PRs do not get
> forgotten, auto-closed and lost?
>
>
> This is not about my specific pull requests or reviewers of those. I
> appreciate their time and engagement.
> This is about the general process of getting feedback and needed
> improvements for it in order to increase contributor community happiness.
>
> Cheers,
> Enrico
>


Re: [DISCUSS] assignee practice on committers+ (possible issue on preemption)

2021-02-18 Thread Sean Owen
I think it's OK to raise particular instances. It's hard for me to evaluate
further in the abstract.

I don't think we use Assignee much at all, except to kinda give credit when
something is done. No piece of code or work can be solely owned by one
person; this is just ASF policy.

I think we've seen the occasional opposite case too: someone starts working
on an issue, and then someone else also starts working on it with a
competing fix or change.

These are ultimately issues of communication. If an issue is pretty
stalled, and you have a proposal, nothing wrong with just going ahead with
a proposal. There may be no disagreement. It might result in the
other person joining your PR. As I say, not sure if there's a deeper issue
than that if even this hasn't been tried?

On Mon, Feb 15, 2021 at 8:35 PM Jungtaek Lim 
wrote:

> Thanks for the input, Hyukjin!
>
> I have been keeping my own policy among all discussions I have raised - I
> would provide the hypothetical example closer to the actual one and avoid
> pointing out directly. The main purpose of the discussion is to ensure our
> policy / consensus makes sense, no more. I can provide a more detailed
> explanation if someone feels the explanation wasn't sufficient to
> understand.
>
> Probably this discussion could play as a "reminder" to every committers if
> similar discussion was raised before and it succeeded to build consensus.
> If there's some point we don't build consensus yet, it'd be a good time to
> discuss further. I don't know what exactly was the discussion and the
> result so what is new here, but I guess this might be a duplicated one as
> you say similar issue.
>
>
>
> On Tue, Feb 16, 2021 at 11:09 AM Hyukjin Kwon  wrote:
>
>> I remember I raised a similar issue a long time ago in the dev mailing
>> list. I agree that setting no assignee makes sense in most of the cases,
>> and also think we share similar thoughts about the assignee on
>> umbrella JIRAs, followup tasks, the case when it's clear with a design doc,
>> etc.
>> It makes me think that the actual issue by setting an assignee happens
>> rarely, and it is an issue to several specific cases that would need a look
>> case-by-case.
>> Were there specific cases that made you concerned?
>>
>>
>> 2021년 2월 15일 (월) 오전 8:58, Jungtaek Lim 님이
>> 작성:
>>
>>> Hi devs,
>>>
>>> I'd like to raise a discussion and hear voices on the "assignee"
>>> practice on committers which may lead issues on preemption.
>>>
>>> I feel this is the one of major unfairnesses between contributors and
>>> committers if used improperly, especially when someone assigns themselves
>>> with multiple JIRA issues.
>>>
>>> Let's say there're features A and B, which may take a month for each (or
>>> require design doc) - both are individual major features, not subtasks or
>>> some sort of "follow-up".
>>>
>>> Technically, committers can file two JIRA issues and assign both of
>>> issues, "without actually doing no progress", and implicitly ensure no one
>>> works on these issues for a couple of months. Even just a plan on backlog
>>> can prevent others from taking up.
>>>
>>> I don't think this is fair with contributors, because contributors don't
>>> tend to file an JIRA issue unless they made a lot of progress. (I'd like to
>>> remind you, competition from contributor's position is quite tense and
>>> stressful.) Say they already spent a month working on it and testing it in
>>> production. They feel ready and visit JIRA, and realize the JIRA issue was
>>> made and assigned to someone, while there's no progress on the JIRA issue.
>>> No idea how much progress "someone" makes. They "might" ask about the
>>> progress, but nothing will change if "someone" simply says "I'm still
>>> working on this" (with even 1% of progress). Isn't this actually against
>>> the reason we don't allow setting assignee to contributor?
>>>
>>> For sure, assigning the issue would make sense if the issue is a subtask
>>> or follow-up, or the issue made explicit progress like design doc is being
>>> put. In other cases I don't see any reason assigning the issue explicitly.
>>> Someone may say to contributors, just leave a comment "I'm working on it",
>>> but isn't it also something committers can also do when they are "actually"
>>> working?
>>>
>>> I think committers should have no advantage on the possible competition
>>> on contribution, and setting assignee without explicit progress makes me
>>> worried.
>>> To make it fair, either we should allow contributors to assign them or
>>> don't allow committers to assign them unless extreme cases - they can still
>>> use the approach contributors do.
>>> (Again I'd feel OK to assign if there's a design doc proving that they
>>> really spent non-trivial effort already. My point is preempting JIRA issues
>>> with only sketched ideas or even just rationalizations.)
>>>
>>> Would like to hear everyone's voices.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> ps. better yet, probably it's better the

Re: Auto-closing PRs or How to get reviewers' attention

2021-02-18 Thread Holden Karau
Git blame is a good way to figure out likely potential reviewers (eg who’s
been working in the area). Another is who filed the JIRA if it’s not you.

On Thu, Feb 18, 2021 at 6:58 AM Enrico Minack 
wrote:

> Hi Spark Developers,
>
> I have a fundamental question on the process of contributing to Apache
> Spark from outside the circle of committers.
>
> I have gone through a number of pull requests and I always found it hard
> to get feedback, especially from committers. I understand there is a very
> high competition for getting attention of those few committers. Given
> Spark's code base is so huge, only very few people will feel comfortable
> approving code changes for a specific code section. Still, the motivation
> of those that want to contribute suffers from this.
>
> In particular I am getting annoyed by the auto-closing PR feature on
> GitHub. I understand the usefulness of this feature for such a frequent
> project, but I personally am impacted by the weaknesses of this approach. I
> hope, this can be improved.
>
> The feature first warns in advance that it is "... closing this PR because
> it hasn't been updated in a while". This comment looks a bit silly in
> situations where the contributor is waiting for committers' feedback.
>
> *What is the approved way to ...*
>
> *... prevent it from being auto-closed?* Committing and commenting to the
> PR does not prevent it from being closed the next day.
> *...** re-open it? *The comment says "If you'd like to revive this PR,
> please reopen it ...", but there is no re-open button anywhere on the PR!
>
> *... remove the Stale tag?* The comment says "...  ask a committer to
> remove the Stale tag!". Where can I find a list of committers and their
> contact details? What is the best way to contact them? E-Mail? Mentioning
> them in a PR comment?
>
> *... find the right committer to review a PR?* The contributors page
> states "ping likely reviewers", but it does not state how to identify
> likely reviewers. Do you recommend git-blaming the relevant code section?
> What if those committers are not available any more? Whom to ask next?
>
> *... contact committers to get their attention?* Cc'ing them in PR
> comments? Sending E-Mails? Doesn't that contribute to their cognitive load?
>
> What is the expected contributor's response to a PR that does not get
> feedback? Giving up?
>
> Are there processes in place to increase the probability PRs do not get
> forgotten, auto-closed and lost?
>
>
> This is not about my specific pull requests or reviewers of those. I
> appreciate their time and engagement.
> This is about the general process of getting feedback and needed
> improvements for it in order to increase contributor community happiness.
>
> Cheers,
> Enrico
>
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Auto-closing PRs or How to get reviewers' attention

2021-02-18 Thread Enrico Minack

Hi Spark Developers,

I have a fundamental question on the process of contributing to Apache 
Spark from outside the circle of committers.


I have gone through a number of pull requests and I always found it hard 
to get feedback, especially from committers. I understand there is a 
very high competition for getting attention of those few committers. 
Given Spark's code base is so huge, only very few people will feel 
comfortable approving code changes for a specific code section. Still, 
the motivation of those that want to contribute suffers from this.


In particular I am getting annoyed by the auto-closing PR feature on 
GitHub. I understand the usefulness of this feature for such a frequent 
project, but I personally am impacted by the weaknesses of this 
approach. I hope, this can be improved.


The feature first warns in advance that it is "... closing this PR 
because it hasn't been updated in a while". This comment looks a bit 
silly in situations where the contributor is waiting for committers' 
feedback.


*What is the approved way to ...*

*... prevent it from being auto-closed?* Committing and commenting to 
the PR does not prevent it from being closed the next day.


*...**re-open it? *The comment says "If you'd like to revive this PR, 
please reopen it ...", but there is no re-open button anywhere on the PR!


*... remove the Stale tag?* The comment says "...  ask a committer to 
remove the Stale tag!". Where can I find a list of committers and their 
contact details? What is the best way to contact them? E-Mail? 
Mentioning them in a PR comment?


*... find the right committer to review a PR?* The contributors page 
states "ping likely reviewers", but it does not state how to identify 
likely reviewers. Do you recommend git-blaming the relevant code 
section? What if those committers are not available any more? Whom to 
ask next?


*... contact committers to get their attention?* Cc'ing them in PR 
comments? Sending E-Mails? Doesn't that contribute to their cognitive load?


What is the expected contributor's response to a PR that does not get 
feedback? Giving up?


Are there processes in place to increase the probability PRs do not get 
forgotten, auto-closed and lost?



This is not about my specific pull requests or reviewers of those. I 
appreciate their time and engagement.
This is about the general process of getting feedback and needed 
improvements for it in order to increase contributor community happiness.


Cheers,
Enrico



Re: Please use Jekyll via "bundle exec" from now on

2021-02-18 Thread Hyukjin Kwon
Thanks Attlila for fixing and sharing this.

2021년 2월 18일 (목) 오후 6:17, Attila Zsolt Piros 님이
작성:

> Hello everybody,
>
> To pin the exact same version of Jekyll across all the contributors, Ruby
> Bundler is introduced.
> This way the differences in the generated documentation, which were caused
> by using different Jekyll versions, are avoided.
>
> Regarding its usage this simply means an extra prefix "*bundle exec*"
> must be added to call Jekyll, so:
> - instead of "jekyll build" please use "*bundle exec jekyll build*"
> - instead of "jekyll serve --watch" please use "*bundle exec jekyll serve
> --watch*"
>
> If you are using an earlier Ruby version please install Bundler by "*gem
> install bundler*" (as of Ruby 2.6 Bundler is part of core Ruby).
>
> This applies to both "apache/spark" and "apache/spark-website"
> repositories where all the documentations are updated accordingly.
>
> For details please check the PRs introducing this:
> - https://github.com/apache/spark/pull/31559
> - https://github.com/apache/spark-website/pull/303
>
> Sincerely,
> Attila Piros
>


Please use Jekyll via "bundle exec" from now on

2021-02-18 Thread Attila Zsolt Piros
Hello everybody,

To pin the exact same version of Jekyll across all the contributors, Ruby
Bundler is introduced.
This way the differences in the generated documentation, which were caused
by using different Jekyll versions, are avoided.

Regarding its usage this simply means an extra prefix "*bundle exec*" must
be added to call Jekyll, so:
- instead of "jekyll build" please use "*bundle exec jekyll build*"
- instead of "jekyll serve --watch" please use "*bundle exec jekyll serve
--watch*"

If you are using an earlier Ruby version please install Bundler by "*gem
install bundler*" (as of Ruby 2.6 Bundler is part of core Ruby).

This applies to both "apache/spark" and "apache/spark-website" repositories
where all the documentations are updated accordingly.

For details please check the PRs introducing this:
- https://github.com/apache/spark/pull/31559
- https://github.com/apache/spark-website/pull/303

Sincerely,
Attila Piros