Re: Improving PR workload management for Arrow maintainers

2021-06-29 Thread Weston Pace
I apologize.  I did plan on working on this but it's taken a back seat
for a while.  I would still recommend shying away from a standalone
UI.  You will end up making a lot of requests (and possibly running
into Github throttles) if you want detailed PR information for all of
the PRs.  To work around those limitations the Spark example that I
looked at kept a standalone database and polled Github on a regular
basis.  This works but then you have quite a bit of complexity (it's
no longer a simple static web page you can just host somewhere, you'll
need to pay for a backend server and also the cost of maintaining that
server).  Also, you may find yourself continuously playing catchup to
add features that exist in Github or face users migrating away from
the custom tool.

The approach I was pursuing was a single Github action repository to
add labels similar to those described by Andrew Lamb.  You could make
it quite complex but I think a simple state machine would be:

New PR Created (not in draft) -> Add "Needs review" label
PR moved into draft -> Remove "Needs review" and "changes requested" labels.
PR Review added with state "Changes Needed" -> Remove "Needs review",
add "changes requested", add comment explaining how to report changes
have been made
Comment made with "I have completed all requested changes" -> Remove
"changes requested", add "needs review", re-request all reviewers
Nightly cron job -> Any PR that has had the "needs review" label for X
days gets "needs attention" label
Nightly cron job -> Any PR that has had the "changes requested" label
for Y days gets "stale" label, add comment explaining why that
happened and encouraging the user to state if they want someone else
to take over the PR.

I investigated automatic adding/removing of labels based on passing /
failing checks but the checks in Arrow are not stable enough I think
and getting that information out of Github is rather tricky.

I don't know that I'll have time to work on this at the moment but I
think it'd be pretty straightforward to build such an action if anyone
is interested.  Also, it sounds like cython has something similar.  If
it is simple enough we could jsut steal it.

On Tue, Jun 29, 2021 at 6:00 AM Antoine Pitrou  wrote:
>
>
> Le 29/06/2021 à 15:25, Wes McKinney a écrit :
> > On Tue, Jun 29, 2021 at 3:10 PM Andrew Lamb  wrote:
> >>
> >> The thing that would make me more efficient reviewing PRs is figuring out
> >> which one of the open reviews are ready for additional feedback.
> >
> > Yes, I think this would be the single most significant quality-of-life
> > improvement for reviewers.
>
> Agreed as well.
>
> The CPython project uses dedicated labels for that (some automatically
> set/unset) as well as a bot that pesters contributors to mention when
> their PR is ready for review again.  It helps assert that the labelled
> PR status reflects their actual status accurately.
>
> See some examples here:
> https://github.com/python/cpython/pull/26941#issuecomment-870643346
> https://github.com/python/cpython/pull/26772#issuecomment-866020819
> https://github.com/python/cpython/pull/26677#pullrequestreview-682724234
>
> Regards
>
> Antoine.
>
>
> >
> >> I think the idea of a webapp or something that shows active reviews would
> >> be helpful (though I get most of that from appropriate email filters).
> >>
> >> What about a system involving labels (for which there is already a basic
> >> GUI in github)? Something low tech like
> >>
> >> (Waiting for Review)
> >> (Addressing Feedback)
> >> (Approved, waiting for Merge)
> >>
> >> With maybe some automation prompting people to add the "Waiting on Review"
> >> label when they want feedback
> >
> > I think it would have to be a bot that automatically sets the labels.
> > If it requires contributors to take some action outside of pushing new
> > work (new commits or a rebased version of the patch) to the PR and
> > leaving responses to comments on the PR, the system is likely to fail
> > some non-trivial percentage of the time.
> >
> > Given the quality of off-the-shelf web app components nowadays (e.g.
> > https://material-ui.com), throwing together a read-only PR dashboard
> > that shows what has changed since you last interacted with them (along
> > with some other helpful things, like whether the build is passing) is
> > "probably" not a super heavy lift. I haven't done any frontend
> > development in years so while the backend part (writing Python code to
> > wrangle data from GitHub's REST API and put it in a SQLite database)
> > wouldn't take very long I would need some help on the front end
> > portion and setting it up for deployment on DigitalOcean or somewhere.
> >
> >> Andrew
> >>
> >> On Tue, Jun 29, 2021 at 4:28 AM Wes McKinney  wrote:
> >>
> >>> hi folks,
> >>>
> >>> I've noted that the volume of PRs for Arrow has been steadily
> >>> increasing (and will likely continue to increase), and while I've
> >>> personally had less time for development / maintenance / code 

Re: [C++] Reducing branching in compute/kernels/vector_selection.cc

2021-06-29 Thread Niranda Perera
Ah. Sorry, my mistake! there's a separate `ExecNonNull()` method! 😇

On Tue, Jun 29, 2021 at 12:00 PM Antoine Pitrou  wrote:

>
> Le 29/06/2021 à 17:58, Niranda Perera a écrit :
> > So, FWIU, in vector selection, the output array would always have a
> > non-null validity buffer, isn't it?
>
> Why?
>
>
> >
> > On Tue, Jun 29, 2021 at 11:54 AM Antoine Pitrou 
> wrote:
> >
> >>
> >>
> >> Le 29/06/2021 à 17:49, Niranda Perera a écrit :
> >>> Hi all,
> >>>
> >>> I'm looking into this now, and I feel like there's a potential memory
> >>> corruption at the very end of the out_data_ array.
> >>>
> >>> algo:
> >>> bool advance = BitUtil::GetBit(filter_data_, filter_offset_ +
> >> in_position);
> >>> BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, advance);
> >>> out_data_[out_position_] = values_data_[in_position];
> >>> out_position_ += advance; // may need static_cast here
> >>> ++in_position;
> >>> say,
> >>> in.data =   [1, 2, 3, 4] // all valid
> >>> filt.data = [1, 1, 0, 0] // all valid
> >>> At in_position = 1, out_position becomes 2. And till the end of the
> loop
> >>> (i.e. in_position <= 3), out_position[2] will be updated (but with
> >>> out_is_valid_[2] set to 0). I belive at the end of the loop, following
> >>> would be the output.
> >>> out.data =  [1, 2, 4]
> >>> out.valid = [1, 1, 0]
> >>>
> >>> Wouldn't this be a problem?
> >>
> >> Indeed, you have to allocate N + 1 entries for the result array instead
> >> of N.  Probably not a big deal, IMHO.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >
> >
>


-- 
Niranda Perera
https://niranda.dev/
@n1r44 


Re: [C++] Reducing branching in compute/kernels/vector_selection.cc

2021-06-29 Thread Antoine Pitrou



Le 29/06/2021 à 17:58, Niranda Perera a écrit :

So, FWIU, in vector selection, the output array would always have a
non-null validity buffer, isn't it?


Why?




On Tue, Jun 29, 2021 at 11:54 AM Antoine Pitrou  wrote:




Le 29/06/2021 à 17:49, Niranda Perera a écrit :

Hi all,

I'm looking into this now, and I feel like there's a potential memory
corruption at the very end of the out_data_ array.

algo:
bool advance = BitUtil::GetBit(filter_data_, filter_offset_ +

in_position);

BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, advance);
out_data_[out_position_] = values_data_[in_position];
out_position_ += advance; // may need static_cast here
++in_position;
say,
in.data =   [1, 2, 3, 4] // all valid
filt.data = [1, 1, 0, 0] // all valid
At in_position = 1, out_position becomes 2. And till the end of the loop
(i.e. in_position <= 3), out_position[2] will be updated (but with
out_is_valid_[2] set to 0). I belive at the end of the loop, following
would be the output.
out.data =  [1, 2, 4]
out.valid = [1, 1, 0]

Wouldn't this be a problem?


Indeed, you have to allocate N + 1 entries for the result array instead
of N.  Probably not a big deal, IMHO.

Regards

Antoine.






Re: Improving PR workload management for Arrow maintainers

2021-06-29 Thread Antoine Pitrou



Le 29/06/2021 à 15:25, Wes McKinney a écrit :

On Tue, Jun 29, 2021 at 3:10 PM Andrew Lamb  wrote:


The thing that would make me more efficient reviewing PRs is figuring out
which one of the open reviews are ready for additional feedback.


Yes, I think this would be the single most significant quality-of-life
improvement for reviewers.


Agreed as well.

The CPython project uses dedicated labels for that (some automatically 
set/unset) as well as a bot that pesters contributors to mention when 
their PR is ready for review again.  It helps assert that the labelled 
PR status reflects their actual status accurately.


See some examples here:
https://github.com/python/cpython/pull/26941#issuecomment-870643346
https://github.com/python/cpython/pull/26772#issuecomment-866020819
https://github.com/python/cpython/pull/26677#pullrequestreview-682724234

Regards

Antoine.





I think the idea of a webapp or something that shows active reviews would
be helpful (though I get most of that from appropriate email filters).

What about a system involving labels (for which there is already a basic
GUI in github)? Something low tech like

(Waiting for Review)
(Addressing Feedback)
(Approved, waiting for Merge)

With maybe some automation prompting people to add the "Waiting on Review"
label when they want feedback


I think it would have to be a bot that automatically sets the labels.
If it requires contributors to take some action outside of pushing new
work (new commits or a rebased version of the patch) to the PR and
leaving responses to comments on the PR, the system is likely to fail
some non-trivial percentage of the time.

Given the quality of off-the-shelf web app components nowadays (e.g.
https://material-ui.com), throwing together a read-only PR dashboard
that shows what has changed since you last interacted with them (along
with some other helpful things, like whether the build is passing) is
"probably" not a super heavy lift. I haven't done any frontend
development in years so while the backend part (writing Python code to
wrangle data from GitHub's REST API and put it in a SQLite database)
wouldn't take very long I would need some help on the front end
portion and setting it up for deployment on DigitalOcean or somewhere.


Andrew

On Tue, Jun 29, 2021 at 4:28 AM Wes McKinney  wrote:


hi folks,

I've noted that the volume of PRs for Arrow has been steadily
increasing (and will likely continue to increase), and while I've
personally had less time for development / maintenance / code reviews
over the last year, I would like to have a discussion about what we
could do to improve our tooling for maintainers to optimize the
efficiency of time spent tending to the PR queue. In my own
experience, I have felt that I have wasted a lot of time digging
around the queue looking for PRs that are awaiting feedback or need to
be merged.

I note first of all that around 70 out of 173 open PRs have been
updated in the last 7 days, so while there is some PR staleness, to
have nearly half of the PRs active is pretty good. That said, ~70
active PRs is a lot of PRs to tend to.

I scraped the project's code review comment history, and here are the
individuals who have left the most comments on PRs since genesis

pitrou6802
wesm  5023
emkornfield   3032
bkietz2834
kou   1489
nealrichardson1439
fsaintjacques 1356
kszucs1250
alamb 1133
jorisvandenbossche1094
liyafan82  831
lidavidm   816
westonpace 794
xhochy 770
nevi-me643
BryanCutler639
jorgecarleitao 635
cpcloud551
sunchao536
ianmcook   499

Since we're probably stuck using GitHub to receive code contributions
(as opposed to systems — Gerrit is one I'm familiar with — that
provide more structure for reviewers to track the patches they "own"
as well as the outgoing/incoming state of reviews), I am wondering
what kinds of tools we could create to make it easier for maintainers
to keep track of PRs they are shepherding through the contribution
process. Ideally this wouldn't involve maintainers having to engage in
some explicit action like assigning themselves as a PR reviewer.

Here's one idea: a web application that displays "your reviews", a
table of PRs that you have interacted with in any way (commented, left
code review, assigned as reviewer, someone mentioned you, etc.) sorted
either by last commit or last comment to assess "freshness". So if you
comment on a PR or leave a code review, it will automatically show up
in "your reviews". It could also indicate whether there has been
activity on the PR since the last time you interacted with it.

Having now used the GitHub API to pull comments from PRs for the above
analysis, there is certainly enough information available to help
create this kind of tool. I'd be willing to 

Re: [C++] Reducing branching in compute/kernels/vector_selection.cc

2021-06-29 Thread Niranda Perera
So, FWIU, in vector selection, the output array would always have a
non-null validity buffer, isn't it?

On Tue, Jun 29, 2021 at 11:54 AM Antoine Pitrou  wrote:

>
>
> Le 29/06/2021 à 17:49, Niranda Perera a écrit :
> > Hi all,
> >
> > I'm looking into this now, and I feel like there's a potential memory
> > corruption at the very end of the out_data_ array.
> >
> > algo:
> > bool advance = BitUtil::GetBit(filter_data_, filter_offset_ +
> in_position);
> > BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, advance);
> > out_data_[out_position_] = values_data_[in_position];
> > out_position_ += advance; // may need static_cast here
> > ++in_position;
> > say,
> > in.data =   [1, 2, 3, 4] // all valid
> > filt.data = [1, 1, 0, 0] // all valid
> > At in_position = 1, out_position becomes 2. And till the end of the loop
> > (i.e. in_position <= 3), out_position[2] will be updated (but with
> > out_is_valid_[2] set to 0). I belive at the end of the loop, following
> > would be the output.
> > out.data =  [1, 2, 4]
> > out.valid = [1, 1, 0]
> >
> > Wouldn't this be a problem?
>
> Indeed, you have to allocate N + 1 entries for the result array instead
> of N.  Probably not a big deal, IMHO.
>
> Regards
>
> Antoine.
>


-- 
Niranda Perera
https://niranda.dev/
@n1r44 


Re: [C++] Reducing branching in compute/kernels/vector_selection.cc

2021-06-29 Thread Antoine Pitrou




Le 29/06/2021 à 17:49, Niranda Perera a écrit :

Hi all,

I'm looking into this now, and I feel like there's a potential memory
corruption at the very end of the out_data_ array.

algo:
bool advance = BitUtil::GetBit(filter_data_, filter_offset_ + in_position);
BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, advance);
out_data_[out_position_] = values_data_[in_position];
out_position_ += advance; // may need static_cast here
++in_position;
say,
in.data =   [1, 2, 3, 4] // all valid
filt.data = [1, 1, 0, 0] // all valid
At in_position = 1, out_position becomes 2. And till the end of the loop
(i.e. in_position <= 3), out_position[2] will be updated (but with
out_is_valid_[2] set to 0). I belive at the end of the loop, following
would be the output.
out.data =  [1, 2, 4]
out.valid = [1, 1, 0]

Wouldn't this be a problem?


Indeed, you have to allocate N + 1 entries for the result array instead 
of N.  Probably not a big deal, IMHO.


Regards

Antoine.


Re: [C++] Reducing branching in compute/kernels/vector_selection.cc

2021-06-29 Thread Niranda Perera
Hi all,

I'm looking into this now, and I feel like there's a potential memory
corruption at the very end of the out_data_ array.

algo:
bool advance = BitUtil::GetBit(filter_data_, filter_offset_ + in_position);
BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, advance);
out_data_[out_position_] = values_data_[in_position];
out_position_ += advance; // may need static_cast here
++in_position;
say,
in.data =   [1, 2, 3, 4] // all valid
filt.data = [1, 1, 0, 0] // all valid
At in_position = 1, out_position becomes 2. And till the end of the loop
(i.e. in_position <= 3), out_position[2] will be updated (but with
out_is_valid_[2] set to 0). I belive at the end of the loop, following
would be the output.
out.data =  [1, 2, 4]
out.valid = [1, 1, 0]

Wouldn't this be a problem?


On Fri, Jun 25, 2021 at 12:19 AM Nate Bauernfeind <
natebauernfe...@deephaven.io> wrote:

> > Basically, it reset/set the borrow bit in eflag register based on the if
> condition, and runs `outpos = outpos - (-1) - borrow_bit`.
>
> That's clever, and I clearly didn't see that!
>
> On Thu, Jun 24, 2021 at 8:57 PM Yibo Cai  wrote:
>
> >
> >
> > On 6/25/21 6:58 AM, Nate Bauernfeind wrote:
> > > FYI, the bench was slightly broken; but the results stand.
> > >
> > >> benchmark::DoNotOptimize(output[rand()]);
> > > Since rand() has a domain of 0 to MAX_INT it blows past the output
> array
> > > (of length 4k). It segfaults in GCC; I'm not sure why the Clang
> benchmark
> > > is happy with that.
> > >
> > > I modified [1] it to:
> > >> benchmark::DoNotOptimize(output[rand() % N]);
> > >
> > > The benchmarks run:
> > > The Clang11 -O3 speed up is 2.3x.
> > > The GCC10.2 -O3 speed up is 2.6x.
> > >
> > > Interestingly, I added a second branching benchmark. The original
> > branching
> > > does this:
> > > ```
> > >if (bitmap[i/8] & (1 << (i%8))) {
> > >  output[outpos++] = input[i];
> > >}
> > >
> > ```
> > >
> > > My additional branching benchmark pulls the assignment outside of the
> > > branch:
> > > ```
> > >output[outpos] = input[i];
> > >if (bitmap[i/8] & (1 << (i%8))) {
> > >  ++outpos;
> > >}
> > > ```
> >
> >  From the disassembler, compiler optimizes this c code with below
> > branch-less assembly code.
> >
> > 10.06% sar%cl,%edx
> > 10.74% and$0x1,%edx
> > 9.93%  cmp$0x1,%edx
> > sbb$0x,%esi
> >
> > Basically, it reset/set the borrow bit in eflag register based on the if
> > condition, and runs `outpos = outpos - (-1) - borrow_bit`.
> >
> > >
> > > The benchmarks run:
> > > The GCC10.2 -O3 speed up compared to the original branching code is
> 2.3x.
> > > (are you as surprised as me?)
> > > The GCC10.2 -O3 speed up compared to the original non-branching code is
> > > 0.9x. (yes; it is slightly slower)
> > >
> > > Point: reducing the footprint of a false branch prediction is as
> > worthwhile
> > > an investment when you can't simply get rid of the conditional.
> > >
> > > Nate
> > > [1] https://quick-bench.com/q/kDFoF2pOuvPo9aVFufkVJMjWf-g
> > >
> > > On Thu, Jun 24, 2021 at 1:01 PM Niranda Perera <
> niranda.per...@gmail.com
> > >
> > > wrote:
> > >
> > >> I created a JIRA for this. I will do the changes in select kernels and
> > >> report back with benchmark results
> > >> https://issues.apache.org/jira/browse/ARROW-13170
> > >>
> > >>
> > >> On Thu, Jun 24, 2021 at 12:27 AM Yibo Cai  wrote:
> > >>
> > >>> Did a quick test. For random bitmaps and my trivial test code, the
> > >>> branch-less code is 3.5x faster than branch one.
> > >>> https://quick-bench.com/q/UD22IIdMgKO9HU1PsPezj05Kkro
> > >>>
> > >>> On 6/23/21 11:21 PM, Wes McKinney wrote:
> >  One project I was interested in getting to but haven't had the time
> >  was introducing branch-free code into vector_selection.cc and
> reducing
> >  the use of if-statements to try to improve performance.
> > 
> >  One way to do this is to take code that looks like this:
> > 
> >  if (BitUtil::GetBit(filter_data_, filter_offset_ + in_position)) {
> >  BitUtil::SetBit(out_is_valid_, out_offset_ + out_position_);
> >  out_data_[out_position_++] = values_data_[in_position];
> >  }
> >  ++in_position;
> > 
> >  and change it to a branch-free version
> > 
> >  bool advance = BitUtil::GetBit(filter_data_, filter_offset_ +
> > >>> in_position);
> >  BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_,
> > advance);
> >  out_data_[out_position_] = values_data_[in_position];
> >  out_position_ += advance; // may need static_cast here
> >  ++in_position;
> > 
> >  Since more people are working on kernels and computing now, I
> thought
> >  this might be an interesting project for someone to explore and see
> >  what improvements are possible (and what the differences between
> e.g.
> >  x86 and ARM architecture are like when it comes to reducing
> >  branching). Another thing to look

Re: Improving PR workload management for Arrow maintainers

2021-06-29 Thread Brian Hulette
I review a decent number of PRs for Apache Beam, and I've built some of my
own tooling to help keep track of open PRs. I wrote a script that pulls
metadata about all relevant PRs and uses some heuristics to categorize them
into:
- incoming review
- outgoing review
- "CC'd" - where I've been mentioned but am not the reviewer or author

In the first two cases I try to highlight the ones that need my
attention, simply by detecting if I'm the person who took the most recent
action or not. This works reasonably well but gets tripped up on several
edge cases:
1) The author might push multiple commits before they're actually ready for
more feedback.
2) A PR might need feedback from multiple reviewers (e.g. people with
domain knowledge of certain areas).

I've been planning to make my script stateful so that I can mark a PR as
"not my turn" (i.e. unhighlight this until there is more activity), and
maybe "never my turn" (i.e. I've finished reviewing this, it's waiting on
someone else), to handle these cases.

The idea of a "Addressing Feedback" -> "Waiting on Review" label that is
automatically transitioned when there is activity would run into these same
edge cases.
If a reviewer had the ability to bump the label back to "Addressing
Feedback", that would at least address #1.

I think Wes's proposal (a read-only web UI) would likely also run into
these edge cases since it stores no state of its own to deconflict in those
situations.

Brian

On Tue, Jun 29, 2021 at 6:26 AM Wes McKinney  wrote:

> On Tue, Jun 29, 2021 at 3:10 PM Andrew Lamb  wrote:
> >
> > The thing that would make me more efficient reviewing PRs is figuring out
> > which one of the open reviews are ready for additional feedback.
>
> Yes, I think this would be the single most significant quality-of-life
> improvement for reviewers.
>
> > I think the idea of a webapp or something that shows active reviews would
> > be helpful (though I get most of that from appropriate email filters).
> >
> > What about a system involving labels (for which there is already a basic
> > GUI in github)? Something low tech like
> >
> > (Waiting for Review)
> > (Addressing Feedback)
> > (Approved, waiting for Merge)
> >
> > With maybe some automation prompting people to add the "Waiting on
> Review"
> > label when they want feedback
>
> I think it would have to be a bot that automatically sets the labels.
> If it requires contributors to take some action outside of pushing new
> work (new commits or a rebased version of the patch) to the PR and
> leaving responses to comments on the PR, the system is likely to fail
> some non-trivial percentage of the time.


> Given the quality of off-the-shelf web app components nowadays (e.g.
> https://material-ui.com), throwing together a read-only PR dashboard
> that shows what has changed since you last interacted with them (along
> with some other helpful things, like whether the build is passing) is
> "probably" not a super heavy lift. I haven't done any frontend
> development in years so while the backend part (writing Python code to
> wrangle data from GitHub's REST API and put it in a SQLite database)
> wouldn't take very long I would need some help on the front end
> portion and setting it up for deployment on DigitalOcean or somewhere.
>
> > Andrew
> >
> > On Tue, Jun 29, 2021 at 4:28 AM Wes McKinney 
> wrote:
> >
> > > hi folks,
> > >
> > > I've noted that the volume of PRs for Arrow has been steadily
> > > increasing (and will likely continue to increase), and while I've
> > > personally had less time for development / maintenance / code reviews
> > > over the last year, I would like to have a discussion about what we
> > > could do to improve our tooling for maintainers to optimize the
> > > efficiency of time spent tending to the PR queue. In my own
> > > experience, I have felt that I have wasted a lot of time digging
> > > around the queue looking for PRs that are awaiting feedback or need to
> > > be merged.
> > >
> > > I note first of all that around 70 out of 173 open PRs have been
> > > updated in the last 7 days, so while there is some PR staleness, to
> > > have nearly half of the PRs active is pretty good. That said, ~70
> > > active PRs is a lot of PRs to tend to.
> > >
> > > I scraped the project's code review comment history, and here are the
> > > individuals who have left the most comments on PRs since genesis
> > >
> > > pitrou6802
> > > wesm  5023
> > > emkornfield   3032
> > > bkietz2834
> > > kou   1489
> > > nealrichardson1439
> > > fsaintjacques 1356
> > > kszucs1250
> > > alamb 1133
> > > jorisvandenbossche1094
> > > liyafan82  831
> > > lidavidm   816
> > > westonpace 794
> > > xhochy 770
> > > nevi-me643
> > > BryanCutler639
> > > jorgecarleitao 635
> > > cpcloud551
> > > sunc

Re: Improving PR workload management for Arrow maintainers

2021-06-29 Thread Jorge Cardoso Leitão
I just had a quick chat over the ASF's slack with Daniel Gruno from the
infra team and they are rolling out the "triage role" [1] for
non-committers, which AFAIK offers useful tools in this context:

* add/remove labels
* assign reviewees
* mark duplicates
* close, open and assign to issues and PRs

One does not disregard the other, just though it could be useful
information to this topic, as maybe this cover some ground?

Best,
Jorge

[1]
https://docs.github.com/en/organizations/managing-access-to-your-organizations-repositories/repository-permission-levels-for-an-organization


On Tue, Jun 29, 2021 at 3:10 PM Andrew Lamb  wrote:

> The thing that would make me more efficient reviewing PRs is figuring out
> which one of the open reviews are ready for additional feedback.
>
> I think the idea of a webapp or something that shows active reviews would
> be helpful (though I get most of that from appropriate email filters).
>
> What about a system involving labels (for which there is already a basic
> GUI in github)? Something low tech like
>
> (Waiting for Review)
> (Addressing Feedback)
> (Approved, waiting for Merge)
>
> With maybe some automation prompting people to add the "Waiting on Review"
> label when they want feedback
>
> Andrew
>
> On Tue, Jun 29, 2021 at 4:28 AM Wes McKinney  wrote:
>
> > hi folks,
> >
> > I've noted that the volume of PRs for Arrow has been steadily
> > increasing (and will likely continue to increase), and while I've
> > personally had less time for development / maintenance / code reviews
> > over the last year, I would like to have a discussion about what we
> > could do to improve our tooling for maintainers to optimize the
> > efficiency of time spent tending to the PR queue. In my own
> > experience, I have felt that I have wasted a lot of time digging
> > around the queue looking for PRs that are awaiting feedback or need to
> > be merged.
> >
> > I note first of all that around 70 out of 173 open PRs have been
> > updated in the last 7 days, so while there is some PR staleness, to
> > have nearly half of the PRs active is pretty good. That said, ~70
> > active PRs is a lot of PRs to tend to.
> >
> > I scraped the project's code review comment history, and here are the
> > individuals who have left the most comments on PRs since genesis
> >
> > pitrou6802
> > wesm  5023
> > emkornfield   3032
> > bkietz2834
> > kou   1489
> > nealrichardson1439
> > fsaintjacques 1356
> > kszucs1250
> > alamb 1133
> > jorisvandenbossche1094
> > liyafan82  831
> > lidavidm   816
> > westonpace 794
> > xhochy 770
> > nevi-me643
> > BryanCutler639
> > jorgecarleitao 635
> > cpcloud551
> > sunchao536
> > ianmcook   499
> >
> > Since we're probably stuck using GitHub to receive code contributions
> > (as opposed to systems — Gerrit is one I'm familiar with — that
> > provide more structure for reviewers to track the patches they "own"
> > as well as the outgoing/incoming state of reviews), I am wondering
> > what kinds of tools we could create to make it easier for maintainers
> > to keep track of PRs they are shepherding through the contribution
> > process. Ideally this wouldn't involve maintainers having to engage in
> > some explicit action like assigning themselves as a PR reviewer.
> >
> > Here's one idea: a web application that displays "your reviews", a
> > table of PRs that you have interacted with in any way (commented, left
> > code review, assigned as reviewer, someone mentioned you, etc.) sorted
> > either by last commit or last comment to assess "freshness". So if you
> > comment on a PR or leave a code review, it will automatically show up
> > in "your reviews". It could also indicate whether there has been
> > activity on the PR since the last time you interacted with it.
> >
> > Having now used the GitHub API to pull comments from PRs for the above
> > analysis, there is certainly enough information available to help
> > create this kind of tool. I'd be willing to contribute to building the
> > backend of such a web application.
> >
> > This is just one idea, but I am curious to hear from others who are
> > spending a lot of time doing code review / PR merging to see what
> > might help them use their time more effectively.
> >
> > Thanks,
> > Wes
> >
>


Re: Improving PR workload management for Arrow maintainers

2021-06-29 Thread Wes McKinney
On Tue, Jun 29, 2021 at 3:10 PM Andrew Lamb  wrote:
>
> The thing that would make me more efficient reviewing PRs is figuring out
> which one of the open reviews are ready for additional feedback.

Yes, I think this would be the single most significant quality-of-life
improvement for reviewers.

> I think the idea of a webapp or something that shows active reviews would
> be helpful (though I get most of that from appropriate email filters).
>
> What about a system involving labels (for which there is already a basic
> GUI in github)? Something low tech like
>
> (Waiting for Review)
> (Addressing Feedback)
> (Approved, waiting for Merge)
>
> With maybe some automation prompting people to add the "Waiting on Review"
> label when they want feedback

I think it would have to be a bot that automatically sets the labels.
If it requires contributors to take some action outside of pushing new
work (new commits or a rebased version of the patch) to the PR and
leaving responses to comments on the PR, the system is likely to fail
some non-trivial percentage of the time.

Given the quality of off-the-shelf web app components nowadays (e.g.
https://material-ui.com), throwing together a read-only PR dashboard
that shows what has changed since you last interacted with them (along
with some other helpful things, like whether the build is passing) is
"probably" not a super heavy lift. I haven't done any frontend
development in years so while the backend part (writing Python code to
wrangle data from GitHub's REST API and put it in a SQLite database)
wouldn't take very long I would need some help on the front end
portion and setting it up for deployment on DigitalOcean or somewhere.

> Andrew
>
> On Tue, Jun 29, 2021 at 4:28 AM Wes McKinney  wrote:
>
> > hi folks,
> >
> > I've noted that the volume of PRs for Arrow has been steadily
> > increasing (and will likely continue to increase), and while I've
> > personally had less time for development / maintenance / code reviews
> > over the last year, I would like to have a discussion about what we
> > could do to improve our tooling for maintainers to optimize the
> > efficiency of time spent tending to the PR queue. In my own
> > experience, I have felt that I have wasted a lot of time digging
> > around the queue looking for PRs that are awaiting feedback or need to
> > be merged.
> >
> > I note first of all that around 70 out of 173 open PRs have been
> > updated in the last 7 days, so while there is some PR staleness, to
> > have nearly half of the PRs active is pretty good. That said, ~70
> > active PRs is a lot of PRs to tend to.
> >
> > I scraped the project's code review comment history, and here are the
> > individuals who have left the most comments on PRs since genesis
> >
> > pitrou6802
> > wesm  5023
> > emkornfield   3032
> > bkietz2834
> > kou   1489
> > nealrichardson1439
> > fsaintjacques 1356
> > kszucs1250
> > alamb 1133
> > jorisvandenbossche1094
> > liyafan82  831
> > lidavidm   816
> > westonpace 794
> > xhochy 770
> > nevi-me643
> > BryanCutler639
> > jorgecarleitao 635
> > cpcloud551
> > sunchao536
> > ianmcook   499
> >
> > Since we're probably stuck using GitHub to receive code contributions
> > (as opposed to systems — Gerrit is one I'm familiar with — that
> > provide more structure for reviewers to track the patches they "own"
> > as well as the outgoing/incoming state of reviews), I am wondering
> > what kinds of tools we could create to make it easier for maintainers
> > to keep track of PRs they are shepherding through the contribution
> > process. Ideally this wouldn't involve maintainers having to engage in
> > some explicit action like assigning themselves as a PR reviewer.
> >
> > Here's one idea: a web application that displays "your reviews", a
> > table of PRs that you have interacted with in any way (commented, left
> > code review, assigned as reviewer, someone mentioned you, etc.) sorted
> > either by last commit or last comment to assess "freshness". So if you
> > comment on a PR or leave a code review, it will automatically show up
> > in "your reviews". It could also indicate whether there has been
> > activity on the PR since the last time you interacted with it.
> >
> > Having now used the GitHub API to pull comments from PRs for the above
> > analysis, there is certainly enough information available to help
> > create this kind of tool. I'd be willing to contribute to building the
> > backend of such a web application.
> >
> > This is just one idea, but I am curious to hear from others who are
> > spending a lot of time doing code review / PR merging to see what
> > might help them use their time more effectively.
> >
> > Thanks,
> > Wes
> >


Re: Improving PR workload management for Arrow maintainers

2021-06-29 Thread Andrew Lamb
The thing that would make me more efficient reviewing PRs is figuring out
which one of the open reviews are ready for additional feedback.

I think the idea of a webapp or something that shows active reviews would
be helpful (though I get most of that from appropriate email filters).

What about a system involving labels (for which there is already a basic
GUI in github)? Something low tech like

(Waiting for Review)
(Addressing Feedback)
(Approved, waiting for Merge)

With maybe some automation prompting people to add the "Waiting on Review"
label when they want feedback

Andrew

On Tue, Jun 29, 2021 at 4:28 AM Wes McKinney  wrote:

> hi folks,
>
> I've noted that the volume of PRs for Arrow has been steadily
> increasing (and will likely continue to increase), and while I've
> personally had less time for development / maintenance / code reviews
> over the last year, I would like to have a discussion about what we
> could do to improve our tooling for maintainers to optimize the
> efficiency of time spent tending to the PR queue. In my own
> experience, I have felt that I have wasted a lot of time digging
> around the queue looking for PRs that are awaiting feedback or need to
> be merged.
>
> I note first of all that around 70 out of 173 open PRs have been
> updated in the last 7 days, so while there is some PR staleness, to
> have nearly half of the PRs active is pretty good. That said, ~70
> active PRs is a lot of PRs to tend to.
>
> I scraped the project's code review comment history, and here are the
> individuals who have left the most comments on PRs since genesis
>
> pitrou6802
> wesm  5023
> emkornfield   3032
> bkietz2834
> kou   1489
> nealrichardson1439
> fsaintjacques 1356
> kszucs1250
> alamb 1133
> jorisvandenbossche1094
> liyafan82  831
> lidavidm   816
> westonpace 794
> xhochy 770
> nevi-me643
> BryanCutler639
> jorgecarleitao 635
> cpcloud551
> sunchao536
> ianmcook   499
>
> Since we're probably stuck using GitHub to receive code contributions
> (as opposed to systems — Gerrit is one I'm familiar with — that
> provide more structure for reviewers to track the patches they "own"
> as well as the outgoing/incoming state of reviews), I am wondering
> what kinds of tools we could create to make it easier for maintainers
> to keep track of PRs they are shepherding through the contribution
> process. Ideally this wouldn't involve maintainers having to engage in
> some explicit action like assigning themselves as a PR reviewer.
>
> Here's one idea: a web application that displays "your reviews", a
> table of PRs that you have interacted with in any way (commented, left
> code review, assigned as reviewer, someone mentioned you, etc.) sorted
> either by last commit or last comment to assess "freshness". So if you
> comment on a PR or leave a code review, it will automatically show up
> in "your reviews". It could also indicate whether there has been
> activity on the PR since the last time you interacted with it.
>
> Having now used the GitHub API to pull comments from PRs for the above
> analysis, there is certainly enough information available to help
> create this kind of tool. I'd be willing to contribute to building the
> backend of such a web application.
>
> This is just one idea, but I am curious to hear from others who are
> spending a lot of time doing code review / PR merging to see what
> might help them use their time more effectively.
>
> Thanks,
> Wes
>


Re: [VOTE] Donation of rust arrow2 and parquet2

2021-06-29 Thread Krisztián Szűcs
+1 (binding)

On Mon, Jun 28, 2021 at 10:57 PM Neal Richardson
 wrote:
>
> +1
>
> On Mon, Jun 28, 2021 at 1:29 PM Andrew Lamb  wrote:
>
> > +1
> >
> > On Mon, Jun 28, 2021 at 1:13 PM QP Hou  wrote:
> >
> > > +1 (non binding)
> > >
> > > Really exciting stuff, amazing work Jorge.
> > >
> > > On Mon, Jun 28, 2021 at 8:32 AM Antoine Pitrou 
> > wrote:
> > > >
> > > > +1 as well (binding)
> > > >
> > > >
> > > > Le 28/06/2021 à 17:28, Ben Kietzman a écrit :
> > > > > +1 (binding)
> > > > >
> > > > > On Mon, Jun 28, 2021 at 5:35 AM Wes McKinney 
> > > wrote:
> > > > >
> > > > >> +1 (binding)
> > > > >>
> > > > >> On Mon, Jun 28, 2021 at 11:08 AM Daniël Heres <
> > danielhe...@gmail.com>
> > > > >> wrote:
> > > > >>>
> > > > >>> +1 (non binding)
> > > > >>>
> > > > >>> Great work Jorge!
> > > > >>>
> > > > >>> On Mon, Jun 28, 2021, 10:26 Weston Steimel <
> > weston.stei...@gmail.com
> > > >
> > > > >> wrote:
> > > > >>>
> > > >  +1
> > > > 
> > > >  On Sun, 27 Jun 2021, 07:41 Jorge Cardoso Leitão, <
> > > > >> jorgecarlei...@gmail.com
> > > > >
> > > >  wrote:
> > > > 
> > > > > Hi,
> > > > >
> > > > > I would like to bring to this mailing list a proposal to donate
> > the
> > > >  source
> > > > > code of arrow2 [1] and parquet2 [2] as experimental repositories
> > > [3]
> > > >  within
> > > > > Apache Arrow, conditional on IP clearance.
> > > > >
> > > > > The specific PRs are:
> > > > >
> > > > > * https://github.com/apache/arrow-experimental-rs-arrow2/pull/1
> > > > > *
> > https://github.com/apache/arrow-experimental-rs-parquet2/pull/1
> > > > >
> > > > > The source code contains rewrites of the arrow and parquet crates
> > > > >> with
> > > > > safety and security in mind. In particular,
> > > > >
> > > > > * no buffer transmutes
> > > > > * no unsafe APIs marked as safe
> > > > > * parquet's implementation is unsafe free
> > > > >
> > > > > There are many other important features, such as big endian
> > support
> > > > >> and
> > > >  IPC
> > > > > 2.0 support. There is one regression over latest: support nested
> > > > >> types in
> > > > > parquet read and write. I observe no negative impact on
> > > performance.
> > > > >
> > > > > See a longer discussion in [4] over the reasons why the current
> > > rust
> > > > > implementation is susceptible to safety violations. In
> > particular,
> > > > >> many
> > > > > core APIs of the crate are considered security vulnerabilities
> > > under
> > > > > RustSec's [5] definitions, and are difficult to address on its
> > > > >> current
> > > > > design.
> > > > >
> > > > > I validated that it is possible to migrate DataFusion [6] and
> > > Polars
> > > > >> [7]
> > > > > without further code changes.
> > > > >
> > > > > The vote will be open for at least 72 hours.
> > > > >
> > > > > [ ] +1 Accept the code donation as experimental repos.
> > > > > [ ] +0
> > > > > [ ] -1 Do not accept the code donation as experimental repos
> > > > >> because...
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > > 
> > > > >>
> > >
> > https://github.com/apache/arrow/blob/master/docs/source/developers/experimental_repos.rst
> > > > > [2] https://github.com/jorgecarleitao/arrow2
> > > > > [3] https://github.com/jorgecarleitao/parquet2
> > > > > [4] https://github.com/jorgecarleitao/arrow2#faq
> > > > > [5] https://rustsec.org/
> > > > > [6] https://github.com/apache/arrow-datafusion/pull/68
> > > > > [7] https://github.com/pola-rs/polars
> > > > >
> > > > 
> > > > >>
> > > > >
> > >
> >


Improving PR workload management for Arrow maintainers

2021-06-29 Thread Wes McKinney
hi folks,

I've noted that the volume of PRs for Arrow has been steadily
increasing (and will likely continue to increase), and while I've
personally had less time for development / maintenance / code reviews
over the last year, I would like to have a discussion about what we
could do to improve our tooling for maintainers to optimize the
efficiency of time spent tending to the PR queue. In my own
experience, I have felt that I have wasted a lot of time digging
around the queue looking for PRs that are awaiting feedback or need to
be merged.

I note first of all that around 70 out of 173 open PRs have been
updated in the last 7 days, so while there is some PR staleness, to
have nearly half of the PRs active is pretty good. That said, ~70
active PRs is a lot of PRs to tend to.

I scraped the project's code review comment history, and here are the
individuals who have left the most comments on PRs since genesis

pitrou6802
wesm  5023
emkornfield   3032
bkietz2834
kou   1489
nealrichardson1439
fsaintjacques 1356
kszucs1250
alamb 1133
jorisvandenbossche1094
liyafan82  831
lidavidm   816
westonpace 794
xhochy 770
nevi-me643
BryanCutler639
jorgecarleitao 635
cpcloud551
sunchao536
ianmcook   499

Since we're probably stuck using GitHub to receive code contributions
(as opposed to systems — Gerrit is one I'm familiar with — that
provide more structure for reviewers to track the patches they "own"
as well as the outgoing/incoming state of reviews), I am wondering
what kinds of tools we could create to make it easier for maintainers
to keep track of PRs they are shepherding through the contribution
process. Ideally this wouldn't involve maintainers having to engage in
some explicit action like assigning themselves as a PR reviewer.

Here's one idea: a web application that displays "your reviews", a
table of PRs that you have interacted with in any way (commented, left
code review, assigned as reviewer, someone mentioned you, etc.) sorted
either by last commit or last comment to assess "freshness". So if you
comment on a PR or leave a code review, it will automatically show up
in "your reviews". It could also indicate whether there has been
activity on the PR since the last time you interacted with it.

Having now used the GitHub API to pull comments from PRs for the above
analysis, there is certainly enough information available to help
create this kind of tool. I'd be willing to contribute to building the
backend of such a web application.

This is just one idea, but I am curious to hear from others who are
spending a lot of time doing code review / PR merging to see what
might help them use their time more effectively.

Thanks,
Wes