Re: [VOTE] Add RLE Arrays to Arrow Format

2022-12-19 Thread Matthew Topol
Huzzah!

That brings us to 3 +1 (binding) votes, and 1 +1 (non-binding) vote!

The vote passes! I've updated the PR for the format changes (on their own)
here: https://github.com/apache/arrow/pull/14176 and will follow it up with
updating the other PRs as I can. If anyone could comment / approve that PR,
I'll merge it to kick this off and start getting the other PRs ready for
review.

Thanks everyone!

On Mon, Dec 19, 2022 at 4:59 PM Ian Cook  wrote:

> @Matt Topol: Yes, a change of the name to "run-end encoding" changes
> my (non-binding) vote to a +1.
>
> On Mon, Dec 19, 2022 at 3:32 PM Matthew Topol
>  wrote:
> >
> > Okay, slight edit to my previous email: It was brought to my attention
> that
> > we need at least 3 +1 binding votes, so this vote is still open for the
> > moment.
> >
> > @IanCook: With the change of the name to RunEndEncoding is that
> sufficient
> > to change your vote to a +1?
> >
> > On Mon, Dec 19, 2022 at 12:57 PM Matt Topol 
> wrote:
> >
> > > That leaves us with a total vote of +1.5 so the vote carries with the
> > > caveat of changing the name to be Run End Encoded rather than Run
> Length
> > > Encoded (unless this means I need to do a new vote with the changed
> name?
> > > This is my first time doing one of these so please correct me if I
> need to
> > > do a new vote!)
> > >
> > > Thanks everyone for your feedback and comments!
> > >
> > > I'm going to go update the Go and Format specific PRs to make them
> regular
> > > PR's (instead of drafts) and get this all moving. Thanks in advance to
> > > anyone who reviews the upcoming PRs!
> > >
> > > --Matt
> > >
> > > On Fri, Dec 16, 2022 at 8:24 PM Weston Pace 
> wrote:
> > >
> > > > +1
> > > >
> > > > I agree that run-end encoding makes more sense but also don't see it
> > > > as a deal breaker.
> > > >
> > > > The most compelling counter-argument I've seen for new types is to
> > > > avoid a schism where some implementations do not support the newer
> > > > types.  However, for the type proposed here I think the risk is low
> > > > because data can be losslessly converted to existing formats for
> > > > compatibility with any system that doesn't support the type.
> > > >
> > > > Another argument I've seen is that we should introduce a more formal
> > > > distinction between "layouts" and "types" (with dictionary and
> > > > run-end-encoding being layouts).  However, this seems like an
> > > > impractical change at this point.  In addition, given that we have
> > > > dictionary as an array type the cat is already out of the bag.
> > > > Furthermore, systems and implementations are still welcome to make
> > > > this distinction themselves.  The spec only needs to specify what the
> > > > buffer layouts should be.  If a particular library chooses to group
> > > > those layouts into two different categories I think that would still
> > > > be feasible.
> > > >
> > > > -Weston
> > > >
> > > > On Fri, Dec 16, 2022 at 1:42 PM Andrew Lamb 
> > > wrote:
> > > > >
> > > > > +1 on the proposal as written
> > > > >
> > > > > I think it makes sense and offers exciting opportunities for faster
> > > > > computation (especially for cases where parquet files can be
> decoded
> > > > > directly into such an array and avoid unpacking. RLE encoded
> dictionary
> > > > are
> > > > > quite compelling)
> > > > >
> > > > > I would prefer to use the term Run-End-Encoding (which would also
> > > follow
> > > > > the naming of the internal fields) but I don't view that as a deal
> > > > blocker.
> > > > >
> > > > > Thank you for all your work in this matter,
> > > > > Andrew
> > > > >
> > > > > On Wed, Dec 14, 2022 at 5:08 PM Matt Topol  >
> > > > wrote:
> > > > >
> > > > > > I'm not at all opposed to renaming it as `Run-End-Encoding` if
> that
> > > > would
> > > > > > be preferable. Hopefully others will chime in with their
> feedback.
> > > > > >
> > > > > > --Matt
> > > > > >
> > > > > > On Wed, Dec 14, 2022 at 12:09 PM Ian Cook  >
> > > > wrote:
> > > > > >
> > > > > > > Thank you Matt, Tobias, and others for the great work on this.
> > > > > > >
> > > > > > > I am -0.5 on this proposal in its current form because (pardon
> the
> > > > > > > pedantry) what we have implemented here is not run-length
> encoding;
> > > > it
> > > > > > > is run-end encoding. Based on community input, the choice was
> made
> > > to
> > > > > > > store run ends instead of run lengths because this enables
> > > O(log(N))
> > > > > > > random access as opposed to O(N). This is a sensible choice,
> but it
> > > > > > > comes with some trade-offs including limitations in array
> length
> > > > > > > (which maybe not really a problem in practice) and lack of
> > > > bit-for-bit
> > > > > > > equivalence with RLE encodings that use run lengths like
> Velox's
> > > > > > > SequenceVector encoding (which I think is a more serious
> problem in
> > > > > > > practice).
> > > > > > >
> > > > > > > I believe that we should either:
> > > > > > > (a) rename this to "run-end encoding

Re: [VOTE] Add RLE Arrays to Arrow Format

2022-12-19 Thread Ian Cook
@Matt Topol: Yes, a change of the name to "run-end encoding" changes
my (non-binding) vote to a +1.

On Mon, Dec 19, 2022 at 3:32 PM Matthew Topol
 wrote:
>
> Okay, slight edit to my previous email: It was brought to my attention that
> we need at least 3 +1 binding votes, so this vote is still open for the
> moment.
>
> @IanCook: With the change of the name to RunEndEncoding is that sufficient
> to change your vote to a +1?
>
> On Mon, Dec 19, 2022 at 12:57 PM Matt Topol  wrote:
>
> > That leaves us with a total vote of +1.5 so the vote carries with the
> > caveat of changing the name to be Run End Encoded rather than Run Length
> > Encoded (unless this means I need to do a new vote with the changed name?
> > This is my first time doing one of these so please correct me if I need to
> > do a new vote!)
> >
> > Thanks everyone for your feedback and comments!
> >
> > I'm going to go update the Go and Format specific PRs to make them regular
> > PR's (instead of drafts) and get this all moving. Thanks in advance to
> > anyone who reviews the upcoming PRs!
> >
> > --Matt
> >
> > On Fri, Dec 16, 2022 at 8:24 PM Weston Pace  wrote:
> >
> > > +1
> > >
> > > I agree that run-end encoding makes more sense but also don't see it
> > > as a deal breaker.
> > >
> > > The most compelling counter-argument I've seen for new types is to
> > > avoid a schism where some implementations do not support the newer
> > > types.  However, for the type proposed here I think the risk is low
> > > because data can be losslessly converted to existing formats for
> > > compatibility with any system that doesn't support the type.
> > >
> > > Another argument I've seen is that we should introduce a more formal
> > > distinction between "layouts" and "types" (with dictionary and
> > > run-end-encoding being layouts).  However, this seems like an
> > > impractical change at this point.  In addition, given that we have
> > > dictionary as an array type the cat is already out of the bag.
> > > Furthermore, systems and implementations are still welcome to make
> > > this distinction themselves.  The spec only needs to specify what the
> > > buffer layouts should be.  If a particular library chooses to group
> > > those layouts into two different categories I think that would still
> > > be feasible.
> > >
> > > -Weston
> > >
> > > On Fri, Dec 16, 2022 at 1:42 PM Andrew Lamb 
> > wrote:
> > > >
> > > > +1 on the proposal as written
> > > >
> > > > I think it makes sense and offers exciting opportunities for faster
> > > > computation (especially for cases where parquet files can be decoded
> > > > directly into such an array and avoid unpacking. RLE encoded dictionary
> > > are
> > > > quite compelling)
> > > >
> > > > I would prefer to use the term Run-End-Encoding (which would also
> > follow
> > > > the naming of the internal fields) but I don't view that as a deal
> > > blocker.
> > > >
> > > > Thank you for all your work in this matter,
> > > > Andrew
> > > >
> > > > On Wed, Dec 14, 2022 at 5:08 PM Matt Topol 
> > > wrote:
> > > >
> > > > > I'm not at all opposed to renaming it as `Run-End-Encoding` if that
> > > would
> > > > > be preferable. Hopefully others will chime in with their feedback.
> > > > >
> > > > > --Matt
> > > > >
> > > > > On Wed, Dec 14, 2022 at 12:09 PM Ian Cook 
> > > wrote:
> > > > >
> > > > > > Thank you Matt, Tobias, and others for the great work on this.
> > > > > >
> > > > > > I am -0.5 on this proposal in its current form because (pardon the
> > > > > > pedantry) what we have implemented here is not run-length encoding;
> > > it
> > > > > > is run-end encoding. Based on community input, the choice was made
> > to
> > > > > > store run ends instead of run lengths because this enables
> > O(log(N))
> > > > > > random access as opposed to O(N). This is a sensible choice, but it
> > > > > > comes with some trade-offs including limitations in array length
> > > > > > (which maybe not really a problem in practice) and lack of
> > > bit-for-bit
> > > > > > equivalence with RLE encodings that use run lengths like Velox's
> > > > > > SequenceVector encoding (which I think is a more serious problem in
> > > > > > practice).
> > > > > >
> > > > > > I believe that we should either:
> > > > > > (a) rename this to "run-end encoding"
> > > > > > (b) change this to a parameterized type called "run encoding" that
> > > > > > takes a Boolean parameter specifying whether run lengths or run
> > ends
> > > > > > are stored.
> > > > > >
> > > > > > Ian
> > > > > >
> > > > > > On Wed, Dec 14, 2022 at 11:27 AM Matt Topol <
> > zotthewiz...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > I'd like to propose adding the RLE type based on earlier
> > > > > > discussions[1][2]
> > > > > > > to the Arrow format:
> > > > > > > - Columnar Format description:
> > > > > > >
> > > > > >
> > > > >
> > >
> > https://github.com/apache/arrow/pull/1/files#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb5

Re: [VOTE] Add RLE Arrays to Arrow Format

2022-12-19 Thread Jorge Cardoso Leitão
+1

Thanks a lot for all this. Really exciting!!

On Mon, 19 Dec 2022, 17:56 Matt Topol,  wrote:

> That leaves us with a total vote of +1.5 so the vote carries with the
> caveat of changing the name to be Run End Encoded rather than Run Length
> Encoded (unless this means I need to do a new vote with the changed name?
> This is my first time doing one of these so please correct me if I need to
> do a new vote!)
>
> Thanks everyone for your feedback and comments!
>
> I'm going to go update the Go and Format specific PRs to make them regular
> PR's (instead of drafts) and get this all moving. Thanks in advance to
> anyone who reviews the upcoming PRs!
>
> --Matt
>
> On Fri, Dec 16, 2022 at 8:24 PM Weston Pace  wrote:
>
> > +1
> >
> > I agree that run-end encoding makes more sense but also don't see it
> > as a deal breaker.
> >
> > The most compelling counter-argument I've seen for new types is to
> > avoid a schism where some implementations do not support the newer
> > types.  However, for the type proposed here I think the risk is low
> > because data can be losslessly converted to existing formats for
> > compatibility with any system that doesn't support the type.
> >
> > Another argument I've seen is that we should introduce a more formal
> > distinction between "layouts" and "types" (with dictionary and
> > run-end-encoding being layouts).  However, this seems like an
> > impractical change at this point.  In addition, given that we have
> > dictionary as an array type the cat is already out of the bag.
> > Furthermore, systems and implementations are still welcome to make
> > this distinction themselves.  The spec only needs to specify what the
> > buffer layouts should be.  If a particular library chooses to group
> > those layouts into two different categories I think that would still
> > be feasible.
> >
> > -Weston
> >
> > On Fri, Dec 16, 2022 at 1:42 PM Andrew Lamb 
> wrote:
> > >
> > > +1 on the proposal as written
> > >
> > > I think it makes sense and offers exciting opportunities for faster
> > > computation (especially for cases where parquet files can be decoded
> > > directly into such an array and avoid unpacking. RLE encoded dictionary
> > are
> > > quite compelling)
> > >
> > > I would prefer to use the term Run-End-Encoding (which would also
> follow
> > > the naming of the internal fields) but I don't view that as a deal
> > blocker.
> > >
> > > Thank you for all your work in this matter,
> > > Andrew
> > >
> > > On Wed, Dec 14, 2022 at 5:08 PM Matt Topol 
> > wrote:
> > >
> > > > I'm not at all opposed to renaming it as `Run-End-Encoding` if that
> > would
> > > > be preferable. Hopefully others will chime in with their feedback.
> > > >
> > > > --Matt
> > > >
> > > > On Wed, Dec 14, 2022 at 12:09 PM Ian Cook 
> > wrote:
> > > >
> > > > > Thank you Matt, Tobias, and others for the great work on this.
> > > > >
> > > > > I am -0.5 on this proposal in its current form because (pardon the
> > > > > pedantry) what we have implemented here is not run-length encoding;
> > it
> > > > > is run-end encoding. Based on community input, the choice was made
> to
> > > > > store run ends instead of run lengths because this enables
> O(log(N))
> > > > > random access as opposed to O(N). This is a sensible choice, but it
> > > > > comes with some trade-offs including limitations in array length
> > > > > (which maybe not really a problem in practice) and lack of
> > bit-for-bit
> > > > > equivalence with RLE encodings that use run lengths like Velox's
> > > > > SequenceVector encoding (which I think is a more serious problem in
> > > > > practice).
> > > > >
> > > > > I believe that we should either:
> > > > > (a) rename this to "run-end encoding"
> > > > > (b) change this to a parameterized type called "run encoding" that
> > > > > takes a Boolean parameter specifying whether run lengths or run
> ends
> > > > > are stored.
> > > > >
> > > > > Ian
> > > > >
> > > > > On Wed, Dec 14, 2022 at 11:27 AM Matt Topol <
> zotthewiz...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I'd like to propose adding the RLE type based on earlier
> > > > > discussions[1][2]
> > > > > > to the Arrow format:
> > > > > > - Columnar Format description:
> > > > > >
> > > > >
> > > >
> >
> https://github.com/apache/arrow/pull/1/files#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60
> > > > > > - Flatbuffers changes:
> > > > > >
> > > > >
> > > >
> >
> https://github.com/apache/arrow/pull/14176/files#diff-e54b4f5d2d279acc5d1df5df9a7636f0142a8041fe02f07034e0d8be48444b07
> > > > > >
> > > > > > There is a proposed implementation available in both C++ (written
> > by
> > > > > Tobias
> > > > > > Zagorni) and Go[3][4]. Both implementations have mostly the same
> > tests
> > > > > > implemented and were tested to be compatible over IPC with an
> > archery
> > > > > test.
> > > > > > In both cases, the implementations are split out among several
> > Draft
> > > > PRs

Re: [VOTE] Add RLE Arrays to Arrow Format

2022-12-19 Thread Matthew Topol
Okay, slight edit to my previous email: It was brought to my attention that
we need at least 3 +1 binding votes, so this vote is still open for the
moment.

@IanCook: With the change of the name to RunEndEncoding is that sufficient
to change your vote to a +1?

On Mon, Dec 19, 2022 at 12:57 PM Matt Topol  wrote:

> That leaves us with a total vote of +1.5 so the vote carries with the
> caveat of changing the name to be Run End Encoded rather than Run Length
> Encoded (unless this means I need to do a new vote with the changed name?
> This is my first time doing one of these so please correct me if I need to
> do a new vote!)
>
> Thanks everyone for your feedback and comments!
>
> I'm going to go update the Go and Format specific PRs to make them regular
> PR's (instead of drafts) and get this all moving. Thanks in advance to
> anyone who reviews the upcoming PRs!
>
> --Matt
>
> On Fri, Dec 16, 2022 at 8:24 PM Weston Pace  wrote:
>
> > +1
> >
> > I agree that run-end encoding makes more sense but also don't see it
> > as a deal breaker.
> >
> > The most compelling counter-argument I've seen for new types is to
> > avoid a schism where some implementations do not support the newer
> > types.  However, for the type proposed here I think the risk is low
> > because data can be losslessly converted to existing formats for
> > compatibility with any system that doesn't support the type.
> >
> > Another argument I've seen is that we should introduce a more formal
> > distinction between "layouts" and "types" (with dictionary and
> > run-end-encoding being layouts).  However, this seems like an
> > impractical change at this point.  In addition, given that we have
> > dictionary as an array type the cat is already out of the bag.
> > Furthermore, systems and implementations are still welcome to make
> > this distinction themselves.  The spec only needs to specify what the
> > buffer layouts should be.  If a particular library chooses to group
> > those layouts into two different categories I think that would still
> > be feasible.
> >
> > -Weston
> >
> > On Fri, Dec 16, 2022 at 1:42 PM Andrew Lamb 
> wrote:
> > >
> > > +1 on the proposal as written
> > >
> > > I think it makes sense and offers exciting opportunities for faster
> > > computation (especially for cases where parquet files can be decoded
> > > directly into such an array and avoid unpacking. RLE encoded dictionary
> > are
> > > quite compelling)
> > >
> > > I would prefer to use the term Run-End-Encoding (which would also
> follow
> > > the naming of the internal fields) but I don't view that as a deal
> > blocker.
> > >
> > > Thank you for all your work in this matter,
> > > Andrew
> > >
> > > On Wed, Dec 14, 2022 at 5:08 PM Matt Topol 
> > wrote:
> > >
> > > > I'm not at all opposed to renaming it as `Run-End-Encoding` if that
> > would
> > > > be preferable. Hopefully others will chime in with their feedback.
> > > >
> > > > --Matt
> > > >
> > > > On Wed, Dec 14, 2022 at 12:09 PM Ian Cook 
> > wrote:
> > > >
> > > > > Thank you Matt, Tobias, and others for the great work on this.
> > > > >
> > > > > I am -0.5 on this proposal in its current form because (pardon the
> > > > > pedantry) what we have implemented here is not run-length encoding;
> > it
> > > > > is run-end encoding. Based on community input, the choice was made
> to
> > > > > store run ends instead of run lengths because this enables
> O(log(N))
> > > > > random access as opposed to O(N). This is a sensible choice, but it
> > > > > comes with some trade-offs including limitations in array length
> > > > > (which maybe not really a problem in practice) and lack of
> > bit-for-bit
> > > > > equivalence with RLE encodings that use run lengths like Velox's
> > > > > SequenceVector encoding (which I think is a more serious problem in
> > > > > practice).
> > > > >
> > > > > I believe that we should either:
> > > > > (a) rename this to "run-end encoding"
> > > > > (b) change this to a parameterized type called "run encoding" that
> > > > > takes a Boolean parameter specifying whether run lengths or run
> ends
> > > > > are stored.
> > > > >
> > > > > Ian
> > > > >
> > > > > On Wed, Dec 14, 2022 at 11:27 AM Matt Topol <
> zotthewiz...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I'd like to propose adding the RLE type based on earlier
> > > > > discussions[1][2]
> > > > > > to the Arrow format:
> > > > > > - Columnar Format description:
> > > > > >
> > > > >
> > > >
> >
> https://github.com/apache/arrow/pull/1/files#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60
> > > > > > - Flatbuffers changes:
> > > > > >
> > > > >
> > > >
> >
> https://github.com/apache/arrow/pull/14176/files#diff-e54b4f5d2d279acc5d1df5df9a7636f0142a8041fe02f07034e0d8be48444b07
> > > > > >
> > > > > > There is a proposed implementation available in both C++ (written
> > by
> > > > > Tobias
> > > > > > Zagorni) and Go[3][4]. Both implementations have mostly t

[WEBSITE] Website merge script is outdated

2022-12-19 Thread Rok Mihevc
Current website PR merge script is outdated [1] and should either be
updated or replaced with merging with the button process.
I've come across this issue when merging website changes  related to Jira
-> GitHub migration  [2] and had to use the merge button.

As things stand now we'll eventually update the merge script, but we could
also decide to use the button. Thoughts?

[1] https://github.com/apache/arrow-site/issues/285
[2] https://github.com/apache/arrow-site/pull/286

Rok


Re: [VOTE] Add RLE Arrays to Arrow Format

2022-12-19 Thread Matt Topol
That leaves us with a total vote of +1.5 so the vote carries with the
caveat of changing the name to be Run End Encoded rather than Run Length
Encoded (unless this means I need to do a new vote with the changed name?
This is my first time doing one of these so please correct me if I need to
do a new vote!)

Thanks everyone for your feedback and comments!

I'm going to go update the Go and Format specific PRs to make them regular
PR's (instead of drafts) and get this all moving. Thanks in advance to
anyone who reviews the upcoming PRs!

--Matt

On Fri, Dec 16, 2022 at 8:24 PM Weston Pace  wrote:

> +1
>
> I agree that run-end encoding makes more sense but also don't see it
> as a deal breaker.
>
> The most compelling counter-argument I've seen for new types is to
> avoid a schism where some implementations do not support the newer
> types.  However, for the type proposed here I think the risk is low
> because data can be losslessly converted to existing formats for
> compatibility with any system that doesn't support the type.
>
> Another argument I've seen is that we should introduce a more formal
> distinction between "layouts" and "types" (with dictionary and
> run-end-encoding being layouts).  However, this seems like an
> impractical change at this point.  In addition, given that we have
> dictionary as an array type the cat is already out of the bag.
> Furthermore, systems and implementations are still welcome to make
> this distinction themselves.  The spec only needs to specify what the
> buffer layouts should be.  If a particular library chooses to group
> those layouts into two different categories I think that would still
> be feasible.
>
> -Weston
>
> On Fri, Dec 16, 2022 at 1:42 PM Andrew Lamb  wrote:
> >
> > +1 on the proposal as written
> >
> > I think it makes sense and offers exciting opportunities for faster
> > computation (especially for cases where parquet files can be decoded
> > directly into such an array and avoid unpacking. RLE encoded dictionary
> are
> > quite compelling)
> >
> > I would prefer to use the term Run-End-Encoding (which would also follow
> > the naming of the internal fields) but I don't view that as a deal
> blocker.
> >
> > Thank you for all your work in this matter,
> > Andrew
> >
> > On Wed, Dec 14, 2022 at 5:08 PM Matt Topol 
> wrote:
> >
> > > I'm not at all opposed to renaming it as `Run-End-Encoding` if that
> would
> > > be preferable. Hopefully others will chime in with their feedback.
> > >
> > > --Matt
> > >
> > > On Wed, Dec 14, 2022 at 12:09 PM Ian Cook 
> wrote:
> > >
> > > > Thank you Matt, Tobias, and others for the great work on this.
> > > >
> > > > I am -0.5 on this proposal in its current form because (pardon the
> > > > pedantry) what we have implemented here is not run-length encoding;
> it
> > > > is run-end encoding. Based on community input, the choice was made to
> > > > store run ends instead of run lengths because this enables O(log(N))
> > > > random access as opposed to O(N). This is a sensible choice, but it
> > > > comes with some trade-offs including limitations in array length
> > > > (which maybe not really a problem in practice) and lack of
> bit-for-bit
> > > > equivalence with RLE encodings that use run lengths like Velox's
> > > > SequenceVector encoding (which I think is a more serious problem in
> > > > practice).
> > > >
> > > > I believe that we should either:
> > > > (a) rename this to "run-end encoding"
> > > > (b) change this to a parameterized type called "run encoding" that
> > > > takes a Boolean parameter specifying whether run lengths or run ends
> > > > are stored.
> > > >
> > > > Ian
> > > >
> > > > On Wed, Dec 14, 2022 at 11:27 AM Matt Topol 
> > > > wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > I'd like to propose adding the RLE type based on earlier
> > > > discussions[1][2]
> > > > > to the Arrow format:
> > > > > - Columnar Format description:
> > > > >
> > > >
> > >
> https://github.com/apache/arrow/pull/1/files#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60
> > > > > - Flatbuffers changes:
> > > > >
> > > >
> > >
> https://github.com/apache/arrow/pull/14176/files#diff-e54b4f5d2d279acc5d1df5df9a7636f0142a8041fe02f07034e0d8be48444b07
> > > > >
> > > > > There is a proposed implementation available in both C++ (written
> by
> > > > Tobias
> > > > > Zagorni) and Go[3][4]. Both implementations have mostly the same
> tests
> > > > > implemented and were tested to be compatible over IPC with an
> archery
> > > > test.
> > > > > In both cases, the implementations are split out among several
> Draft
> > > PRs
> > > > so
> > > > > that they can be easily reviewed piecemeal if the vote is approved,
> > > with
> > > > > each Draft PR including the changes of the one before it. The links
> > > > > provided are the Draft PRs with the entirety of the changes
> included.
> > > > >
> > > > > The vote will be open for at least 72 hours.
> > > > >
> > > > > [ ] +1 add the proposed RLE type 

Re: [DISC] Self-Hosted Runners for Arrow

2022-12-19 Thread Jacob Wujciak
Jarek, thank you for the glowing review :)

Yes, we will have monitoring setup in the instance we are going to host to
protect against abuse like that but as we use a non-FOSS tool for
monitoring internally there is no code included for this at this time.

I would like to give a shout out to Álvaro Maldonado Mateos and Ian Flores
Siaca who have been doing the work of implementing this and are available
for detailed technical questions or suggestions via the issues of the repo
[1]!


[1]: https://github.com/voltrondata-labs/gha-controller-infra/issues


On Sun, Dec 18, 2022 at 4:40 PM Jarek Potiuk  wrote:

> Comment from outside - I looked briefly at the implementation and docs and
> the GHA controller looks very clear and straightforward to implement.
> Fantastic job Jacob and big shoutout to Voltron Data for implementing and
> open-sourcing it.
>
> I am going to try it out  in Apache Airflow very soon. We were waiting for
> something that GitHub Actions are cooking up
> https://github.com/orgs/github/projects/4247 but it just moved from Q4
> 2022
> to Q1 2022 so  you never know :).
>
> One small comment for the security of hosting your self-hosted runners that
> you might want to take into account.
>
> While this is great there are ephemeral runners (they provide all the
> necessary security boundaries, escaping from a container in K8S is not an
> easy feat), there is still one case where allowing any PRs to run code in
> your self-hosted runners is potentially problematic - i.e. possibility of
> using the processing power of your machines by anyone to do any kind of
> jobs (and do it with your donated credits or money). For example
> cryptomining. This is not an academic problem - this has already happened
> in the past
>
> https://github.blog/2021-04-22-github-actions-update-helping-maintainers-combat-bad-actors/
> and that's why GitHub Actions introduced mandatory "Approval" for
> first-time-users Pull requests - because the bad actors were actuallly
> abusing Github's public runners to mine crypto.
>
> The approval workflow actually protects against the "mass abuse" - i.e.
> creating new accounts and using them to exploit this on multiple repos, but
> it does not protect you against the case that some collaborators will use
> your self-hosted runners to do any kind of computing. There are likely ways
> to mitigate it like limiting the maximum time container can run, and of
> course attempts to do so might be caught during reviews (and the offending
> user can be called out) - but I think if you want powerful CI and have a
> lot of contributors, this might slip under the radar easily unless you have
> some monitoring in place. The fact that it is not mass-exploitable by new
> users, makes it less likely to occur (because the regular users might lose
> their reputation if they attempt to do it), but it is still a possibility.
>
> It's up to you if you would like to protect against it in some ways (in
> Airflow we will likely continue using https://github.com/ashb/runner and
> limit the self-hosted workflows to "main" workflows and to maintainer's
> PRs) and it is not a blocker, but I wanted you to be aware of this
> potential abuse scenario.
>
> J.
>
>
>
> On Fri, Dec 16, 2022 at 7:27 PM Jacob Wujciak
> 
> wrote:
>
> > No news with regards to arrow specific S390x machines but apparently IBM
> > has donated a number of S390x VMs to the ASF which we should be able to
> use
> > but I have not had the time yet to investigate this option.
> >
> >
> > Matt Topol  schrieb am Fr., 16. Dez. 2022,
> 17:01:
> >
> > > These are awesome! Has there been any luck in reaching out to IBM to
> see
> > if
> > > they could donate one or more s390x VMs to use as runners for testing
> the
> > > s390x builds? That is probably my only concern with Travis going away
> at
> > > EOY, since we don't have a way currently to test those builds on GH
> > > Actions.
> > >
> > > --Matt
> > >
> > > On Fri, Dec 16, 2022 at 8:46 AM Jacob Wujciak
> > > 
> > > wrote:
> > >
> > > > I would like to propose the addition of a self-hosted runner system
> to
> > > the
> > > > arrow repository to add speciality runners (arm64 and CUDA). This
> will
> > > > allow us to compensate for the arm64 jobs that previously ran on
> > Travis,
> > > > which will be turned off EOY[1].
> > > >
> > > > The migration to GitHub Issues will require a significant extension
> of
> > > our
> > > > existing “comment bot”-workflows (e.g. assigning and labeling issues
> > for
> > > > non-committers, see [3]), with such a system we could add reserved
> > > runners
> > > > that only pick up these “comment bot”-jobs to guarantee a smooth
> > > developer
> > > > experience, regardless of the state of the ASF CI resources.
> > > >
> > > > As the allocation of GitHub-hosted runners for the Apache software
> > > > foundation was recently increased, the queue times are currently low,
> > but
> > > > this will inevitably change and such a system would enable us to
> react
> > 

Re: [VOTE] Disable ASF Jira issue reporting

2022-12-19 Thread Rok Mihevc
New issue reporting on Jira has just been disabled.
Thank you all for participating and Todd for setting this up.

Rok

On Fri, Dec 16, 2022 at 5:02 PM Rok Mihevc  wrote:

> Raul opened these issues to track required changes to the release scripts:
> * [Release][Archery] Update archery release curate to support GitHub
> issues [1]
> * [Release][Archery] Update archery release changelog to support GitHub
> issues [2]
> * [Release][Archery] Update archery release cherry-pick to support GitHub
> issues [3]
>
> We also had a chat on Zulip and Raul pointed out that it would make sense
> to do full refactoring of these scripts after Jira has been completely
> locked down as all of the Jira logic could be removed from these scripts at
> that point.
>
> [1] https://github.com/apache/arrow/issues/14997
> [2] https://github.com/apache/arrow/issues/14999
> [3] https://github.com/apache/arrow/issues/15002
>
> On Fri, Dec 16, 2022 at 3:31 PM Rok Mihevc  wrote:
>
>> Thanks for bringing that point up Raul!
>> Would a good workaround be to open the required Jira issues now, before
>> we lock the Jira?
>>
>>
>> On Fri, Dec 16, 2022 at 1:42 PM Raúl Cumplido 
>> wrote:
>>
>>> Thanks Rok for looking into this.
>>>
>>> I am ok migrating to GitHub but I want to mention that there are some
>>> archery related commands like `archery release curate`, `archery release
>>> changelog` and more importantly `archery release cherry-pick` that must
>>> be
>>> updated in order to work with GitHub.
>>>
>>> For the release curate and changelog we could "easily" extract the GH
>>> issues parsing the commit titles but at the moment the cherry-pick
>>> command
>>> won't work. This is used to cherry pick the commits that have been tagged
>>> on JIRA with the corresponding version into the maintenance branch once
>>> the
>>> code freeze has been done. At the moment the cherry-pick command is only
>>> able to cherry pick commits that are tagged with the corresponding
>>> version
>>> on JIRA not on GitHub and are not already on the maintenance branch. If
>>> we
>>> fully migrate to GitHub before this is updated the Release Manager will
>>> have to take that into account on the release, cherry-picking commits
>>> from
>>> GH will have to be done manually. I am happy to work on updating those
>>> scripts, I already have them on my TODO list for next year, but I am
>>> probably not going to have time before the next release.
>>>
>>> I also think this is not a blocker for moving to GitHub but worth
>>> mentioning as it will require some extra effort until this is fixed.
>>>
>>>
>>> El vie, 16 dic 2022 a las 8:28, Alenka Frim (>> .invalid>)
>>> escribió:
>>>
>>> > Thank you for working on this Rok 🙏
>>> >
>>> > On Fri, 16 Dec 2022 at 01:21, Rok Mihevc  wrote:
>>> >
>>> > > The vote is now 8 +1 votes, 1 +1 "when the merge scripts are ready"
>>> and 1
>>> > > -1 vote "until the labels are ready".
>>> > >
>>> > > Please correct me if I'm wrong, but I believe merge scripts and
>>> labels
>>> > are
>>> > > now ready. If that is the case we can tally this vote as 10 +1 votes
>>> and
>>> > > proceed with disabling ASF Jira issue reporting. I'll wait 24 hours
>>> if
>>> > > there are objections and then ask Infra to disable creating new
>>> issues.
>>> > >
>>> > > Rok
>>> > >
>>> > > On Mon, Nov 28, 2022 at 4:58 PM Matthew Topol
>>> > >> > > >
>>> > > wrote:
>>> > >
>>> > > > +1
>>> > > >
>>> > > > On Fri, Nov 25, 2022 at 10:31 AM Alessandro Molina
>>> > > >  wrote:
>>> > > >
>>> > > > > +1 as far as for "now" we actually mean "as soon as the necessary
>>> > > scripts
>>> > > > > have been ported to github"
>>> > > > >
>>> > > > > I mean, I doubt the plan is to disable jira before we can
>>> actually
>>> > ship
>>> > > > PRs
>>> > > > > from github issues and thus block development.
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > Il Mer 23 Nov 2022, 22:37 Todd Farmer
>>> 
>>> > > ha
>>> > > > > scritto:
>>> > > > >
>>> > > > > > Hello,
>>> > > > > >
>>> > > > > > I would like to propose that issue reporting in ASF Jira for
>>> the
>>> > > Apache
>>> > > > > > Arrow project be disabled, and all users directed to use GitHub
>>> > > issues
>>> > > > > for
>>> > > > > > reporting going forward. GitHub issue reporting is now enabled
>>> [1]
>>> > in
>>> > > > > > response to a recent Infra policy change eliminating
>>> self-service
>>> > > user
>>> > > > > > registration for ASF Jira accounts. The Apache Arrow project
>>> has
>>> > > > already
>>> > > > > > voted in support of migrating issue tracking from ASF Jira to
>>> > GitHub
>>> > > > > issues
>>> > > > > > [2], and migration work is ongoing [3].
>>> > > > > >
>>> > > > > > Disabling ASF Jira issue reporting will move all such work to
>>> > GitHub
>>> > > > > > issues. I expect that usage of this new platform by all
>>> > participants
>>> > > -
>>> > > > > not
>>> > > > > > just new community members lacking ASF Jira accounts - will
>>> > expedite
>>> > > > > > further discovery and improvements to this platf