Re: Arrow sync call August 3 at 12:00 US/Eastern, 16:00 UTC

2021-08-04 Thread Jonathan Keane
Notes for the meeting, it was relatively short and sparsely attended
this fortnight:

Attendees:
* David Li
* Jonathan Keane
* Nic Crane
* Neal Richardson

Topics discussed
* Compute IR proposal: There's been some discussion, check it out
* CRAN resubmission, we have the fixes we need, will send the
resubmission shortly

Thanks to all who attended, see y'all in a fortnight.

-Jon

On Tue, Aug 3, 2021 at 4:20 PM Jonathan Keane  wrote:
>
> Hello everyone,
>
> Our biweekly sync call is tomorrow (3 August) at 12:00 noon Eastern time.
>
> For today's call, let's please us this Google Meet URL (different from the
> usual one):
> https://meet.google.com/vbq-yufg-zwr?authuser=0
>
> All are welcome to join. Notes will be shared with the mailing list
> afterward.
>
> Thanks,
> -Jon


Re: Support for Co-authored-by tag on individual commits when integrating pull requests

2021-08-04 Thread Sutou Kouhei
Hi Kevin and Fiona,

Sorry for not noticing it on the merge and thanks for
opening a JIRA issue for this. Please ping me on GitHub when
a pull request for the issue is created. I can review it.


Thanks,
-- 
kou

In 

  "Re: Support for Co-authored-by tag on individual commits when integrating 
pull requests" on Wed, 4 Aug 2021 20:11:38 +,
  Fiona La  wrote:

> Thanks Wes and Kevin!
> 
> I have opened a Jira ticket for tracking this work: 
> https://issues.apache.org/jira/browse/ARROW-13564.
> 
> Regards,
> Fiona
> 
> From: Kevin Gurney 
> Date: Wednesday, August 4, 2021 at 4:00 PM
> To: dev 
> Cc: Fiona La 
> Subject: Re: Support for Co-authored-by tag on individual commits when 
> integrating pull requests
> Hi Wes,
> 
> Thank you for the quick response!
> 
> No need to apologize! The Co-authored-by workflow is new to us, so we are 
> learning what works as we go.
> 
> In terms of adding Fiona's name to the pull request that's already been 
> integrated, we appreciate your consideration, but understand if this is too 
> difficult to fix in the main branch at this point.
> 
> To prevent this issue from occurring in the future, we will open a pull 
> request to modify the merge_arrow_pr.py script to scrape "Co-authored-by" 
> tags as suggested.
> 
> Thank you!
> 
> Kevin
> 
> 
> From: Wes McKinney 
> Sent: Wednesday, August 4, 2021 11:02 AM
> To: dev 
> Cc: Fiona La 
> Subject: Re: Support for Co-authored-by tag on individual commits when 
> integrating pull requests
> 
> hi Kevin,
> 
> Unfortunately, I don't think it's possible to amend the existing
> commit logs because that would require force-pushing the main branch.
> I suppose we could revert the commit and push a new commit with the
> commit message fixed.
> 
>> We realized after the pull request was integrated that Fiona may have gotten 
>> credit if she pushed at least one commit from a separate GitHub account. 
>> Although, we aren't 100% sure if this true.
> 
> Indeed, if Fiona's e-mail address was in the git Author field for any
> commit in the PR, the PR merge script would have added a
> "Co-authored-by:" message to the squashed commit message.
> 
> I think the next step here is to modify the PR merge script to scrape
> any "Co-authored-by:" lines from the individual commit messages so
> they can all be listed in the combined PR message.
> 
> Sorry about this, this is the first incidence of this particular issue
> occurring to my knowledge.
> 
> Thanks
> Wes
> 
> On Wed, Aug 4, 2021 at 9:46 AM Kevin Gurney  wrote:
>>
>> Hi All,
>>
>> Fiona La (Cc'd) and I recently worked together with Kou to integrate some 
>> changes to the MATLAB interface (pull request: 
>> https://github.com/apache/arrow/pull/10614).
>>  Fiona and I pair programmed the implementation together on "one machine", 
>> using my GitHub account to push commits. We used GitHub's support for 
>> Co-authored-by tags 
>> (https://docs.github.com/en/github/committing-changes-to-your-project/creating-and-editing-commits/creating-a-commit-with-multiple-authors)
>>  to include Fiona's name on every commit. We thought this would be 
>> sufficient to ensure that her name was included in the main Apache Arrow git 
>> history after the commits were squashed and integrated by Kou. 
>> Unfortunately, it looks like her name was dropped from the list of 
>> Co-authors during integration.
>>
>> In order to ensure that all contributors to the project get credit:
>>
>> 1. Is there an existing, recommended best practice for pair programming on 
>> pull requests that ensures all contributors get credit?
>> * We realized after the pull request was integrated that Fiona may have 
>> gotten credit if she pushed at least one commit from a separate GitHub 
>> account. Although, we aren't 100% sure if this true.
>> 2. It looks like 
>> https://github.com/apache/arrow/blob/master/dev/merge_arrow_pr.py
>>  does not support the Co-authored-by tag workflow on individual commits 
>> described above.
>> * We are interested in opening a pull request to modify merge_arrow_pr.py to 
>> add support for this workflow.
>> 3. Is there a way to retroactively add Fiona's name to the git history for 
>> https://github.com/apache/arrow/pull/10614
>>  so she receives credit?
>>
>> Thank you!
>>
>> Kevin Gurney


Re: Support for Co-authored-by tag on individual commits when integrating pull requests

2021-08-04 Thread Fiona La
Thanks Wes and Kevin!

I have opened a Jira ticket for tracking this work: 
https://issues.apache.org/jira/browse/ARROW-13564.

Regards,
Fiona

From: Kevin Gurney 
Date: Wednesday, August 4, 2021 at 4:00 PM
To: dev 
Cc: Fiona La 
Subject: Re: Support for Co-authored-by tag on individual commits when 
integrating pull requests
Hi Wes,

Thank you for the quick response!

No need to apologize! The Co-authored-by workflow is new to us, so we are 
learning what works as we go.

In terms of adding Fiona's name to the pull request that's already been 
integrated, we appreciate your consideration, but understand if this is too 
difficult to fix in the main branch at this point.

To prevent this issue from occurring in the future, we will open a pull request 
to modify the merge_arrow_pr.py script to scrape "Co-authored-by" tags as 
suggested.

Thank you!

Kevin


From: Wes McKinney 
Sent: Wednesday, August 4, 2021 11:02 AM
To: dev 
Cc: Fiona La 
Subject: Re: Support for Co-authored-by tag on individual commits when 
integrating pull requests

hi Kevin,

Unfortunately, I don't think it's possible to amend the existing
commit logs because that would require force-pushing the main branch.
I suppose we could revert the commit and push a new commit with the
commit message fixed.

> We realized after the pull request was integrated that Fiona may have gotten 
> credit if she pushed at least one commit from a separate GitHub account. 
> Although, we aren't 100% sure if this true.

Indeed, if Fiona's e-mail address was in the git Author field for any
commit in the PR, the PR merge script would have added a
"Co-authored-by:" message to the squashed commit message.

I think the next step here is to modify the PR merge script to scrape
any "Co-authored-by:" lines from the individual commit messages so
they can all be listed in the combined PR message.

Sorry about this, this is the first incidence of this particular issue
occurring to my knowledge.

Thanks
Wes

On Wed, Aug 4, 2021 at 9:46 AM Kevin Gurney  wrote:
>
> Hi All,
>
> Fiona La (Cc'd) and I recently worked together with Kou to integrate some 
> changes to the MATLAB interface (pull request: 
> https://github.com/apache/arrow/pull/10614).
>  Fiona and I pair programmed the implementation together on "one machine", 
> using my GitHub account to push commits. We used GitHub's support for 
> Co-authored-by tags 
> (https://docs.github.com/en/github/committing-changes-to-your-project/creating-and-editing-commits/creating-a-commit-with-multiple-authors)
>  to include Fiona's name on every commit. We thought this would be sufficient 
> to ensure that her name was included in the main Apache Arrow git history 
> after the commits were squashed and integrated by Kou. Unfortunately, it 
> looks like her name was dropped from the list of Co-authors during 
> integration.
>
> In order to ensure that all contributors to the project get credit:
>
> 1. Is there an existing, recommended best practice for pair programming on 
> pull requests that ensures all contributors get credit?
> * We realized after the pull request was integrated that Fiona may have 
> gotten credit if she pushed at least one commit from a separate GitHub 
> account. Although, we aren't 100% sure if this true.
> 2. It looks like 
> https://github.com/apache/arrow/blob/master/dev/merge_arrow_pr.py
>  does not support the Co-authored-by tag workflow on individual commits 
> described above.
> * We are interested in opening a pull request to modify merge_arrow_pr.py to 
> add support for this workflow.
> 3. Is there a way to retroactively add Fiona's name to the git history for 
> https://github.com/apache/arrow/pull/10614
>  so she receives credit?
>
> Thank you!
>
> Kevin Gurney


Re: Support for Co-authored-by tag on individual commits when integrating pull requests

2021-08-04 Thread Kevin Gurney
Hi Wes,

Thank you for the quick response!

No need to apologize! The Co-authored-by workflow is new to us, so we are 
learning what works as we go.

In terms of adding Fiona's name to the pull request that's already been 
integrated, we appreciate your consideration, but understand if this is too 
difficult to fix in the main branch at this point.

To prevent this issue from occurring in the future, we will open a pull request 
to modify the merge_arrow_pr.py script to scrape "Co-authored-by" tags as 
suggested.

Thank you!

Kevin


From: Wes McKinney 
Sent: Wednesday, August 4, 2021 11:02 AM
To: dev 
Cc: Fiona La 
Subject: Re: Support for Co-authored-by tag on individual commits when 
integrating pull requests

hi Kevin,

Unfortunately, I don't think it's possible to amend the existing
commit logs because that would require force-pushing the main branch.
I suppose we could revert the commit and push a new commit with the
commit message fixed.

> We realized after the pull request was integrated that Fiona may have gotten 
> credit if she pushed at least one commit from a separate GitHub account. 
> Although, we aren't 100% sure if this true.

Indeed, if Fiona's e-mail address was in the git Author field for any
commit in the PR, the PR merge script would have added a
"Co-authored-by:" message to the squashed commit message.

I think the next step here is to modify the PR merge script to scrape
any "Co-authored-by:" lines from the individual commit messages so
they can all be listed in the combined PR message.

Sorry about this, this is the first incidence of this particular issue
occurring to my knowledge.

Thanks
Wes

On Wed, Aug 4, 2021 at 9:46 AM Kevin Gurney  wrote:
>
> Hi All,
>
> Fiona La (Cc'd) and I recently worked together with Kou to integrate some 
> changes to the MATLAB interface (pull request: 
> https://github.com/apache/arrow/pull/10614).
>  Fiona and I pair programmed the implementation together on "one machine", 
> using my GitHub account to push commits. We used GitHub's support for 
> Co-authored-by tags 
> (https://docs.github.com/en/github/committing-changes-to-your-project/creating-and-editing-commits/creating-a-commit-with-multiple-authors)
>  to include Fiona's name on every commit. We thought this would be sufficient 
> to ensure that her name was included in the main Apache Arrow git history 
> after the commits were squashed and integrated by Kou. Unfortunately, it 
> looks like her name was dropped from the list of Co-authors during 
> integration.
>
> In order to ensure that all contributors to the project get credit:
>
> 1. Is there an existing, recommended best practice for pair programming on 
> pull requests that ensures all contributors get credit?
> * We realized after the pull request was integrated that Fiona may have 
> gotten credit if she pushed at least one commit from a separate GitHub 
> account. Although, we aren't 100% sure if this true.
> 2. It looks like 
> https://github.com/apache/arrow/blob/master/dev/merge_arrow_pr.py
>  does not support the Co-authored-by tag workflow on individual commits 
> described above.
> * We are interested in opening a pull request to modify merge_arrow_pr.py to 
> add support for this workflow.
> 3. Is there a way to retroactively add Fiona's name to the git history for 
> https://github.com/apache/arrow/pull/10614
>  so she receives credit?
>
> Thank you!
>
> Kevin Gurney


Re: Review request for Dataset Java API PRs

2021-08-04 Thread Paul Whalen
I would love to see a component or some standardization around Java using the C 
Data Interface. I’ve been prototyping JNI bindings for DataFusion in the last 
week or so with some success, and was getting ready to ask where/how such a 
thing might fit in. I’ll be sure to watch that JIRA. 

(I also prototyped with a Panama EA build, but obviously we’re a long way from 
using that)

Paul

> On Aug 4, 2021, at 11:11 AM, Antoine Pitrou  wrote:
> 
> 
> I don't know about the rest of these tasks, but sharing data between Arrow 
> Java and C++ should definitely use the C data interface.
> 
> It seems there's work in progress here, feel free to collaborate:
> https://issues.apache.org/jira/browse/ARROW-12965
> 
> Regards
> 
> Antoine.
> 
> 
>> Le 04/08/2021 à 17:45, Micah Kornfield a écrit :
>> Hi Hongze,
>> Sorry I started taking a look at these a while ago, but my focus has been
>> elsewhere with the time I have available to contribute to the project.  One
>> thing that can also help is if there is a way to divide any of the PRs into
>> smaller standalone components it would likely help get them merged sooner
>> (I seem to recall at least one PR redid both how memory management was
>> working between C++ and Java as well as adding more functionality for
>> datasets, apologies if I am misremembering).
>>  If other people have time to review that would be great.
>> Thanks,
>> Micah
>>> On Wed, Aug 4, 2021 at 6:11 AM Wes McKinney  wrote:
>>> hi Hongze — I am not sure who will be able to review these, but in the
>>> future feel free to raise your Java PRs on the mailing list even
>>> sooner, no need to wait for more than a month. There are far fewer
>>> active Java developers vs. C++ or Rust, so it can help to get people's
>>> attention on your work.
>>> 
>>> - Wes
>>> 
>>> On Tue, Aug 3, 2021 at 9:44 PM Hongze Zhang  wrote:
 
 Hi,
 
 I have some PRs that were to improve Dataset API's Java implementation
 have not been reviewing for months. Could someone help me to review
 them? Thanks in advance.
 
 1. https://github.com/apache/arrow/pull/10201
 ARROW-11776: [Java][Dataset] Support writing to files within dataset
 scanner via JNI
 2. https://github.com/apache/arrow/pull/10333
 ARROW-12607: [Website] Doc section for Dataset Java bindings
 3. https://github.com/apache/arrow/pull/10114
 ARROW-12480: [Java][Dataset] FileSystemDataset: Support reading from a
 directory
 4.https://github.com/apache/arrow/pull/10652
 ARROW-13257: [Java][Dataset] Allow passing empty columns for projection
 
 One of the most critical changes among the PRs is to add write support
 to Java API (The first in the list). This also includes some work that
 builds a common way to share Arrow data between C++ and Java over JNI.
 Also this work was pretty close to the proposal in ARROW-7272[1].
 
 Other PRs are minor improvements like the the second one to create Java
 Dataset doc page on Arrow website. It also received some review
 comments already.
 
 Thanks,
 Hongze
 
 [1] https://issues.apache.org/jira/browse/ARROW-7272
 
>>> 


Re: Review request for Dataset Java API PRs

2021-08-04 Thread Antoine Pitrou



I don't know about the rest of these tasks, but sharing data between 
Arrow Java and C++ should definitely use the C data interface.


It seems there's work in progress here, feel free to collaborate:
https://issues.apache.org/jira/browse/ARROW-12965

Regards

Antoine.


Le 04/08/2021 à 17:45, Micah Kornfield a écrit :

Hi Hongze,
Sorry I started taking a look at these a while ago, but my focus has been
elsewhere with the time I have available to contribute to the project.  One
thing that can also help is if there is a way to divide any of the PRs into
smaller standalone components it would likely help get them merged sooner
(I seem to recall at least one PR redid both how memory management was
working between C++ and Java as well as adding more functionality for
datasets, apologies if I am misremembering).

  If other people have time to review that would be great.

Thanks,
Micah

On Wed, Aug 4, 2021 at 6:11 AM Wes McKinney  wrote:


hi Hongze — I am not sure who will be able to review these, but in the
future feel free to raise your Java PRs on the mailing list even
sooner, no need to wait for more than a month. There are far fewer
active Java developers vs. C++ or Rust, so it can help to get people's
attention on your work.

- Wes

On Tue, Aug 3, 2021 at 9:44 PM Hongze Zhang  wrote:


Hi,

I have some PRs that were to improve Dataset API's Java implementation
have not been reviewing for months. Could someone help me to review
them? Thanks in advance.

1. https://github.com/apache/arrow/pull/10201
ARROW-11776: [Java][Dataset] Support writing to files within dataset
scanner via JNI
2. https://github.com/apache/arrow/pull/10333
ARROW-12607: [Website] Doc section for Dataset Java bindings
3. https://github.com/apache/arrow/pull/10114
ARROW-12480: [Java][Dataset] FileSystemDataset: Support reading from a
directory
4.https://github.com/apache/arrow/pull/10652
ARROW-13257: [Java][Dataset] Allow passing empty columns for projection

One of the most critical changes among the PRs is to add write support
to Java API (The first in the list). This also includes some work that
builds a common way to share Arrow data between C++ and Java over JNI.
Also this work was pretty close to the proposal in ARROW-7272[1].

Other PRs are minor improvements like the the second one to create Java
Dataset doc page on Arrow website. It also received some review
comments already.

Thanks,
Hongze

[1] https://issues.apache.org/jira/browse/ARROW-7272







Re: Review request for Dataset Java API PRs

2021-08-04 Thread Micah Kornfield
Hi Hongze,
Sorry I started taking a look at these a while ago, but my focus has been
elsewhere with the time I have available to contribute to the project.  One
thing that can also help is if there is a way to divide any of the PRs into
smaller standalone components it would likely help get them merged sooner
(I seem to recall at least one PR redid both how memory management was
working between C++ and Java as well as adding more functionality for
datasets, apologies if I am misremembering).

 If other people have time to review that would be great.

Thanks,
Micah

On Wed, Aug 4, 2021 at 6:11 AM Wes McKinney  wrote:

> hi Hongze — I am not sure who will be able to review these, but in the
> future feel free to raise your Java PRs on the mailing list even
> sooner, no need to wait for more than a month. There are far fewer
> active Java developers vs. C++ or Rust, so it can help to get people's
> attention on your work.
>
> - Wes
>
> On Tue, Aug 3, 2021 at 9:44 PM Hongze Zhang  wrote:
> >
> > Hi,
> >
> > I have some PRs that were to improve Dataset API's Java implementation
> > have not been reviewing for months. Could someone help me to review
> > them? Thanks in advance.
> >
> > 1. https://github.com/apache/arrow/pull/10201
> > ARROW-11776: [Java][Dataset] Support writing to files within dataset
> > scanner via JNI
> > 2. https://github.com/apache/arrow/pull/10333
> > ARROW-12607: [Website] Doc section for Dataset Java bindings
> > 3. https://github.com/apache/arrow/pull/10114
> > ARROW-12480: [Java][Dataset] FileSystemDataset: Support reading from a
> > directory
> > 4.https://github.com/apache/arrow/pull/10652
> > ARROW-13257: [Java][Dataset] Allow passing empty columns for projection
> >
> > One of the most critical changes among the PRs is to add write support
> > to Java API (The first in the list). This also includes some work that
> > builds a common way to share Arrow data between C++ and Java over JNI.
> > Also this work was pretty close to the proposal in ARROW-7272[1].
> >
> > Other PRs are minor improvements like the the second one to create Java
> > Dataset doc page on Arrow website. It also received some review
> > comments already.
> >
> > Thanks,
> > Hongze
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-7272
> >
>


Re: Support for Co-authored-by tag on individual commits when integrating pull requests

2021-08-04 Thread Wes McKinney
hi Kevin,

Unfortunately, I don't think it's possible to amend the existing
commit logs because that would require force-pushing the main branch.
I suppose we could revert the commit and push a new commit with the
commit message fixed.

> We realized after the pull request was integrated that Fiona may have gotten 
> credit if she pushed at least one commit from a separate GitHub account. 
> Although, we aren't 100% sure if this true.

Indeed, if Fiona's e-mail address was in the git Author field for any
commit in the PR, the PR merge script would have added a
"Co-authored-by:" message to the squashed commit message.

I think the next step here is to modify the PR merge script to scrape
any "Co-authored-by:" lines from the individual commit messages so
they can all be listed in the combined PR message.

Sorry about this, this is the first incidence of this particular issue
occurring to my knowledge.

Thanks
Wes

On Wed, Aug 4, 2021 at 9:46 AM Kevin Gurney  wrote:
>
> Hi All,
>
> Fiona La (Cc'd) and I recently worked together with Kou to integrate some 
> changes to the MATLAB interface (pull request: 
> https://github.com/apache/arrow/pull/10614). Fiona and I pair programmed the 
> implementation together on "one machine", using my GitHub account to push 
> commits. We used GitHub's support for Co-authored-by tags 
> (https://docs.github.com/en/github/committing-changes-to-your-project/creating-and-editing-commits/creating-a-commit-with-multiple-authors)
>  to include Fiona's name on every commit. We thought this would be sufficient 
> to ensure that her name was included in the main Apache Arrow git history 
> after the commits were squashed and integrated by Kou. Unfortunately, it 
> looks like her name was dropped from the list of Co-authors during 
> integration.
>
> In order to ensure that all contributors to the project get credit:
>
>   1.  Is there an existing, recommended best practice for pair programming on 
> pull requests that ensures all contributors get credit?
>  *   We realized after the pull request was integrated that Fiona may 
> have gotten credit if she pushed at least one commit from a separate GitHub 
> account. Although, we aren't 100% sure if this true.
>   2.  It looks like 
> https://github.com/apache/arrow/blob/master/dev/merge_arrow_pr.py does not 
> support the Co-authored-by tag workflow on individual commits described above.
>  *   We are interested in opening a pull request to modify 
> merge_arrow_pr.py to add support for this workflow.
>   3.  Is there a way to retroactively add Fiona's name to the git history for 
> https://github.com/apache/arrow/pull/10614 so she receives credit?
>
> Thank you!
>
> Kevin Gurney


Support for Co-authored-by tag on individual commits when integrating pull requests

2021-08-04 Thread Kevin Gurney
Hi All,

Fiona La (Cc'd) and I recently worked together with Kou to integrate some 
changes to the MATLAB interface (pull request: 
https://github.com/apache/arrow/pull/10614). Fiona and I pair programmed the 
implementation together on "one machine", using my GitHub account to push 
commits. We used GitHub's support for Co-authored-by tags 
(https://docs.github.com/en/github/committing-changes-to-your-project/creating-and-editing-commits/creating-a-commit-with-multiple-authors)
 to include Fiona's name on every commit. We thought this would be sufficient 
to ensure that her name was included in the main Apache Arrow git history after 
the commits were squashed and integrated by Kou. Unfortunately, it looks like 
her name was dropped from the list of Co-authors during integration.

In order to ensure that all contributors to the project get credit:

  1.  Is there an existing, recommended best practice for pair programming on 
pull requests that ensures all contributors get credit?
 *   We realized after the pull request was integrated that Fiona may have 
gotten credit if she pushed at least one commit from a separate GitHub account. 
Although, we aren't 100% sure if this true.
  2.  It looks like 
https://github.com/apache/arrow/blob/master/dev/merge_arrow_pr.py does not 
support the Co-authored-by tag workflow on individual commits described above.
 *   We are interested in opening a pull request to modify 
merge_arrow_pr.py to add support for this workflow.
  3.  Is there a way to retroactively add Fiona's name to the git history for 
https://github.com/apache/arrow/pull/10614 so she receives credit?

Thank you!

Kevin Gurney


Re: Review request for Dataset Java API PRs

2021-08-04 Thread Wes McKinney
hi Hongze — I am not sure who will be able to review these, but in the
future feel free to raise your Java PRs on the mailing list even
sooner, no need to wait for more than a month. There are far fewer
active Java developers vs. C++ or Rust, so it can help to get people's
attention on your work.

- Wes

On Tue, Aug 3, 2021 at 9:44 PM Hongze Zhang  wrote:
>
> Hi,
>
> I have some PRs that were to improve Dataset API's Java implementation
> have not been reviewing for months. Could someone help me to review
> them? Thanks in advance.
>
> 1. https://github.com/apache/arrow/pull/10201
> ARROW-11776: [Java][Dataset] Support writing to files within dataset
> scanner via JNI
> 2. https://github.com/apache/arrow/pull/10333
> ARROW-12607: [Website] Doc section for Dataset Java bindings
> 3. https://github.com/apache/arrow/pull/10114
> ARROW-12480: [Java][Dataset] FileSystemDataset: Support reading from a
> directory
> 4.https://github.com/apache/arrow/pull/10652
> ARROW-13257: [Java][Dataset] Allow passing empty columns for projection
>
> One of the most critical changes among the PRs is to add write support
> to Java API (The first in the list). This also includes some work that
> builds a common way to share Arrow data between C++ and Java over JNI.
> Also this work was pretty close to the proposal in ARROW-7272[1].
>
> Other PRs are minor improvements like the the second one to create Java
> Dataset doc page on Arrow website. It also received some review
> comments already.
>
> Thanks,
> Hongze
>
> [1] https://issues.apache.org/jira/browse/ARROW-7272
>


Re: [Discuss] [Rust] Arrow2/parquet2 going foward

2021-08-04 Thread QP Hou
Just my two cents.

I think we all have the same goal here, which is to accelerate the
transitioning of arrow to arrow2 as the official arrow rust
implementation.

In my opinion, the biggest gain we can get from merging two projects
into one repo is to have some kind of a policy to enforce that every
new feature/test added to the current arrow implementation also  needs
to be added to the arrow2 implementation. This way, we can make sure
the gap between arrow and arrow2 is closing on every iteration.
Without this, I tend to agree with Jorge that merging two repos would
add more overhead to his work and slow him down.

For those who want to contribute to arrow2 to accelerate the
transition, I don't think they would have problem sending PRs to the
arrow2 repo. For those who are not interested in contributing to
arrow2, merging the arrow2 code base into the current arrow-rs repo
won't incentivize them to contribute. Merging arrow2 into current
arrow-rs repo could help with discovery. But I think this can be
achieved by adding a big note in the current arrow-rs README to
encourage contributions to the arrow2 repo as well.

At the end of the day, Jorge is currently the sole active contributor
to the arrow2 implementation, so I think he would have the most say on
what's the most productive way to push arrow2 forward. The only
concern I have with regards to merging arrow2 into arrow-rs right now
is Jorge spent all the efforts to do the merge, then it turned out
that he is still the only active contributor to arrow2 within
arrow-rs, but with more overhead that he has to deal with.

As for maintaining semantic versioning for arrow2, Andy had a good
point that we could still release arrow2 with its own versioning even
if we merge it into the arrow-rs repo. So I don't think we should
worry/focus too much about versioning in our discussion. Velocity to
close the gap between arrow-rs and arrow2 is the most important thing.

Lastly, I do agree with Andrew that it would be good to only maintain
a single arrow crate in crates.io in the long run. As he mentioned,
when the current arrow2 code base becomes stable, we could still
release it under the arrow namespace in crates.io with a major version
bump. The absolute value in the major version doesn't really matter as
long as we stick to the convention that breaking change will result in
a major version bump.

Thanks,
QP



On Tue, Aug 3, 2021 at 5:31 PM paddy horan  wrote:
>
> Hi Jorge,
>
> I see value in consolidating development in a single repo and releasing under 
> the existing arrow crate.  Regarding versioning, I think once we follow 
> semantic versioning we are fine.  I don't think it's worth migrating to a 
> different repo and crate to comply with the de-facto standard you mention.
>
> Just one person's opinion though,
> Paddy
>
>
> -Original Message-
> From: Jorge Cardoso Leitão 
> Sent: Tuesday, August 3, 2021 5:23 PM
> To: dev@arrow.apache.org
> Subject: Re: [Discuss] [Rust] Arrow2/parquet2 going foward
>
> Hi Paddy,
>
> > What do you think about moving Arrow2 into the main Arrow repo where
> > it
> is only enabled via an "experimental" feature flag?
>
> AFAIK this is already possible:
> * add `arrow2 = { version = "0.2.0", optional = true }` to Cargo.toml
> * add `#[cfg(feature = "arrow2")]\npub mod arrow2;\n` to lib.rs
>
> We do this kind of thing to expose APIs from non-arrow crates such as parts 
> of the parquet-format-rs crate, and is generally the way to go when a crate 
> wants to expose a third-party API.
>
> I would not recommend doing this, though: by exposing arrow2 from arrow, we 
> double the compilation time and binary size of all dependencies that activate 
> the flag. Furthermore, there are users of arrow2 that do not need the arrow 
> crate, which this model would not support.
>
> AFAIK where development happens is unrelated to this aspect, Rust enables 
> this by design.
>
> > but also this would be a clear signal that Arrow2 is <1.0.
> > the experimental flag will be a clear signal to the existing Arrow
> community that Arrow2 is the future but that it is <1.0
>
> arrow2 is already <1.0 
> .
>  My argument is that the arrow/arrow-flight/parquet are not versioned 
> according to the Rust community standards: It is a de facto practice in Rust 
> to delay major releases until the API is stable. Tokio's blog post about 
> their 1.0 
>