Re: post-release tasks (4.0.1)

2021-06-18 Thread Jorge Cardoso Leitão
Sutou was able to push them (we needed to login via yarn, not npm).

I think that we are all good; the last item was done. I have created a
short post [1] for it.

Do we usually announce it anywhere else?

Best,
Jorge


[1] https://github.com/apache/arrow-site/pull/122




On Sat, Jun 12, 2021 at 5:57 AM Jorge Cardoso Leitão
 wrote:
>
> Thanks a lot, Krisztian.
>
> The JS packages are still missing. I already have access to npm (thanks 
> Sutou). As part of the npm-release.sh in 4.0.1, we require all tests to pass 
> [1]. However, there are tests failing on my computer [2], which blocks the 
> release.
>
> What is the procedure when this happens? We have voted on the release and we 
> have shipped other packages, so my feeling is that this should not block an 
> in-progress release; imo we should not run the tests prior to publication, as 
> there is nothing we can do at that point in time.
>
> One idea is to manually comment the "running the tests during this release" 
> and open an issue to not run the tests prior to the "publish" script, since 
> at that point it is too late to gate the release; that should have been done 
> before we reached this point.
>
> [1] https://github.com/apache/arrow/blob/release-4.0.1/js/npm-release.sh#L23
> [2] https://issues.apache.org/jira/browse/ARROW-13046
>
> Best,
> Jorge
>
>
> On Thu, Jun 10, 2021 at 1:15 PM Krisztián Szűcs  
> wrote:
>>
>> On Thu, Jun 10, 2021 at 6:57 AM Jorge Cardoso Leitão
>>  wrote:
>> >
>> > I have been unable to generate the docs from any of my two machines (my
>> > macbook and a VM on azure), and I do not think we should delay this
>> > further. Could someone kindly create a PR with the generated docs to the
>> > website?
>> Hi!
>>
>> I'm going to update the docs.
>> Are there any remaining post-release tasks?
>>
>> Thanks, Krisztian
>> >
>> > I think that the command amounts to "dev/release/post-09-docs.sh 4.0.1".
>> >
>> > Best,
>> > Jorge
>> >
>> >
>> >
>> >
>> >
>> > On Sun, Jun 6, 2021 at 6:39 AM Jorge Cardoso Leitão <
>> > jorgecarlei...@gmail.com> wrote:
>> >
>> > > Hi,
>> > >
>> > > Sorry for the delay on this, but it is not being easy to build the docs
>> > > [1-5], which is why this is taking some time. It seems that our CI is
>> > > caching docker layers when testing, which causes it to miss new errors
>> > > happening during those layers that are only triggered when the image is
>> > > built from scratch.
>> > >
>> > > Best,
>> > > Jorge
>> > >
>> > > [1] https://issues.apache.org/jira/browse/ARROW-12971
>> > > [2] https://issues.apache.org/jira/browse/ARROW-12846
>> > > [3] https://issues.apache.org/jira/browse/ARROW-12909
>> > > [4] https://issues.apache.org/jira/browse/ARROW-12915
>> > > [5] https://issues.apache.org/jira/browse/ARROW-12954
>> > >
>> > >
>> > >
>> > >
>> > > On Mon, May 31, 2021 at 1:03 PM Jorge Cardoso Leitão <
>> > > jorgecarlei...@gmail.com> wrote:
>> > >
>> > >> Thanks a lot, both.
>> > >>
>> > >> Accepted. Will upload this evening.
>> > >>
>> > >> Best,
>> > >> Jorge
>> > >>
>> > >>
>> > >> On Mon, May 31, 2021 at 12:55 PM Krisztián Szűcs <
>> > >> szucs.kriszt...@gmail.com> wrote:
>> > >>
>> > >>> On Sun, May 30, 2021 at 7:37 PM Jorge Cardoso Leitão
>> > >>>  wrote:
>> > >>> >
>> > >>> > Hi,
>> > >>> >
>> > >>> > Sorry for the delay here.
>> > >>> >
>> > >>> > Below is the list of post-release tasks:
>> > >>> >
>> > >>> 1.  [Krisztian] open a pull request to bump the version numbers in the
>> > >>> source code
>> > >>> 2.  [x] upload source
>> > >>> 3.  [x] upload binaries
>> > >>> 4.  [x] update website
>> > >>> 5.  [x] upload ruby gems
>> > >>> 6.  [Jorge] upload js packages
>> > >>> 8.  [x] upload C# packages
>> > >>> 9.  [Won't do] upload rust crates
>> > >>> 10. [x] update conda recipes
>> > >>> 11. [x] upload wheels/sdist to pypi
>> > >>> 12. [x] update homebrew packages
>> > >>> 13. [x] update maven artifacts
>> > >>> 14. [x] update msys2
>> > >>> 15. [x] update R packages
>> > >>> 16. [Jorge] update docs (in progress, Jorge)
>> > >>> >
>> > >>> > * Could someone add me to npmjs so that I can publish it there
>> > >>> > (jorgecarleitao [1])?
>> > >>> Sent you an invite, could you please check it?
>> > >>> >
>> > >>> > Thank you for your patience,
>> > >>> > Jorge
>> > >>> >
>> > >>> > [1] https://www.npmjs.com/~jorgecarleitao
>> > >>>
>> > >>


Re: [C++] Apache Arrow C++ Variadic Kernels Design

2021-06-18 Thread Wes McKinney
COUNT(DISTINCT varargs...) can be used either as a scalar aggregate
function or a group aggregate function. For example

SELECT COUNT(DISTINCT expr1, expr2, ...)
FROM TABLE;

returns a single value. It can be used with GROUP BY to produce a
distinct count per group. I think it would be useful to have available
as a scalar aggregate function. Either way good to know that our
aggregation exprs will need to support varargs

SELECT DISTINCT is equivalent to our Unique. So one implementation of

SELECT DISTINCT expr1, expr2, ...
FROM TABLE;

could be implemented by internally grouping the exprs into a
StructArray and calling Unique on a struct array. We could also simply
call the aggregation machinery with no aggregate exprs.

Might want to make some Jira issues for the above if there are not already.

On Fri, Jun 18, 2021 at 4:37 PM Ian Cook  wrote:
>
> > Aren't SELECT DISTINCT and COUNT DISTINCT just condensed variants of a 
> > GROUP BY query? Do they need to be exposed as standalone kernels?
>
> I listed SELECT DISTINCT and COUNT DISTINCT in the document only as
> examples of SQL statements that take a variable number of arguments,
> not to imply that these should be exposed as compute kernels in Arrow.
> But I think you are right to suggest that they do not really belong in
> this list, because as you say it is probably best to think of them as
> shortcut SQL syntax for obtaining results that could instead be
> obtained through a GROUP BY query. I have removed them.
>
> Thank you,
> Ian
>
> On Fri, Jun 18, 2021 at 2:26 PM Antoine Pitrou  wrote:
> >
> >
> > Aren't SELECT DISTINCT and COUNT DISTINCT just condensed variants of a
> > GROUP BY query? Do they need to be exposed as standalone kernels?
> >
> >
> > Le 18/06/2021 à 00:58, Ian Cook a écrit :
> > > Arrow developers,
> > >
> > > A couple of recent PRs have added new variadic scalar kernels to the
> > > Arrow C++ library (ARROW-12751, ARROW-12709). There were some
> > > questions raised in comments on Jira and GitHub about whether these
> > > could instead be implemented as unary or binary kernels that take
> > > ListArray or StructArray input. Since I believe we plan to add at
> > > least a few more variadic kernels, I wrote a document [1] with help
> > > from some colleagues at Ursa to describe the rationale behind why we
> > > believe it is best to implement these as variadic kernels. Feedback is
> > > welcome.
> > >
> > > Thank you,
> > > Ian
> > >
> > > [1] 
> > > https://docs.google.com/document/d/1ExysJ43WpjZ_P6vnfx6dzCRSvM-3qlqpc6gPjy9cNXM/
> > >


Re: [C++] Apache Arrow C++ Variadic Kernels Design

2021-06-18 Thread Ian Cook
> Aren't SELECT DISTINCT and COUNT DISTINCT just condensed variants of a GROUP 
> BY query? Do they need to be exposed as standalone kernels?

I listed SELECT DISTINCT and COUNT DISTINCT in the document only as
examples of SQL statements that take a variable number of arguments,
not to imply that these should be exposed as compute kernels in Arrow.
But I think you are right to suggest that they do not really belong in
this list, because as you say it is probably best to think of them as
shortcut SQL syntax for obtaining results that could instead be
obtained through a GROUP BY query. I have removed them.

Thank you,
Ian

On Fri, Jun 18, 2021 at 2:26 PM Antoine Pitrou  wrote:
>
>
> Aren't SELECT DISTINCT and COUNT DISTINCT just condensed variants of a
> GROUP BY query? Do they need to be exposed as standalone kernels?
>
>
> Le 18/06/2021 à 00:58, Ian Cook a écrit :
> > Arrow developers,
> >
> > A couple of recent PRs have added new variadic scalar kernels to the
> > Arrow C++ library (ARROW-12751, ARROW-12709). There were some
> > questions raised in comments on Jira and GitHub about whether these
> > could instead be implemented as unary or binary kernels that take
> > ListArray or StructArray input. Since I believe we plan to add at
> > least a few more variadic kernels, I wrote a document [1] with help
> > from some colleagues at Ursa to describe the rationale behind why we
> > believe it is best to implement these as variadic kernels. Feedback is
> > welcome.
> >
> > Thank you,
> > Ian
> >
> > [1] 
> > https://docs.google.com/document/d/1ExysJ43WpjZ_P6vnfx6dzCRSvM-3qlqpc6gPjy9cNXM/
> >


Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-18 Thread Gosh Arzumanyan
Hi David,

Thanks for poking me on this. I have been thinking it out but have not got
to crafting a doc. Let me put together a rough proposal this weekend.
Afterwards I'll do need your help for bringing it to a reviewable state.

Cheers,
Gosh

On Fri., 18 Jun. 2021, 18:11 David Li,  wrote:

> Following up here - Gosh, did you get a chance to put something together?
> Do you need/want help on this? This would also potentially be useful for
> FlightSQL. (See the discussion on GitHub:
> https://github.com/apache/arrow/pull/9368#discussion_r572941765)
>
> Best,
> David
>
> On Fri, Apr 16, 2021, at 10:59, Gosh Arzumanyan wrote:
> > Hi guys!
> >
> > Thanks for the feedback/info.
> > Let me try to put a proposal together. Though I guess I'll need some
> > assistance on crafting it both in terms of the structure of a proposal
> > expected in the Arrow community as well as technical guidance.
> >
> > WIll share a doc with some ideas shortly so that we can start to iterate
> > over it.
> >
> > Cheers,
> > Gosh
> >
> > On Tue, Apr 13, 2021 at 6:55 PM Nate Bauernfeind <
> > natebauernfe...@deephaven.io >
> wrote:
> >
> > > > possibly in coordination with the Deephaven/Barrage team, if they're
> also
> > > still interested
> > >
> > > Good opportunity for me to chime in =). I think we still have interest
> in
> > > this feature. On the other thread, it took a little cajoling, but I've
> come
> > > around to agree with the conclusions of taking a RecordBatch and
> splitting
> > > it up (a set of RecordBatches for added rows followed by a set of
> > > RecordBatches for modifications). In this case I think it's best not to
> > > evolve the schema between added row RecordBatches and modified row
> > > RecordBatches (sending empty buffer nodes and field nodes will be
> > > significantly cheaper). However, the schema evolution would be very
> useful
> > > for when the rpc client changes the set of columns that they are
> subscribed
> > > to (which is relatively rare compared to when the subscribed table
> itself
> > > ticks).
> > >
> > > That said, schema evolution is not yet particularly high in our queue.
> > >
> > > On Tue, Apr 13, 2021 at 9:12 AM David Li  lidavidm%40apache.org>> wrote:
> > >
> > > > Thanks for the details. I'll note a few things, but adding schema
> > > > evolution to Flight is reasonable, if you'd like to put together a
> > > > proposal for discussion (possibly in coordination with the
> > > > Deephaven/Barrage team, if they're also still interested).
> > > >
> > > > >3. Assume that there is a strong reason to query A1,..,AK
> together.
> > > >
> > > > While I don't know the details here, at least with Flight/gRPC, it's
> > > > not necessarily expensive to make several requests to the same
> server,
> > > > as gRPC will consolidate them into the same underlying network
> > > > connection. You could issue one GetFlightInfo request for all streams
> > > > at once, and get back a list of endpoints for each individual
> > > > subquery, which you could then issue separate DoGet requests for.
> > > >
> > > > There's a slight mismatch there in that GetFlightInfo returns a
> > > > FlightInfo, which assumes all endpoints have the same schema. But for
> > > > a specific application, you could ignore that field (nothing in
> Flight
> > > > checks that schema against the actual data).
> > > >
> > > > Of course, if said strong reason is that all the data is really
> > > > retrieved together despite being distinct datasets, then this would
> > > > complicate the server side implementation quite a bit. But it's one
> > > > option.
> > > >
> > > > > A potential way to address this(with the existing tools) could be
> > > having
> > > > a
> > > > > union schema of all fields across all entities(potentially prefixed
> > > with
> > > > > the field name just like in sql joins) and setting the values to NA
> > > which
> > > > > do not belong to an entity.
> > > >
> > > > I had a similar use case in the past, and it was suggested to use
> > > > Arrow's Union type which handles this directly. A Union of Struct
> > > > types essentially lets you have multiple distinct schemas all encoded
> > > > in the same overall table, with explicit information about which
> > > > schema is currently in use. But as you point out this isn't helpful
> if
> > > > you don't know all the schemas up front.
> > > >
> > > > Best,
> > > > David
> > > >
> > > > On 2021/04/13 11:21:20, Gosh Arzumanyan  gosharz%40gmail.com>> wrote:
> > > > > Hi David,
> > > > >
> > > > > Thanks for sharing the link!
> > > > >
> > > > > Here is how a potential use case might look like:
> > > > >
> > > > >1. Assume that we have a service S which accepts expressions in
> some
> > > > >language X.
> > > > >2. Assume that a typical query to this service requests entities
> > > A_1,
> > > > >A_2,..,A_K. Each of those entities generates a stream of record
> > > > batches.
> > > > >Record batches for a single A_I 

Re: [C++] Apache Arrow C++ Variadic Kernels Design

2021-06-18 Thread Antoine Pitrou



Aren't SELECT DISTINCT and COUNT DISTINCT just condensed variants of a 
GROUP BY query? Do they need to be exposed as standalone kernels?



Le 18/06/2021 à 00:58, Ian Cook a écrit :

Arrow developers,

A couple of recent PRs have added new variadic scalar kernels to the
Arrow C++ library (ARROW-12751, ARROW-12709). There were some
questions raised in comments on Jira and GitHub about whether these
could instead be implemented as unary or binary kernels that take
ListArray or StructArray input. Since I believe we plan to add at
least a few more variadic kernels, I wrote a document [1] with help
from some colleagues at Ursa to describe the rationale behind why we
believe it is best to implement these as variadic kernels. Feedback is
welcome.

Thank you,
Ian

[1] 
https://docs.google.com/document/d/1ExysJ43WpjZ_P6vnfx6dzCRSvM-3qlqpc6gPjy9cNXM/



Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-18 Thread David Li
Following up here - Gosh, did you get a chance to put something together? Do 
you need/want help on this? This would also potentially be useful for 
FlightSQL. (See the discussion on GitHub: 
https://github.com/apache/arrow/pull/9368#discussion_r572941765)

Best,
David

On Fri, Apr 16, 2021, at 10:59, Gosh Arzumanyan wrote:
> Hi guys!
> 
> Thanks for the feedback/info.
> Let me try to put a proposal together. Though I guess I'll need some
> assistance on crafting it both in terms of the structure of a proposal
> expected in the Arrow community as well as technical guidance.
> 
> WIll share a doc with some ideas shortly so that we can start to iterate
> over it.
> 
> Cheers,
> Gosh
> 
> On Tue, Apr 13, 2021 at 6:55 PM Nate Bauernfeind <
> natebauernfe...@deephaven.io > wrote:
> 
> > > possibly in coordination with the Deephaven/Barrage team, if they're also
> > still interested
> >
> > Good opportunity for me to chime in =). I think we still have interest in
> > this feature. On the other thread, it took a little cajoling, but I've come
> > around to agree with the conclusions of taking a RecordBatch and splitting
> > it up (a set of RecordBatches for added rows followed by a set of
> > RecordBatches for modifications). In this case I think it's best not to
> > evolve the schema between added row RecordBatches and modified row
> > RecordBatches (sending empty buffer nodes and field nodes will be
> > significantly cheaper). However, the schema evolution would be very useful
> > for when the rpc client changes the set of columns that they are subscribed
> > to (which is relatively rare compared to when the subscribed table itself
> > ticks).
> >
> > That said, schema evolution is not yet particularly high in our queue.
> >
> > On Tue, Apr 13, 2021 at 9:12 AM David Li  > > wrote:
> >
> > > Thanks for the details. I'll note a few things, but adding schema
> > > evolution to Flight is reasonable, if you'd like to put together a
> > > proposal for discussion (possibly in coordination with the
> > > Deephaven/Barrage team, if they're also still interested).
> > >
> > > >3. Assume that there is a strong reason to query A1,..,AK together.
> > >
> > > While I don't know the details here, at least with Flight/gRPC, it's
> > > not necessarily expensive to make several requests to the same server,
> > > as gRPC will consolidate them into the same underlying network
> > > connection. You could issue one GetFlightInfo request for all streams
> > > at once, and get back a list of endpoints for each individual
> > > subquery, which you could then issue separate DoGet requests for.
> > >
> > > There's a slight mismatch there in that GetFlightInfo returns a
> > > FlightInfo, which assumes all endpoints have the same schema. But for
> > > a specific application, you could ignore that field (nothing in Flight
> > > checks that schema against the actual data).
> > >
> > > Of course, if said strong reason is that all the data is really
> > > retrieved together despite being distinct datasets, then this would
> > > complicate the server side implementation quite a bit. But it's one
> > > option.
> > >
> > > > A potential way to address this(with the existing tools) could be
> > having
> > > a
> > > > union schema of all fields across all entities(potentially prefixed
> > with
> > > > the field name just like in sql joins) and setting the values to NA
> > which
> > > > do not belong to an entity.
> > >
> > > I had a similar use case in the past, and it was suggested to use
> > > Arrow's Union type which handles this directly. A Union of Struct
> > > types essentially lets you have multiple distinct schemas all encoded
> > > in the same overall table, with explicit information about which
> > > schema is currently in use. But as you point out this isn't helpful if
> > > you don't know all the schemas up front.
> > >
> > > Best,
> > > David
> > >
> > > On 2021/04/13 11:21:20, Gosh Arzumanyan  > > > wrote:
> > > > Hi David,
> > > >
> > > > Thanks for sharing the link!
> > > >
> > > > Here is how a potential use case might look like:
> > > >
> > > >1. Assume that we have a service S which accepts expressions in some
> > > >language X.
> > > >2. Assume that a typical query to this service requests entities
> > A_1,
> > > >A_2,..,A_K. Each of those entities generates a stream of record
> > > batches.
> > > >Record batches for a single A_I share the same schema, yet there is
> > no
> > > >guarantee that schemas are equal across all streams.
> > > >3. Assume that there is a strong reason to query A1,..,AK together.
> > > >4. Service generates record batches(concurrently), tags those(e.g.
> > > with
> > > >schema level metadata) and sends them over.
> > > >
> > > > A potential way to address this(with the existing tools) could be
> > having
> > > a
> > > > union schema of all fields across all 

Re: Future of Rust sync call

2021-06-18 Thread Wes McKinney
Sounds great! Thanks

On Fri, Jun 18, 2021 at 9:29 AM Jorge Cardoso Leitão
 wrote:
>
> Hi Wes,
>
> Yes, on ASF Slack, #arrow-rust. Andy advertised it here some time ago.
>
> Most relevant topics there end up either as a github issue or over
> this mailing list. On this note, hat tip to Andrew, who has been doing
> a lot of the curation.
>
> There are other informal discussions, more about Rust lang
> developments, projects using DataFusion, practical questions, etc.
> Being both on Slack and ursalab's Zulip, I would say they are at the
> same level: some initial discussion over some idea => move it to issue
> tracker / mailing list.
>
> Best,
> Jorge
>
>
>
>
>
>
> On Fri, Jun 18, 2021 at 4:07 PM Wes McKinney  wrote:
> >
> > hi Jorge — there is a Rust Slack channel? On that, I would just say to
> > be vigilant about what communication takes place there (since Slack is
> > semi-private) versus on channels that are being archived / mirrored to
> > mailing lists. It's useful for coordination and quick questions but
> > not a place to make make decisions.
> >
> > Thanks,
> > Wes
> >
> > On Fri, Jun 18, 2021 at 12:14 AM Jorge Cardoso Leitão
> >  wrote:
> > >
> > > Hi,
> > >
> > > I agree that the communication improved a lot with moving the issues
> > > to Github and slack, which made the sync call less relevant.
> > >
> > > Best,
> > > Jorge
> > >
> > >
> > > On Thu, Jun 17, 2021 at 11:55 PM Andrew Lamb  wrote:
> > > >
> > > > I think dropping back from the Rust sync call and using the regular 
> > > > Arrow
> > > > Sync call should that be necessary is a good idea
> > > >
> > > > Andrew
> > > >
> > > >
> > > > On Thu, Jun 17, 2021 at 12:27 PM Andy Grove  
> > > > wrote:
> > > >
> > > > > I would like to propose canceling the bi-weekly Rust sync call, at 
> > > > > least in
> > > > > its current form.
> > > > >
> > > > > The call has not been very active since we moved to the new GitHub
> > > > > repositories and implemented some changes to the development process. 
> > > > > It
> > > > > seems that the Rust community is communicating well without the need 
> > > > > for
> > > > > this Rust-specific sync call, and we can always join the regular 
> > > > > Arrow sync
> > > > > call if there are issues that need discussing.
> > > > >
> > > > > If there are no objections, I will create a PR soon to remove 
> > > > > references
> > > > > for this call from our documentation.
> > > > >
> > > > > Andy.
> > > > >


Re: [C++] Apache Arrow C++ Variadic Kernels Design

2021-06-18 Thread Wes McKinney
hi Ian — I agree with implementing these functions with
varargs/variadic inputs (this was my original intent when drafting
compute/kernel.h and related machinery last year).

As one nuance with the way that things work right now, the type
matching infrastructure isn't necessarily able to determine whether
varargs inputs are compatible with each other — the type matching rule
considers each argument independently

https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernel.cc#L411

There may be other places to address this type correspondence, but if
it were deemed useful to better support variadic argument validation,
we could probably fairly easily generalize the TypeMatcher API so that
it can "see" the types of the other arguments.

Thanks,
Wes

On Thu, Jun 17, 2021 at 5:59 PM Ian Cook  wrote:
>
> Arrow developers,
>
> A couple of recent PRs have added new variadic scalar kernels to the
> Arrow C++ library (ARROW-12751, ARROW-12709). There were some
> questions raised in comments on Jira and GitHub about whether these
> could instead be implemented as unary or binary kernels that take
> ListArray or StructArray input. Since I believe we plan to add at
> least a few more variadic kernels, I wrote a document [1] with help
> from some colleagues at Ursa to describe the rationale behind why we
> believe it is best to implement these as variadic kernels. Feedback is
> welcome.
>
> Thank you,
> Ian
>
> [1] 
> https://docs.google.com/document/d/1ExysJ43WpjZ_P6vnfx6dzCRSvM-3qlqpc6gPjy9cNXM/


Re: Future of Rust sync call

2021-06-18 Thread Jorge Cardoso Leitão
Hi Wes,

Yes, on ASF Slack, #arrow-rust. Andy advertised it here some time ago.

Most relevant topics there end up either as a github issue or over
this mailing list. On this note, hat tip to Andrew, who has been doing
a lot of the curation.

There are other informal discussions, more about Rust lang
developments, projects using DataFusion, practical questions, etc.
Being both on Slack and ursalab's Zulip, I would say they are at the
same level: some initial discussion over some idea => move it to issue
tracker / mailing list.

Best,
Jorge






On Fri, Jun 18, 2021 at 4:07 PM Wes McKinney  wrote:
>
> hi Jorge — there is a Rust Slack channel? On that, I would just say to
> be vigilant about what communication takes place there (since Slack is
> semi-private) versus on channels that are being archived / mirrored to
> mailing lists. It's useful for coordination and quick questions but
> not a place to make make decisions.
>
> Thanks,
> Wes
>
> On Fri, Jun 18, 2021 at 12:14 AM Jorge Cardoso Leitão
>  wrote:
> >
> > Hi,
> >
> > I agree that the communication improved a lot with moving the issues
> > to Github and slack, which made the sync call less relevant.
> >
> > Best,
> > Jorge
> >
> >
> > On Thu, Jun 17, 2021 at 11:55 PM Andrew Lamb  wrote:
> > >
> > > I think dropping back from the Rust sync call and using the regular Arrow
> > > Sync call should that be necessary is a good idea
> > >
> > > Andrew
> > >
> > >
> > > On Thu, Jun 17, 2021 at 12:27 PM Andy Grove  wrote:
> > >
> > > > I would like to propose canceling the bi-weekly Rust sync call, at 
> > > > least in
> > > > its current form.
> > > >
> > > > The call has not been very active since we moved to the new GitHub
> > > > repositories and implemented some changes to the development process. It
> > > > seems that the Rust community is communicating well without the need for
> > > > this Rust-specific sync call, and we can always join the regular Arrow 
> > > > sync
> > > > call if there are issues that need discussing.
> > > >
> > > > If there are no objections, I will create a PR soon to remove references
> > > > for this call from our documentation.
> > > >
> > > > Andy.
> > > >


Re: [Question] Rational for offsets instead of deltas

2021-06-18 Thread Wes McKinney
On Fri, Jun 18, 2021 at 1:12 AM Micah Kornfield  wrote:
>
> >
> > Is it to ensure O(1) random access (instead of having to sum all
> > deltas up to the index)?
>
>
> This is my understanding of why it was chosen.

Yes, that's the reason. For example, certain columnar query processing
patterns (e.g. selection vectors) depend on random access. We made the
stipulation that all Arrow data types would support O(1) random access
to broaden use cases.

>
> On Thu, Jun 17, 2021 at 10:32 PM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
> > Hi,
> >
> > (this has no direction; I am just genuinely curious)
> >
> > I am wondering, what is the rational to use "offsets" instead of
> > "lengths" to represent variable sized arrays?
> >
> > I.e. ["a", "", None, "ab"] is represented as
> >
> > offsets: [0, 1, 1, 1, 3]
> > values: "aab"
> >
> > what is the reasoning to use this over
> >
> > lengths: [1, 0, 0, 2]
> > values: "aab"
> >
> > I am asking this because I have seen people using the LargeUtf8 type,
> > or breaking Record batches in chunks, to avoid hitting the ceiling of
> > i32 of large arrays with strings.
> >
> > Is it to ensure O(1) random access (instead of having to sum all
> > deltas up to the index)?
> >
> > Best,
> > Jorge
> >


Re: Future of Rust sync call

2021-06-18 Thread Wes McKinney
hi Jorge — there is a Rust Slack channel? On that, I would just say to
be vigilant about what communication takes place there (since Slack is
semi-private) versus on channels that are being archived / mirrored to
mailing lists. It's useful for coordination and quick questions but
not a place to make make decisions.

Thanks,
Wes

On Fri, Jun 18, 2021 at 12:14 AM Jorge Cardoso Leitão
 wrote:
>
> Hi,
>
> I agree that the communication improved a lot with moving the issues
> to Github and slack, which made the sync call less relevant.
>
> Best,
> Jorge
>
>
> On Thu, Jun 17, 2021 at 11:55 PM Andrew Lamb  wrote:
> >
> > I think dropping back from the Rust sync call and using the regular Arrow
> > Sync call should that be necessary is a good idea
> >
> > Andrew
> >
> >
> > On Thu, Jun 17, 2021 at 12:27 PM Andy Grove  wrote:
> >
> > > I would like to propose canceling the bi-weekly Rust sync call, at least 
> > > in
> > > its current form.
> > >
> > > The call has not been very active since we moved to the new GitHub
> > > repositories and implemented some changes to the development process. It
> > > seems that the Rust community is communicating well without the need for
> > > this Rust-specific sync call, and we can always join the regular Arrow 
> > > sync
> > > call if there are issues that need discussing.
> > >
> > > If there are no objections, I will create a PR soon to remove references
> > > for this call from our documentation.
> > >
> > > Andy.
> > >


Re: [Question] Rational for offsets instead of deltas

2021-06-18 Thread Micah Kornfield
>
> Is it to ensure O(1) random access (instead of having to sum all
> deltas up to the index)?


This is my understanding of why it was chosen.

On Thu, Jun 17, 2021 at 10:32 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Hi,
>
> (this has no direction; I am just genuinely curious)
>
> I am wondering, what is the rational to use "offsets" instead of
> "lengths" to represent variable sized arrays?
>
> I.e. ["a", "", None, "ab"] is represented as
>
> offsets: [0, 1, 1, 1, 3]
> values: "aab"
>
> what is the reasoning to use this over
>
> lengths: [1, 0, 0, 2]
> values: "aab"
>
> I am asking this because I have seen people using the LargeUtf8 type,
> or breaking Record batches in chunks, to avoid hitting the ceiling of
> i32 of large arrays with strings.
>
> Is it to ensure O(1) random access (instead of having to sum all
> deltas up to the index)?
>
> Best,
> Jorge
>