Re: Graph model in arrow

2020-11-18 Thread Fan Liya
Hi Leo,

For graph data model, I can think of two popular ways of representations:
1) adjacent matrix: an n x n matrix A (where n is the number of vertices),
and Aij = 1 indicates an arc from i to j.
2) adjacent list: a table head node for each vertex, and a list for each
vertex to store arcs.

For adjacent matrices, it should be easy to use the existing Arrow library.
A matrix is just a vector of vectors.

For adjacent lists, it should be straightforward to represent the table
head list and edge lists as Arrow vectors.

Best,
Liya Fan


On Wed, Nov 18, 2020 at 12:03 PM Leonidus Bhai 
wrote:

> Hi,
>
> I am thinking of building out a Query system using Arrow. I have a graph
> data model with objects having bidirectional relationships to each other.
> Objects are persisted in an OLTP system with normalized schema. Queries are
> scan-like queries across multiple object types having nested relationships.
>
> Looked at Tensor types but its still unclear how to model relationships
> with Arrow types.
>
> Any thoughts? ideas? Thanks
>
> Leo
>


Re: [Discuss] Arrow Release Schedule

2020-11-18 Thread Wes McKinney
>From searching for "java maven" in our Jira issues:

* https://issues.apache.org/jira/browse/ARROW-6103
* https://issues.apache.org/jira/browse/ARROW-1234

I just created

https://issues.apache.org/jira/browse/ARROW-10648

On Wed, Nov 18, 2020 at 3:40 PM Keerat Singh  wrote:
>
> Hi Neal,
>
> Do you have any information on the status of the tickets ?
>
> Regards,
> Keerat
>
> On Mon, Nov 16, 2020 at 11:56 AM Keerat Singh  wrote:
>
> > Thank you, Kou and Wes, for your responses.
> >
> > As per discussions in the last sync call[11-Nov], there were talks about
> > releasing more frequently and help is needed with the build process.
> > There was also a discussion on creating specific tickets on which help is
> > needed from the community.
> > Just wanted to confirm if those tickets were created, if not, who is the
> > best person to create those tickets?
> >
> > Regards,
> > Keerat
> >
> >
> > On Tue, Nov 10, 2020 at 7:53 PM Wes McKinney  wrote:
> >
> > > +1 to everything that Kou said.
> > >
> > > On Tue, Nov 10, 2020 at 8:53 PM Sutou Kouhei  wrote:
> > >
> > > > Hi,
> > > >
> > > > I agree with Wes. We can automate more release related
> > > > tasks. I hope that we produce and verify release artifacts
> > > > nightly.
> > > >
> > > > See also:
> > > > [DISCUSS] Release cadence and release vote conventions
> > > >
> > > >
> > >
> > https://lists.apache.org/thread.html/5e93b0d79a5d3a31cee6f2100c94221de72cb4d5acb1d92b8681e9a6%40%3Cdev.arrow.apache.org%3E
> > > >
> > > > I think that the main blocker is the Java's build system
> > > > mentioned by Wes in this thread too. For example, it
> > > > requires tagging in release process. It's not suitable for
> > > > building nightly release artifacts.
> > > > (See the above thread for details. I described this more.)
> > > >
> > > >
> > > > Keeping all nightly builds green is also very helpful.
> > > > We're receiving "[NIGHTLY]" report everyday from dev@.
> > > > See also:
> > > >
> > >
> > https://lists.apache.org/list.html?dev@arrow.apache.org:lte=1M:%5BNIGHTLY%5D
> > > >
> > > > There are some failures everyday.
> > > > If we have any failures, we can't produce release artifacts.
> > > > In recent releases, we fix these failures when we release a
> > > > new version. If we keep all nightly builds green, we will be
> > > > able to release a new version soon when we want to release.
> > > >
> > > >
> > > > Thanks,
> > > > --
> > > > kou
> > > >
> > > > In  > >
> > > >   "Re: [Discuss] Arrow Release Schedule" on Tue, 10 Nov 2020 18:23:16
> > > > -0600,
> > > >   Wes McKinney  wrote:
> > > >
> > > > > We do need a PMC member to sign the release artifacts. Aside from
> > > > > that, IMHO there is a lot that can be done to improve the automation
> > > > > around producing the release artifacts and preparing the release
> > > > > branch.
> > > > >
> > > > > As Krisztian can attest, producing a release currently requires a
> > > > > _lot_ of human time (and time is money). Now that we've gone through
> > > > > this process to produce 28 major and patch releases, I think it is
> > > > > time (and probably well overdue) to improve the release artifact
> > > > > "stamping" process to be more fully automated so that all that's
> > > > > required of a PMC member is to obtain the staged artifacts from a
> > > > > secure location, sign them, and then push them to ASF dist.
> > > > >
> > > > > On Tue, Nov 10, 2020 at 3:47 PM Keerat Singh <
> > keer...@bitquilltech.com
> > > >
> > > > wrote:
> > > > >>
> > > > >> Hi Wes,
> > > > >>
> > > > >> Is it only the PMC members that can volunteer to drive this or can
> > > > someone from the community volunteer and drive as well if they desire
> > to
> > > > have a release sooner?
> > > > >>
> > > > >> I see that the release process has a fairly comprehensive checklist
> > of
> > > > tasks here(
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide
> > > ),
> > > > but there are certain requirements, which not all community members
> > will
> > > be
> > > > able to satisfy, for e.g:
> > > > >>
> > > > >> Being a committer to be able to push to dist and maven repository
> > > > >> A GPG key in the Apache Web of Trust (cross-signed by other Apache
> > > > committers/PMC members) to sign the artifacts
> > > > >>
> > > > >> What parts of the release could a community volunteer help with,
> > given
> > > > they are not able to satisfy certain release requirements?
> > > > >>
> > > > >> Regards,
> > > > >> Keerat
> > > > >>
> > > > >> On Tue, Nov 3, 2020 at 5:27 AM Wes McKinney 
> > > > wrote:
> > > > >>>
> > > > >>> I think to release more often, a few things are necessary:
> > > > >>>
> > > > >>> - Other organizations / PMC members must volunteer more time to
> > drive
> > > > >>> releases and the process around them. My team (and Krisztian in
> > > > particular)
> > > > >>> together with Kou and Uwe have done the majority of this work the
> > > last
> > > > >>> couple of years.
> > > > >>>
> > > > >>> 

Re: [Discuss] Arrow Release Schedule

2020-11-18 Thread Keerat Singh
Hi Neal,

Do you have any information on the status of the tickets ?

Regards,
Keerat

On Mon, Nov 16, 2020 at 11:56 AM Keerat Singh  wrote:

> Thank you, Kou and Wes, for your responses.
>
> As per discussions in the last sync call[11-Nov], there were talks about
> releasing more frequently and help is needed with the build process.
> There was also a discussion on creating specific tickets on which help is
> needed from the community.
> Just wanted to confirm if those tickets were created, if not, who is the
> best person to create those tickets?
>
> Regards,
> Keerat
>
>
> On Tue, Nov 10, 2020 at 7:53 PM Wes McKinney  wrote:
>
> > +1 to everything that Kou said.
> >
> > On Tue, Nov 10, 2020 at 8:53 PM Sutou Kouhei  wrote:
> >
> > > Hi,
> > >
> > > I agree with Wes. We can automate more release related
> > > tasks. I hope that we produce and verify release artifacts
> > > nightly.
> > >
> > > See also:
> > > [DISCUSS] Release cadence and release vote conventions
> > >
> > >
> >
> https://lists.apache.org/thread.html/5e93b0d79a5d3a31cee6f2100c94221de72cb4d5acb1d92b8681e9a6%40%3Cdev.arrow.apache.org%3E
> > >
> > > I think that the main blocker is the Java's build system
> > > mentioned by Wes in this thread too. For example, it
> > > requires tagging in release process. It's not suitable for
> > > building nightly release artifacts.
> > > (See the above thread for details. I described this more.)
> > >
> > >
> > > Keeping all nightly builds green is also very helpful.
> > > We're receiving "[NIGHTLY]" report everyday from dev@.
> > > See also:
> > >
> >
> https://lists.apache.org/list.html?dev@arrow.apache.org:lte=1M:%5BNIGHTLY%5D
> > >
> > > There are some failures everyday.
> > > If we have any failures, we can't produce release artifacts.
> > > In recent releases, we fix these failures when we release a
> > > new version. If we keep all nightly builds green, we will be
> > > able to release a new version soon when we want to release.
> > >
> > >
> > > Thanks,
> > > --
> > > kou
> > >
> > > In  >
> > >   "Re: [Discuss] Arrow Release Schedule" on Tue, 10 Nov 2020 18:23:16
> > > -0600,
> > >   Wes McKinney  wrote:
> > >
> > > > We do need a PMC member to sign the release artifacts. Aside from
> > > > that, IMHO there is a lot that can be done to improve the automation
> > > > around producing the release artifacts and preparing the release
> > > > branch.
> > > >
> > > > As Krisztian can attest, producing a release currently requires a
> > > > _lot_ of human time (and time is money). Now that we've gone through
> > > > this process to produce 28 major and patch releases, I think it is
> > > > time (and probably well overdue) to improve the release artifact
> > > > "stamping" process to be more fully automated so that all that's
> > > > required of a PMC member is to obtain the staged artifacts from a
> > > > secure location, sign them, and then push them to ASF dist.
> > > >
> > > > On Tue, Nov 10, 2020 at 3:47 PM Keerat Singh <
> keer...@bitquilltech.com
> > >
> > > wrote:
> > > >>
> > > >> Hi Wes,
> > > >>
> > > >> Is it only the PMC members that can volunteer to drive this or can
> > > someone from the community volunteer and drive as well if they desire
> to
> > > have a release sooner?
> > > >>
> > > >> I see that the release process has a fairly comprehensive checklist
> of
> > > tasks here(
> > >
> >
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide
> > ),
> > > but there are certain requirements, which not all community members
> will
> > be
> > > able to satisfy, for e.g:
> > > >>
> > > >> Being a committer to be able to push to dist and maven repository
> > > >> A GPG key in the Apache Web of Trust (cross-signed by other Apache
> > > committers/PMC members) to sign the artifacts
> > > >>
> > > >> What parts of the release could a community volunteer help with,
> given
> > > they are not able to satisfy certain release requirements?
> > > >>
> > > >> Regards,
> > > >> Keerat
> > > >>
> > > >> On Tue, Nov 3, 2020 at 5:27 AM Wes McKinney 
> > > wrote:
> > > >>>
> > > >>> I think to release more often, a few things are necessary:
> > > >>>
> > > >>> - Other organizations / PMC members must volunteer more time to
> drive
> > > >>> releases and the process around them. My team (and Krisztian in
> > > particular)
> > > >>> together with Kou and Uwe have done the majority of this work the
> > last
> > > >>> couple of years.
> > > >>>
> > > >>> - Some investments in improving the release tooling to be more
> > > automated
> > > >>> and less error prone must be made. We’ve talked for example about
> > > tearing
> > > >>> out the Maven release machinery for Java, that would be a
> significant
> > > >>> benefit.
> > > >>>
> > > >>> On Tue, Nov 3, 2020 at 12:45 AM Micah Kornfield <
> > emkornfi...@gmail.com
> > > >
> > > >>> wrote:
> > > >>>
> > > >>> > >
> > > >>> > > Are there any plans for a more frequent release cadence?
> > > >>> >
> > > >>> > Not to my knowledge.  The release 

Re: [C++] 0x00 in Binary type

2020-11-18 Thread Micah Kornfield
+1 to what Francois said.  You either want to use the method that takes a
length or string_view for this case:
https://github.com/apache/arrow/blob/843e8bb556a03f0e4c18841a623d1a0e9c236ee5/cpp/src/arrow/array/builder_binary.h#L72

On Wed, Nov 18, 2020 at 11:05 AM Francois Saint-Jacques <
fsaintjacq...@gmail.com> wrote:

> I would say at first sight that it's due to your usage of char[] and
> builder.Append(d) implicitly does a strlen.
>
> François
>
> On Wed, Nov 18, 2020 at 2:00 PM Ying Zhou  wrote:
> >
> > Sure!
> >
> > BinaryBuilder builder;
> > char d[] = "\x00\x01\xbf\x5b”;
> > (void)(builder.Append(d));
> > std::shared_ptr array;
> > (void)(builder.Finish());
> > int32_t dataLength = 0;
> > auto aarray = std::static_pointer_cast(array);
> > const uint8_t* data = aarray->GetValue(0, );
> > data = aarray->GetValue(3, );
> > RecordProperty("l3", dataLength);
> > RecordProperty("30", data[0]);
> > RecordProperty("31", data[1]);
> > RecordProperty("32", data[2]);
> > RecordProperty("33", data[3]);
> >
> > We need Google Test to use RecordProperty. dataLength is 0 instead of 4
> and data[i] are 255, 0, 0 and 0 respectively.
> >
> > My JIRA ID is yingzhou474.
> >
> >
> > > On Nov 18, 2020, at 1:49 PM, Antoine Pitrou 
> wrote:
> > >
> > >
> > > Hello,
> > >
> > > Le 18/11/2020 à 19:06, Ying Zhou a écrit :
> > >>
> > >> According to the documentation BINARY is "Variable-length bytes (no
> guarantee of UTF8-ness)”. However in practice if I embed 0x00 in the middle
> of a char array and Append it to a BinaryBuilder the 0x00 is converted to
> 0xff, everything after it is not appended and the length is computed as if
> the 0x00 and everything after it don’t exist (i.e. standard STRING
> behavior).
> > >
> > > Can you post some code showing how you build the array?
> > >
> > >> P.S. Please allow me to assign Jira tickets to myself. Really thanks!
> > >
> > > What is your JIRA id?
> > >
> > > Regards
> > >
> > > Antoine.
> >
>


Re: [C++] 0x00 in Binary type

2020-11-18 Thread Francois Saint-Jacques
I would say at first sight that it's due to your usage of char[] and
builder.Append(d) implicitly does a strlen.

François

On Wed, Nov 18, 2020 at 2:00 PM Ying Zhou  wrote:
>
> Sure!
>
> BinaryBuilder builder;
> char d[] = "\x00\x01\xbf\x5b”;
> (void)(builder.Append(d));
> std::shared_ptr array;
> (void)(builder.Finish());
> int32_t dataLength = 0;
> auto aarray = std::static_pointer_cast(array);
> const uint8_t* data = aarray->GetValue(0, );
> data = aarray->GetValue(3, );
> RecordProperty("l3", dataLength);
> RecordProperty("30", data[0]);
> RecordProperty("31", data[1]);
> RecordProperty("32", data[2]);
> RecordProperty("33", data[3]);
>
> We need Google Test to use RecordProperty. dataLength is 0 instead of 4 and 
> data[i] are 255, 0, 0 and 0 respectively.
>
> My JIRA ID is yingzhou474.
>
>
> > On Nov 18, 2020, at 1:49 PM, Antoine Pitrou  wrote:
> >
> >
> > Hello,
> >
> > Le 18/11/2020 à 19:06, Ying Zhou a écrit :
> >>
> >> According to the documentation BINARY is "Variable-length bytes (no 
> >> guarantee of UTF8-ness)”. However in practice if I embed 0x00 in the 
> >> middle of a char array and Append it to a BinaryBuilder the 0x00 is 
> >> converted to 0xff, everything after it is not appended and the length is 
> >> computed as if the 0x00 and everything after it don’t exist (i.e. standard 
> >> STRING behavior).
> >
> > Can you post some code showing how you build the array?
> >
> >> P.S. Please allow me to assign Jira tickets to myself. Really thanks!
> >
> > What is your JIRA id?
> >
> > Regards
> >
> > Antoine.
>


Re: [C++] 0x00 in Binary type

2020-11-18 Thread Ying Zhou
Sure!

BinaryBuilder builder;
char d[] = "\x00\x01\xbf\x5b”;
(void)(builder.Append(d));
std::shared_ptr array;
(void)(builder.Finish());
int32_t dataLength = 0;
auto aarray = std::static_pointer_cast(array);
const uint8_t* data = aarray->GetValue(0, );
data = aarray->GetValue(3, );
RecordProperty("l3", dataLength);
RecordProperty("30", data[0]);
RecordProperty("31", data[1]);
RecordProperty("32", data[2]);
RecordProperty("33", data[3]);

We need Google Test to use RecordProperty. dataLength is 0 instead of 4 and 
data[i] are 255, 0, 0 and 0 respectively. 

My JIRA ID is yingzhou474.


> On Nov 18, 2020, at 1:49 PM, Antoine Pitrou  wrote:
> 
> 
> Hello,
> 
> Le 18/11/2020 à 19:06, Ying Zhou a écrit :
>> 
>> According to the documentation BINARY is "Variable-length bytes (no 
>> guarantee of UTF8-ness)”. However in practice if I embed 0x00 in the middle 
>> of a char array and Append it to a BinaryBuilder the 0x00 is converted to 
>> 0xff, everything after it is not appended and the length is computed as if 
>> the 0x00 and everything after it don’t exist (i.e. standard STRING behavior).
> 
> Can you post some code showing how you build the array?
> 
>> P.S. Please allow me to assign Jira tickets to myself. Really thanks!
> 
> What is your JIRA id?
> 
> Regards
> 
> Antoine.



Re: [C++] 0x00 in Binary type

2020-11-18 Thread Antoine Pitrou


Hello,

Le 18/11/2020 à 19:06, Ying Zhou a écrit :
> 
> According to the documentation BINARY is "Variable-length bytes (no guarantee 
> of UTF8-ness)”. However in practice if I embed 0x00 in the middle of a char 
> array and Append it to a BinaryBuilder the 0x00 is converted to 0xff, 
> everything after it is not appended and the length is computed as if the 0x00 
> and everything after it don’t exist (i.e. standard STRING behavior).

Can you post some code showing how you build the array?

> P.S. Please allow me to assign Jira tickets to myself. Really thanks!

What is your JIRA id?

Regards

Antoine.


[C++] 0x00 in Binary type

2020-11-18 Thread Ying Zhou
Hello,

According to the documentation BINARY is "Variable-length bytes (no guarantee 
of UTF8-ness)”. However in practice if I embed 0x00 in the middle of a char 
array and Append it to a BinaryBuilder the 0x00 is converted to 0xff, 
everything after it is not appended and the length is computed as if the 0x00 
and everything after it don’t exist (i.e. standard STRING behavior). I would 
like to know whether it is intended. If it is then we should change the
 documentation and explicitly state that 0x00 is not allowed. Otherwise we need 
to change the implementation to allow it.

Thanks,
Ying

P.S. Please allow me to assign Jira tickets to myself. Really thanks!

Re: Using arrow/compute/kernels/*internal.h headers

2020-11-18 Thread Wes McKinney
On #2, I think this discussion might be overly speculative until a
full-fledged multithreaded hash aggregation is implemented in the
Apache Arrow C++ library. There are other analytic database systems in
the wild which might provide a blueprint for the way that we should
approach this, and I don't know that it should be constrained by the
particular details of the existing kernel machinery we have
implemented.

To be clear, it would be great to see this work happen in the Apache
project rather than in third party projects.

On Wed, Nov 18, 2020 at 10:20 AM Benjamin Kietzman  wrote:
>
> 1: Excellent!
>
> 2: The open JIRA for grouped aggregation is
> https://issues.apache.org/jira/browse/ARROW-4124
> (though it's out of date since it predates the addition of
> ScalarAggregateKernel).
> To summarize: for *grouped* aggregation we want the kernel to do the work
> of evaluating group
> condition(s) and the corresponding KernelState should include any hash
> tables and a columnar
> (for mean, this would be an array of struct) store of
> pre-finalized results.
> Pushing this into the kernel yields significant perf wins since layout of
> state can be controlled
> more freely and we avoid talking across a virtual/type erased interface
> inside a hot loop.
> A sketch of how ARROW-4124 could be resolved within the
> compute::{Function,Kernel} interface:
> - add compute functions like "grouped_mean" which are binary.
> - the first argument is the array to aggregate while the second is the
> grouping condition
> - compute kernels for these functions can probably still be
> ScalarAggregateKernels (though after
>   this we may need to rename them); the interface to implement would be
> ScalarAggregator, which
>   is in compute/kernels/aggregate_internal.h
> - The new ScalarAggregator would be able to reuse existing arrow internals
> such as our HashTable
>   and could store pre-finalized results in a mutable ArrayData (which would
> make finalizing pretty
>   trivial).
>
> This returns to your original question about how to get access to arrow
> internals like ScalarAggregator
> outside the arrow repo. However hopefully after this discussion it seems
> feasible and preferable to
> add the functionality you need upstream
>
> On Tue, Nov 17, 2020 at 9:58 PM Niranda Perera 
> wrote:
>
> > 1. This is great. I will follow this JIRA. (better yet, I'll see if I can
> > make that contribution)
> >
> > 2. If we forget about the multithreading case for a moment, this
> > requirement came up while implementing a "groupby + aggregation" operation
> > (single-threaded). Let's assume that a table is not sorted. So, the
> > simplest approach would be to keep the intermediate state in a container
> > (map/vector) and update the state while traversing the table. This approach
> > becomes important when there is a large number of groups and there aren't
> > enough rows with the same key to use 'consume' vector aggregation (on a
> > sorted table).
> >
> >
> >
> > On Tue, Nov 17, 2020 at 10:54 AM Benjamin Kietzman 
> > wrote:
> >
> >> Hi Niranda,
> >>
> >> hastebin: That looks generally correct, though I should warn you that a
> >> recent PR
> >> ( https://github.com/apache/arrow/pull/8574 ) changed the return type of
> >> DispatchExact to Kernel so you'll need to insert an explicit cast to
> >> ScalarAggregateKernel.
> >>
> >> 1: This seems like a feature which might be generally useful to consumers
> >> of
> >> the compute module, so it's probably worth adding to the KernelState
> >> interface
> >> in some way rather than exposing individual implementations. I've opened
> >> https://issues.apache.org/jira/browse/ARROW-10630 to track this feature
> >>
> >> 2: I would not expect your container to contain a large number of
> >> KernelStates.
> >> Specifically: why would you need more than one per local thread?
> >> Additionally
> >> for the specific case of aggregation I'd expect that KernelStates not
> >> actively in
> >> use would be `merge`d and no longer stored. With small numbers of
> >> instances
> >> I would expect the memory overhead due to polymorphism to be negligible.
> >>
> >> On Mon, Nov 16, 2020 at 7:03 PM Niranda Perera 
> >> wrote:
> >>
> >>> Hi Ben and Wes,
> >>> Based on our discussion, I did the following.
> >>> https://hastebin.com/ajadonados.cpp
> >>>
> >>> It seems to be working fine. Would love to get your feedback on this!
> >>> :-)
> >>>
> >>> But I have a couple of concerns.
> >>> 1. Say I want to communicate the intermediate state data across multiple
> >>> processes. Unfortunately, KernelState struct does not expose the data
> >>> pointer to the outside. If say SumState is exposed, we could have accessed
> >>> that data, isn't it? WDYT?
> >>> 2. Polymorphism and virtual functions - Intuitively, a mean aggregation
> >>> intermediate state would be a {T, int64} tuple. But I believe the size of
> >>> struct "SumImpl : public ScalarAggregator (:public KernelState)" would be
> >>> sizeof(T) + sizeof(int64) + 

Re: Using arrow/compute/kernels/*internal.h headers

2020-11-18 Thread Benjamin Kietzman
1: Excellent!

2: The open JIRA for grouped aggregation is
https://issues.apache.org/jira/browse/ARROW-4124
(though it's out of date since it predates the addition of
ScalarAggregateKernel).
To summarize: for *grouped* aggregation we want the kernel to do the work
of evaluating group
condition(s) and the corresponding KernelState should include any hash
tables and a columnar
(for mean, this would be an array of struct) store of
pre-finalized results.
Pushing this into the kernel yields significant perf wins since layout of
state can be controlled
more freely and we avoid talking across a virtual/type erased interface
inside a hot loop.
A sketch of how ARROW-4124 could be resolved within the
compute::{Function,Kernel} interface:
- add compute functions like "grouped_mean" which are binary.
- the first argument is the array to aggregate while the second is the
grouping condition
- compute kernels for these functions can probably still be
ScalarAggregateKernels (though after
  this we may need to rename them); the interface to implement would be
ScalarAggregator, which
  is in compute/kernels/aggregate_internal.h
- The new ScalarAggregator would be able to reuse existing arrow internals
such as our HashTable
  and could store pre-finalized results in a mutable ArrayData (which would
make finalizing pretty
  trivial).

This returns to your original question about how to get access to arrow
internals like ScalarAggregator
outside the arrow repo. However hopefully after this discussion it seems
feasible and preferable to
add the functionality you need upstream

On Tue, Nov 17, 2020 at 9:58 PM Niranda Perera 
wrote:

> 1. This is great. I will follow this JIRA. (better yet, I'll see if I can
> make that contribution)
>
> 2. If we forget about the multithreading case for a moment, this
> requirement came up while implementing a "groupby + aggregation" operation
> (single-threaded). Let's assume that a table is not sorted. So, the
> simplest approach would be to keep the intermediate state in a container
> (map/vector) and update the state while traversing the table. This approach
> becomes important when there is a large number of groups and there aren't
> enough rows with the same key to use 'consume' vector aggregation (on a
> sorted table).
>
>
>
> On Tue, Nov 17, 2020 at 10:54 AM Benjamin Kietzman 
> wrote:
>
>> Hi Niranda,
>>
>> hastebin: That looks generally correct, though I should warn you that a
>> recent PR
>> ( https://github.com/apache/arrow/pull/8574 ) changed the return type of
>> DispatchExact to Kernel so you'll need to insert an explicit cast to
>> ScalarAggregateKernel.
>>
>> 1: This seems like a feature which might be generally useful to consumers
>> of
>> the compute module, so it's probably worth adding to the KernelState
>> interface
>> in some way rather than exposing individual implementations. I've opened
>> https://issues.apache.org/jira/browse/ARROW-10630 to track this feature
>>
>> 2: I would not expect your container to contain a large number of
>> KernelStates.
>> Specifically: why would you need more than one per local thread?
>> Additionally
>> for the specific case of aggregation I'd expect that KernelStates not
>> actively in
>> use would be `merge`d and no longer stored. With small numbers of
>> instances
>> I would expect the memory overhead due to polymorphism to be negligible.
>>
>> On Mon, Nov 16, 2020 at 7:03 PM Niranda Perera 
>> wrote:
>>
>>> Hi Ben and Wes,
>>> Based on our discussion, I did the following.
>>> https://hastebin.com/ajadonados.cpp
>>>
>>> It seems to be working fine. Would love to get your feedback on this!
>>> :-)
>>>
>>> But I have a couple of concerns.
>>> 1. Say I want to communicate the intermediate state data across multiple
>>> processes. Unfortunately, KernelState struct does not expose the data
>>> pointer to the outside. If say SumState is exposed, we could have accessed
>>> that data, isn't it? WDYT?
>>> 2. Polymorphism and virtual functions - Intuitively, a mean aggregation
>>> intermediate state would be a {T, int64} tuple. But I believe the size of
>>> struct "SumImpl : public ScalarAggregator (:public KernelState)" would be
>>> sizeof(T) + sizeof(int64) + sizeof(ScalarAggregator) + sizeof(vptr),
>>> isn't it? So, if I am managing a compute state container, this means that
>>> my memory requirement would be higher than simply using a {T, int64} tuple.
>>> Please correct me if I am wrong. I am not sure if there is a better
>>> solution to this, but just want to discuss it with you.
>>>
>>>
>>> On Tue, Nov 10, 2020 at 9:44 AM Wes McKinney 
>>> wrote:
>>>
 Yes, open a Jira and propose a PR implementing the changes you need

 On Mon, Nov 9, 2020 at 8:31 PM Niranda Perera 
 wrote:
 >
 > @wes How should I proceed with this nevertheless? should I open a
 JIRA?
 >
 > On Mon, Nov 9, 2020 at 11:09 AM Wes McKinney 
 wrote:
 >
 > > On Mon, Nov 9, 2020 at 9:32 AM Niranda Perera <
 

Re: Travis CI jobs gummed up on Arrow PRs?

2020-11-18 Thread Andrew Lamb
Thanks for follow up.

Reading between the lines this sounds like it was simply a capacity issue
at Travis CI.

And indeed, when I looked at the original job

that I reported appeared to be stuck, it had indeed run (and failed as the
upstream revision was gone).

Thanks again for taking the time to look into this,
Andrew



On Wed, Nov 18, 2020 at 12:04 AM Kazuaki Ishizaki 
wrote:

> We got a response at
>
> https://travis-ci.community/t/s390x-jobs-are-stuck-in-the-received-state-for-days/10581/3?u=kiszk
> . Now, this problem has been solved.
>
> An interesting comment in the post is as follow:
> > If you would like to have an increased build capacity, we are happy to
> discuss the plans with you. Please send email to supp...@travis-ci.com.
>
> Regards,
> Kazuaki Ishizaki
>
>
>
>
> From:   Andrew Lamb 
> To: dev@arrow.apache.org
> Date:   2020/11/16 19:37
> Subject:[EXTERNAL] Re: Travis CI jobs gummed up on Arrow PRs?
>
>
>
> Thank you!
>
> On Sun, Nov 15, 2020 at 8:30 PM Kazuaki Ishizaki 
> wrote:
>
> > I have just reported this issue at the TravisCI forum.
> >
> >
> >
>
> https://travis-ci.community/t/s390x-jobs-have-not-been-almost-executed/10581
>
> >
> > Regards,
> > Kazuaki Ishizaki,
> >
> > Sutou Kouhei  wrote on 2020/11/16 10:02:18:
> >
> > > From: Sutou Kouhei 
> > > To: dev@arrow.apache.org
> > > Date: 2020/11/16 10:02
> > > Subject: [EXTERNAL] Re: Travis CI jobs gummed up on Arrow PRs?
> > >
> > > Hi,
> > >
> > > > 1. Is anyone else knows about these failures?
> > >
> > > "these failures" means that the Travis CI jobs aren't ran,
> > > right? (It doesn't mean that the Travis CI jobs reports
> > > "failure".)
> > >
> > > This may be a Travis CI bug.
> > >
> > > > 2. Should we look into disabling these checks for PRs that only
> touch
> > rust
> > > > code? I can do this, but am not sure of the history
> > >
> > > One approach is adding "[travis skip]" to commit message.
> > > We can't use path based conditional build on Travis CI
> > > because Travis CI doesn't provide the feature.
> > >
> > > See also:
> > >
> > >   * INVALID URI REMOVED
> > >
> >
> >
>
> u=https-3A__docs.travis-2Dci.com_user_conditions-2Dv1=DwICAg=jf_iaSHvJObTbx-
> > > siA1ZOg=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg=U4B_SrhIun-
> > > kKemWyzO5XUEZhrCIfFKnR967kBAI3rE=iGmV-
> > > uUdAjCFprWPStYvU1jHvMt7D2YM8hmtdRLC4n4=
> > >   * INVALID URI REMOVED
> > >
> >
> >
>
> u=https-3A__docs.travis-2Dci.com_user_conditions-2Dtesting=DwICAg=jf_iaSHvJObTbx-
> > > siA1ZOg=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg=U4B_SrhIun-
> > >
> >
> >
>
> kKemWyzO5XUEZhrCIfFKnR967kBAI3rE=yO4rf79YYHbt8NJKOLGaCKMKUehCe_ub2RtpDLq06QA=
> > >
> > >
> > > Thanks,
> > > --
> > > kou
> > >
> > > In
> 
> > >   "Travis CI jobs gummed up on Arrow PRs?" on Sun, 15 Nov 2020
> 07:55:27
> > -0500,
> > >   Andrew Lamb  wrote:
> > >
> > > > Sorry if this has already been discussed.
> > > >
> > > > There seems to be something wrong with the Travis CI jobs on some
> > Arrow PRs
> > > > -- for example,
> > > > INVALID URI REMOVED
> > >
> >
> >
>
> u=https-3A__github.com_apache_arrow_pull_8662_checks-3Fcheck-5Frun-5Fid-3D1400052607=DwICAg=jf_iaSHvJObTbx-
> > > siA1ZOg=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg=U4B_SrhIun-
> > >
> kKemWyzO5XUEZhrCIfFKnR967kBAI3rE=V6iophseGVoYXwP1pLHBNtuMiHnEjT2bj69-
> > > iULRCZc= .
> > > > They go into the "pending" state and never seem to actually run.
> > > >
> > > > Since these appear to be checks of the C++ implementation on s390 or
> > ARM,
> > > > which aren't relevant to the Rust implementation, it isn't really
> > blocking
> > > > anything.
> > > >
> > > > I am wondering
> > > > 1. Is anyone else knows about these failures?
> > > > 2. Should we look into disabling these checks for PRs that only
> touch
> > rust
> > > > code? I can do this, but am not sure of the history
> > > >
> > > > Thank you,
> > > > Andrew
> > > >
> > > > p.s. here is another example of a non rust PR that seems to show the
> > same
> > > > issue:
> > > > INVALID URI REMOVED
> > >
> u=https-3A__github.com_apache_arrow_pull_8647=DwICAg=jf_iaSHvJObTbx-
> > > siA1ZOg=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg=U4B_SrhIun-
> > >
> >
> >
>
> kKemWyzO5XUEZhrCIfFKnR967kBAI3rE=lA3pYfzvwvn9bMYt2saPIQe5ZmC11vkGCPO8g6TyMMM=
> > > >
> > > > p.p.s. Here is a screenshot showing the travis CI UI for such a
> gummed
> > up
> > > > test
> > > >
> > > > [image: Screen Shot 2020-11-15 at 7.50.27 AM.png]
> > >
> >
> >
> >
>
>
>
>


[NIGHTLY] Arrow Build Report for Job nightly-2020-11-18-0

2020-11-18 Thread Crossbow


Arrow Build Report for Job nightly-2020-11-18-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0

Failed Tasks:
- conda-osx-clang-py36-r36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-azure-conda-osx-clang-py36-r36
- conda-osx-clang-py37-r40:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-azure-conda-osx-clang-py37-r40
- conda-win-vs2017-py36-r36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-azure-conda-win-vs2017-py36-r36
- conda-win-vs2017-py37-r40:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-azure-conda-win-vs2017-py37-r40
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-github-test-conda-cpp
- test-conda-python-3.7-spark-branch-3.0:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-github-test-conda-python-3.7-spark-branch-3.0
- test-conda-python-3.7-turbodbc-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-github-test-conda-python-3.7-turbodbc-latest
- test-conda-python-3.8-jpype:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-github-test-conda-python-3.8-jpype
- test-ubuntu-18.04-docs:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-azure-test-ubuntu-18.04-docs

Succeeded Tasks:
- centos-6-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-github-centos-6-amd64
- centos-7-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-travis-centos-7-aarch64
- centos-7-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-github-centos-7-amd64
- centos-8-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-travis-centos-8-aarch64
- centos-8-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-github-centos-8-amd64
- conda-clean:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-azure-conda-clean
- conda-linux-gcc-py36-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-drone-conda-linux-gcc-py36-aarch64
- conda-linux-gcc-py36-cpu-r36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-azure-conda-linux-gcc-py36-cpu-r36
- conda-linux-gcc-py36-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-azure-conda-linux-gcc-py36-cuda
- conda-linux-gcc-py37-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-drone-conda-linux-gcc-py37-aarch64
- conda-linux-gcc-py37-cpu-r40:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-azure-conda-linux-gcc-py37-cpu-r40
- conda-linux-gcc-py37-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-azure-conda-linux-gcc-py37-cuda
- conda-linux-gcc-py38-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-drone-conda-linux-gcc-py38-aarch64
- conda-linux-gcc-py38-cpu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-azure-conda-linux-gcc-py38-cpu
- conda-linux-gcc-py38-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-azure-conda-linux-gcc-py38-cuda
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-azure-conda-osx-clang-py38
- conda-win-vs2017-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-azure-conda-win-vs2017-py38
- debian-buster-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-github-debian-buster-amd64
- debian-buster-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-travis-debian-buster-arm64
- debian-stretch-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-github-debian-stretch-amd64
- debian-stretch-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-travis-debian-stretch-arm64
- example-cpp-minimal-build-static-system-dependency:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-github-example-cpp-minimal-build-static-system-dependency
- example-cpp-minimal-build-static:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-github-example-cpp-minimal-build-static
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-18-0-travis-gandiva-jar-osx
- gandiva-jar-xenial:
  URL: