change in pyarrow scalar equality?

2020-08-05 Thread Bryan Cutler
Hi all, I came across a behavior change from 0.17.1 when comparing array scalar values with python objects. This used to work for 0.17.1 and before, but in 1.0.0 equals always returns false. I saw there was a previous discussion on Python equality semantics, but not sure if the conclusion is the

Re: [DISCUSS] Support of higher bit-width Decimal type

2020-08-05 Thread Micah Kornfield
> > Sounds fine to me. I guess one question is what needs to be formalized > in the Schema.fbs files or elsewhere in the columnar format > documentation (and we will need to hold an associated vote for that I > think) Yes, i think we will need to hold a vote for it. Since this is essentially a

Re: [DISCUSS][C++] Group by operation for RecordBatch and Table

2020-08-05 Thread Wes McKinney
I see there's a bunch of additional aggregation code in Dremio that might serve as inspiration (some of which is related to distributed aggregation, so may not be relevant) https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/sabot/op/aggregate Maybe Andy or one

Re: [DISCUSS] Support of higher bit-width Decimal type

2020-08-05 Thread Wes McKinney
Sounds fine to me. I guess one question is what needs to be formalized in the Schema.fbs files or elsewhere in the columnar format documentation (and we will need to hold an associated vote for that I think) On Mon, Aug 3, 2020 at 11:30 PM Micah Kornfield wrote: > > Given no objections, we'll go

Re: [DISCUSS][C++] Group by operation for RecordBatch and Table

2020-08-05 Thread Wes McKinney
hi Kenta, Yes, I think it only makes sense to implement this in the context of the query engine project. Here's a list of assorted thoughts about it: * I have been mentally planning to follow the Vectorwise-type query engine architecture that's discussed in [1] [2] and many other academic

Re: Arrow sync call August 5 at 12:00 US/Eastern, 16:00 UTC

2020-08-05 Thread Neal Richardson
Attendees: Projjal Chanda Fred Gan Andy Grove Todd Hendricks Jörn Horstmann Ben Kietzman Rok Mihevc Neal Richardson Paul Taylor Andrew Wieteska Discussion * 1.0.1 * Andy: Rust packaging issue, need to test on published crate * Timing: week of August 17 * Bug in dictionary batches in device

[DISCUSS][C++] Group by operation for RecordBatch and Table

2020-08-05 Thread Kenta Murata
Hi folks, Red Arrow, the Ruby binding of Arrow GLib, implements grouped aggregation features for RecordBatch and Table. Because these features are written in Ruby, they are too slow for large size data. We need to make them much faster. To improve their calculation speed, they should be

Re: Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

2020-08-05 Thread Radu Teodorescu
> I will have a closer look and comment most likely next week. Thank you! > > Unfortunately, having code developed in external repositories increases the > complexity of importing that code back into the Apache project Not sure if > you’re interested in preemptively following the project’s

Re: Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

2020-08-05 Thread Wes McKinney
I will have a closer look and comment most likely next week. Unfortunately, having code developed in external repositories increases the complexity of importing that code back into the Apache project Not sure if you’re interested in preemptively following the project’s style guide (file naming,

Re: [DISCUSS] How to extended time value range for Timestamp type?

2020-08-05 Thread Wes McKinney
I also am not sure there is a good case for a new built-in type since it introduces a good deal of complexity, particularly when there is the extension type option. We’ve been living with 64-bit nanoseconds in pandas for a decade, for example (and without the option for lower resolutions!!), and

Re: Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

2020-08-05 Thread Radu Teodorescu
Wes & crew, Congratulations and thank you for the successful 1.0 rollout , it is certainly making a huge difference for my day job! Is it a good time now to revive the conversation below? (and https://github.com/apache/arrow/pull/7548 ) I have also gone ahead and released a prototype the covers

[NIGHTLY] Arrow Build Report for Job nightly-2020-08-05-0

2020-08-05 Thread Crossbow
Arrow Build Report for Job nightly-2020-08-05-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0 Failed Tasks: - conda-linux-gcc-py36-cpu: URL:

Re: [DISSCUSS][JAVA] Avoid set reader/writer indices in FieldVector#getFieldBuffers

2020-08-05 Thread Ji Liu
hi liya, Thanks for your careful review, it is a typo, the order of getBuffers is wrong. Fan Liya 于2020年8月5日周三 下午2:14写道: > Hi Ji, > > IMO, for the correct order, the validity buffer should precede the offset > buffer (e.g. this is the order used by BaseVariableWidthVector & >

Re: [DISSCUSS][JAVA] Avoid set reader/writer indices in FieldVector#getFieldBuffers

2020-08-05 Thread Fan Liya
Hi Ji, IMO, for the correct order, the validity buffer should precede the offset buffer (e.g. this is the order used by BaseVariableWidthVector & BaseLargeVariableWidthVector). In ListVector#getBuffers, the offset buffer precedes the validity buffer, so I am a little confused why you say the