Re: Any standard way for min/max values per record-batch?

2021-02-17 Thread Micah Kornfield
> > What is the parallel-list means? Something like: table RecordBatch { nodes: [FieldNode]; // Statistics related to the data represented by each FieldNode // This field is either length=0 or has the same length as nodes. statistics: [Statistic]; } On Wed, Feb 17, 2021 at 8:34 PM Kohei KaiGai

Re: [Python] A user friendly way to filter parquet partitions

2021-02-17 Thread Micah Kornfield
Hi Weiyang, The library looks interesting, and for python certainly seems like it might add a better user experience. I'm not super active in python maintenance (others who are can hopefully chime in). But my impression is we try to keep dependencies minimal in general. Furthermore, the goal of

Re: Exposing low-level Parquet encryption to Python user (or, maybe not)

2021-02-17 Thread Micah Kornfield
> > I don't think any notion of threading should be present in the > implementation, except for the required locks around shared structures. I seem to recall the debate was how to model some class interactions to determine what should be considered shared structures and what should not. On Wed,

Re: Any standard way for min/max values per record-batch?

2021-02-17 Thread Kohei KaiGai
Thanks for the clarification. > There is key-value metadata available on Message which might be able to > work in the short term (some sort of encoded message). I think > standardizing how we store statistics per batch does make sense. > For example, JSON array of min/max values as a key-value

Re: [Format][RFC] Introduce COMPLEX type for IntervalUnit

2021-02-17 Thread Micah Kornfield
> > I didn’t find any page/documentation on how to do RFC in Arrow protocol, > so can anyone point me to it or PR with email will be enough? That is enough to start discussion. Before formal acceptance and merging of the PR there needs to be a Java and C++ implementations for the type that pass

Re: Cross-endianness IPC support in Arrow C++

2021-02-17 Thread Micah Kornfield
Congrats! On Wed, Feb 17, 2021 at 4:12 PM Wes McKinney wrote: > This is great news! Congrats to everyone who worked on this to make it > possible. I know that the cross-endianness question was something that > came up periodically (even though BE systems are increasingly exotic > nowadays) so

Re: Any standard way for min/max values per record-batch?

2021-02-17 Thread Micah Kornfield
There is key-value metadata available on Message which might be able to work in the short term (some sort of encoded message). I think standardizing how we store statistics per batch does make sense. We unfortunately can't add anything to field-node without breaking compatibility. But another

Any standard way for min/max values per record-batch?

2021-02-17 Thread Kohei KaiGai
Hello, Does Apache Arrow have any standard way to embed min/max values of the fields per record-batch basis? It looks FieldNode supports neither dedicated min/max attribute nor custom-metadata. https://github.com/apache/arrow/blob/master/format/Message.fbs#L28 If we embed an array of min/max

Re: Cross-endianness IPC support in Arrow C++

2021-02-17 Thread Wes McKinney
This is great news! Congrats to everyone who worked on this to make it possible. I know that the cross-endianness question was something that came up periodically (even though BE systems are increasingly exotic nowadays) so it's great that we now have a robust answer On Wed, Feb 17, 2021 at 8:48

Re: [Format][RFC] Introduce COMPLEX type for IntervalUnit

2021-02-17 Thread Andrew Lamb
That is a great suggestion Wes, thank you. I wonder if we could get away with a 128 bit representation that is the concatenation of the two existing interval types (YearMonth)(DayTime). Or maybe even define a `struct` type with those fields that is used by DataFusion. Basically, given our

Re: [Format][RFC] Introduce COMPLEX type for IntervalUnit

2021-02-17 Thread Wes McKinney
On Wed, Feb 17, 2021 at 3:46 PM wrote: > > > It's unclear to me that this needs to be introduced into the top-level > > Similar thing to columnar format, How to store interval like 1 month 1 day 1 > hour? It’s not possible to do it without converting 1 month to 30 days, which > is a bad way. >

Re: Requirements on JIRA usage in Apache Arrow

2021-02-17 Thread Wes McKinney
Read more (this is one ASF member's interpretation of the Openness tenet of the Apache Way) about this: http://theapacheway.com/open/ On Wed, Feb 17, 2021 at 3:46 PM Wes McKinney wrote: > > For trivial PRs that do not merit mention in the changelog you could > preface the issue title with

Re: Requirements on JIRA usage in Apache Arrow

2021-02-17 Thread Andrew Lamb
I like the idea of encouraging regular contributors to use JIRA more consistently On Wed, Feb 17, 2021 at 4:47 PM Wes McKinney wrote: > For trivial PRs that do not merit mention in the changelog you could > preface the issue title with something like "ARROW-XXX" and we can > modify the merge

Re: Requirements on JIRA usage in Apache Arrow

2021-02-17 Thread Wes McKinney
For trivial PRs that do not merit mention in the changelog you could preface the issue title with something like "ARROW-XXX" and we can modify the merge tool to bypass the consistency check for these. I think some other Apache projects do this. I can understand how it might seem like a nuisance to

Re: [Format][RFC] Introduce COMPLEX type for IntervalUnit

2021-02-17 Thread talk
> It's unclear to me that this needs to be introduced into the top-level Similar thing to columnar format, How to store interval like 1 month 1 day 1 hour? It’s not possible to do it without converting 1 month to 30 days, which is a bad way. > On 17 Feb 2021, at 21:02, Wes McKinney wrote: >

Re: Requirements on JIRA usage in Apache Arrow

2021-02-17 Thread Andrew Lamb
Thanks for the background Wes. This is exactly what I was looking for. I think using JIRA for the single source of truth / project management has lots of value and I don't want to propose changing that. I am trying to lower the barrier to contributing to Arrow even more. While I agree creating

Re: Requirements on JIRA usage in Apache Arrow

2021-02-17 Thread Wes McKinney
hi Andrew, There isn't a hard requirement. It's a culture thing where the purpose of Jira issues is to create a changelog and for developers to communicate publicly what work they are proposing to perform in the project. We decided by consensus (essentially) that having a single point of truth

Requirements on JIRA usage in Apache Arrow

2021-02-17 Thread Andrew Lamb
Can someone tell me / point me at what the actual "requirements" for using JIRA in Apache Arrow are? Specifically, I would like to know: 1. Where does the requirement for each commit to have a JIRA ticket come from? (Is that Apache Arrow specific, or is it a more general Apache governance

Re: [Format][RFC] Introduce COMPLEX type for IntervalUnit

2021-02-17 Thread Wes McKinney
It's unclear to me that this needs to be introduced into the top-level columnar format without more analysis — have you considered implementing this for DataFusion as an extension type for the time being? On Wed, Feb 17, 2021 at 11:59 AM t...@dmtry.me wrote: > > Hi, > > For now, There are only

[Format][RFC] Introduce COMPLEX type for IntervalUnit

2021-02-17 Thread t...@dmtry.me
Hi, For now, There are only two types of IntervalUnit inside Arrow: - YearMonth - month stored as int32 - DayTime - days as int32 and time in milliseconds as in32. Total (64 bites) Since DF is using Arrow, It’s not possible to store “Complex” intervals such 1 MONTH 1 DAY 1 HOUR. I think, the

Re: Exposing low-level Parquet encryption to Python user (or, maybe not)

2021-02-17 Thread Gidon Gershinsky
This certainly sounds good to me. Cheers, Gidon On Wed, Feb 17, 2021 at 7:36 PM Antoine Pitrou wrote: > > I don't think any notion of threading should be present in the > implementation, except for the required locks around shared structures. > I don't know where the idea of a "main thread"

Re: Exposing low-level Parquet encryption to Python user (or, maybe not)

2021-02-17 Thread Antoine Pitrou
I don't think any notion of threading should be present in the implementation, except for the required locks around shared structures. I don't know where the idea of a "main thread" comes from, but it probably shouldn't exist in a C++ library. Regards Antoine. Le 17/02/2021 à 18:34, Gidon

Re: Exposing low-level Parquet encryption to Python user (or, maybe not)

2021-02-17 Thread Gidon Gershinsky
Just to clarify. There are two options, which one do you refer to? A design with a main thread that handles projections and the keys (relevant for the projected columns); or the current code with any thread allowed to handle full file reading, inc the footer, column projections and their keys? Can

Re: "2.0.1" and "3.0.1" versions on JIRA

2021-02-17 Thread Wes McKinney
I think 2.0.1 can be removed. I doubt that a 3.0.1 patch release is going to happen either but it can be removed later. On Wed, Feb 17, 2021 at 9:41 AM Antoine Pitrou wrote: > > Hi, > > There are versions named "2.0.1" and "3.0.1" on JIRA, they are tagged > with a number of issues: >

"2.0.1" and "3.0.1" versions on JIRA

2021-02-17 Thread Antoine Pitrou
Hi, There are versions named "2.0.1" and "3.0.1" on JIRA, they are tagged with a number of issues: https://issues.apache.org/jira/projects/ARROW/versions/12349263 https://issues.apache.org/jira/projects/ARROW/versions/12349610 What should we do with them? It seems that "2.0.1" at least should

Cross-endianness IPC support in Arrow C++

2021-02-17 Thread Antoine Pitrou
Hello, I would like to announce that we have just merged https://github.com/apache/arrow/pull/7507, which implements - on the C++ side - endianness conversion when reading IPC data with non-native endianness. This means that IPC and Flight communication using Arrow C++ should be possible

Re: Arrow sync call February 17 at 12:00 US/Eastern, 17:00 UTC

2021-02-17 Thread Antoine Pitrou
Le 17/02/2021 à 12:07, Andrew Lamb a écrit : > *Proposal*: Allow Rust and other implementations release additional point > / maintenance versions at a different cadences, out of lockstep with the > major arrow releases. We could still release the Rust library as part of > the major Arrow

Re: Exposing low-level Parquet encryption to Python user (or, maybe not)

2021-02-17 Thread Gidon Gershinsky
>From the doc, "To maintain consistency with the style of parquet-cpp, the above structures should not be explicitly synchronized with individual mutexes. In the case of a parquet::arrow::FileReader, the request to read a given selection of row groups and columns is issued from a single main

Re: Arrow sync call February 17 at 12:00 US/Eastern, 17:00 UTC

2021-02-17 Thread Andrew Lamb
I have two items I would like to propose for the agenda if there is time: 1. Manual creation of JIRA Tickets *Background/Issue*: Currently all contributors are required to make a JIRA account and do some mechanical JIRA creation to create well formed Arrow PRs. This is mindless work and people

Re: Exposing low-level Parquet encryption to Python user (or, maybe not)

2021-02-17 Thread Antoine Pitrou
I'm not sure a threading model is expected for an encryption layer. Am I missing something? Regards Antoine. Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit : > Precisely, the main change is in the threading model. Afaik, the document > proposes a model that fits pandas, but might be

Re: Threading Improvements Proposal

2021-02-17 Thread Antoine Pitrou
Le 17/02/2021 à 05:20, Micah Kornfield a écrit : >> >> If a method could potentially run some kind of long term blocking I/O >> wait then yes. So reading / writing tables & datasets, IPC, >> filesystem APIs, etc. will all need to adapt. It doesn't have to be >> all at once. CPU only functions

[NIGHTLY] Arrow Build Report for Job nightly-2021-02-17-0

2021-02-17 Thread Crossbow
Arrow Build Report for Job nightly-2021-02-17-0 All tasks: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0 Failed Tasks: - conda-linux-gcc-py36-aarch64: URL: