>
> What is the parallel-list means?
Something like:
table RecordBatch {
nodes: [FieldNode];
// Statistics related to the data represented by each FieldNode
// This field is either length=0 or has the same length as nodes.
statistics: [Statistic];
}
On Wed, Feb 17, 2021 at 8:34 PM Kohei KaiGai
Hi Weiyang,
The library looks interesting, and for python certainly seems like it might
add a better user experience.
I'm not super active in python maintenance (others who are can hopefully
chime in). But my impression is we try to keep dependencies minimal in
general.
Furthermore, the goal of
>
> I don't think any notion of threading should be present in the
> implementation, except for the required locks around shared structures.
I seem to recall the debate was how to model some class interactions to
determine what should be considered shared structures and what should not.
On Wed,
Thanks for the clarification.
> There is key-value metadata available on Message which might be able to
> work in the short term (some sort of encoded message). I think
> standardizing how we store statistics per batch does make sense.
>
For example, JSON array of min/max values as a key-value me
>
> I didn’t find any page/documentation on how to do RFC in Arrow protocol,
> so can anyone point me to it or PR with email will be enough?
That is enough to start discussion. Before formal acceptance and merging
of the PR there needs to be a Java and C++ implementations for the type
that pass i
Congrats!
On Wed, Feb 17, 2021 at 4:12 PM Wes McKinney wrote:
> This is great news! Congrats to everyone who worked on this to make it
> possible. I know that the cross-endianness question was something that
> came up periodically (even though BE systems are increasingly exotic
> nowadays) so it
There is key-value metadata available on Message which might be able to
work in the short term (some sort of encoded message). I think
standardizing how we store statistics per batch does make sense.
We unfortunately can't add anything to field-node without breaking
compatibility. But another o
Hello,
Does Apache Arrow have any standard way to embed min/max values of the fields
per record-batch basis?
It looks FieldNode supports neither dedicated min/max attribute nor
custom-metadata.
https://github.com/apache/arrow/blob/master/format/Message.fbs#L28
If we embed an array of min/max valu
This is great news! Congrats to everyone who worked on this to make it
possible. I know that the cross-endianness question was something that
came up periodically (even though BE systems are increasingly exotic
nowadays) so it's great that we now have a robust answer
On Wed, Feb 17, 2021 at 8:48 A
That is a great suggestion Wes, thank you.
I wonder if we could get away with a 128 bit representation that is the
concatenation of the two existing interval types (YearMonth)(DayTime). Or
maybe even define a `struct` type with those fields that is used by
DataFusion.
Basically, given our reading
On Wed, Feb 17, 2021 at 3:46 PM wrote:
>
> > It's unclear to me that this needs to be introduced into the top-level
>
> Similar thing to columnar format, How to store interval like 1 month 1 day 1
> hour? It’s not possible to do it without converting 1 month to 30 days, which
> is a bad way.
>
Read more (this is one ASF member's interpretation of the Openness
tenet of the Apache Way) about this:
http://theapacheway.com/open/
On Wed, Feb 17, 2021 at 3:46 PM Wes McKinney wrote:
>
> For trivial PRs that do not merit mention in the changelog you could
> preface the issue title with someth
I like the idea of encouraging regular contributors to use JIRA more
consistently
On Wed, Feb 17, 2021 at 4:47 PM Wes McKinney wrote:
> For trivial PRs that do not merit mention in the changelog you could
> preface the issue title with something like "ARROW-XXX" and we can
> modify the merge too
For trivial PRs that do not merit mention in the changelog you could
preface the issue title with something like "ARROW-XXX" and we can
modify the merge tool to bypass the consistency check for these. I
think some other Apache projects do this. I can understand how it
might seem like a nuisance to
> It's unclear to me that this needs to be introduced into the top-level
Similar thing to columnar format, How to store interval like 1 month 1 day 1
hour? It’s not possible to do it without converting 1 month to 30 days, which
is a bad way.
> On 17 Feb 2021, at 21:02, Wes McKinney wrote:
>
>
Thanks for the background Wes. This is exactly what I was looking for.
I think using JIRA for the single source of truth / project management has
lots of value and I don't want to propose changing that. I am trying to
lower the barrier to contributing to Arrow even more.
While I agree creating JI
hi Andrew,
There isn't a hard requirement. It's a culture thing where the purpose
of Jira issues is to create a changelog and for developers to
communicate publicly what work they are proposing to perform in the
project. We decided by consensus (essentially) that having a single
point of truth for
Can someone tell me / point me at what the actual "requirements" for using
JIRA in Apache Arrow are?
Specifically, I would like to know:
1. Where does the requirement for each commit to have a JIRA ticket come
from? (Is that Apache Arrow specific, or is it a more general Apache
governance require
It's unclear to me that this needs to be introduced into the top-level
columnar format without more analysis — have you considered
implementing this for DataFusion as an extension type for the time
being?
On Wed, Feb 17, 2021 at 11:59 AM t...@dmtry.me wrote:
>
> Hi,
>
> For now, There are only tw
Hi,
For now, There are only two types of IntervalUnit inside Arrow:
- YearMonth - month stored as int32
- DayTime - days as int32 and time in milliseconds as in32. Total (64 bites)
Since DF is using Arrow, It’s not possible to store “Complex” intervals such 1
MONTH 1 DAY 1 HOUR.
I think, the b
This certainly sounds good to me.
Cheers, Gidon
On Wed, Feb 17, 2021 at 7:36 PM Antoine Pitrou wrote:
>
> I don't think any notion of threading should be present in the
> implementation, except for the required locks around shared structures.
> I don't know where the idea of a "main thread" c
I don't think any notion of threading should be present in the
implementation, except for the required locks around shared structures.
I don't know where the idea of a "main thread" comes from, but it
probably shouldn't exist in a C++ library.
Regards
Antoine.
Le 17/02/2021 à 18:34, Gidon G
Just to clarify. There are two options, which one do you refer to? A design
with a main thread that handles projections and the keys (relevant for the
projected columns); or the current code with any thread allowed to handle
full file reading, inc the footer, column projections and their keys? Can
I think 2.0.1 can be removed. I doubt that a 3.0.1 patch release is going
to happen either but it can be removed later.
On Wed, Feb 17, 2021 at 9:41 AM Antoine Pitrou wrote:
>
> Hi,
>
> There are versions named "2.0.1" and "3.0.1" on JIRA, they are tagged
> with a number of issues:
> https://iss
Hi,
There are versions named "2.0.1" and "3.0.1" on JIRA, they are tagged
with a number of issues:
https://issues.apache.org/jira/projects/ARROW/versions/12349263
https://issues.apache.org/jira/projects/ARROW/versions/12349610
What should we do with them? It seems that "2.0.1" at least should b
Hello,
I would like to announce that we have just merged
https://github.com/apache/arrow/pull/7507, which implements - on the C++
side - endianness conversion when reading IPC data with non-native
endianness.
This means that IPC and Flight communication using Arrow C++ should be
possible betwee
Le 17/02/2021 à 12:07, Andrew Lamb a écrit :
> *Proposal*: Allow Rust and other implementations release additional point
> / maintenance versions at a different cadences, out of lockstep with the
> major arrow releases. We could still release the Rust library as part of
> the major Arrow releas
Le 17/02/2021 à 12:47, Gidon Gershinsky a écrit :
> From the doc,
> "To maintain consistency with the style of parquet-cpp, the above
> structures should not be explicitly synchronized with individual mutexes.
> In the case of a parquet::arrow::FileReader, the request to read a given
> selection
>From the doc,
"To maintain consistency with the style of parquet-cpp, the above
structures should not be explicitly synchronized with individual mutexes.
In the case of a parquet::arrow::FileReader, the request to read a given
selection of row groups and columns is issued from a single main thread
I have two items I would like to propose for the agenda if there is time:
1. Manual creation of JIRA Tickets
*Background/Issue*: Currently all contributors are required to make a JIRA
account and do some mechanical JIRA creation to create well formed Arrow
PRs. This is mindless work and people who
I'm not sure a threading model is expected for an encryption layer. Am
I missing something?
Regards
Antoine.
Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit :
> Precisely, the main change is in the threading model. Afaik, the document
> proposes a model that fits pandas, but might be problem
Le 17/02/2021 à 05:20, Micah Kornfield a écrit :
>>
>> If a method could potentially run some kind of long term blocking I/O
>> wait then yes. So reading / writing tables & datasets, IPC,
>> filesystem APIs, etc. will all need to adapt. It doesn't have to be
>> all at once. CPU only functions
Arrow Build Report for Job nightly-2021-02-17-0
All tasks:
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0
Failed Tasks:
- conda-linux-gcc-py36-aarch64:
URL:
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-drone-conda-linux
33 matches
Mail list logo