Re: [DISCUSS] Approach to generic schema representation

2024-07-08 Thread Jorge Cardoso Leitão
Hi, So, something like a human and computer readable standard for arrow schemas, e.g. via yaml or a json schema. We kind of do this in our integration tests / golden tests, where we have a non-official json representation of an arrow schema. The ask here is to standardize such a format in some

Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?

2024-06-05 Thread Jorge Cardoso Leitão
Hi This is c++ specific, but imo the question applies more broadly. I understood that the rationale for stats in compressed+encoded formats like parquet is that computing those stats has a high cost (io + decompress + decode + aggregate). This motivates the materialization of aggregates. In arro

Re: [VOTE] Move Arrow DataFusion Subproject to new Top Level Apache Project

2024-03-01 Thread Jorge Cardoso Leitão
+1 - great work!!! On Fri, Mar 1, 2024 at 5:49 PM Micah Kornfield wrote: > +1 (binding) > > On Friday, March 1, 2024, Uwe L. Korn wrote: > > > +1 (binding) > > > > On Fri, Mar 1, 2024, at 2:37 PM, Andy Grove wrote: > > > +1 (binding) > > > > > > On Fri, Mar 1, 2024 at 6:20 AM Weston Pace > > w

Re: [VOTE] Accept donation of Comet Spark native engine

2024-01-27 Thread Jorge Cardoso Leitão
+1 On Sun, 28 Jan 2024, 00:00 Wes McKinney, wrote: > +1 (binding) > > On Sat, Jan 27, 2024 at 12:26 PM Micah Kornfield > wrote: > > > +1 Binding > > > > On Sat, Jan 27, 2024 at 10:21 AM David Li wrote: > > > > > +1 (binding) > > > > > > On Sat, Jan 27, 2024, at 13:03, L. C. Hsieh wrote: > > >

Re: [VOTE] Add RLE Arrays to Arrow Format

2022-12-19 Thread Jorge Cardoso Leitão
+1 Thanks a lot for all this. Really exciting!! On Mon, 19 Dec 2022, 17:56 Matt Topol, wrote: > That leaves us with a total vote of +1.5 so the vote carries with the > caveat of changing the name to be Run End Encoded rather than Run Length > Encoded (unless this means I need to do a new vote w

Re: compressed feather v2 "slicing from the middle"

2022-09-21 Thread Jorge Cardoso Leitão
Hi, AFAIK compressed IPC arrow files do not support random access (like uncompressed counterparts) - you need to decompress the whole batch (or at least the columns you need). A "RecordBatch" is the compression unit of the file. Think of it like a parquet file whose every row group has a single da

Re: Usage of the name Feather?

2022-08-29 Thread Jorge Cardoso Leitão
I agree. I suspect that the most widely used API with "feather" is Pandas' read_feather. On Mon, 29 Aug 2022, 19:55 Weston Pace, wrote: > I agree as well. I think most lingering uses of the term "feather" > are in pyarrow and R however, so it might be good to hear from some of > those mainta

Re: [VOTE] Format: Rules and procedures for Canonical extension types

2022-08-29 Thread Jorge Cardoso Leitão
+1 Really well written, thanks for driving this! On Mon, 29 Aug 2022, 11:16 Antoine Pitrou, wrote: > > Hello, > > Just a heads up that more PMC votes are needed here. > > > > Le 24/08/2022 à 17:24, Antoine Pitrou a écrit : > > > > Hello, > > > > I would like to propose we vote for the following

Re: [QUESTION] How is mmap implemented for 8bit padded files?

2022-08-03 Thread Jorge Cardoso Leitão
made UB-safe by using the memcpy trick, which is correctly > optimized by production compilers: > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/ubsan.h#L55-L69 > > Regards > > Antoine. > > > Le 01/08/2022 à 18:55, Jorge Cardoso Leitão a écrit : > &

Re: [DISCUSS][Format] Starting to do some concrete work on the new "StringView" columnar data type

2022-08-01 Thread Jorge Cardoso Leitão
I am +1 on either - imo: * it is important to have either available * both provide a non-trivial improvement over what we have * the trade-off is difficult to decide upon - I trust whomever is implementing it to experiment and decide which better fits Arrow and the ecosystem. Thank you so much fo

[QUESTION] How is mmap implemented for 8bit padded files?

2022-08-01 Thread Jorge Cardoso Leitão
Hi, I am trying to follow the C++ implementation with respect to mmap IPC files and reading them zero-copy, in the context of reproducing it in Rust. My understanding from reading the source code is that we essentially: * identify the memory regions (offset and length) of each of the buffers, via

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-27 Thread Jorge Cardoso Leitão
Hi Laurent, I agree that there is a common pattern in converting row-based formats to Arrow. Imho the difficult part is not to map the storage format to Arrow specifically - it is to map the storage format to any in-memory (row- or columnar- based) format, since it requires in-depth knowledge abo

Re: [VOTE] Mark C Stream Interface as Stable

2022-06-08 Thread Jorge Cardoso Leitão
Sorry, I got a bit confused on what we were voting on. Thank you for the clarification. +1 Best, Jorge On Wed, Jun 8, 2022 at 9:53 PM Antoine Pitrou wrote: > > Le 08/06/2022 à 20:55, Jorge Cardoso Leitão a écrit : > > 0 (binding) - imo there is some unclarity over what is ex

Re: [VOTE] Mark C Stream Interface as Stable

2022-06-08 Thread Jorge Cardoso Leitão
0 (binding) - imo there is some unclarity over what is expected to be passed over the C streaming interface - an Array or a StructArray. I think the spec claims the former, but the C++ implementation (which I assume is the reference here) expects the latter [1]. Would it be possible to clarify th

Re: [ANNOUNCE] New Arrow committer: Liang-Chi Hsieh

2022-04-29 Thread Jorge Cardoso Leitão
Congratulations, great work! On Sat, Apr 30, 2022 at 3:30 AM L. C. Hsieh wrote: > Thanks all! > > On Fri, Apr 29, 2022 at 7:19 PM Yijie Shen > wrote: > > > > Congrats Liang-Chi! > > > > > > On Thu, Apr 28, 2022 at 8:36 PM Vibhatha Abeykoon > > wrote: > > > > > Congratulations! > > > > > > On T

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Jorge Cardoso Leitão
ore > > This doesn't really answer my question, does it? > > > > > > On Tue, Apr 26, 2022 at 10:22 AM Antoine Pitrou > wrote: > > > >> > >> Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit : > >>>> Would WASM be able to inte

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Jorge Cardoso Leitão
> Would WASM be able to interact in-process with non-WASM buffers safely? AFAIK yes. My understanding from playing with it in JS is that a WASM-backed udf execution would be something like: 1. compile the C++/Rust/etc UDF to WASM (a binary format) 2. provide a small WASM-compiled middleware of th

Re: [Question] Is it possible to write to IPC without an intermediary buffer?

2022-04-05 Thread Jorge Cardoso Leitão
ed to worry about possibly corrupting > the data. The challenging part is determining the exact locations that > need to be overwritten. > > -MIcah > > On Mon, Apr 4, 2022 at 7:40 AM Jorge Cardoso Leitão < > jorgecarlei...@gmail.com> wrote: > > > Hi, > > &g

[Question] Is it possible to write to IPC without an intermediary buffer?

2022-04-04 Thread Jorge Cardoso Leitão
Hi, Motivated by [1], I wonder if it is possible to write to IPC without writing the data to an intermediary buffer. The challenge is that the header of an IPC message [header][data] requires: * the positions of the buffers * the total length of the body For uncompressed data, we could compute

Re: [ANNOUNCE] New Arrow committers: Raphael Taylor-Davies, Wang Xudong, Yijie Shen, and Kun Liu

2022-03-14 Thread Jorge Cardoso Leitão
Congrats to all of you - well deserved! On Mon, Mar 14, 2022, 20:47 Bryan Cutler wrote: > Congrats to all! > > On Thu, Mar 10, 2022 at 12:11 AM Alenka Frim > wrote: > > > Congratulations all! > > > > On Thu, Mar 10, 2022 at 1:55 AM Yang hao <1371656737...@gmail.com> > wrote: > > > > > Congratul

Re: [Discuss][Format] Add 32-bit and 64-bit Decimals

2022-03-08 Thread Jorge Cardoso Leitão
n the two reference Arrow implementations (C++ and > > >> Java). However, our implementation landscape is now much richer than > it > > >> used to be (for example, there is a tremendous activity on the Rust > > >> side). Do we want to keep the historical &q

Re: [Discuss][Format] Add 32-bit and 64-bit Decimals

2022-03-07 Thread Jorge Cardoso Leitão
+1 adding 32 and 64 bit decimals. +0 to release it without integration tests - both IPC and the C data interface use a variable bit width to declare the appropriate size for decimal types. Relaxing from {128,256} to {32,64,128,256} seems a low risk from an integration perspective, as implementatio

Re: [JavaScript] Appending Nulls to a Struct (Bug)

2022-02-18 Thread Jorge Cardoso Leitão
A change in the length of an array is equivalent to a change in at least one of its buffers (i.e. length is always physical). * Primitive arrays (i32, i64, etc): the arrays' length is equal to the length of the buffer divided by the size of the type. E.g. buffer.len() = 8 and i32 <=> length = 2) *

Re: [JavaScript] Appending Nulls to a Struct (Bug)

2022-02-18 Thread Jorge Cardoso Leitão
Isn't field-0 representing ["joe", None, None, "mark"]? validity is "1001" and offsets [0,3,3,7]. My reading is that the values buffer is "joemark" because we do not represent values in null slots. Best, Jorge On Fri, Feb 18, 2022 at 7:07 PM Phillip Cloud wrote: > My read of the spec for s

Re: [JavaScript] Appending Nulls to a Struct (Bug)

2022-02-18 Thread Jorge Cardoso Leitão
Hi Dominik, That is my understanding - if it exists, the length of the validity must equal the length of each field. Otherwise, it would be difficult to iterate over the fields and validity together, since we would not have enough rows in the fields for the validity. I think that this is broader

Re: [Discuss] Best practice for storing key-value metadata for Extension Types

2022-02-08 Thread Jorge Cardoso Leitão
Hi, Great questions and write up. Thanks! imo dragging a JSON reader and writer to read official extension types' metadata seems overkill. The c data interface is expected to be quite low level. Imo we should aim for a (non-human readable) binary format. For non-official, imo you are spot on - us

Re: [DISCUSS] New Types (Schema.fbs vs Extension Types)

2022-02-08 Thread Jorge Cardoso Leitão
Note that we do not have tests on tensor arrays, so testing the extension type on these may be hindered by divergences between implementations. I do not think we even have json integration files for them. If the focus is extension types, maybe it would be best to cover types whose physical represe

Re: [ANNOUNCE] New Arrow PMC chair: Kouhei Sutou

2022-01-25 Thread Jorge Cardoso Leitão
Thank you so much for all your contributions to open source and to Apache Arrow in particular, and for accepting taking this role. On Tue, Jan 25, 2022 at 7:10 PM QP Hou wrote: > Congrats Kou, very well deserved. > > On Tue, Jan 25, 2022 at 9:53 AM Benson Muite > wrote: > > > > Congratulations

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2022-01-19 Thread Jorge Cardoso Leitão
de whether a pointer > > is correct and doesn't reveal unrelated data?). > > > > I think we should discuss this with the DuckDB folks (and possibly the > > Velox folks, but I guess that it might socio-politically more difficult) > > so as to measure how much of an i

Re: [RUST][DataFusion][Arrow] Switching DataFusion to use arrow2 implementation and the future of arrow

2022-01-19 Thread Jorge Cardoso Leitão
Hi, Thank you for raising this here and for your comments. I am very humbled by the feedback and adoption that arrow2 got so far. My current hypothesis is that arrow2 will be donated to Apache Arrow, I just don't feel comfortable and have the energy doing so right now. Thank you for your underst

Re: [VOTE][RUST] Release Apache Arrow Rust 7.0.0 RC1

2022-01-11 Thread Jorge Cardoso Leitão
+1 On Tue, Jan 11, 2022 at 9:17 PM QP Hou wrote: > +1 (non-binding) > > On Mon, Jan 10, 2022 at 3:14 PM Andy Grove wrote: > > > > +1 (binding) > > > > Thanks, > > > > Andy. > > > > On Sat, Jan 8, 2022 at 3:43 AM Andrew Lamb wrote: > > > > > Hi, > > > > > > I would like to propose a release of

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2022-01-08 Thread Jorge Cardoso Leitão
Fair enough (wrt to deprecation). Think that the sequence view is a replacement for our existing (that allows O(N) selections), but I agree with the sentiment that preserving compatibility is more important than a single way of doing it. Thanks for that angle! Imo the Arrow format is already compo

Re: [DataFusion] Question about Accumulator API and maybe potential bugs

2022-01-03 Thread Jorge Cardoso Leitão
Hi, The accumulator API is designed to accept multiple columns (e.g. the pearson correlation takes 2 columns, not one). &values[0] corresponds to the first column passed to the accumulator. All concrete implementations of accumulators in DataFusion atm only accept one column (Sum, Avg, Count, Min,

Re: [VOTE][RUST] Release Apache Arrow Rust 6.5.0 RC1

2021-12-28 Thread Jorge Cardoso Leitão
+1 Thanks, Jorge On Fri, Dec 24, 2021 at 3:21 AM Wang Xudong wrote: > +1 (non-binding) > > Happy holidays > > --- > xudong > > Andy Grove 于2021年12月24日周五 09:19写道: > > > +1 (binding) > > > > Thanks, > > > > Andy. > > > > On Thu, Dec 23, 2021 at 2:26 PM Andrew Lamb > wrote: > > > > > Hi, > > >

Re: [ANNOUNCE] New Arrow PMC member: Daniël Heres

2021-12-21 Thread Jorge Cardoso Leitão
Congratulations!! On Tue, Dec 21, 2021 at 5:24 PM Andrew Lamb wrote: > Congratulations Daniël ! Well deserved > > On Tue, Dec 21, 2021 at 12:18 PM Wes McKinney wrote: > > > The Project Management Committee (PMC) for Apache Arrow has invited > > Daniël Heres to become a PMC member and we are ple

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2021-12-15 Thread Jorge Cardoso Leitão
Hi, Thanks a lot for this initiative and the write up. I did a small bench for the sequence view and added a graph to the document for evidence of what Wes is writing wrt to performance of "selection / take / filter". Big +1 in replacing our current representation of variable-sized arrays by the

Re: [VOTE][RUST] Release Apache Arrow Rust 6.4.0 RC1

2021-12-13 Thread Jorge Cardoso Leitão
+1 Thank you! On Fri, Dec 10, 2021 at 9:50 PM Andy Grove wrote: > +1 (binding) > > Thank you, Andrew, and everyone else involved in the input validation work. > This definitely helps address one of the biggest criticisms of the crate. > > Andy. > > On Fri, Dec 10, 2021 at 12:30 PM Andrew Lamb

Re: [ANNOUNCE] New Arrow committer: Rémi Dattai

2021-12-07 Thread Jorge Cardoso Leitão
Congrats! On Wed, Dec 8, 2021 at 8:14 AM Daniël Heres wrote: > Congrats Rémi! > > On Wed, Dec 8, 2021, 04:27 Ian Joiner wrote: > > > Congrats! > > > > On Tuesday, December 7, 2021, Wes McKinney wrote: > > > > > On behalf of the Arrow PMC, I'm happy to announce that Rémi Dattai has > > > accept

Re: [ANNOUNCE] New Arrow PMC member: Joris Van den Bossche

2021-11-17 Thread Jorge Cardoso Leitão
Congratulations! On Thu, Nov 18, 2021 at 3:34 AM Ian Joiner wrote: > Congrats Joris and really thanks for your effort in integrating ORC and > dataset! > > Ian > > > On Nov 17, 2021, at 5:55 PM, Wes McKinney wrote: > > > > The Project Management Committee (PMC) for Apache Arrow has invited > >

Re: [Parquet][C++][Python] Maximum Row Group Length Default

2021-11-17 Thread Jorge Cardoso Leitão
What are the tradeoffs between a low and large and row group size? Is it that a low value allows for quicker random access (as we can seek row groups based on the number of rows they have), while a larger value allows for higher dict-encoding and compression ratios? Best, Jorge On Wed, Nov 17

Re: Synergies with Apache Avro?

2021-11-16 Thread Jorge Cardoso Leitão
to eat into any > gains. There is also non-zero engineering effort to implement the necessary > filter/selection push down APIs that most of them provide. That being > said, I'd love to see real world ETL pipeline benchmarks :) > > > On Tue, Nov 2, 2021 at 4:39 AM Jorge Cardoso

Re: Question about Arrow Mutable/Immutable Arrays choice

2021-11-03 Thread Jorge Cardoso Leitão
I think the c data interface requires the arrays to be immutable or two implementations will race when mutating/reading the shared regions, since we have no mechanism to synchronize read/write access across the boundary. Best, Jorge On Wed, Nov 3, 2021 at 1:50 PM Alessandro Molina < alessan...@u

Re: Synergies with Apache Avro?

2021-11-02 Thread Jorge Cardoso Leitão
> Just an idea: Do the Avro libs support different allocators? Maybe > using > > a > > > different one (e.g. mimalloc) would yield more similar results by > working > > > around the fragmentation you described. > > > > > > This wouldn't change

Synergies with Apache Avro?

2021-10-31 Thread Jorge Cardoso Leitão
Hi, I am reporting back a conclusion that I recently arrived at when adding support for reading Avro to Arrow. Avro is a storage format that does not have an associated in-memory format. In Rust, the official implementation deserializes an enum, in Python to a vector of Object, and I suspect in J

Re: [Discuss] Single offset per array has a non-trivial performance implication

2021-10-27 Thread Jorge Cardoso Leitão
ARROW-14453 [2] https://github.com/pola-rs/polars [3] https://h2oai.github.io/db-benchmark/ On Wed, Oct 27, 2021 at 7:57 PM Antoine Pitrou wrote: > > Le 26/10/2021 à 21:30, Jorge Cardoso Leitão a écrit : > > Hi, > > > > One aspect of the design of "arrow2" is that it d

[Discuss] Single offset per array has a non-trivial performance implication

2021-10-26 Thread Jorge Cardoso Leitão
Hi, One aspect of the design of "arrow2" is that it deals with array slices differently from the rest of the implementations. Essentially, the offset is not stored in ArrayData, but on each individual Buffer. Some important consequence are: * people can work with buffers and bitmaps without havin

Re: [ANNOUNCE] New Arrow committer: Jiayu Liu

2021-10-07 Thread Jorge Cardoso Leitão
Congratulations!!! :) On Thu, Oct 7, 2021, 21:58 Weston Pace wrote: > Congratulations Jiayu Liu! > > On Thu, Oct 7, 2021 at 8:02 AM Yijie Shen > wrote: > > > > Congratulations Jianyu > > > > > > Micah Kornfield 于2021年10月8日 周五上午12:29写道: > > > > > A little late, but welcome and thank you for your

Re: [Question] Allocations along 64 byte cache lines

2021-09-09 Thread Jorge Cardoso Leitão
ench.com/q/NxyDu89azmKJmiVxF29Ei8FybWk > > > On 9/7/21 7:16 PM, Jorge Cardoso Leitão wrote: > > Thanks, > > > > I think that the alignment requirement in IPC is different from this one: > > we enforce 8/64 byte alignment when serializing for IPC, but we (only) &

Re: [ANNOUNCE] New Arrow committer: Nic Crane

2021-09-09 Thread Jorge Cardoso Leitão
Congrats!! =) On Thu, Sep 9, 2021, 20:12 Micah Kornfield wrote: > Congrats! > > On Thursday, September 9, 2021, Weston Pace wrote: > > > Congratulations Nic! > > > > On Thu, Sep 9, 2021 at 7:43 AM Antoine Pitrou > wrote: > > > > > > > > > Welcome on board Nic! > > > > > > > > > On Thu, 9 Sep 2

Re: [Question] Allocations along 64 byte cache lines

2021-09-07 Thread Jorge Cardoso Leitão
ld be > > for operations on wider types (Decimal128 and Decimal256). Another > place > > where I think alignment could help is when adding two primitive arrays > (it > > sounds like this was summing a single array?). > > > > [1] > > > https://lists.apache.o

Re: [Question] Allocations along 64 byte cache lines

2021-09-06 Thread Jorge Cardoso Leitão
Thanks a lot Antoine for the pointers. Much appreciated! Generally, it should not hurt to align allocations to 64 bytes anyway, > since you are generally dealing with large enough data that the > (small) memory overhead doesn't matter. > Not for performance. However, 64 byte alignment in Rust req

[Question] Allocations along 64 byte cache lines

2021-09-06 Thread Jorge Cardoso Leitão
Hi, We have a whole section related to byte alignment ( https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding) recommending 64 byte alignment and referring to intel's manual. Do we have evidence that this alignment helps (besides intel claims)? I am asking because going

Re: Set of primitive physical types

2021-09-02 Thread Jorge Cardoso Leitão
: > > > > > > Hi Jorge, > > > Are there places in the docs that you think this would simplify? > > > There is an old JIRA [1] about introducing a c-struct type that I > > > think aligns with this observation [1] > > > > > >

Re: C++ Determine Size of RecordBatch

2021-09-01 Thread Jorge Cardoso Leitão
note that that would be an upper bound because buffers can be shared between arrays. On Wed, Sep 1, 2021 at 2:15 PM Antoine Pitrou wrote: > On Tue, 31 Aug 2021 21:46:23 -0700 > Rares Vernica wrote: > > > > I'm storing RecordBatch objects in a local cache to improve performance. > I > > want to

Set of primitive physical types

2021-08-30 Thread Jorge Cardoso Leitão
Hi, Just came across this curiosity that IMO may help us to design physical types in the future. Not sure if this was mentioned before, but it seems to me that `DaysMilliseconds` and `MonthDayNano` belong to a broader class of physical types "typed tuples" in that they are constructed by defining

Re: [VOTE][RUST] Release Apache Arrow Rust 5.3.0 RC1

2021-08-30 Thread Jorge Cardoso Leitão
for the avoidance of doubt, +1 on the vote: release :) On Sat, Aug 28, 2021 at 12:12 PM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote: > +1 > > Thanks, Andrew! > > On Sat, Aug 28, 2021 at 12:10 PM Andrew Lamb wrote: > >> Update here: the issue is tha

Re: [VOTE][RUST] Release Apache Arrow Rust 5.3.0 RC1

2021-08-28 Thread Jorge Cardoso Leitão
+1 Thanks, Andrew! On Sat, Aug 28, 2021 at 12:10 PM Andrew Lamb wrote: > Update here: the issue is that we made a chance that is not compatible with > older rust toolchains. > > The consensus on the PR seems to be that since arrow-rs doesn't offer any > explicit compatibility for older rust too

Re: [VOTE][Format] Clarify allowed value range for the Time types

2021-08-20 Thread Jorge Cardoso Leitão
+1 On Fri, Aug 20, 2021 at 2:43 PM David Li wrote: > +1 > > On Thu, Aug 19, 2021, at 18:33, Weston Pace wrote: > > +1 > > > > On Thu, Aug 19, 2021 at 9:18 AM Wes McKinney > wrote: > > > > > > +1 > > > > > > On Thu, Aug 19, 2021 at 6:20 PM Antoine Pitrou > wrote: > > > > > > > > > > > > Hello,

Re: [VOTE][Format] Add in a new interval type can combines Month, Days and Nanoseconds

2021-08-17 Thread Jorge Cardoso Leitão
+1 On Tue, Aug 17, 2021 at 8:50 PM Micah Kornfield wrote: > Hello, > As discussed previously [1], I'd like to call a vote to add a new interval > type which is a triple of Month, Days, and nanoseconds. The formal > definition is defined in a PR [2] along with Java and C++ implementations > that

Re: [VOTE][RUST][Datafusion] Release Apache Arrow Datafusion 5.0.0 RC3

2021-08-13 Thread Jorge Cardoso Leitão
+1 Great work everyone! On Fri, Aug 13, 2021, 22:19 Daniël Heres wrote: > +1 (non binding). Looking good. > > > On Fri, Aug 13, 2021, 07:49 QP Hou wrote: > > > Good call Ruihang. I remember we used to have this toolchain file when > > we were still in the main arrow repo. I will take a look i

Re: [Question] what is the purpose of the typeids in the UnionArray?

2021-08-13 Thread Jorge Cardoso Leitão
%3E > > -Micah > > On Fri, Aug 13, 2021 at 10:57 AM Keith Kraus > wrote: > > > How would using the typeid directly work with arbitrary Extension types? > > > > -Keith > > > > On Fri, Aug 13, 2021 at 12:49 PM Jorge Cardoso Leitão < > > jorgeca

[Question] what is the purpose of the typeids in the UnionArray?

2021-08-13 Thread Jorge Cardoso Leitão
Hi, In the UnionArray, there is a level of indirection between types (buffer of i8s) -> typeId (i8) -> field. For example, the generated_union part of our integration tests has the data: types: [5, 5, 5, 5, 7, 7, 7, 7, 5, 5, 7] (len = 11) typeids: [5, 7] fields: [int32, utf8] My understanding is

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-13 Thread Jorge Cardoso Leitão
ions" in spawn_blocking or something equivalent. Best, Jorge On Thu, Aug 12, 2021 at 10:03 PM Phillip Cloud wrote: > On Thu, Aug 12, 2021 at 1:03 PM Jorge Cardoso Leitão < > jorgecarlei...@gmail.com> wrote: > > > I agree with Antoine that we should weigh the pros and cons

Re: [Rust] Integration tests for recursive nested data?

2021-08-12 Thread Jorge Cardoso Leitão
Hi, The checkout of arrow-rs on the failed build is over fa5acd971c97, which up to 3hrs or so was master, so, I think it is picking the right code. Did a quick investigation: * The integration tests on arrow-rs have not been running since June the 30th. they stopped running after the merge of [1

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-12 Thread Jorge Cardoso Leitão
I agree with Antoine that we should weigh the pros and cons of flatbuffers (or protobuf or thrift for that matter) over a more human-friendly, simpler, format like json or MsgPack. I also struggle a bit to reason with the complexity of using flatbuffers for this. E.g. there is no async support for

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-08-11 Thread Jorge Cardoso Leitão
Couple of questions 1. Is the goal that IRs have equal semantics, i.e. given (IR,data), the operation "(IR,data) - engine -> result" MUST be the same for all "engine"? 2. if yes, imo we may need to worry about: * a definition of equality that implementations agree on. * agreement over what the sem

Re: [Discuss] [Rust] Arrow2/parquet2 going foward

2021-08-06 Thread Jorge Cardoso Leitão
; Thanks, > > QP > > > > > > > > On Tue, Aug 3, 2021 at 5:31 PM paddy horan > wrote: > > > > > > Hi Jorge, > > > > > > I see value in consolidating development in a single repo and releasing > > under the existing arrow crate. Re

Re: [Discuss] [Rust] Arrow2/parquet2 going foward

2021-08-03 Thread Jorge Cardoso Leitão
ting Arrow > community that Arrow2 is the future but that it is <1.0 > - existing users will be well supported in this transition > - In general, I think the longer that development proceeds in separate > repos the harder it will be to eventually merge the two in a way that >

Re: [Discuss] [Rust] Arrow2/parquet2 going foward

2021-08-02 Thread Jorge Cardoso Leitão
Hi, Sorry for the delay. If there is a path towards an official release under a <1.0.0 versioning schema aligned with the rest of the Rust ecosystem and in line with the stability of the API, then IMO we should move all development to within Apache experimental asap (I can handle this and the lik

Re: [ANNOUNCE] New Arrow PMC member: Neville Dipale

2021-07-29 Thread Jorge Cardoso Leitão
Congratulations, Neville :) On Fri, Jul 30, 2021 at 8:18 AM QP Hou wrote: > Well deserved, congratulations Neville! > > On Thu, Jul 29, 2021 at 3:20 PM Wes McKinney wrote: > > > > The Project Management Committee (PMC) for Apache Arrow has invited > > Neville Dipale to become a PMC member and w

Re: [ANNOUNCE] New Arrow committer: QP Hou

2021-07-26 Thread Jorge Cardoso Leitão
Congratulations and thank you for all the great work! It is a pleasure to work with you. Best, Jorge On Mon, Jul 26, 2021 at 7:38 PM Niranda Perera wrote: > Congrats QP! :-) > > On Mon, Jul 26, 2021 at 1:24 PM Micah Kornfield > wrote: > > > Congrats QP! > > > > On Mon, Jul 26, 2021 at 10:02 A

[DISCUSS] Release Python Datafusion 0.3.0

2021-07-20 Thread Jorge Cardoso Leitão
Hi, I would like to gauge your interest in a release of the Python bindings for DataFusion. There has been a tremendous amount of updates to it, including support for Python 3.9. This release is backward compatible and there are no blockers. This would be the first time a release of this is cut

Re: [Discuss] [Rust] Arrow2/parquet2 going foward

2021-07-20 Thread Jorge Cardoso Leitão
> and arrow 6.0.0 (or other future versions) is perfectly compatible with > semantic versioning and other software projects. > > Andrew > > On Mon, Jul 19, 2021 at 2:08 AM Jorge Cardoso Leitão < > jorgecarlei...@gmail.com> wrote: > > > Hi, > > > > W

Re: [Discuss] [Rust] Arrow2/parquet2 going foward

2021-07-18 Thread Jorge Cardoso Leitão
; > I think this approach wouldn't result in extra work (backporting the > > important changes to 5.1,5.2 release). It only shows the magnitude of > this > > change, the work would be done by you anyways, this would just make it > > clear this is a huge effort. >

[Discuss] [Rust] Arrow2/parquet2 going foward

2021-07-17 Thread Jorge Cardoso Leitão
Hi, Arrow2 and parquet2 have passed the IP clearance vote and are ready to be merged to apache/* repos. My plan is to merge them and PR to both of them to the latest updates on my own repo, so that I can temporarily (and hopefully permanently) archive the versions of my account and move developme

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

2021-07-13 Thread Jorge Cardoso Leitão
o gene...@incubator.apache.org like > > > https://lists.apache.org/thread.html/r319b85f0f24f9b0529865387ccfe1b2a00a16f394a48144ba25c3225%40%3Cgeneral.incubator.apache.org%3E > > On Sat, Jul 10, 2021 at 7:48 AM Jorge Cardoso Leitão > > wrote: > > > > Thanks a lot Wes,

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

2021-07-09 Thread Jorge Cardoso Leitão
, Jul 5, 2021 at 10:38 AM Wes McKinney wrote: > Great, thanks for the update and pushing this forward. Let us know if > you need help with anything. > > On Sun, Jul 4, 2021 at 8:26 PM Jorge Cardoso Leitão > wrote: > > > > Hi, > > > > Wes and Neils, > >

Re: [Rust] Eliminate Timezone field from Timestamp types?

2021-07-07 Thread Jorge Cardoso Leitão
Hi, AFAIK timezone is part of the spec. In Python, that would be [1] import pyarrow as pa dt1 = pa.timestamp("ms", "+00:10") dt2 = pa.timestamp("ms") arrow-rs is not very consistent with how it handles it. imo that is an artifact of being currently difficult (API wise) to create an array with a

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

2021-07-04 Thread Jorge Cardoso Leitão
al-rs-parquet2/pull/1 On Mon, Jun 7, 2021 at 11:55 PM Wes McKinney wrote: > On Sun, Jun 6, 2021 at 1:47 AM Jorge Cardoso Leitão > wrote: > > > > Hi, > > > > Thanks a lot for your feedback. I agree with all the arguments put > forward, > > including Andrew

[RESULT] [VOTE] Donation of rust arrow2 and parquet2

2021-07-02 Thread Jorge Cardoso Leitão
With 10 +1, 3 +1 non-binding, and no 0 nor -1, the vote passed. Thank you all for your participation and for this clarification. I will start work with the incubator for the IP clearance. Best, Jorge

Re: [Discuss] Consider renaming "Arrow" in HO2 benchmarks?

2021-07-01 Thread Jorge Cardoso Leitão
PR to the benchmark repo that clarifies that > it's executing the query using the arrow R/C++ library, when in fact > the query is actually primarily handled by dplyr and not Arrow at all. > The benchmark is very misleading in its current form. > > On Fri, Jun 25, 2021 at 11:55

Re: Improving PR workload management for Arrow maintainers

2021-06-29 Thread Jorge Cardoso Leitão
I just had a quick chat over the ASF's slack with Daniel Gruno from the infra team and they are rolling out the "triage role" [1] for non-committers, which AFAIK offers useful tools in this context: * add/remove labels * assign reviewees * mark duplicates * close, open and assign to issues and PRs

[VOTE] Donation of rust arrow2 and parquet2

2021-06-26 Thread Jorge Cardoso Leitão
Hi, I would like to bring to this mailing list a proposal to donate the source code of arrow2 [1] and parquet2 [2] as experimental repositories [3] within Apache Arrow, conditional on IP clearance. The specific PRs are: * https://github.com/apache/arrow-experimental-rs-arrow2/pull/1 * https://gi

Re: [VOTE][RUST] Release Apache Arrow Rust 4.4.0 RC1

2021-06-25 Thread Jorge Cardoso Leitão
+1 Ran verification script on Apple intel. On Fri, Jun 25, 2021 at 12:16 AM Andrew Lamb wrote: > Hi, > > I would like to propose a release of Apache Arrow Rust Implementation, > version 4.4.0. > > This release candidate is based on commit: > 32b835e5bee228d8a52015190596f4c33765849a [1] > > The

Re: [VOTE] Clarify meaning of timestamp without time zone to equal the concept of "LocalDateTime"

2021-06-25 Thread Jorge Cardoso Leitão
+1 On Fri, Jun 25, 2021 at 7:47 PM Julian Hyde wrote: > +1 > > > On Jun 25, 2021, at 10:36 AM, Antoine Pitrou wrote: > > > > > > Le 24/06/2021 à 21:16, Weston Pace a écrit : > >> The discussion in [1] led to the following proposal which I would like > >> to submit for a vote. > >> --- > >> Arro

[Discuss] Consider renaming "Arrow" in HO2 benchmarks?

2021-06-25 Thread Jorge Cardoso Leitão
Hi, HO2 has a set of benchmarks comparing different query engines [1]. There is currently an implementation named "Arrow", backed by the Arrow R implementation [2]. This is one of the least performant implementations evaluated. I sense that this may negatively affect the Arrow format, as people

Re: [PAPER] Selection vectors and bitmaps for filter results

2021-06-22 Thread Jorge Cardoso Leitão
Thank you for sharing, Wes, an interesting paper indeed. In Rust we currently use a different strategy. We build an iterator over ranges [a_i, b_i[ to be selected from the filter bitmap, and filter the array based on those ranges. For a single filter, the ranges are iterated as they are being bui

[ANNOUNCE] Apache 4.0.1 released

2021-06-21 Thread Jorge Cardoso Leitão
The Apache Arrow team is pleased to announce the 4.0.1 release. This release covers general bug fixes on the different implementations, notably C++, R, Python and JavaScript. The list is available [1], with the list of contributors [2] and changelog [3]. As usual, see the install page [4] for inst

[Rust] experimental parquet2 repo

2021-06-19 Thread Jorge Cardoso Leitão
Hi, I have created a new experimental repo [1] to lay foundations for the parquet2 work within Arrow. The way I proceeded so far: 1. pushed arrow-rs master to it (from commit 9f56a) 2. removed all arrow-related code and committed 3. removed all (rm -rf *) and committed 4. PRed parquet2, rebased

Re: post-release tasks (4.0.1)

2021-06-18 Thread Jorge Cardoso Leitão
5:57 AM Jorge Cardoso Leitão wrote: > > Thanks a lot, Krisztian. > > The JS packages are still missing. I already have access to npm (thanks > Sutou). As part of the npm-release.sh in 4.0.1, we require all tests to pass > [1]. However, there are tests failing on my computer [2]

Re: Future of Rust sync call

2021-06-18 Thread Jorge Cardoso Leitão
tion and quick questions but > not a place to make make decisions. > > Thanks, > Wes > > On Fri, Jun 18, 2021 at 12:14 AM Jorge Cardoso Leitão > wrote: > > > > Hi, > > > > I agree that the communication improved a lot with moving the issues > > to Gi

[Question] Rational for offsets instead of deltas

2021-06-17 Thread Jorge Cardoso Leitão
Hi, (this has no direction; I am just genuinely curious) I am wondering, what is the rational to use "offsets" instead of "lengths" to represent variable sized arrays? I.e. ["a", "", None, "ab"] is represented as offsets: [0, 1, 1, 1, 3] values: "aab" what is the reasoning to use this over le

Re: Future of Rust sync call

2021-06-17 Thread Jorge Cardoso Leitão
Hi, I agree that the communication improved a lot with moving the issues to Github and slack, which made the sync call less relevant. Best, Jorge On Thu, Jun 17, 2021 at 11:55 PM Andrew Lamb wrote: > > I think dropping back from the Rust sync call and using the regular Arrow > Sync call should

Re: [Format][Important] Needed clarification of timezone-less timestamps

2021-06-17 Thread Jorge Cardoso Leitão
Thank you everyone for participating so far; really important and useful discussion. I think of this discussion as a set of test cases over behavior: parameterization: * Timestamp(ms, None) * Timestamp(ms, "00:00") * Timestamp(ms, "01:00") Cases: * its string representation equals to * add a dur

Re: post-release tasks (4.0.1)

2021-06-11 Thread Jorge Cardoso Leitão
ase.sh#L23 [2] https://issues.apache.org/jira/browse/ARROW-13046 Best, Jorge On Thu, Jun 10, 2021 at 1:15 PM Krisztián Szűcs wrote: > On Thu, Jun 10, 2021 at 6:57 AM Jorge Cardoso Leitão > wrote: > > > > I have been unable to generate the docs from any of my two machines (my > &

Re: Complex Number support in Arrow

2021-06-10 Thread Jorge Cardoso Leitão
Isn't an array of complexes represented by what arrow already supports? In particular, I see at least two valid in-memory representations to use, that depend on what we are going to do with it: * Struct[re, im] * FixedList[2] In the first case, we have two buffers, [x0, x1, ...] and [y0, y1, ...]

Re: Delta Lake support for DataFusion

2021-06-10 Thread Jorge Cardoso Leitão
Hi, I agree with all of you. ^_^ I created https://github.com/apache/arrow-datafusion/issues/533 to track this. I tried to encapsulate the three main use-cases for the SQL extension. Feel free to edit at will. Best, Jorge On Thu, Jun 10, 2021 at 8:37 AM QP Hou wrote: > Thanks Daniël for st

Re: post-release tasks (4.0.1)

2021-06-09 Thread Jorge Cardoso Leitão
4.0.1". Best, Jorge On Sun, Jun 6, 2021 at 6:39 AM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote: > Hi, > > Sorry for the delay on this, but it is not being easy to build the docs > [1-5], which is why this is taking some time. It seems that our CI is > cac

Re: Delta Lake support for DataFusion

2021-06-09 Thread Jorge Cardoso Leitão
Hi, Some questions that come to mind: 1. If we add vendor X to datafusion, will we be open to other vendor Y? How do we compare vendors? How do we draw the line of "not sufficiently relevant"? 2. How do we ensure that we do not distort the same level playing field that some people expect from Dat

Re: Representation of "null" values for non-numeric types in Arrow/Pandas interop

2021-06-08 Thread Jorge Cardoso Leitão
Semantically, a NaN is defined according to the IEEE_754 for floating points, while a null represents any value whose value is undefined, unknown, etc. An important set of problems that arrow solves is that it has a native representation for null values (independent of NaNs): arrow's in-memory mod

  1   2   3   >