Re: [DISCUSS] Approach to generic schema representation

2024-07-08 Thread Jorge Cardoso Leitão
Hi,

So, something like a human and computer readable standard for arrow
schemas, e.g. via yaml or a json schema.

We kind of do this in our integration tests / golden tests,  where we have
a non-official json representation of an arrow schema.

The ask here is to standardize such a format in some way.

Imo that makes sense.

Best,
Jorge

On Mon, Jul 8, 2024, 20:06 Jeremy Leibs  wrote:

> That handles questions of machine-to-machine coordination, and let's me do
> things like validation, but it doesn't address questions of the kind of
> user-facing API documentation someone would need to practically form and/or
> process data when integrating a library into their code.
>
> I want to be able to document for a user the equivalent of: "The API
> contract of this interface is that you must submit an arrow payload that is
> a StructArray with fields x, y, z, w, each of which must be non-nullable
> floats." But I want to be able to do it in a way that concise/formal. Right
> now I basically have to say something like:
>
> If you're a rust user, make sure your payload adheres to the following
> datatype:
>
> arrow2::datatypes::DataType::Struct(Arc::new(vec![
> arrow2::datatypes::Field::new("x", Float32, false),
> arrow2::datatypes::Field::new("y", Float32, false),
> arrow2::datatypes::Field::new("z", Float32, false),
> arrow2::datatypes::Field::new("w", Float32, false),
> ]))
>
> If you're a python user, make sure your payload adheres to the following
> datatype:
>
> pa.struct([
> pa.field("x", pa.float32(), nullable=False, metadata={}),
> pa.field("y", pa.float32(), nullable=False, metadata={}),
> pa.field("z", pa.float32(), nullable=False, metadata={}),
> pa.field("w", pa.float32(), nullable=False, metadata={}),
> ]),
>
> I'd like to just write that once in a way that any user can easily map into
> their own code and arrow library.
>
> On Mon, Jul 8, 2024 at 12:42 PM Weston Pace  wrote:
>
> > +1 for empty stream/file as schema serialization.  I have used this
> > approach myself on more than one occasion and it works well.  It can even
> > be useful for transmitting schemas between different arrow-native
> libraries
> > in the same language (e.g. rust->rust) since it allows the different
> > libraries to use different arrow versions.
> >
> > There is one other approach if you only need intra-process serialization
> > (e.g. between threads / libraries in the same process).  You can use the
> C
> > data interface (https://arrow.apache.org/docs/format/CDataInterface.html
> ).
> > It is maybe a slightly more complex API (because of the release callback)
> > and I think it is unlikely to be significantly faster (unless you have an
> > abnormally large schema).  However, it has the same advantages and might
> be
> > useful if you are already using the C data interface elsewhere.
> >
> >
> > On Mon, Jul 8, 2024 at 8:27 AM Matt Topol 
> wrote:
> >
> > > Hey Jeremy,
> > >
> > > Currently the first message of an IPC stream is a Schema message which
> > > consists solely of a flatbuffer message and defined in the Schema.fbs
> > file
> > > of the Arrow repo. All of the libraries that can read Arrow IPC should
> be
> > > able to also handle converting a single IPC schema message back into an
> > > Arrow schema without issue. Would that be sufficient for you?
> > >
> > > On Mon, Jul 8, 2024 at 11:12 AM Jeremy Leibs  wrote:
> > >
> > > > I'm looking for any advice folks may have on a generic way to
> document
> > > and
> > > > represent expected arrow schemas as part of an interface definition.
> > > >
> > > > For context, our library provides a cross-language (python, c++,
> rust)
> > > SDK
> > > > for logging semantic multi-modal data (point clouds, images,
> geometric
> > > > transforms, bounding boxes, etc.). Each of these primitive types has
> an
> > > > associated arrow schema, but to date we have largely abstracted that
> > from
> > > > our users through language-native object types, and a bunch of
> > generated
> > > > code to "serialize" stuff into the arrow buffers before transmitting
> > via
> > > > our IPC.
> > > >
> > > > We're trying to take steps in the direction of making it easier for
> > > > advanced users to write and read data from the store directly using
> > > arrow,
> > > > without needing to go in-and-out of an intermediate object-oriented
> > > > representation. However, doing this means documenting to users, for
> > > > example: "This is the arrow schema to use when sending a point cloud
> > > with a
> > > > color channel".
> > > >
> > > > I would love it if, eventually, the arrow project had a way of
> > defining a
> > > > spec file similar to a .proto or a .fbs, with all libraries
> supporting
> > > > loading of a schema object by directly parsing the spec. Has anyone
> > taken
> > > > steps in this direction?
> > > >
> > > > The best alternative I have at the moment is to redundantly define
> the
> > > > schema for eac

Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?

2024-06-05 Thread Jorge Cardoso Leitão
Hi

This is c++ specific, but imo the question applies more broadly.

I understood that the rationale for stats in compressed+encoded formats
like parquet is that computing those stats has a high cost (io + decompress
+ decode + aggregate). This motivates the materialization of aggregates.

In arrow the data is already in an in-memory format (e.g. IPC+mmap, or in
the heap) and the cost is thus smaller (aggregate).

It could be useful to quantify how much is being saved vs how much
complexity is being added to the format + implementations.

Best,
Jorge


On Thu, Jun 6, 2024, 07:55 Micah Kornfield  wrote:

> Generally I think this is a good idea that has been proposed before but I
> don't think we could ever make progress on design.
>
> On Sun, Jun 2, 2024 at 7:17 PM Sutou Kouhei  wrote:
>
> > Hi,
> >
> > Related GitHub issue:
> > https://github.com/apache/arrow/issues/41909
> >
> > How about adding arrow::ArrayStatistics?
> >
> > Motivation:
> >
> > An Apache Arrow format data doesn't have statistics. (We can
> > add statistics as metadata but there isn't any standard way
> > for it.)
> >
> > But a source of an Apache Arrow format data such as Apache
> > Parquet format data may have statistics. We can get the
> > source statistics via source reader such as
> > parquet::ColumnChunkMetaData::statistics() but can't get
> > them from read Apache Arrow format data. If we want to use
> > the source statistics, we need to keep the source reader.
> >
> > Proposal:
> >
> > How about adding arrow::ArrayStatistics or something and
> > attaching source statistics to read arrow::Array? If source
> > statistics are attached to read arrow::Array, we don't need
> > to keep a source reader to get source statistics.
> >
> > What do you think about this idea?
> >
> >
> > NOTE: I haven't thought about the arrow::ArrayStatistics
> > details yet. We'll be able to use parquet::Statistics and
> > its family as a reference.
> > https://github.com/apache/arrow/blob/main/cpp/src/parquet/statistics.h
> >
> >
> > Thanks,
> > --
> > kou
> >
>


Re: [VOTE] Move Arrow DataFusion Subproject to new Top Level Apache Project

2024-03-01 Thread Jorge Cardoso Leitão
+1 - great work!!!

On Fri, Mar 1, 2024 at 5:49 PM Micah Kornfield 
wrote:

> +1 (binding)
>
> On Friday, March 1, 2024, Uwe L. Korn  wrote:
>
> > +1 (binding)
> >
> > On Fri, Mar 1, 2024, at 2:37 PM, Andy Grove wrote:
> > > +1 (binding)
> > >
> > > On Fri, Mar 1, 2024 at 6:20 AM Weston Pace 
> > wrote:
> > >
> > >> +1 (binding)
> > >>
> > >> On Fri, Mar 1, 2024 at 3:33 AM Andrew Lamb 
> > wrote:
> > >>
> > >> > Hello,
> > >> >
> > >> > As we have discussed[1][2] I would like to vote on the proposal to
> > >> > create a new Apache Top Level Project for DataFusion. The text of
> the
> > >> > proposed resolution and background document is copy/pasted below
> > >> >
> > >> > If the community is in favor of this, we plan to submit the
> resolution
> > >> > to the ASF board for approval with the next Arrow report (for the
> > >> > April 2024 board meeting).
> > >> >
> > >> > The vote will be open for at least 7 days.
> > >> >
> > >> > [ ] +1 Accept this Proposal
> > >> > [ ] +0
> > >> > [ ] -1 Do not accept this proposal because...
> > >> >
> > >> > Andrew
> > >> >
> > >> > [1]
> https://lists.apache.org/thread/c150t1s1x0kcb3r03cjyx31kqs5oc341
> > >> > [2] https://github.com/apache/arrow-datafusion/discussions/6475
> > >> >
> > >> > -- Proposed Resolution -
> > >> >
> > >> > Resolution to Create the Apache DataFusion Project from the Apache
> > >> > Arrow DataFusion Sub Project
> > >> >
> > >> > =
> > >> >
> > >> > X. Establish the Apache DataFusion Project
> > >> >
> > >> > WHEREAS, the Board of Directors deems it to be in the best
> > >> > interests of the Foundation and consistent with the
> > >> > Foundation's purpose to establish a Project Management
> > >> > Committee charged with the creation and maintenance of
> > >> > open-source software related to an extensible query engine
> > >> > for distribution at no charge to the public.
> > >> >
> > >> > NOW, THEREFORE, BE IT RESOLVED, that a Project Management
> > >> > Committee (PMC), to be known as the "Apache DataFusion Project",
> > >> > be and hereby is established pursuant to Bylaws of the
> > >> > Foundation; and be it further
> > >> >
> > >> > RESOLVED, that the Apache DataFusion Project be and hereby is
> > >> > responsible for the creation and maintenance of software
> > >> > related to an extensible query engine; and be it further
> > >> >
> > >> > RESOLVED, that the office of "Vice President, Apache DataFusion" be
> > >> > and hereby is created, the person holding such office to
> > >> > serve at the direction of the Board of Directors as the chair
> > >> > of the Apache DataFusion Project, and to have primary responsibility
> > >> > for management of the projects within the scope of
> > >> > responsibility of the Apache DataFusion Project; and be it further
> > >> >
> > >> > RESOLVED, that the persons listed immediately below be and
> > >> > hereby are appointed to serve as the initial members of the
> > >> > Apache DataFusion Project:
> > >> >
> > >> > * Andy Grove (agr...@apache.org)
> > >> > * Andrew Lamb (al...@apache.org)
> > >> > * Daniël Heres (dhe...@apache.org)
> > >> > * Jie Wen (jake...@apache.org)
> > >> > * Kun Liu (liu...@apache.org)
> > >> > * Liang-Chi Hsieh (vii...@apache.org)
> > >> > * Qingping Hou: (ho...@apache.org)
> > >> > * Wes McKinney(w...@apache.org)
> > >> > * Will Jones (wjones...@apache.org)
> > >> >
> > >> > RESOLVED, that the Apache DataFusion Project be and hereby
> > >> > is tasked with the migration and rationalization of the Apache
> > >> > Arrow DataFusion sub-project; and be it further
> > >> >
> > >> > RESOLVED, that all responsibilities pertaining to the Apache
> > >> > Arrow DataFusion sub-project encumbered upon the
> > >> > Apache Arrow Project are hereafter discharged.
> > >> >
> > >> > NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrew Lamb
> > >> > be appointed to the office of Vice President, Apache DataFusion, to
> > >> > serve in accordance with and subject to the direction of the
> > >> > Board of Directors and the Bylaws of the Foundation until
> > >> > death, resignation, retirement, removal or disqualification,
> > >> > or until a successor is appointed.
> > >> > =
> > >> >
> > >> >
> > >> > ---
> > >> >
> > >> >
> > >> > Summary:
> > >> >
> > >> > We propose creating a new top level project, Apache DataFusion, from
> > >> > an existing sub project of Apache Arrow to facilitate additional
> > >> > community and project growth.
> > >> >
> > >> > Abstract
> > >> >
> > >> > Apache Arrow DataFusion[1]  is a very fast, extensible query engine
> > >> > for building high-quality data-centric systems in Rust, using the
> > >> > Apache Arrow in-memory format. DataFusion offers SQL and Dataframe
> > >> > APIs, excellent performance, built-in support for CSV, Parquet,
> JSON,
> > >> > and Avro, extensive customization, and a great community.
> > >> >
> > >> > [1] htt

Re: [VOTE] Accept donation of Comet Spark native engine

2024-01-27 Thread Jorge Cardoso Leitão
+1

On Sun, 28 Jan 2024, 00:00 Wes McKinney,  wrote:

> +1 (binding)
>
> On Sat, Jan 27, 2024 at 12:26 PM Micah Kornfield 
> wrote:
>
> > +1 Binding
> >
> > On Sat, Jan 27, 2024 at 10:21 AM David Li  wrote:
> >
> > > +1 (binding)
> > >
> > > On Sat, Jan 27, 2024, at 13:03, L. C. Hsieh wrote:
> > > > +1 (binding)
> > > >
> > > > On Sat, Jan 27, 2024 at 8:10 AM Andrew Lamb 
> > > wrote:
> > > >>
> > > >> +1 (binding)
> > > >>
> > > >> This is super exciting
> > > >>
> > > >> On Sat, Jan 27, 2024 at 11:00 AM Daniël Heres <
> danielhe...@gmail.com>
> > > wrote:
> > > >>
> > > >> > +1 (binding). Awesome addition to the DataFusion ecosystem!!!
> > > >> >
> > > >> > Daniël
> > > >> >
> > > >> >
> > > >> > On Sat, Jan 27, 2024, 16:57 vin jake 
> wrote:
> > > >> >
> > > >> > > +1 (binding)
> > > >> > >
> > > >> > > Andy Grove  于 2024年1月27日周六 下午11:43写道:
> > > >> > >
> > > >> > > > Hello,
> > > >> > > >
> > > >> > > > This vote is to determine if the Arrow PMC is in favor of
> > > accepting the
> > > >> > > > donation of Comet (a Spark native engine that is powered by
> > > DataFusion
> > > >> > > and
> > > >> > > > the Rust implementation of Arrow).
> > > >> > > >
> > > >> > > > The donation was previously discussed on the mailing list [1].
> > > >> > > >
> > > >> > > > The proposed donation is at [2].
> > > >> > > >
> > > >> > > > The Arrow PMC will start the IP clearance process if the vote
> > > passes.
> > > >> > > There
> > > >> > > > is a Google document [3] where the community is working on the
> > > draft
> > > >> > > > contents for the IP clearance form.
> > > >> > > >
> > > >> > > > The vote will be open for at least 72 hours.
> > > >> > > >
> > > >> > > > [ ] +1 : Accept the donation
> > > >> > > > [ ] 0 : No opinion
> > > >> > > > [ ] -1 : Reject donation because...
> > > >> > > >
> > > >> > > > My vote: +1
> > > >> > > >
> > > >> > > > Thanks,
> > > >> > > >
> > > >> > > > Andy.
> > > >> > > >
> > > >> > > >
> > > >> > > > [1]
> > > https://lists.apache.org/thread/0q1rb11jtpopc7vt1ffdzro0omblsh0s
> > > >> > > > [2] https://github.com/apache/arrow-datafusion-comet/pull/1
> > > >> > > > [3]
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > >
> >
> https://docs.google.com/document/d/1azmxE1LERNUdnpzqDO5ortKTsPKrhNgQC4oZSmXa8x4/edit?usp=sharing
> > > >> > > >
> > > >> > >
> > > >> >
> > >
> >
>


Re: [VOTE] Add RLE Arrays to Arrow Format

2022-12-19 Thread Jorge Cardoso Leitão
+1

Thanks a lot for all this. Really exciting!!

On Mon, 19 Dec 2022, 17:56 Matt Topol,  wrote:

> That leaves us with a total vote of +1.5 so the vote carries with the
> caveat of changing the name to be Run End Encoded rather than Run Length
> Encoded (unless this means I need to do a new vote with the changed name?
> This is my first time doing one of these so please correct me if I need to
> do a new vote!)
>
> Thanks everyone for your feedback and comments!
>
> I'm going to go update the Go and Format specific PRs to make them regular
> PR's (instead of drafts) and get this all moving. Thanks in advance to
> anyone who reviews the upcoming PRs!
>
> --Matt
>
> On Fri, Dec 16, 2022 at 8:24 PM Weston Pace  wrote:
>
> > +1
> >
> > I agree that run-end encoding makes more sense but also don't see it
> > as a deal breaker.
> >
> > The most compelling counter-argument I've seen for new types is to
> > avoid a schism where some implementations do not support the newer
> > types.  However, for the type proposed here I think the risk is low
> > because data can be losslessly converted to existing formats for
> > compatibility with any system that doesn't support the type.
> >
> > Another argument I've seen is that we should introduce a more formal
> > distinction between "layouts" and "types" (with dictionary and
> > run-end-encoding being layouts).  However, this seems like an
> > impractical change at this point.  In addition, given that we have
> > dictionary as an array type the cat is already out of the bag.
> > Furthermore, systems and implementations are still welcome to make
> > this distinction themselves.  The spec only needs to specify what the
> > buffer layouts should be.  If a particular library chooses to group
> > those layouts into two different categories I think that would still
> > be feasible.
> >
> > -Weston
> >
> > On Fri, Dec 16, 2022 at 1:42 PM Andrew Lamb 
> wrote:
> > >
> > > +1 on the proposal as written
> > >
> > > I think it makes sense and offers exciting opportunities for faster
> > > computation (especially for cases where parquet files can be decoded
> > > directly into such an array and avoid unpacking. RLE encoded dictionary
> > are
> > > quite compelling)
> > >
> > > I would prefer to use the term Run-End-Encoding (which would also
> follow
> > > the naming of the internal fields) but I don't view that as a deal
> > blocker.
> > >
> > > Thank you for all your work in this matter,
> > > Andrew
> > >
> > > On Wed, Dec 14, 2022 at 5:08 PM Matt Topol 
> > wrote:
> > >
> > > > I'm not at all opposed to renaming it as `Run-End-Encoding` if that
> > would
> > > > be preferable. Hopefully others will chime in with their feedback.
> > > >
> > > > --Matt
> > > >
> > > > On Wed, Dec 14, 2022 at 12:09 PM Ian Cook 
> > wrote:
> > > >
> > > > > Thank you Matt, Tobias, and others for the great work on this.
> > > > >
> > > > > I am -0.5 on this proposal in its current form because (pardon the
> > > > > pedantry) what we have implemented here is not run-length encoding;
> > it
> > > > > is run-end encoding. Based on community input, the choice was made
> to
> > > > > store run ends instead of run lengths because this enables
> O(log(N))
> > > > > random access as opposed to O(N). This is a sensible choice, but it
> > > > > comes with some trade-offs including limitations in array length
> > > > > (which maybe not really a problem in practice) and lack of
> > bit-for-bit
> > > > > equivalence with RLE encodings that use run lengths like Velox's
> > > > > SequenceVector encoding (which I think is a more serious problem in
> > > > > practice).
> > > > >
> > > > > I believe that we should either:
> > > > > (a) rename this to "run-end encoding"
> > > > > (b) change this to a parameterized type called "run encoding" that
> > > > > takes a Boolean parameter specifying whether run lengths or run
> ends
> > > > > are stored.
> > > > >
> > > > > Ian
> > > > >
> > > > > On Wed, Dec 14, 2022 at 11:27 AM Matt Topol <
> zotthewiz...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I'd like to propose adding the RLE type based on earlier
> > > > > discussions[1][2]
> > > > > > to the Arrow format:
> > > > > > - Columnar Format description:
> > > > > >
> > > > >
> > > >
> >
> https://github.com/apache/arrow/pull/1/files#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60
> > > > > > - Flatbuffers changes:
> > > > > >
> > > > >
> > > >
> >
> https://github.com/apache/arrow/pull/14176/files#diff-e54b4f5d2d279acc5d1df5df9a7636f0142a8041fe02f07034e0d8be48444b07
> > > > > >
> > > > > > There is a proposed implementation available in both C++ (written
> > by
> > > > > Tobias
> > > > > > Zagorni) and Go[3][4]. Both implementations have mostly the same
> > tests
> > > > > > implemented and were tested to be compatible over IPC with an
> > archery
> > > > > test.
> > > > > > In both cases, the implementations are split out among several
> > Draft
> > > > PRs

Re: compressed feather v2 "slicing from the middle"

2022-09-21 Thread Jorge Cardoso Leitão
Hi,

AFAIK compressed IPC arrow files do not support random access (like
uncompressed counterparts) - you need to decompress the whole batch (or at
least the columns you need). A "RecordBatch" is the compression unit of the
file. Think of it like a parquet file whose every row group has a single
data page.

With that said, the file contains the number of rows that each record batch
has, and thus you can compute which record batch you can start from. This
is available in the RecordBatch message "length" [1]. Unfortunately you do
need to read the record batch message to know it, and these are located in
the middle of the file:
[header][record_batch1_message][record_batch1_data][record_batch2_message][record_batch2_data]...[footer].
The positions of the messages are declared in the file's footer's
"record_batches".

[1] https://github.com/apache/arrow/blob/master/format/Message.fbs#L87

Best,
Jorge


On Thu, Sep 22, 2022 at 3:01 AM John Muehlhausen  wrote:

> Why aren't all the compressed batches the chunk size I specified in
> write_feather (700)?  How can I know which batch my slice resides in if
> this is not a constant?  Using pyarrow 9.0.0
>
> This file contains 1.5 billion rows.  I need a way to know where to look
> for, say, [780567127,922022522)
>
> 0.7492516040802002 done 0 len 700
> 1.7520167827606201 done 1 len 700
> 3.302407741546631 done 2 len 4995912
> 5.16457986831665 done 3 len 700
> 6.0424370765686035 done 4 len 4706276
> 7.58642315864563 done 5 len 700
> 7.719322681427002 done 6 len 289636
> 8.705692291259766 done 7 len 5698775
>
> On Wed, Sep 21, 2022 at 7:49 PM John Muehlhausen  wrote:
>
> > The following seems like good news... like I should be able to decompress
> > just one column of a RecordBatch in the middle of a compressed feather v2
> > file.  Is there a Python API for this kind of access?  C++?
> >
> > /// Provided for forward compatibility in case we need to support
> different
> > /// strategies for compressing the IPC message body (like whole-body
> > /// compression rather than buffer-level) in the future
> > enum BodyCompressionMethod:byte {
> >   /// Each constituent buffer is first compressed with the indicated
> >   /// compressor, and then written with the uncompressed length in the
> > first 8
> >   /// bytes as a 64-bit little-endian signed integer followed by the
> > compressed
> >   /// buffer bytes (and then padding as required by the protocol). The
> >   /// uncompressed length may be set to -1 to indicate that the data that
> >   /// follows is not compressed, which can be useful for cases where
> >   /// compression does not yield appreciable savings.
> >   BUFFER
> > }
> >
> > On Wed, Sep 21, 2022 at 7:03 PM John Muehlhausen  wrote:
> >
> >> ``Internal structure supports random access and slicing from the middle.
> >> This also means that you can read a large file chunk by chunk without
> >> having to pull the whole thing into memory.''
> >> https://ursalabs.org/blog/2020-feather-v2/
> >>
> >> For a compressed v2 file, can I decompress just one column of a batch in
> >> the middle, or is the entire batch with all of its columns compressed
> as a
> >> unit?
> >>
> >> Unfortunately reader.get_batch(i) seems like it is doing a lot of work.
> >> Like maybe decompressing all the columns?
> >>
> >> Thanks,
> >> John
> >>
> >
>


Re: Usage of the name Feather?

2022-08-29 Thread Jorge Cardoso Leitão
I agree.

I suspect that the most widely used API with "feather" is Pandas'
read_feather.



On Mon, 29 Aug 2022, 19:55 Weston Pace,  wrote:

> I agree as well.  I think most lingering uses of the term "feather"
> are in pyarrow and R however, so it might be good to hear from some of
> those maintainers.
>
>
>
> On Mon, Aug 29, 2022 at 9:35 AM Antoine Pitrou  wrote:
> >
> >
> > I agree with this as well.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Mon, 29 Aug 2022 11:29:45 -0400
> > Andrew Lamb  wrote:
> > > In the rust implementation we use the term "Arrow IPC" and I support
> your
> > > option 1:
> > >
> > > > The name Feather V2 is deprecated. Only the extension ".arrow" will
> be
> > > used for IPC files.
> > >
> > > Andrew
> > >
> > > On Mon, Aug 29, 2022 at 11:21 AM Matthew Topol
> 
> > > wrote:
> > >
> > > > When I wrote "In-Memory Analytics with Apache Arrow" I definitely
> > > > treated "Feather" as deprecated and mentioned it only in passing
> > > > specifically indicating "Arrow IPC" as the terminology to use. I only
> > > > even mentioned "Feather" at all because there are still methods in
> > > > pyarrow that reference it by name.
> > > >
> > > > That's just my opinion though...
> > > >
> > > > On Mon, Aug 29 2022 at 11:08:53 AM -0400, David Li
> > > >  wrote:
> > > > > This has come up before, e.g. see [1] [2] [3].
> > > > >
> > > > > I would say "Feather" is effectively deprecated and we are using
> > > > > "Arrow IPC" now but I am not sure what others think. (From that
> > > > > GitHub link, it seems to be mixed.) And ".arrow" is the official
> > > > > extension now (since it is registered as part of our MIME type).
> But
> > > > > there's existing documentation and not everything has been updated
> to
> > > > > be consistent (as you saw).
> > > > >
> > > > > [1]:
> > > > > 
> > > > > [2]:
> > > > > 
> > > > > [3]:
> > > > > <
> > > >
> https://stackoverflow.com/questions/67910612/arrow-ipc-vs-feather/67911190#67911190
> > > > >
> > > > >
> > > > > -David
> > > > >
> > > > > On Mon, Aug 29, 2022, at 10:50, 島 達也 wrote:
> > > > >>  Hi all.
> > > > >>
> > > > >>  I know the documentation (mainly pyarrow documentation) sometimes
> > > > >> refers
> > > > >>  to IPC files as Feather files, but are there any guidelines for
> > > > >> when to
> > > > >>  refer to an IPC file as a Feather file and when to refer to it as
> > > > >> an IPC
> > > > >>  file?
> > > > >>  I believe that calling the same file an Arrow IPC file at times
> and
> > > > >> a
> > > > >>  Feather file at other times is confusing to those unfamiliar with
> > > > >> Apache
> > > > >>  Arrow (myself included).
> > > > >>  Surprisingly, these files may even have completely different
> > > > >> extensions,
> > > > >>  ".arrow" and ".feather", which are not similar.
> > > > >>
> > > > >>  Perhaps there are several options for future use of the name
> > > > >> Feather,
> > > > >>  such as
> > > > >>
> > > > >>   1. The name Feather V2 is deprecated. Only the extension
> ".arrow"
> > > > >> will
> > > > >>  be used for IPC files.
> > > > >>   2. In some contexts(?), IPC files are referred to as Feather;
> only
> > > > >>  ".arrow" is used for the IPC file extension to clearly
> > > > >> distinguish
> > > > >>  it from Feather V1's ".feather".
> > > > >>   3. When an IPC file is called Feather by some rule, extension
> > > > >>  ".feather" is used, and when an IPC file is not called
> Feather,
> > > > >>  extension ".arrow" is used.
> > > > >>
> > > > >>  I mistakenly thought the current status was 2, but according to
> the
> > > > >>  discussion in this PR
> > > > >> (),
> > > > >>  apparently the current status seems 3. (However, there seems to
> be
> > > > >> no
> > > > >>  rule as to when an IPC file should be called a Feather)
> > > > >>
> > > > >>  I am not very familiar with Arrow and this is my first post to
> this
> > > > >>  mailing list so I apologize if I have done something wrong or
> > > > >> inappropriate.
> > > > >>
> > > > >>  Best,
> > > > >>  SHIMA Tatsuya
> > > >
> > > >
> > >
> >
> >
> >
>


Re: [VOTE] Format: Rules and procedures for Canonical extension types

2022-08-29 Thread Jorge Cardoso Leitão
+1

Really well written, thanks for driving this!

On Mon, 29 Aug 2022, 11:16 Antoine Pitrou,  wrote:

>
> Hello,
>
> Just a heads up that more PMC votes are needed here.
>
>
>
> Le 24/08/2022 à 17:24, Antoine Pitrou a écrit :
> >
> > Hello,
> >
> > I would like to propose we vote for the following set of rules for
> > registering well-known ("canonical") extension types.
> >
> >
> > * Canonical extension types are described and maintained in a separate
> > document under the format specifications directory:
> > https://github.com/apache/arrow/tree/master/docs/source/format (note
> > this gets turned into HTML docs by Sphinx =>
> > https://arrow.apache.org/docs/index.html)
> >
> > * Each canonical extension type requires a separate discussion and vote
> > on the mailing-list
> >
> > * The specification text to be added *must* follow these requirements
> >
> > 1) It *must* have a well-defined name starting with
> "org.apache.arrow."
> > 2) Its parameters, if any, *must* be described in the proposal
> > 3) Its serialization *must* be described in the proposal and should
> > not require unduly work or unusual software dependencies (for example, a
> > trivial custom text format or JSON would be acceptable)
> > 4) Its expected semantics *should* be described as well and any
> > potential ambiguities or pain points addressed or at least mentioned
> >
> > * The extension type *should* have one implementation submitted;
> > preferably two if non-trivial (for example if parameterized)
> >
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Accept this proposal
> > [ ] +0
> > [ ] -1 Do not accept this proposal because...
> >
> >
> > Regards
> >
> > Antoine.
>


Re: [QUESTION] How is mmap implemented for 8bit padded files?

2022-08-03 Thread Jorge Cardoso Leitão
Hi Antoine,

Thanks a lot for your answer.

So, if I understand (I may have not), we do not impose restrictions to the
alignment of the data when we get the pointer; only when we read from it.
Doesn't this require checking for alignment at runtime?

Best,
Jorge



On Tue, Aug 2, 2022 at 6:59 PM Antoine Pitrou  wrote:

>
> Hi Jorge,
>
> So there are two aspects to the answer:
>
> - ideally, the C++ implementation also works on non-aligned data (though
> this is poorly tested, if any)
>
> - when mmap'ing a file, you should get a page-aligned address
>
> As for int128 and int256, these usually don't exist at the hardware
> level anyway, so implementing those reads as a combination of 64-bit
> reads shouldn't hurt performance-wise.
>
> More generally, I don't know about Rust but in C++ unaligned access
> would be made UB-safe by using the memcpy trick, which is correctly
> optimized by production compilers:
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/ubsan.h#L55-L69
>
> Regards
>
> Antoine.
>
>
> Le 01/08/2022 à 18:55, Jorge Cardoso Leitão a écrit :
> > Hi,
> >
> > I am trying to follow the C++ implementation with respect to mmap IPC
> files
> > and reading them zero-copy, in the context of reproducing it in Rust.
> >
> > My understanding from reading the source code is that we essentially:
> > * identify the memory regions (offset and length) of each of the buffers,
> > via IPC's flatbuffer "Node".
> > * cast the uint8 pointer to the corresponding type based on the datatype
> > (e.g. f32 for float32)
> >
> > I am struggling to understand how we ensure that the pointer is aligned
> > [2,3] to the type (e.g. f32) so that the uint8 pointer can be safely
> casted
> > to it.
> >
> > In other words, I would expect mmap to work when:
> > * the files' bit padding is 64 bits
> > * the target type is <= 64 bits
> > However,
> > * we have types with more than 64 bits (int128 and int256)
> > * a file can be 8-bit aligned
> >
> > The background is that Rust requires pointers to be aligned to the type
> for
> > safe casting (it is UB to read unaligned pointers), and the above
> naturally
> > poses a challenge when reading i128, i256 and 8-bit padded files.
> >
> > Best,
> > Jorge
> >
> > [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc
> > [2] https://en.wikipedia.org/wiki/Data_structure_alignment
> > [3] https://stackoverflow.com/a/4322950/931303
> >
>


Re: [DISCUSS][Format] Starting to do some concrete work on the new "StringView" columnar data type

2022-08-01 Thread Jorge Cardoso Leitão
I am +1 on either - imo:

* it is important to have either available
* both provide a non-trivial improvement over what we have
* the trade-off is difficult to decide upon - I trust whomever is
implementing it to experiment and decide which better fits Arrow and the
ecosystem.

Thank you so much for driving this, Wes.

Best,
Jorge


On Mon, Aug 1, 2022 at 7:14 PM Wes McKinney  wrote:

> On Sun, Jul 31, 2022 at 8:05 AM Antoine Pitrou  wrote:
> >
> >
> > Hi Wes,
> >
> > Le 31/07/2022 à 00:02, Wes McKinney a écrit :
> > >
> > > I understand there are still some aspects of this project that cause
> > > some squeamishness (like having arbitrary memory addresses embedded
> > > within array values whose lifetime a C ABI consumer may not know about
> > > -- we already export memory addresses in the C ABI but fewer of them
> > > because they are only the buffers at the array level). We discussed
> > > some alternative approaches that address some of these questions, but
> > > each come with associated trade-offs.
> >
> > Are any of these trade-offs blocking?
> >
>
> They aren't blocking implementation work at least.
>
> I think the alternative designs / requirements that were discussed were
>
> * Attaching all referenced memory buffers by pointers in the C ABI or
> * Using offsets into an attached buffer instead of pointers
>
> I think that either of these pose conflicts with pooled allocators or
> tiered buffer management, since a single Arrow vector may reference
> many buffers within a memory pool (where different vectors may
> reference different memory chunks in the pool — so externalizing all
> referenced buffers is burdensome in the first case or would require an
> expensive "repack" operation in the latter case, defeating the goal of
> zero copy).
>
> You can see a discussion of how Umbra has three different storage
> tiers (persistent, transient, temporary) for out-of-line strings
>
> https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf
>
> It might be a good idea to look more carefully at how DuckDB and Velox
> do memory management for the out-of-line strings.
>
> If we start placing restrictions on how the out-of-line string buffers
> are managed and externalized, it risks undermining the zero-copy
> interoperability benefits that we're trying to achieve with this.
>


[QUESTION] How is mmap implemented for 8bit padded files?

2022-08-01 Thread Jorge Cardoso Leitão
Hi,

I am trying to follow the C++ implementation with respect to mmap IPC files
and reading them zero-copy, in the context of reproducing it in Rust.

My understanding from reading the source code is that we essentially:
* identify the memory regions (offset and length) of each of the buffers,
via IPC's flatbuffer "Node".
* cast the uint8 pointer to the corresponding type based on the datatype
(e.g. f32 for float32)

I am struggling to understand how we ensure that the pointer is aligned
[2,3] to the type (e.g. f32) so that the uint8 pointer can be safely casted
to it.

In other words, I would expect mmap to work when:
* the files' bit padding is 64 bits
* the target type is <= 64 bits
However,
* we have types with more than 64 bits (int128 and int256)
* a file can be 8-bit aligned

The background is that Rust requires pointers to be aligned to the type for
safe casting (it is UB to read unaligned pointers), and the above naturally
poses a challenge when reading i128, i256 and 8-bit padded files.

Best,
Jorge

[1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc
[2] https://en.wikipedia.org/wiki/Data_structure_alignment
[3] https://stackoverflow.com/a/4322950/931303


Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-27 Thread Jorge Cardoso Leitão
Hi Laurent,

I agree that there is a common pattern in converting row-based formats to
Arrow.

Imho the difficult part is not to map the storage format to Arrow
specifically - it is to map the storage format to any in-memory (row- or
columnar- based) format, since it requires in-depth knowledge about the 2
formats (the source format and the target format).

- Understanding the Arrow API which can be challenging for complex cases of
> rows representing complex objects (list of struct, struct of struct, ...).
>

the developer would have the same problem - just shifted around - they now
need to convert their complex objects to the intermediary representation.
Whether it is more "difficult" or "complex" to learn than Arrow is an open
question, but we would essentially be shifting the problem from "learning
Arrow" to "learning the Intermediate in-memory".

@Micah Kornfield, as described before my goal is not to define a memory
> layout specification but more to define an API and a translation mechanism
> able to take this intermediate representation (list of generic objects
> representing the entities to translate) and to convert it into one or more
> Arrow records.
>

imho a spec of "list of generic objects representing the entities" is
specified by an in-memory format (not by an API spec).

A second challenge I anticipate is that in-memory formats inneerently "own"
the memory they outline (since by definition they outline how this memory
is outlined). An Intermediate in-memory representation would be no
different. Since row-based formats usually require at least one allocation
per row (and often more for variable-length types), the transformation
(storage format -> row-based in-memory format -> Arrow) incurs a
significant cost (~2x slower last time I played with this problem in JSON
[1]).

A third challenge I anticipate is that given that we have 10+ languages, we
would eventually need to convert the intermediary representation across
languages, which imo just hints that we would need to formalize an agnostic
spec for such representation (so languages agree on its representation),
and thus essentially declare a new (row-based) format.

(none of this precludes efforts to invent an in-memory row format for
analytics workloads)

@Wes McKinney 

I still think having a canonical in-memory row format (and libraries
> to transform to and from Arrow columnar format) is a good idea — but
> there is the risk of ending up in the tar pit of reinventing Avro.
>

afaik Avro does not have O(1) random access neither to its rows nor columns
- records are concatenated back to back, every record's column is
concatenated back to back within a record, and there is no indexing
information on how to access a particular row or column. There are blocks
of rows that reduce the cost of accessing large offsets, but imo it is far
from the O(1) offered by Arrow (and expected by analytics workloads).

[1] https://github.com/jorgecarleitao/arrow2/pull/1024

Best,
Jorge

On Thu, Jul 28, 2022 at 5:38 AM Laurent Quérel 
wrote:

> Let me clarify the proposal a bit before replying to the various previous
> feedbacks.
>
>
>
> It seems to me that the process of converting a row-oriented data source
> (row = set of fields or something more hierarchical) into an Arrow record
> repeatedly raises the same challenges. A developer who must perform this
> kind of transformation is confronted with the following questions and
> problems:
>
> - Understanding the Arrow API which can be challenging for complex cases of
> rows representing complex objects (list of struct, struct of struct, ...).
>
> - Decide which Arrow schema(s) will correspond to your data source. In some
> complex cases it can be advantageous to translate the same row-oriented
> data source into several Arrow schemas (e.g. OpenTelementry data sources).
>
> - Decide on the encoding of the columns to make the most of the
> column-oriented format and thus increase the compression rate (e.g. define
> the columns that should be represent as dictionaries).
>
>
>
> By experience, I can attest that this process is usually iterative. For
> non-trivial data sources, arriving at the arrow representation that offers
> the best compression ratio and is still perfectly usable and queryable is a
> long and tedious process.
>
>
>
> I see two approaches to ease this process and consequently increase the
> adoption of Apache Arrow:
>
> - Definition of a canonical in-memory row format specification that every
> row-oriented data source provider can progressively adopt to get an
> automatic translation into the Arrow format.
>
> - Definition of an integration library allowing to map any row-oriented
> source into a generic row-oriented source understood by the converter. It
> is not about defining a unique in-memory format but more about defining a
> standard API to represent row-oriented data.
>
>
>
> In my opinion these two approaches are complementary. The first option is a
> long-term approach targeting di

Re: [VOTE] Mark C Stream Interface as Stable

2022-06-08 Thread Jorge Cardoso Leitão
Sorry, I got a bit confused on what we were voting on. Thank you for the
clarification.

+1

Best,
Jorge


On Wed, Jun 8, 2022 at 9:53 PM Antoine Pitrou  wrote:

>
> Le 08/06/2022 à 20:55, Jorge Cardoso Leitão a écrit :
> > 0 (binding) - imo there is some unclarity over what is expected to be
> > passed over the C streaming interface - an Array or a StructArray.
> >
> > I think the spec claims the former, but the C++ implementation (which I
> > assume is the reference here) expects the latter [1].
>
> It is definitely be the former, despite any limitation in the C++
> implementation.
>
> > Would it be possible to clarify this on either end so we do not clone the
> > spec and/or reference implementation with this unclarity?
>
> For the record, there is no reference implementation. The spec is the
> reference.
>
> Regards
>
> Antoine.
>


Re: [VOTE] Mark C Stream Interface as Stable

2022-06-08 Thread Jorge Cardoso Leitão
0 (binding) - imo there is some unclarity over what is expected to be
passed over the C streaming interface - an Array or a StructArray.

I think the spec claims the former, but the C++ implementation (which I
assume is the reference here) expects the latter [1].

Would it be possible to clarify this on either end so we do not clone the
spec and/or reference implementation with this unclarity?

+1 if we are voting on formalizing the spec as written, and consider
ARROW-15747 as something for C++/Python to improve upon. In other words, I
favor that the C stream interface is aligned with the C data interface with
respect to what it consumes (arbitrary arrays).

[1] https://issues.apache.org/jira/browse/ARROW-15747

Best,
Jorge



On Wed, Jun 8, 2022 at 8:29 PM Antoine Pitrou  wrote:

>
> +1 (binding)
>
>
> Le 08/06/2022 à 20:15, Will Jones a écrit :
> > Hi,
> >
> > Given all feedback to discussion [1] has been positive, I would like to
> > propose marking the C Stream Interface as stable.
> >
> > I have prepared PRs in apache/arrow [2] and apache/arrow-rs [3] to remove
> > all "experimental" markers from the interface and update the support grid
> > for the interface.
> >
> > Only votes from PMC members are binding, but all members of the community
> > are encouraged to provide input and vote with "(non-binding)". The vote
> > will run for at least 72 hours.
> >
> > [ ] +1 Remove all "experimental" marks from C Streaming Interface
> > [ ] +0
> > [ ] -1 Keep "experimental" warnings on C Streaming Interface
> >
> > [1] https://lists.apache.org/thread/0y604o9s3wkyty328wv8d21ol7s40q55
> > [2] https://github.com/apache/arrow/pull/13345
> > [3] https://github.com/apache/arrow-rs/pull/1821
> >
>


Re: [ANNOUNCE] New Arrow committer: Liang-Chi Hsieh

2022-04-29 Thread Jorge Cardoso Leitão
Congratulations, great work!

On Sat, Apr 30, 2022 at 3:30 AM L. C. Hsieh  wrote:

> Thanks all!
>
> On Fri, Apr 29, 2022 at 7:19 PM Yijie Shen 
> wrote:
> >
> > Congrats Liang-Chi!
> >
> >
> > On Thu, Apr 28, 2022 at 8:36 PM Vibhatha Abeykoon 
> > wrote:
> >
> > > Congratulations!
> > >
> > > On Thu, Apr 28, 2022 at 5:36 PM David Li  wrote:
> > >
> > > > Congrats and welcome Liang-Chi!
> > > >
> > > > On Thu, Apr 28, 2022, at 05:07, Rok Mihevc wrote:
> > > > > Congrats!
> > > > >
> > > > > Rok
> > > > >
> > > > > On Thu, Apr 28, 2022 at 7:49 AM L. C. Hsieh 
> wrote:
> > > > >>
> > > > >> Thanks all! Looking forward to working with you on the project!
> > > > >>
> > > > >> On Wed, Apr 27, 2022 at 9:26 PM Gidon Gershinsky <
> gg5...@gmail.com>
> > > > wrote:
> > > > >> >
> > > > >> > Congrats Liang-Chi!
> > > > >> >
> > > > >> > Cheers, Gidon
> > > > >> >
> > > > >> >
> > > > >> > On Thu, Apr 28, 2022 at 4:17 AM Yang hao <
> 1371656737...@gmail.com>
> > > > wrote:
> > > > >> >
> > > > >> > > Congratulations Liang-Chi!
> > > > >> > >
> > > > >> > > From: Weston Pace 
> > > > >> > > Date: Thursday, April 28, 2022 at 05:19
> > > > >> > > To: dev@arrow.apache.org 
> > > > >> > > Subject: Re: [ANNOUNCE] New Arrow committer: Liang-Chi Hsieh
> > > > >> > > Congratulations Liang-Chi!
> > > > >> > >
> > > > >> > > On Wed, Apr 27, 2022 at 9:54 AM Chao Sun 
> > > > wrote:
> > > > >> > > >
> > > > >> > > > Congrats Liang-Chi! well deserved!
> > > > >> > > >
> > > > >> > > > On Wed, Apr 27, 2022 at 12:49 PM L. C. Hsieh <
> vii...@gmail.com>
> > > > wrote:
> > > > >> > > > >
> > > > >> > > > > Thank you, Andrew and Bryan,
> > > > >> > > > > I'm pleased to become an Arrow committer. Looking forward
> to
> > > > >> > > > > contributing more on Apache Arrow!
> > > > >> > > > >
> > > > >> > > > > On Wed, Apr 27, 2022 at 12:34 PM Bryan Cutler <
> > > > cutl...@gmail.com>
> > > > >> > > wrote:
> > > > >> > > > > >
> > > > >> > > > > > Congratulations!! That's great news and really glad to
> have
> > > > you on
> > > > >> > > the
> > > > >> > > > > > project!
> > > > >> > > > > >
> > > > >> > > > > > On Wed, Apr 27, 2022, 11:44 AM Andrew Lamb <
> > > > al...@influxdata.com>
> > > > >> > > wrote:
> > > > >> > > > > >
> > > > >> > > > > > > On behalf of the Arrow PMC, I'm happy to announce that
> > > > Liang-Chi
> > > > >> > > Hsieh
> > > > >> > > > > > > has accepted an invitation to become a committer on
> Apache
> > > > >> > > > > > > Arrow. Welcome, and thank you for your contributions!
> > > > >> > > > > > >
> > > > >> > > > > > > Andrew
> > > > >> > > > > > >
> > > > >> > >
> > > >
> > > --
> > > Vibhatha Abeykoon
> > >
>


Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Jorge Cardoso Leitão
I need to correct myself here - it is currently not possible to pass memory
at zero cost between the engine and WASM interpreter. This is related to
your point about safety - WASM provides memory safety guarantees because it
controls the memory region that it can read from and write to. Therefore,
currently passing data from and into WASM requires a memcopy.

There is a proposal [1] to improve the situation, but currently would incur
a cost in the query engine, since we would need to memcopy the regions
around.

I forgot that on my poc I passed the parquet file from js to WASM and
de-serialized it to arrow directly in wasm - so memory was already being
allocated from within WASM sandbox, not JS. Sorry for the confusion.

[1] https://github.com/WebAssembly/design/issues/1439

Best,
Jorge



On Tue, Apr 26, 2022 at 3:43 PM Antoine Pitrou  wrote:

>
> Le 26/04/2022 à 16:30, Gavin Ray a écrit :
> > Antoine, sandboxing comes into play from two places:
> >
> > 1) The WASM specification itself, which puts a bounds on the types of
> > behaviors possible
> > 2) The implementation of the WASM bytecode interpreter chosen, like Jorge
> > mentioned in the comment above
> >
> > The wasmtime docs have a pretty solid section covering the sandboxing
> > guarantees of WASM, and then the interpreter-specific behavior/abilities
> of
> > wasmtime FWIW:
> > https://docs.wasmtime.dev/security-sandboxing.html#webassembly-core
>
> This doesn't really answer my question, does it?
>
>
> >
> > On Tue, Apr 26, 2022 at 10:22 AM Antoine Pitrou 
> wrote:
> >
> >>
> >> Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit :
> >>>> Would WASM be able to interact in-process with non-WASM buffers
> safely?
> >>>
> >>> AFAIK yes. My understanding from playing with it in JS is that a
> >>> WASM-backed udf execution would be something like:
> >>>
> >>> 1. compile the C++/Rust/etc UDF to WASM (a binary format)
> >>> 2. provide a small WASM-compiled middleware of the c data interface
> that
> >>> consumes (binary, c data interface pointers)
> >>> 3. ship a WASM interpreter as part of the query engine
> >>> 4. pass binary and c data interface pointers from the query engine
> >> program
> >>> to the interpreter with WASM-compiled middleware
> >>
> >> Ok, but the key word in my question was "safely". What mechanisms are in
> >> place such that the WASM user function will not access Arrow buffers out
> >> of bounds? Nothing really stands out in
> >> https://webassembly.github.io/spec/core/index.html, but it's the first
> >> time I try to have a look at the WebAssembly spec.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >>>
> >>> Step 2 is necessary to read the buffers from FFI and output the result
> >> back
> >>> from the interpreter once the UDF is done, similar to what we do in
> >>> datafusion to run Python from Rust. In the case of datafusion the
> >> "binary"
> >>> is a Python function, which has security implications since the Python
> >>> interpreter allows everything by default.
> >>>
> >>> Best,
> >>> Jorge
> >>>
> >>>
> >>>
> >>> On Tue, Apr 26, 2022 at 2:56 PM Antoine Pitrou 
> >> wrote:
> >>>
> >>>>
> >>>> Le 25/04/2022 à 23:04, David Li a écrit :
> >>>>> The WebAssembly documentation has a rundown of the techniques used:
> >>>> https://webassembly.org/docs/security/
> >>>>>
> >>>>> I think usually you would run WASM in-process, though we could indeed
> >>>> also put it in a subprocess to further isolate things.
> >>>>
> >>>> Would WASM be able to interact in-process with non-WASM buffers
> safely?
> >>>> It's not obvious from reading the page above.
> >>>>
> >>>>
> >>>>>
> >>>>> It would be interesting to define the Flight "harness" protocol.
> >>>> Handling heterogeneous arguments may require some evolution in Flight
> >> (e.g.
> >>>> if the function is non scalar and arguments are of different length -
> >> we'd
> >>>> need something like the ColumnBag proposal, so this might be a good
> >> reason
> >>>> to revive that).
> >>>>>
> >>>>> On Mon, Apr 25, 2022, at 16

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Jorge Cardoso Leitão
> Would WASM be able to interact in-process with non-WASM buffers safely?

AFAIK yes. My understanding from playing with it in JS is that a
WASM-backed udf execution would be something like:

1. compile the C++/Rust/etc UDF to WASM (a binary format)
2. provide a small WASM-compiled middleware of the c data interface that
consumes (binary, c data interface pointers)
3. ship a WASM interpreter as part of the query engine
4. pass binary and c data interface pointers from the query engine program
to the interpreter with WASM-compiled middleware

Step 2 is necessary to read the buffers from FFI and output the result back
from the interpreter once the UDF is done, similar to what we do in
datafusion to run Python from Rust. In the case of datafusion the "binary"
is a Python function, which has security implications since the Python
interpreter allows everything by default.

Best,
Jorge



On Tue, Apr 26, 2022 at 2:56 PM Antoine Pitrou  wrote:

>
> Le 25/04/2022 à 23:04, David Li a écrit :
> > The WebAssembly documentation has a rundown of the techniques used:
> https://webassembly.org/docs/security/
> >
> > I think usually you would run WASM in-process, though we could indeed
> also put it in a subprocess to further isolate things.
>
> Would WASM be able to interact in-process with non-WASM buffers safely?
> It's not obvious from reading the page above.
>
>
> >
> > It would be interesting to define the Flight "harness" protocol.
> Handling heterogeneous arguments may require some evolution in Flight (e.g.
> if the function is non scalar and arguments are of different length - we'd
> need something like the ColumnBag proposal, so this might be a good reason
> to revive that).
> >
> > On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote:
> >> Le 25/04/2022 à 22:19, Wes McKinney a écrit :
> >>> I was going to reply to this e-mail thread on user@ but thought I
> >>> would start a new thread on dev@.
> >>>
> >>> Executing user-defined functions in memory, especially untrusted
> >>> functions, in general is unsafe. For "trusted" functions, having an
> >>> in-memory API for writing them in user languages is very useful. I
> >>> remember tinkering with adding UDFs in Impala with LLVM IR, which
> >>> would allow UDFs to have performance consistent with built-ins
> >>> (because built-in functions are all inlined into code-generated
> >>> expressions), but segfaults would bring down the server, so only
> >>> admins could be trusted to add new UDFs.
> >>>
> >>> However, I wonder if we should eventually define an "external UDF"
> >>> protocol and an example UDF "harness", using Flight to do RPC across
> >>> the process boundaries. So the idea is that an external local UDF
> >>> Flight execution service is spun up, and then data is sent to the UDF
> >>> in a DoExchange call.
> >>>
> >>> As Jacques pointed out in an interview 1], a compelling solution to
> >>> the UDF sandboxing problem is WASM. This allows "untrusted" WASM
> >>> functions to be run safely in-process.
> >>
> >> How does the sandboxing work in this case? Is it simply executing in a
> >> separate process with restricted capabilities, or are other mechanisms
> >> put in place?
>


Re: [Question] Is it possible to write to IPC without an intermediary buffer?

2022-04-05 Thread Jorge Cardoso Leitão
Hi Micah,

Thank you for your reply. That is also my understanding - not possible in
streaming IPC, possible in file IPC with random access. The pseudo-code
could be something like:

start = writer.seek_current();
empty_locations = create_empty_header(schema)
write_header(writer, empty_locations)
locations = write_buffers(writer, batch)
end_buffers_position = writer.seek_current()
writer.seek(start)
write_header(writer, locations)
writer.seek(end_buffers_position)

AFAI can understand, this would cause writing to IPC to require O(N) where
N is the average size of the buffers, as opposed to O(N*B) where N is the
average size of the buffer and B the number of buffers. I.e. It is still
quite a multiplicative factor involved.

I filed https://issues.apache.org/jira/browse/ARROW-16118 with the idea.

Best,
Jorge



On Mon, Apr 4, 2022 at 6:09 PM Micah Kornfield 
wrote:

> Hi Jorge,
> I don't think any implementation does this but I think it is technically
> possible, although it might be complicated to actually do.  It also
> requires random access files (the output can't be purely streaming).
>
> I think the approach you would need to take is to pr-write the header
> information without the values zeroed out at first., After you've
> compressed and written the physical bytes you would need to update the
> values in place, after you know them.  Since Flatbuffers doesn't do any
> variable length encoding, you don't need to worry about possibly corrupting
> the data.   The challenging part is determining the exact locations that
> need to be overwritten.
>
> -MIcah
>
> On Mon, Apr 4, 2022 at 7:40 AM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
> > Hi,
> >
> > Motivated by [1], I wonder if it is possible to write to IPC without
> > writing the data to an intermediary buffer.
> >
> > The challenge is that the header of an IPC message [header][data]
> requires:
> >
> > * the positions of the buffers
> > * the total length of the body
> >
> > For uncompressed data, we could compute these before-hand at `O(C)`
> where C
> > is the number of columns. However, I am unable to find a way of computing
> > these ahead of writing for compressed buffers: we need to compress the
> data
> > to know its compressed (and thus buffers) size.
> >
> > Is this understanding correct?
> >
> > Best,
> > Jorge
> >
> > [1] https://github.com/pola-rs/polars/issues/2639
> >
>


[Question] Is it possible to write to IPC without an intermediary buffer?

2022-04-04 Thread Jorge Cardoso Leitão
Hi,

Motivated by [1], I wonder if it is possible to write to IPC without
writing the data to an intermediary buffer.

The challenge is that the header of an IPC message [header][data] requires:

* the positions of the buffers
* the total length of the body

For uncompressed data, we could compute these before-hand at `O(C)` where C
is the number of columns. However, I am unable to find a way of computing
these ahead of writing for compressed buffers: we need to compress the data
to know its compressed (and thus buffers) size.

Is this understanding correct?

Best,
Jorge

[1] https://github.com/pola-rs/polars/issues/2639


Re: [ANNOUNCE] New Arrow committers: Raphael Taylor-Davies, Wang Xudong, Yijie Shen, and Kun Liu

2022-03-14 Thread Jorge Cardoso Leitão
Congrats to all of you - well deserved!

On Mon, Mar 14, 2022, 20:47 Bryan Cutler  wrote:

> Congrats to all!
>
> On Thu, Mar 10, 2022 at 12:11 AM Alenka Frim 
> wrote:
>
> > Congratulations all!
> >
> > On Thu, Mar 10, 2022 at 1:55 AM Yang hao <1371656737...@gmail.com>
> wrote:
> >
> > > Congratulations to all!
> > >
> > > From: Benson Muite 
> > > Date: Thursday, March 10, 2022 at 03:45
> > > To: dev@arrow.apache.org 
> > > Subject: Re: [ANNOUNCE] New Arrow committers: Raphael Taylor-Davies,
> Wang
> > > Xudong, Yijie Shen, and Kun Liu
> > > Congratulations!
> > >
> > > On 3/9/22 9:56 PM, David Li wrote:
> > > > Congrats everyone!
> > > >
> > > > On Wed, Mar 9, 2022, at 13:47, Rok Mihevc wrote:
> > > >> Congrats all!
> > > >>
> > > >> Rok
> > > >>
> > > >> On Wed, Mar 9, 2022 at 7:16 PM QP Hou  wrote:
> > > >>>
> > > >>> Congratulations to all, well deserved!
> > > >>>
> > > >>> On Wed, Mar 9, 2022 at 9:37 AM Daniël Heres  >
> > > wrote:
> > > 
> > >  Congratulations!
> > > 
> > >  On Wed, Mar 9, 2022, 18:26 LM  wrote:
> > > 
> > > > Congrats to you all!
> > > >
> > > > On Wed, Mar 9, 2022 at 9:19 AM Chao Sun 
> > wrote:
> > > >
> > > >> Congrats all!
> > > >>
> > > >> On Wed, Mar 9, 2022 at 9:16 AM Micah Kornfield <
> > > emkornfi...@gmail.com>
> > > >> wrote:
> > > >>>
> > > >>> Congrats!
> > > >>>
> > > >>> On Wed, Mar 9, 2022 at 8:36 AM Weston Pace <
> > weston.p...@gmail.com>
> > > >> wrote:
> > > >>>
> > >  Congratulations to all of you!
> > > 
> > >  On Wed, Mar 9, 2022, 4:52 AM Matthew Turner <
> > > >> matthew.m.tur...@outlook.com>
> > >  wrote:
> > > 
> > > > Congrats all and thank you for your contributions! It's been
> > > great
> > > > to
> > >  work
> > > > with and learn from you all.
> > > >
> > > > -Original Message-
> > > > From: Andrew Lamb 
> > > > Sent: Wednesday, March 9, 2022 8:59 AM
> > > > To: dev 
> > > > Subject: [ANNOUNCE] New Arrow committers: Raphael
> > Taylor-Davies,
> > > > Wang
> > > > Xudong, Yijie Shen, and Kun Liu
> > > >
> > > > On behalf of the Arrow PMC, I'm happy to announce that
> > > >
> > > > Raphael Taylor-Davies
> > > > Wang Xudong
> > > > Yijie Shen
> > > > Kun Liu
> > > >
> > > > Have all accepted invitations to become committers on Apache
> > > Arrow!
> > > > Welcome, thank you for all your contributions so far, and we
> > look
> > > >> forward
> > > > to continuing to drive Apache Arrow forward to an even better
> > > place
> > > >> in
> > >  the
> > > > future.
> > > >
> > > > This exciting growth in committers mirrors the growth of the
> > > Arrow
> > > >> Rust
> > > > community.
> > > >
> > > > Andrew
> > > >
> > > > p.s. sorry for the somewhat impersonal email; I was trying to
> > > avoid
> > > > several very similar emails. I am truly excited for each of
> > these
> > > > individuals.
> > > >
> > > 
> > > >>
> > > >
> > >
> >
>


Re: [Discuss][Format] Add 32-bit and 64-bit Decimals

2022-03-08 Thread Jorge Cardoso Leitão
Agreed.

Also, I would like to revise my previous comment about the small risk.
While prototyping this I did hit some bumps. They primary came from two
reasons:

* I was unable to find arrow/json files in the arrow-testing generated
files with a non-default decimal bitwidth (I think we only have the
on-the-fly generated file in archery)
* the FFI interface has a default decimal of 128 (`d:{precision}:{scale}`)
and implementations may not support the 256 case (e.g. Rust has no native
i256). For these cases, this could be the first non-default decimal
implementation.

So, maybe we follow the standard procedure?

Best,
Jorge



On Tue, Mar 8, 2022 at 9:22 PM Micah Kornfield 
wrote:

> >
> > I’d also like to chime in in favor of 32- and 64-bit decimals because
> > it’ll help achieve better performance on TPC-H (and maybe other
> > benchmarks). The decimal columns need only 12 digits of precision, for
> > which a 64-bit decimal is sufficient. It’s currently wasteful to use a
> > 128-bit decimal. You can technically use a float too, but I expect 64-bit
> > decimal to be faster.
>
>
> We should be careful here.  If this assumes loading from Parquet or other
> file formats currently in the library, arbitrarily changing the type to
> load the minimum data-length possible could break users, this should
> probably be a configuration option.  This also reminds me I think there is
> some technical debt with decimals and parquet.
>
> [1] https://issues.apache.org/jira/browse/ARROW-12022
>
> On Tue, Mar 8, 2022 at 11:05 AM Sasha Krassovsky <
> krassovskysa...@gmail.com>
> wrote:
>
> > I’d also like to chime in in favor of 32- and 64-bit decimals because
> > it’ll help achieve better performance on TPC-H (and maybe other
> > benchmarks). The decimal columns need only 12 digits of precision, for
> > which a 64-bit decimal is sufficient. It’s currently wasteful to use a
> > 128-bit decimal. You can technically use a float too, but I expect 64-bit
> > decimal to be faster.
> >
> > Sasha Krassovsky
> >
> > > 8 марта 2022 г., в 09:01, Micah Kornfield 
> > написал(а):
> > >
> > > 
> > >>
> > >>
> > >> Do we want to keep the historical "C++ and Java" requirement or
> > >> do we want to make it a more flexible "two independent official
> > >> implementations", which could be for example C++ and Rust, Rust and
> > >> Java, etc.
> > >
> > >
> > > I think flexibility here is a good idea, I'd like to hear other
> opinions.
> > >
> > > For this particular case if there aren't volunteers to help out in
> > another
> > > implementation I'm willing to help with Java (I don't have bandwidth to
> > > do both C++ and Java).
> > >
> > > Cheers,
> > > -Micah
> > >
> > >> On Tue, Mar 8, 2022 at 8:23 AM Antoine Pitrou 
> > wrote:
> > >>
> > >>
> > >> Le 07/03/2022 à 20:26, Micah Kornfield a écrit :
> > >>>>
> > >>>> Relaxing from {128,256} to {32,64,128,256} seems a low risk
> > >>>> from an integration perspective, as implementations already need to
> > read
> > >>>> the bitwidth to select the appropriate physical representation (if
> > they
> > >>>> support it).
> > >>>
> > >>> I think there are two reasons for having implementations first.
> > >>> 1.  Lower risk bugs in implementation/spec.
> > >>> 2.  A mechanism to ensure that there is some boot-strapped coverage
> in
> > >>> commonly used reference implementations.
> > >>
> > >> That sounds reasonable.
> > >>
> > >> Another question that came to my mind is: traditionally, we've
> mandated
> > >> implementations in the two reference Arrow implementations (C++ and
> > >> Java).  However, our implementation landscape is now much richer than
> it
> > >> used to be (for example, there is a tremendous activity on the Rust
> > >> side).  Do we want to keep the historical "C++ and Java" requirement
> or
> > >> do we want to make it a more flexible "two independent official
> > >> implementations", which could be for example C++ and Rust, Rust and
> > >> Java, etc.
> > >>
> > >> (by "independent" I mean that one should not be based on the other,
> for
> > >> example it should not be "C++ and Python" :-))
> > >>
> > >> Regards
> 

Re: [Discuss][Format] Add 32-bit and 64-bit Decimals

2022-03-07 Thread Jorge Cardoso Leitão
+1 adding 32 and 64 bit decimals.

+0 to release it without integration tests - both IPC and the C data
interface use a variable bit width to declare the appropriate size for
decimal types. Relaxing from {128,256} to {32,64,128,256} seems a low risk
from an integration perspective, as implementations already need to read
the bitwidth to select the appropriate physical representation (if they
support it).

Best,
Jorge




On Mon, Mar 7, 2022, 11:41 Antoine Pitrou  wrote:

>
> Le 03/03/2022 à 18:05, Micah Kornfield a écrit :
> > I think this makes sense to add these.  Typically when adding new types,
> > we've waited  on the official vote until there are two reference
> > implementations demonstrating compatibility.
>
> You are right, I had forgotten about that.  Though in this case, it
> might be argued we are just relaxing the constraints on an existing type.
>
> What do others think?
>
> Regards
>
> Antoine.
>
>
> >
> > On Thu, Mar 3, 2022 at 6:55 AM Antoine Pitrou 
> wrote:
> >
> >>
> >> Hello,
> >>
> >> Currently, the Arrow format specification restricts the bitwidth of
> >> decimal numbers to either 128 or 256 bits.
> >>
> >> However, there is interest in allowing other bitwidths, at least 32 and
> >> 64 bits for this proposal. A 64-bit (respectively 32-bit) decimal
> >> datatype would allow for precisions of up to 18 digits (respectively 9
> >> digits), which are sufficient for some applications which are mainly
> >> looking for exact computations rather than sheer precision. Obviously,
> >> smaller datatypes are cheaper to store in memory and cheaper to run
> >> computations on.
> >>
> >> For example, the Spark documentation mentions that some decimal types
> >> may fit in a Java int (32 bits) or long (64 bits):
> >>
> >>
> https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DecimalType.html
> >>
> >> ... and a draft PR had even been filed for initial support in the C++
> >> implementation (https://github.com/apache/arrow/pull/8578).
> >>
> >> I am therefore proposing that we relax the wording in the Arrow format
> >> specification to also allow 32- and 64-bit decimal types.
> >>
> >> This is a preliminary discussion to gather opinions and potential
> >> counter-arguments against this proposal. If no strong counter-argument
> >> emerges, we will probably run a vote in a week or two.
> >>
> >> Best regards
> >>
> >> Antoine.
> >>
> >
>


Re: [JavaScript] Appending Nulls to a Struct (Bug)

2022-02-18 Thread Jorge Cardoso Leitão
A change in the length of an array is equivalent to a change in at least
one of its buffers (i.e. length is always physical).

* Primitive arrays (i32, i64, etc): the arrays' length is equal to the
length of the buffer divided by the size of the type. E.g. buffer.len() = 8
and i32 <=> length = 2)
* Variable length (binary, list, utf8): the arrays' length is equal to the
length of the offset buffer divided by the size of the offset type minus
one (e.g. buffer.len() = 12 and i32 <=> length = 2)
* StructArray: the arrays' length is equal to the length of any of its
fields.
* ...

When appending a slot to a StructArray (null or not), we need to append one
item to each of its fields
* a primitive array field the values buffer is increased by the size of the
backing type (and, if it exists, its validity is increased by 1 bit)
* In variable length arrays the values offsets buffer is increased by the
size of the offset type (and, if it exists, its validity is increased by 1
bit)
* ...

What we append on each of its fields is underdetermined. Most
implementations append a null item, but anything is ok. For example, if the
field is a primitive array and has no validity, it may make more sense to
append a slot with value 0 to avoid allocating a validity. But if the field
itself is deeply nested, a null may be cheaper (less pushes on its
children).

Best,
Jorge



On Fri, Feb 18, 2022 at 8:02 PM Phillip Cloud  wrote:

> I think I'm confused by where this appended value lives. Is it only a
> logical value or does the value show up in memory?
> For example, appending another null to the name field is only going to
> change the validity map, offsets array and length and there will not be any
> changes the values buffer.
>
> The value is logically there, but there's no additional values-buffer
> memory.
>
> Is that correct?
>
> On Fri, Feb 18, 2022 at 1:44 PM Micah Kornfield 
> wrote:
>
> > >
> > > It is definitely required according to my understanding, and to how the
> > > C++ implementation works.  The validation functions in the C++
> > > implementation also check for this (if a child buffer is too small for
> > > the number of values advertised by the parent, it is an error).
> >
> > +1.
> >
> > I think the wording is confusing.   "While a struct does not have
> physical
> > storage for each of its semantic slots" refers to the fact that all
> fields
> > in the struct are stored in separate child arrays and not as buffers on
> the
> > Struct array itself.  The actual value used in the child Array isn't
> > important i the struct is null but it must be appended so the length of
> the
> > struct is equal to the length of all of its children.
> >
> > -Micah
> >
> > On Fri, Feb 18, 2022 at 10:39 AM Antoine Pitrou 
> > wrote:
> >
> > >
> > > Le 18/02/2022 à 19:29, Phillip Cloud a écrit :
> > > >
> > > > The description underneath the example says:
> > > >
> > > >> While a struct does not have physical storage for each of its
> semantic
> > > > slots
> > > >> (i.e. each scalar C-like struct), an entire struct slot can be set
> to
> > > > null via the validity bitmap.
> > > >
> > > > To me this suggests that appending a sentinel value to the values
> > buffer
> > > > for a field is allowed,
> > > > but not required.
> > > >
> > > > Am I understanding this correctly?
> > >
> > > It is definitely required according to my understanding, and to how the
> > > C++ implementation works.  The validation functions in the C++
> > > implementation also check for this (if a child buffer is too small for
> > > the number of values advertised by the parent, it is an error).
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> >
>


Re: [JavaScript] Appending Nulls to a Struct (Bug)

2022-02-18 Thread Jorge Cardoso Leitão
Isn't field-0 representing ["joe", None, None, "mark"]? validity is
"1001" and offsets [0,3,3,7]. My reading is that the values buffer is
"joemark" because we do not represent values in null slots.

Best,
Jorge


On Fri, Feb 18, 2022 at 7:07 PM Phillip Cloud  wrote:

> My read of the spec for structs [1] is that there is no requirement to have
> a value in child arrays where there are nulls, which suggests the
> implementation conforms to the spec here.
>
> The example emphasizes this by showing the VarBinary column data as
> "joemark" as opposed to something like "joemark".
>
> [1]: https://arrow.apache.org/docs/format/Columnar.html#struct-layout
>
> On Fri, Feb 18, 2022 at 12:53 PM Dominik Moritz 
> wrote:
>
> >  Can someone clarify whether the spec is clear about the behavior?
> >
> > On Feb 18, 2022 at 07:23:19, Alfie Mountfield  wrote:
> >
> > > Hello all,
> > > I've raised a JIRA ticket (
> > > https://issues.apache.org/jira/browse/ARROW-15705)
> > > for this, but I'm still uncertain on my reading of the spec so I
> thought
> > > I'd ask here to confirm I've understood it correctly.
> > >
> > > I believe that child arrays should always be the same length as the
> > struct
> > > array? It seems that in the JS implementation of Arrow though, if you
> > add a
> > > null value to a StructBuilder, it only modifies the null-bitmap and
> > doesn't
> > > actually try to append the null-value to the children arrays. I'm
> > guessing
> > > this is a bug.
> > >
> > > If so, is there anything I need to do to get the PR I've opened (
> > > https://github.com/apache/arrow/pull/12451) in?
> > >
> > > Cheers,
> > > Alfie
> > >
> > > --
> > >
> > >
> > >
> > >    
> > >   * *
> > >
> > >
> > > *HASH,
> > > Inc. *is a Delaware-registered corporation. *HASH, Ltd.* is a UK
> > (England)
> > > registered company (No. 13003048). This message contains information
> > which
> > > may be confidential and privileged. Unless you are the intended
> recipient
> > > (or authorized to receive this message for the intended recipient), you
> > > may
> > > not use, copy, disseminate or disclose to anyone the message or any
> > > information contained in the message. If you have received the message
> in
> > > error, please advise the sender by reply e-mail, and delete the
> message.
> > >
> > >
> > >
> > >
> >
>


Re: [JavaScript] Appending Nulls to a Struct (Bug)

2022-02-18 Thread Jorge Cardoso Leitão
Hi Dominik,

That is my understanding - if it exists, the length of the validity must
equal the length of each field. Otherwise, it would be difficult to iterate
over the fields and validity together, since we would not have enough rows
in the fields for the validity.

I think that this is broader - when a validity is present, its length must
be equal to the number of slots of the array.

AFAI understand, the invariants of a StructArray are:

* its fields must have the same length
* if a validity is present, its length must equal the length of each field

Best,
Jorge



On Fri, Feb 18, 2022 at 6:53 PM Dominik Moritz  wrote:

>  Can someone clarify whether the spec is clear about the behavior?
>
> On Feb 18, 2022 at 07:23:19, Alfie Mountfield  wrote:
>
> > Hello all,
> > I've raised a JIRA ticket (
> > https://issues.apache.org/jira/browse/ARROW-15705)
> > for this, but I'm still uncertain on my reading of the spec so I thought
> > I'd ask here to confirm I've understood it correctly.
> >
> > I believe that child arrays should always be the same length as the
> struct
> > array? It seems that in the JS implementation of Arrow though, if you
> add a
> > null value to a StructBuilder, it only modifies the null-bitmap and
> doesn't
> > actually try to append the null-value to the children arrays. I'm
> guessing
> > this is a bug.
> >
> > If so, is there anything I need to do to get the PR I've opened (
> > https://github.com/apache/arrow/pull/12451) in?
> >
> > Cheers,
> > Alfie
> >
> > --
> >
> >
> >
> >    
> >   * *
> >
> >
> > *HASH,
> > Inc. *is a Delaware-registered corporation. *HASH, Ltd.* is a UK
> (England)
> > registered company (No. 13003048). This message contains information
> which
> > may be confidential and privileged. Unless you are the intended recipient
> > (or authorized to receive this message for the intended recipient), you
> > may
> > not use, copy, disseminate or disclose to anyone the message or any
> > information contained in the message. If you have received the message in
> > error, please advise the sender by reply e-mail, and delete the message.
> >
> >
> >
> >
>


Re: [Discuss] Best practice for storing key-value metadata for Extension Types

2022-02-08 Thread Jorge Cardoso Leitão
Hi,

Great questions and write up. Thanks!

imo dragging a JSON reader and writer to read official extension types'
metadata seems overkill. The c data interface is expected to be quite low
level. Imo we should aim for a (non-human readable) binary format. For
non-official, imo you are spot on - use what best fits to the use-case or
application. If the application is storing other metadata in json, json may
make sense, in Python pickle is another option, flatbuffers or something
like that is also ok imo.

Wrt to binary, imo the challenge is:
* we state that backward incompatible changes to the c data interface
require a new spec [1]
* we state that the metadata is a binary string [2]
* a valid string is a subset of all valid byte arrays and thus removing "
*string*" from the spec is backward incompatible

If we write invalid utf8 to it and a reader assumes utf8 when reading it,
we trigger undefined behavior.

I was a bit surprised by ARROW-15613 - my understanding is that the c++
implementation is not following the spec, and if we at arrow2 were not be
checking for utf8, we would be exposing a vulnerability (at least according
to Rust's standards). We just checked it out of luck (it is O(1), so why
not).

What is the concern with string-encoding binary like base64?

Given that one of our reference implementations is not following the spec
and there is value in allowing arbitrary bytes on the metadata values, we
may as well just update the spec to align with the reference
implementation? If we do that, I would suggest that we do it both in the c
data interface and the IPC specification, since imo it is quite important
that an extension can flow all the way through IPC and c data interface.

An alternative approach is to consider ARROW-15613 a bug and do not change
the spec - require consumers to encode the binary data in a string
representation like base64.

I just think it is important that we are consistent between the IPC and the
c data interface.

For reference, Polars uses base64 encoding of Python blobs (pickle,
pointers, etc.) because we enforce the spec on arrow2.

Best,
Jorge

[1]
https://arrow.apache.org/docs/format/CDataInterface.html#updating-this-specification
[2]
https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata
[ARROW-15613] https://issues.apache.org/jira/browse/ARROW-15613


Re: [DISCUSS] New Types (Schema.fbs vs Extension Types)

2022-02-08 Thread Jorge Cardoso Leitão
Note that we do not have tests on tensor arrays, so testing the extension
type on these may be hindered by divergences between implementations. I do
not think we even have json integration files for them.

If the focus is extension types, maybe it would be best to cover types
whose physical representations are covered in e.g. IPC or c data interface
tests.

I do not know if we voted on a naming convention, but we may want to
reserve a namespace for us (e.g. "arrow").

Also, note that Rust's arrow2 supports extension types (tested part of the
IPC and c data interface*), and Polars relies on it to allow Python generic
"object" in its machinery.

Best,
Jorge

* pending https://issues.apache.org/jira/browse/ARROW-15613



On Tue, Feb 8, 2022, 13:52 Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> On Mon, 7 Feb 2022 at 21:02, Rok Mihevc  wrote:
>
> > To follow up the discussion from the bi-weekly Arrow sync:
> >
> > - JSON seems the most suitable candidate for the extension metadata.
> > E.g.: TensorArray
> > {"key": "ARROW:extension:name", "value": "tensor > 3, 4), strides=(12, 4, 1)>"},
> > {"key": "ARROW:extension:metadata", "value": "{'type': 'int64',
> > 'shape': [3, 3, 4], 'strides': [12, 4, 1]}"}
> >
>
> I will start a separate thread for the exact encoding of the metadata value
> (i.e. JSON or something else) if that's OK. I already started writing one
> last week anyway, and that keeps things a bit separated.
>
> For the name of the extension type:
> - We might want to use something like "arrow.tensor" to follow the
> recommendation at
> https://arrow.apache.org/docs/format/Columnar.html#extension-types to use
> a
> namespace. And so for "well known" extension types that are defined in the
> Arrow project itself, I think we can use the "arrow" namespace? (as
> example, for the extension types defined in pandas, I used the "pandas."
> namespace)
> - In general, I think it's best to keep the name itself simple, and leave
> any parametrization out of it (since this is included in the metadata). So
> in this case that would be just "tensor" instead of "tensor shape=..., ..>".
> - Specifically for this extension type, we might want to use something like
> "fixed_size_tensor" instead of "tensor", to be able to differentiate in the
> future between the tensor type with constant shape vs variable shape (
> ARROW-1614  vs
> ARROW-8714
> ). But that's something
> to discuss in the relevant JIRA issue / PR.
>
> - We want to start with at least one integration test pair. Potential
> > candidates are cpp, julia, go, rust.
> >
>
> Rust does not yet seem to support extension types? (
> https://github.com/apache/arrow-rs/issues/218)
>
>
> > - First well known extension type candidate is TensorArray but other
> > suggestions are welcome.
> >
>
> Others that I am aware of that have been brought up in the past are UUID (
> ARROW-2152 ), complex
> numbers (ARROW-638 , this
> has a PR) and 8-bit boolean values (ARROW-1674
> ). But I think we should
> mainly look at demand / someone wanting to implement this, and (for you)
> this seems to be Tensors, so it's fine to focus on that.
>
> Joris
>
>
> >
> > On Tue, Jan 25, 2022 at 10:34 AM Antoine Pitrou 
> > wrote:
> > >
> > >
> > > Le 25/01/2022 à 10:12, Joris Van den Bossche a écrit :
> > > > On Sat, 22 Jan 2022 at 20:27, Rok Mihevc 
> wrote:
> > > >>
> > > >> Thanks for the input Weston!
> > > >>
> > > >> How about arrow/experimental/format/ExtensionTypes.fbs or
> > > >> arrow/format/ExtensionTypes.fbs for language independent schema and
> > > >> loosely arrow//extensions for implementations?
> > > >>
> > > >> Having machine readable definitions could perhaps be useful for
> > > >> generating implementations in some cases.
> > > >
> > > > Is it useful to put this in a flatbuffer file? Based on the list from
> > > > Weston just below, I think this will mostly contain a *description*
> of
> > > > those different aspect (a specification of the extension type), and
> > > > there is no data that actually fits in a flatbuffer table? In that
> > > > case a plain text (eg markdown) file seems more fitting?
> > >
> > > I agree this is mostly a plain text (or, rather, reST :-))
> specification
> > > task.
> > >
> > > Regards
> > >
> > > Antoine.
> >
>


Re: [ANNOUNCE] New Arrow PMC chair: Kouhei Sutou

2022-01-25 Thread Jorge Cardoso Leitão
Thank you so much for all your contributions to open source and to Apache
Arrow in particular, and for accepting taking this role.

On Tue, Jan 25, 2022 at 7:10 PM QP Hou  wrote:

> Congrats Kou, very well deserved.
>
> On Tue, Jan 25, 2022 at 9:53 AM Benson Muite 
> wrote:
> >
> > Congratulations Kou!
> > On 1/25/22 8:44 PM, Vibhatha Abeykoon wrote:
> > > Congrats Kou!
> > >
> > >
> > > On Tue, Jan 25, 2022 at 11:13 PM Ian Joiner 
> wrote:
> > >
> > >> Congrats Kou!
> > >>
> > >> On Tuesday, January 25, 2022, Wes McKinney 
> wrote:
> > >>
> > >>> I am pleased to announce that we have a new PMC chair and VP as per
> > >>> our newly started tradition of rotating the chair once a year. I have
> > >>> resigned and Kouhei was duly elected by the PMC and approved
> > >>> unanimously by the board. Please join me in congratulating Kou!
> > >>>
> > >>> Thanks,
> > >>> Wes
> > >>>
> > >>
> > >
> >
>


Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2022-01-19 Thread Jorge Cardoso Leitão
I have prototyped the sequence views in Rust [1], and it seems a pretty
straightforward addition with a trivial representation in both IPC and FFI.

I did observe a performance difference between using signed (int64) and
unsigned (uint64) offsets/lengths:

take/sequence/20time:   [20.491 ms 20.800 ms 21.125 ms]

take/sequence_signed/20 time:   [22.719 ms 23.142 ms 23.593 ms]

take/array/20   time:   [44.454 ms 45.056 ms 45.712 ms]

where 20 means 2^20 entries,
* array is our current array
* sequence is a sequence view of utf8 with uint64 indices, and
* sequence_signed is the same sequence view layout but with int64 indices

I.e. I observe a ~10% loss to support signed offsets/lengths. Details in
[2].

Best,
Jorge

[1] https://github.com/jorgecarleitao/arrow2/pull/784
[2] https://github.com/DataEngineeringLabs/arrow-string-view

On Wed, Jan 12, 2022 at 2:34 PM Andrew Lamb  wrote:

> I also agree that splitting the StringView proposal into its own thing
> would be beneficial for discussion clarity
>
> On Wed, Jan 12, 2022 at 5:34 AM Antoine Pitrou  wrote:
>
> >
> > Le 12/01/2022 à 01:49, Wes McKinney a écrit :
> > > hi all,
> > >
> > > Thank you for all the comments on this mailing list thread and in the
> > > Google document. There is definitely a lot of work to take some next
> > > steps from here, so I think it would make sense to fork off each of
> > > the proposed additions into dedicated discussions. The most
> > > contentious issue, it seems, is whether to maintain a 1-to-1
> > > relationship between the IPC format and the C ABI, which would make it
> > > rather difficult to implement the "string view" data type in a way
> > > that is flexible and useful to applications (for example, giving them
> > > control over their own memory management as opposed to forcing data to
> > > be "pre-serialized" into buffers that are referenced by offsets).
> > >
> > > I tend to be of the "practicality beats purity" mindset, where
> > > sufficiently beneficial changes to the in-memory format (and C ABI)
> > > may be worth breaking the implicit contract where the IPC format and
> > > the in-memory data structures have a strict 1-to-1 relationship. I
> > > suggest to help reach some consensus around this that I will create a
> > > new document focused only on the "string/binary view" type and the
> > > different implementation considerations (like what happens when you
> > > write it to the IPC format), as well as the different variants of the
> > > data structure itself that have been discussed with the associated
> > > trade-offs. Does this sound like a good approach?
> >
> > Indeed, this sounds like it will help making a decision.
> >
> > Personally, I am still very concerned by the idea of adding pointers to
> > the in-memory representation. Besides the loss of equivalence with the
> > IPC format, a representation using embedded pointers cannot be fully
> > validated for safety or correctness (how do you decide whether a pointer
> > is correct and doesn't reveal unrelated data?).
> >
> > I think we should discuss this with the DuckDB folks (and possibly the
> > Velox folks, but I guess that it might socio-politically more difficult)
> > so as to measure how much of an inconvenience it would be for them to
> > switch to a purely offsets-based approach.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
> > >
> > > Thanks,
> > > Wes
> > >
> > >
> > > On Sat, Jan 8, 2022 at 7:30 AM Jorge Cardoso Leitão
> > >  wrote:
> > >>
> > >> Fair enough (wrt to deprecation). Think that the sequence view is a
> > >> replacement for our existing (that allows O(N) selections), but I
> agree
> > >> with the sentiment that preserving compatibility is more important
> than
> > a
> > >> single way of doing it. Thanks for that angle!
> > >>
> > >> Imo the Arrow format is already composed of 3 specifications:
> > >>
> > >> * C data interface (intra-process communication)
> > >> * IPC format (inter-process communication)
> > >> * Flight (RPC protocol)
> > >>
> > >> E.g.
> > >> * IPC requires a `dict_id` in the fields declaration, but the C data
> > >> interface has no such requirement (because, pointers)
> > >> * IPC accepts endian and compression, the C data interface does not
> > >> * DataFusion does not support IPC (yet ^_^), but its Python bindings

Re: [RUST][DataFusion][Arrow] Switching DataFusion to use arrow2 implementation and the future of arrow

2022-01-19 Thread Jorge Cardoso Leitão
Hi,

Thank you for raising this here and for your comments. I am very humbled by
the feedback and adoption that arrow2 got so far.

My current hypothesis is that arrow2 will be donated to Apache Arrow, I
just don't feel comfortable and have the energy doing so right now.

Thank you for your understanding,
Jorge


On Mon, Jan 17, 2022 at 6:21 PM Wes McKinney  wrote:

> Sounds good, thanks all and look forward to hearing more about this.
>
> To second what Micah said, a reminder to please engage with civility.
> The ASF's code of conduct is found here [1]. We are all volunteering
> our time to try to do what is best for the developer and user
> communities' long-term health and success.
>
> [1]: https://www.apache.org/foundation/policies/conduct.html
>
>
> On Mon, Jan 17, 2022 at 6:07 AM Andrew Lamb  wrote:
> >
> > For what it is worth, I personally would likely spend much less time
> > maintaining arrow-rs if datafusion switched to arrow2. That discussion is
> > happening independently here [1].
> >
> > [1] https://github.com/apache/arrow-datafusion/issues/1532
> >
> > On Sun, Jan 16, 2022 at 11:17 PM Micah Kornfield 
> > wrote:
> >
> > > I agree, Jorge's point of view (and anyone else who has contributed to
> > > arrow2) is important here.
> > >
> > > One thing that isn't exactly clear to me from the linked issue is how
> much
> > > interest there is in the community for maintaining arrow-rs?  How much
> is a
> > > donation of arrow2 a factor here?
> > >
> > > Also, the last time we discussed this on the mailing list I think
> things
> > > got contentious.  Just a reminder please communicate with care.  [1]
> has
> > > some good advice on this.
> > >
> > > Thanks,
> > > Micah
> > >
> > > [1] https://community.apache.org/contributors/etiquette
> > >
> > > On Sun, Jan 16, 2022 at 6:51 PM QP Hou  wrote:
> > >
> > > > Hi Wes,
> > > >
> > > > I believe what you mentioned is the plan, i.e. move arrow2 to ASF in
> > > > the long run when it stabalize on its design/API and could benefit
> > > > from a more rigorous release process. From what I have seen, the
> > > > project is still undergoing major API changes on a monthly basis, so
> > > > quick releases and fast user feedback is quite valuable. But let's
> > > > hear Jorge's point of view on this first.
> > > >
> > > > On Sun, Jan 16, 2022 at 2:42 PM Wes McKinney 
> > > wrote:
> > > > >
> > > > > Is there a possibility of donating arrow2 to the Arrow project (at
> > > > > some point)? The main impact to development would be holding votes
> on
> > > > > releases, but this is probably a good thing long term from a
> > > > > governance standpoint. The answer may be "not right now" and that's
> > > > > fine. Having many of the same people split across projects with
> > > > > different governance structures is less than ideal.
> > > > >
> > > > > On Fri, Jan 14, 2022 at 1:15 PM Andrew Lamb 
> > > > wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I wanted to draw your attention to two issues of significance to
> the
> > > > Rust
> > > > > > Arrow implementations
> > > > > >
> > > > > > Discussion for switching DataFusion to use arrow2:
> > > > > > https://github.com/apache/arrow-datafusion/issues/1532
> > > > > >
> > > > > > Discussion for what to do with arrow if DataFusion switches to
> use
> > > > arrow2:
> > > > > > https://github.com/apache/arrow-rs/issues/1176
> > > > > >
> > > > > > The second is likely the most pertinent to people on this mailing
> > > > list, but
> > > > > > the first is the reason why the second has become important.
> > > > > >
> > > > > > Andrew
> > > >
> > >
>


Re: [VOTE][RUST] Release Apache Arrow Rust 7.0.0 RC1

2022-01-11 Thread Jorge Cardoso Leitão
+1

On Tue, Jan 11, 2022 at 9:17 PM QP Hou  wrote:

> +1 (non-binding)
>
> On Mon, Jan 10, 2022 at 3:14 PM Andy Grove  wrote:
> >
> > +1 (binding)
> >
> > Thanks,
> >
> > Andy.
> >
> > On Sat, Jan 8, 2022 at 3:43 AM Andrew Lamb  wrote:
> >
> > > Hi,
> > >
> > > I would like to propose a release of Apache Arrow Rust Implementation,
> > > version 7.0.0.
> > >
> > > This release candidate is based on commit:
> > > 719096b2d342dd3bf1f3f2226a26b93e19602852 [1]
> > >
> > > The proposed release tarball and signatures are hosted at [2].
> > >
> > > The changelog is located at [3].
> > >
> > > Please download, verify checksums and signatures, run the unit tests,
> > > and vote on the release. There is a script [4] that automates some of
> > > the verification.
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 Release this as Apache Arrow Rust
> > > [ ] +0
> > > [ ] -1 Do not release this as Apache Arrow Rust  because...
> > >
> > > [1]:
> > >
> > >
> https://github.com/apache/arrow-rs/tree/719096b2d342dd3bf1f3f2226a26b93e19602852
> > > [2]:
> > > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-7.0.0-rc1
> > > [3]:
> > >
> > >
> https://github.com/apache/arrow-rs/blob/719096b2d342dd3bf1f3f2226a26b93e19602852/CHANGELOG.md
> > > [4]:
> > >
> > >
> https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
> > >
>


Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2022-01-08 Thread Jorge Cardoso Leitão
Fair enough (wrt to deprecation). Think that the sequence view is a
replacement for our existing (that allows O(N) selections), but I agree
with the sentiment that preserving compatibility is more important than a
single way of doing it. Thanks for that angle!

Imo the Arrow format is already composed of 3 specifications:

* C data interface (intra-process communication)
* IPC format (inter-process communication)
* Flight (RPC protocol)

E.g.
* IPC requires a `dict_id` in the fields declaration, but the C data
interface has no such requirement (because, pointers)
* IPC accepts endian and compression, the C data interface does not
* DataFusion does not support IPC (yet ^_^), but its Python bindings
leverage the C data interface to pass data to pyarrow

This to say that imo as long as we document the different specifications
that compose Arrow and their intended purposes, it is ok. Because the c
data interface is the one with the highest constraints (zero-copy, higher
chance of out of bound reads, etc.), it makes sense for proposals (and
implementations) first be written against it.


I agree with Neal's point wrt to the IPC. For extra context, many `async`
implementations use cooperative scheduling, which are vulnerable to DOS if
they need to perform heavy CPU-bound tasks (as the p-thread is blocked and
can't switch). QP Hou and I have summarized a broader version of this
statement here [1].

In async contexts, If deserializing from IPC requires a significant amount
of compute, that task should (to avoid blocking) be sent to a separate
thread pool to avoid blocking the p-threads assigned to the runtime's
thread pool. If the format is O(1) in CPU-bounded work, its execution can
be done in an async context without a separate thread pool. Arrow's IPC
format is quite unique there in that it requires almost always O(1) CPU
work to be loaded to memory (at the expense of more disk usage).

I believe that atm we have two O(N) blocking tasks in reading IPC format
(decompression and byte swapping (big <-> little endian)), and three O(N)
blocking tasks in writing (compression, de-offset bitmaps, byte swapping).
The more prevalent O(N) CPU-bound tasks are at the IPC interface, the less
compelling it becomes vs e.g. parquet (file) or avro (stream), which have
an expectation of CPU-bound work. In this context, keeping the IPC format
compatible with the ABI spec is imo an important characteristic of Apache
Arrow that we should strive to preserve. Alternatively, we could also just
abandon this idea and say that the format expects CPU-bound tasks to
deserialize (even if considerably smaller than avro or parquet), so that
people can design the APIs accordingly.

Best,
Jorge

[1]
https://jorgecarleitao.medium.com/how-to-efficiently-load-data-to-memory-d65ee359196c


On Sun, Dec 26, 2021 at 5:31 PM Antoine Pitrou  wrote:

>
>
> Le 23/12/2021 à 17:59, Neal Richardson a écrit :
> >> I think in this particular case, we should consider the C ABI /
> >> in-memory representation and IPC format as separate beasts. If an
> >> implementation of Arrow does not want to use this string-view array
> >> type at all (for example, if it created memory safety issues in Rust),
> >> then it can choose to convert to the existing string array
> >> representation when receiving a C ABI payload. Whether or not there is
> >> an alternate IPC format for this data type seems like a separate
> >> question -- my preference actually would be to support this for
> >> in-memory / C ABI use but not to alter the IPC format.
> >>
> >
> > I think this idea deserves some clarification or at least more
> exposition.
> > On first reading, it was not clear to me that we might add things to the
> > in-memory Arrow format but not IPC, that that was even an option. I'm
> > guessing I'm not the only one who missed that.
> >
> > If these new types are only part of the Arrow in-memory format, then it's
> > not the case that reading/writing IPC files involves no serialization
> > overhead. I recognize that that's technically already the case since IPC
> > supports compression now, but it's not generally how we talk about the
> > relationship between the IPC and in-memory formats (see our own FAQ [1],
> > for example). If we go forward with these changes, it would be a good
> > opportunity for us to clarify in our docs/website that the "Arrow format"
> > is not a single thing.
>
> I'm worried that making the "Arrow format" polysemic/context-dependent
> would spread a lot of confusion among potential users of Arrow.
>
> Regards
>
> Antoine.
>


Re: [DataFusion] Question about Accumulator API and maybe potential bugs

2022-01-03 Thread Jorge Cardoso Leitão
Hi,

The accumulator API is designed to accept multiple columns (e.g. the
pearson correlation takes 2 columns, not one). &values[0] corresponds to
the first column passed to the accumulator. All concrete implementations of
accumulators in DataFusion atm only accept one column (Sum, Avg, Count,
Min, Max), but the API is designed to accept with multiple columns.

So, update_batch(&mut self, values: &[ArrayRef]) corresponds to: update the
accumulator from n columns. For sum, this would be 1, for pearson
correlation this would be 2, for e.g. a ML model whose weights are computed
over all columns, this would be the number of input columns N of the model.
For stddev, you should use 1, since stddev is a function of a single
column.

`update(&mut self, values: &[ScalarValue])` corresponds to updating the
state with intermediary states. In a HashAggregate, we reduce each
partition, and use `update` to compute the final value from the
intermediary (scalar) states.

Hope this helps,
Jorge



On Tue, Jan 4, 2022 at 5:55 AM LM  wrote:

> Hi All,
>
> I just started looking into DataFusion and am considering using it as the
> platform for our next gen analytics solution. To get started, I tried to
> add a few functions such as stddev. While writing the code I noticed some
> discrepancies (it may also be my unfamiliarity of the code base) in the
> Accumulator API and the implementation of some functions. The API is
> defined as the following:
>
> pub trait Accumulator: Send + Sync + Debug {
> /// Returns the state of the accumulator at the end of the accumulation.
> // in the case of an average on which we track `sum` and `n`, this function
> should return a vector
> // of two values, sum and n.
> fn state(&self) -> Result>;
>
> /// updates the accumulator's state from a vector of scalars.
> fn update(&mut self, values: &[ScalarValue]) -> Result<()>;
>
> /// updates the accumulator's state from a vector of arrays.
> fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
> if values.is_empty() {
> return Ok(());
> };
> (0..values[0].len()).try_for_each(|index| {
> let v = values
> .iter()
> .map(|array| ScalarValue::try_from_array(array, index))
> .collect::>>()?;
> self.update(&v)
> })
> I am only quoting the update and update_batch functions for brevity, same
> for the merge functions. So here it indicates that the update function
> takes a *vector* and update_batch takes *vector of array. *
>
> When reading code for some actual implementation for example *sum* and
> *average,
> *both implementations assume when update is called *only one *value is
> passed in; and when update_batch is called *only one *array is passed in.
>
> impl Accumulator for AvgAccumulator {
> fn state(&self) -> Result> {
> Ok(vec![ScalarValue::from(self.count), self.sum.clone()])
> }
>
> fn update(&mut self, values: &[ScalarValue]) -> Result<()> {
> let values = &values[0];
>
> self.count += (!values.is_null()) as u64;
> self.sum = sum::sum(&self.sum, values)?;
>
> Ok(())
> }
>
> fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
> let values = &values[0];
>
> self.count += (values.len() - values.data().null_count()) as u64;
> self.sum = sum::sum(&self.sum, &sum::sum_batch(values)?)?;
> Ok(())
>
> impl Accumulator for SumAccumulator {
> fn state(&self) -> Result> {
> Ok(vec![self.sum.clone()])
> }
>
> fn update(&mut self, values: &[ScalarValue]) -> Result<()> {
> // sum(v1, v2, v3) = v1 + v2 + v3
> self.sum = sum(&self.sum, &values[0])?;
> Ok(())
> }
>
> fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
> let values = &values[0];
> self.sum = sum(&self.sum, &sum_batch(values)?)?;
> Ok(())
> }
>
> Could someone shed some light in case I missed anything?
>
> Regards,
> Lin
>


Re: [VOTE][RUST] Release Apache Arrow Rust 6.5.0 RC1

2021-12-28 Thread Jorge Cardoso Leitão
+1

Thanks,
Jorge


On Fri, Dec 24, 2021 at 3:21 AM Wang Xudong  wrote:

> +1 (non-binding)
>
> Happy holidays
>
> ---
> xudong
>
> Andy Grove  于2021年12月24日周五 09:19写道:
>
> > +1 (binding)
> >
> > Thanks,
> >
> > Andy.
> >
> > On Thu, Dec 23, 2021 at 2:26 PM Andrew Lamb 
> wrote:
> >
> > > Hi,
> > >
> > > Happy holidays to those of you who are celebrating. I would like to
> > propose
> > > a release of Apache Arrow Rust Implementation, version 6.5.0.
> > >
> > > This release candidate is based on commit:
> > > 70069c62f03b74d5e05ec75b808086edeefeecaf [1]
> > >
> > > The proposed release tarball and signatures are hosted at [2].
> > >
> > > The changelog is located at [3].
> > >
> > > Please download, verify checksums and signatures, run the unit tests,
> > > and vote on the release. There is a script [4] that automates some of
> > > the verification.
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 Release this as Apache Arrow Rust
> > > [ ] +0
> > > [ ] -1 Do not release this as Apache Arrow Rust  because...
> > >
> > > [1]:
> > >
> > >
> >
> https://github.com/apache/arrow-rs/tree/70069c62f03b74d5e05ec75b808086edeefeecaf
> > > [2]:
> > > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-6.5.0-rc1
> > > [3]:
> > >
> > >
> >
> https://github.com/apache/arrow-rs/blob/70069c62f03b74d5e05ec75b808086edeefeecaf/CHANGELOG.md
> > > [4]:
> > >
> > >
> >
> https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
> > > -
> > > Running rat license checker on
> > >
> > >
> >
> /Users/alamb/Software/arrow-rs/dev/dist/apache-arrow-rs-6.5.0-rc1/apache-arrow-rs-6.5.0.tar.gz
> > >
> >
>


Re: [ANNOUNCE] New Arrow PMC member: Daniël Heres

2021-12-21 Thread Jorge Cardoso Leitão
Congratulations!!

On Tue, Dec 21, 2021 at 5:24 PM Andrew Lamb  wrote:

> Congratulations Daniël ! Well deserved
>
> On Tue, Dec 21, 2021 at 12:18 PM Wes McKinney  wrote:
>
> > The Project Management Committee (PMC) for Apache Arrow has invited
> > Daniël Heres to become a PMC member and we are pleased to announce
> > that Daniël has accepted.
> >
> > Congratulations and welcome!
> >
>


Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2021-12-15 Thread Jorge Cardoso Leitão
Hi,

Thanks a lot for this initiative and the write up.

I did a small bench for the sequence view and added a graph to the document
for evidence of what Wes is writing wrt to performance of "selection / take
/ filter".

Big +1 in replacing our current representation of variable-sized arrays by
the "sequence view". atm I am -0.5 in adding it without removing the
[Large]Utf8Array / Binary / List, as I see the advantages as sufficiently
large to break compatibility and deprecate the previous representations
(and do not enjoy maintaining multiple similar representations that solve
very similar problems).

Likewise, +1 for the RLE and -0.5 for the constant array, as the latter
seems redundant to me (it is an RLE).

Wrt to the string view: would like to run some benches on that too. Could
someone clarify what are the "good cases" for that one?

More generally, I second the point made by Antoine: there is already some
fragmentation over the types in the official implementations (see [1]), and
we do not even have a common integration test suite for the c data
interface. One approach to this dimension is to *deprecate*
representations, which goes into the direction mentioned above.

Wrt to design, we could consider a separate enum for the RLE vs plain
encoding, as they are not really semantic types (the dictionary is also not
a semantic type but it is represented as one in at least the Rust
implementation, unfortunately).

Wrt to Rust impl in particular, I do not think that the String View poses a
problem - Rust can layout according to the C representation. Here [2] is
the corresponding Rust code of the struct in the doc (generated via Rust's
bindgen [3]).

Thanks again for this, looking very much forward to it!

[1]
https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/datagen.py#L1546
[2]
https://github.com/DataEngineeringLabs/arrow-string-view/blob/main/src/string_view.rs
[3] https://rust-lang.github.io/rust-bindgen/command-line-usage.html


On Wed, Dec 15, 2021 at 3:15 AM Wes McKinney  wrote:

> Ultimately, the problem comes down to providing a means of O(#
> records) selection (take, filter) performance and memory use for
> non-numeric data (strings, arrays, maps, etc.).
>
> DuckDB and Velox are two projects which have designed themselves to be
> very nearly Arrow-compatible but have implemented alternative memory
> layouts to achieve O(# records) selections on all data types. I am
> proposing to adopt these innovations as additional memory layouts in
> Arrow with a target of zero-copy across the C ABI — how exactly they
> are translated to the IPC format seems less of an immediate benefit
> than enabling the in-memory performance/memory use optimization since
> query engines can accelerate performance with faster selections. If
> there are some alternative proposals to achieve O(# records) time and
> space complexity for selection operations, let's definitely look at
> them.
>
>
> On Tue, Dec 14, 2021 at 8:02 PM Weston Pace  wrote:
> >
> > Would it be simpler to change the spec so that child arrays can be
> > chunked?  This might reduce the data type growth and make the intent
> > more clear.
> >
> > This will add another dimension to performance analysis.  We pretty
> > regularly get issues/tickets from users that have unknowingly created
> > parquet files with poor row group resolution (e.g. 50 rows per row
> > group) and experience rotten performance as a result.  I suspect
> > something similar could happen here.  It sounds like arrays will
> > naturally subdivide over time.  Users might start seeing poor
> > performance without realizing the root cause is because their 1
> > million element array has been split into 10,000 allocations of 100
> > elements.  However, I suspect this is something that could be managed
> > with visibility and recompaction utilities.
> >
> >
> > On Tue, Dec 14, 2021 at 1:22 PM Wes McKinney 
> wrote:
> > >
> > > hi folks,
> > >
> > > A few things in the general discussion, before certain things will
> > > have to be split off into their own dedicated discussions.
> > >
> > > It seems that I didn't do a very good job of motivating the "sequence
> > > view" type. Let me take a step back and discuss one of the problems
> > > these new memory layouts are solving.
> > >
> > > In Arrow currently, selection operations ("take", "filter", or
> > > indirect sort — the equivalent of arr.take(argsort(something_else)) if
> > > you're coming from NumPy) have time complexity proportional to the
> > > number of records for primitive types and complexity proportional to
> > > the greater of max(# records, memory size) for nested types.
> > >
> > > So, for example:
> > >
> > > * Take(arr, indices) has O(# records) complexity for primitive types
> > > and does O(# records) memory allocation
> > > * Take(arr, indices) has O(max(# records, size of memory buffers /
> > > child arrays)) complexity for strings and nested types and does O(size
> > > of memory buffers) memory allo

Re: [VOTE][RUST] Release Apache Arrow Rust 6.4.0 RC1

2021-12-13 Thread Jorge Cardoso Leitão
+1

Thank you!

On Fri, Dec 10, 2021 at 9:50 PM Andy Grove  wrote:

> +1 (binding)
>
> Thank you, Andrew, and everyone else involved in the input validation work.
> This definitely helps address one of the biggest criticisms of the crate.
>
> Andy.
>
> On Fri, Dec 10, 2021 at 12:30 PM Andrew Lamb  wrote:
>
> > Hi,
> >
> > I would like to propose a release of Apache Arrow Rust Implementation,
> > version 6.4.0. I am personally excited about this release as it contains
> > the input validation needed to close out the currently open RUSTSEC
> issues
> > for Arrow [5].
> >
> > This release candidate is based on commit:
> > 7a0bca35239f1d4fc3a1dca410384a1e5e962147 [1].
> >
> > The proposed release tarball and signatures are hosted at [2].
> >
> > The changelog is located at [3].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. There is a script [4] that automates some of
> > the verification.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow Rust
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow Rust  because...
> >
> > [1]:
> >
> >
> https://github.com/apache/arrow-rs/tree/7a0bca35239f1d4fc3a1dca410384a1e5e962147
> > [2]:
> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-6.4.0-rc1
> > [3]:
> >
> >
> https://github.com/apache/arrow-rs/blob/7a0bca35239f1d4fc3a1dca410384a1e5e962147/CHANGELOG.md
> > [4]:
> >
> >
> https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
> > [5]: https://github.com/rustsec/advisory-db/tree/main/crates/arrow
> >
>


Re: [ANNOUNCE] New Arrow committer: Rémi Dattai

2021-12-07 Thread Jorge Cardoso Leitão
Congrats!

On Wed, Dec 8, 2021 at 8:14 AM Daniël Heres  wrote:

> Congrats Rémi!
>
> On Wed, Dec 8, 2021, 04:27 Ian Joiner  wrote:
>
> > Congrats!
> >
> > On Tuesday, December 7, 2021, Wes McKinney  wrote:
> >
> > > On behalf of the Arrow PMC, I'm happy to announce that Rémi Dattai has
> > > accepted an invitation to become a committer on Apache Arrow. Welcome,
> > > and thank you for your contributions!
> > >
> > > Wes
> > >
> >
>


Re: [ANNOUNCE] New Arrow PMC member: Joris Van den Bossche

2021-11-17 Thread Jorge Cardoso Leitão
Congratulations!

On Thu, Nov 18, 2021 at 3:34 AM Ian Joiner  wrote:

> Congrats Joris and really thanks for your effort in integrating ORC and
> dataset!
>
> Ian
>
> > On Nov 17, 2021, at 5:55 PM, Wes McKinney  wrote:
> >
> > The Project Management Committee (PMC) for Apache Arrow has invited
> > Joris Van den Bossche to become a PMC member and we are pleased to
> > announce that Joris has accepted.
> >
> > Congratulations and welcome!
>
>


Re: [Parquet][C++][Python] Maximum Row Group Length Default

2021-11-17 Thread Jorge Cardoso Leitão
What are the tradeoffs between a low and large and row group size?

Is it that a low value allows for quicker random access (as we can seek row
groups based on the number of rows they have), while a larger value allows
for higher dict-encoding and compression ratios?

Best,
Jorge




On Wed, Nov 17, 2021 at 9:11 PM Jonathan Keane  wrote:

> This doesn't address the large number of row groups ticket that was
> raised, but for some visibility: there is some work to change the row
> group sizing based on the size of data instead of a static number of
> rows [1] as well as exposing a few more knobs to tune [2]
>
> There is a bit of prior art in the R implementation for attempting to
> get a reasonable row group size based on the shape of the data
> (basically, aims to have row groups that have 250 Million cells in
> them). [3]
>
> [1] https://issues.apache.org/jira/browse/ARROW-4542
> [2] https://issues.apache.org/jira/browse/ARROW-14426 and
> https://issues.apache.org/jira/browse/ARROW-14427
> [3]
> https://github.com/apache/arrow/blob/641554b0bcce587549bfcfd0cde3cb4bc23054aa/r/R/parquet.R#L204-L222
>
> -Jon
>
> On Wed, Nov 17, 2021 at 4:35 AM Joris Van den Bossche
>  wrote:
> >
> > In addition, would it be useful to be able to change this
> max_row_group_length
> > from Python?
> > Currently that writer property can't be changed from Python, you can only
> > specify the row_group_size (chunk_size in C++)
> > when writing a table, but that's currently only useful to set it to
> > something that is smaller than the max_row_group_length.
> >
> > Joris
>


Re: Synergies with Apache Avro?

2021-11-16 Thread Jorge Cardoso Leitão
>
> I haven't looked at it for a while but my recollection, at least in java,
> is streaming process for each step outlined rather than a batch process
> (i.e. decompress some bytes, then decode them lazily a "Next Row" is
> called).


Sorry for the late reply, It took me a bit to go through the relevant parts
of the Java implementation. I agree that the deserialization of items
within a block is done on a per item, and can even re-use a previously
allocated item [1]. From what I can read, the blocks are still read to
memory as whole chunks via `nextRawBlock` [2]. I.e. from a row oriented
processing, the stream is still composed of blocks that are first read into
memory and then deserialized row by row (and item by item within a row).

Do you have a target system in mind?  As I said for columnar/arrow native
> query engines this obviously sounds like a win, but for row oriented
> processing engines, the transposition costs are going to eat into any gains.
>

I agree - I was thinking in terms of columnar query engines aiming at
leveraging simd and data locality.

That being said, I'd love to see real world ETL pipeline benchmarks :)
>

Definitely. This was an educational exercise.

[1]
https://github.com/apache/avro/blob/42822886c28ea74a744abb7e7a80a942c540faa5/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader12.java#L160
[2] https://github.com/apache/avro/blob/
42822886c28ea74a744abb7e7a80a942c540faa5
/lang/java/avro/src/main/java/org/apache/avro/file/DataFileStream.java#L213


On Tue, Nov 2, 2021 at 6:41 PM Micah Kornfield 
wrote:

> Wrt to row iterations and native rows: my understanding is that even
>> though most Avro APIs present themselves as iterators of rows, internally
>> they read a whole compressed serialized block into memory, decompress it,
>> and then deserialize item by item into a row ("read block -> decompress
>> block -> decode item by item into rows -> read next block"). Avro is based
>> on batches of rows (blocks) that are compressed individually (similar to
>> parquet pages, but all column chunks are serialized in a single page within
>> a row group).
>
>
> I haven't looked at it for a while but my recollection, at least in java,
> is streaming process for each step outlined rather than a batch process
> (i.e. decompress some bytes, then decode them lazily a "Next Row" is
> called).
>
> My hypothesis (we can bench this) is that if the user wants to perform any
>> compute over the data, it is advantageous to load the block to arrow
>> (decompressed block -> RecordBatch), benefiting from arrow's analytics
>> performance instead, as opposed to using a native row-based format where we
>> can't leverage SIMD/cache hits/must allocate and deallocate on every item.
>> As usual, there are use-cases where this does not hold - I am thinking in
>> terms of traditional ETL / CPU intensive stuff.
>
>
> Do you have a target system in mind?  As I said for columnar/arrow native
> query engines this obviously sounds like a win, but for row oriented
> processing engines, the transposition costs are going to eat into any
> gains. There is also non-zero engineering effort to implement the necessary
> filter/selection push down APIs that most of them provide.  That being
> said, I'd love to see real world ETL pipeline benchmarks :)
>
>
> On Tue, Nov 2, 2021 at 4:39 AM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
>> Thank you all for all your comments.
>>
>> The first comments: thanks a lot for your suggestions. I tried with
>> mimalloc and there is indeed a -25% improvement for avro-rs. =)
>>
>> This sentence is a little bit hard to parse.  Is a row of 3 strings or a
>>> row of 1 string consisting of 3 bytes?  Was the example hard-coded?  A
>>> lot
>>> of the complexity of parsing avro is the schema evolution rules, I
>>> haven't
>>> looked at whether the canonical implementations do any optimization for
>>> the
>>> happy case when reader and writer schema are the same.
>>>
>>
>> The graph was for a single column of a constant string of 3 bytes ("foo")
>> each divided into (avro) blocks of 4000 rows each (default block size of
>> 16kb). I also tried random strings of 3 bytes and 7 bytes, as well as an
>> integer column, and compressed blocks (deflate): with equal speedups.
>> Generic benchmarks are obviously catered for. I agree that schema evolution
>> adds extra CPU time, and that this is the happy case; I have not
>> benchmarked those yet.
>>
>> With respect to being a single column, I agree. The second bench that you
>> saw is still a 

Re: Question about Arrow Mutable/Immutable Arrays choice

2021-11-03 Thread Jorge Cardoso Leitão
I think the c data interface requires the arrays to be immutable or two
implementations will race when mutating/reading the shared regions, since
we have no mechanism to synchronize read/write access across the boundary.

Best,
Jorge


On Wed, Nov 3, 2021 at 1:50 PM Alessandro Molina <
alessan...@ursacomputing.com> wrote:

> I recently noticed that in the Java implementation we expose a set/setSafe
> function that allows to mutate Arrow Arrays [1]
>
> This seems to be at odds with the general design of the C++ (and by
> consequence Python and R) library where Arrays are immutable and can be
> modified only through compute functions returning copies.
>
> The Arrow Format documentation [2] seems to suggest that mutation of data
> structures is possible and left as an implementation detail, but given that
> some users might be willing to mutate existing structures (for example to
> avoid incurring in the memory cost of copies when dealing with big arrays)
> I think there might be reasons for both allowing mutation of Arrays and
> disallowing it. It probably makes sense to ensure that all the
> implementations agree on such a fundamental choice to avoid setting
> expectations on users' side that might not apply when they cross language
> barriers.
>
> [1]
>
> https://arrow.apache.org/docs/java/reference/org/apache/arrow/vector/SmallIntVector.html#setSafe-int-int-
> [2] https://arrow.apache.org/docs/format/Columnar.html
>


Re: Synergies with Apache Avro?

2021-11-02 Thread Jorge Cardoso Leitão
or
> > snmalloc, even without the library supporting to change the allocator. In
> > my experience this indeed helps with allocation heavy code (I have seen
> > changes of up to 30%).
> >
> > Best regards,
> >
> > Daniël
> >
> >
> > On Sun, Oct 31, 2021, 18:15 Adam Lippai  wrote:
> >
> > > Hi Jorge,
> > >
> > > Just an idea: Do the Avro libs support different allocators? Maybe
> using
> > a
> > > different one (e.g. mimalloc) would yield more similar results by
> working
> > > around the fragmentation you described.
> > >
> > > This wouldn't change the fact that they are relatively slow, however it
> > > could allow you better apples to apples comparison thus better CPU
> > > profiling and understanding of the nuances.
> > >
> > > Best regards,
> > > Adam Lippai
> > >
> > >
> > > On Sun, Oct 31, 2021, 17:42 Jorge Cardoso Leitão <
> > jorgecarlei...@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I am reporting back a conclusion that I recently arrived at when
> adding
> > > > support for reading Avro to Arrow.
> > > >
> > > > Avro is a storage format that does not have an associated in-memory
> > > > format. In Rust, the official implementation deserializes an enum, in
> > > > Python to a vector of Object, and I suspect in Java to an equivalent
> > > vector
> > > > of object. The important aspect is that all of them use fragmented
> > memory
> > > > regions (as opposed to what we do with e.g. one uint8 buffer for
> > > > StringArray).
> > > >
> > > > I benchmarked reading to arrow vs reading via the official Avro
> > > > implementations. The results are a bit surprising: reading 2^20 rows
> > of 3
> > > > byte strings is ~6x faster than the official Avro Rust implementation
> > and
> > > > ~20x faster vs "fastavro", a C implementation with bindings for
> Python
> > > (pip
> > > > install fastavro), all with a difference slope (see graph below or
> > > numbers
> > > > and used code here [1]).
> > > > [image: avro_read.png]
> > > >
> > > > I found this a bit surprising because we need to read row by row and
> > > > perform a transpose of the data (from rows to columns) which is
> usually
> > > > expensive. Furthermore, reading strings can't be that much optimized
> > > after
> > > > all.
> > > >
> > > > To investigate the root cause, I drilled down to the flamegraphs for
> > both
> > > > the official avro rust implementation and the arrow2 implementation:
> > the
> > > > majority of the time in the Avro implementation is spent allocating
> > > > individual strings (to build the [str] - equivalents); the majority
> of
> > > the
> > > > time in arrow2 is equally divided between zigzag decoding (to get the
> > > > length of the item), reallocs, and utf8 validation.
> > > >
> > > > My hypothesis is that the difference in performance is unrelated to a
> > > > particular implementation of arrow or avro, but to a general concept
> of
> > > > reading to [str] vs arrow. Specifically, the item by item allocation
> > > > strategy is far worse than what we do in Arrow with a single region
> > which
> > > > we reallocate from time to time with exponential growth. In some
> > > > architectures we even benefit from the __memmove_avx_unaligned_erms
> > > > instruction that makes it even cheaper to reallocate.
> > > >
> > > > Has anyone else performed such benchmarks or played with Avro ->
> Arrow
> > > and
> > > > found supporting / opposing findings to this hypothesis?
> > > >
> > > > If this hypothesis holds (e.g. with a similar result against the Java
> > > > implementation of Avro), it imo puts arrow as a strong candidate for
> > the
> > > > default format of Avro implementations to deserialize into when using
> > it
> > > > in-memory, which could benefit both projects?
> > > >
> > > > Best,
> > > > Jorge
> > > >
> > > > [1] https://github.com/DataEngineeringLabs/arrow2-benches
> > > >
> > > >
> > > >
> > >
> >
>


Synergies with Apache Avro?

2021-10-31 Thread Jorge Cardoso Leitão
Hi,

I am reporting back a conclusion that I recently arrived at when adding
support for reading Avro to Arrow.

Avro is a storage format that does not have an associated in-memory format.
In Rust, the official implementation deserializes an enum, in Python to a
vector of Object, and I suspect in Java to an equivalent vector of object.
The important aspect is that all of them use fragmented memory regions (as
opposed to what we do with e.g. one uint8 buffer for StringArray).

I benchmarked reading to arrow vs reading via the official Avro
implementations. The results are a bit surprising: reading 2^20 rows of 3
byte strings is ~6x faster than the official Avro Rust implementation and
~20x faster vs "fastavro", a C implementation with bindings for Python (pip
install fastavro), all with a difference slope (see graph below or numbers
and used code here [1]).
[image: avro_read.png]

I found this a bit surprising because we need to read row by row and
perform a transpose of the data (from rows to columns) which is usually
expensive. Furthermore, reading strings can't be that much optimized after
all.

To investigate the root cause, I drilled down to the flamegraphs for both
the official avro rust implementation and the arrow2 implementation: the
majority of the time in the Avro implementation is spent allocating
individual strings (to build the [str] - equivalents); the majority of the
time in arrow2 is equally divided between zigzag decoding (to get the
length of the item), reallocs, and utf8 validation.

My hypothesis is that the difference in performance is unrelated to a
particular implementation of arrow or avro, but to a general concept of
reading to [str] vs arrow. Specifically, the item by item allocation
strategy is far worse than what we do in Arrow with a single region which
we reallocate from time to time with exponential growth. In some
architectures we even benefit from the __memmove_avx_unaligned_erms
instruction that makes it even cheaper to reallocate.

Has anyone else performed such benchmarks or played with Avro -> Arrow and
found supporting / opposing findings to this hypothesis?

If this hypothesis holds (e.g. with a similar result against the Java
implementation of Avro), it imo puts arrow as a strong candidate for the
default format of Avro implementations to deserialize into when using it
in-memory, which could benefit both projects?

Best,
Jorge

[1] https://github.com/DataEngineeringLabs/arrow2-benches


Re: [Discuss] Single offset per array has a non-trivial performance implication

2021-10-27 Thread Jorge Cardoso Leitão
Hi,

> A big +1 to this, covering all the edge cases with slices is pretty
complicated (there was at least one long standing bug related to this in
the 6.0 release).  I imagine there are potentially more lurking in the code
base.

Thanks for this observation, arrow-rs faces a similar issue: it is
relatively easy to hit bugs because of the offset.

> To be clear, this only comes into play for bit buffers (such as the
validity bitmap), right?  Otherwise, the offset can just be incorporated
into the buffer's base pointer.

Exactly. Except for the BooleanArray, the "offset" in c data interface is
just an offset of the validity bitmap. I raised ARROW-14453 [1] to try to
mitigate this, but it does not solve it completely.

> This seems to assume that many or most arrays will have non-zero
offsets.  Is this something that commonly happens in the Rust Arrow
world?  In Arrow C++ I'm not sure non-zero offsets appear very frequently.

The main use-case I observe comes from Polars [2]. Polars is pretty fast
[3] and employs many techniques to extract performance from Arrow. Some
use-cases where slices are used (ccing Ritchie, that is the expert):
* users slice dataframes and series ad-hoc
* in group-bys and aggregations, polars slices arrays to parallelize the
workload when the chunks are large.
* "take" and "filter" of utf8 arrays is sufficiently expensive that it
slices the array and parallelizes the workload.
* group-bys with complex aggregations (rank, collect_list, reverse) it
builds a ListArray and then performs operations on the subitems. Given
[[None], [1, 2, 3, 4]], it operates on the subarray [1, 2, 3, 4] as "[None,
1, 2, 3, 4].slice(1, 4)", so, a slice per item.

[1] https://issues.apache.org/jira/browse/ARROW-14453
[2] https://github.com/pola-rs/polars
[3] https://h2oai.github.io/db-benchmark/


On Wed, Oct 27, 2021 at 7:57 PM Antoine Pitrou  wrote:

>
> Le 26/10/2021 à 21:30, Jorge Cardoso Leitão a écrit :
> > Hi,
> >
> > One aspect of the design of "arrow2" is that it deals with array slices
> > differently from the rest of the implementations. Essentially, the offset
> > is not stored in ArrayData, but on each individual Buffer. Some important
> > consequence are:
> >
> > * people can work with buffers and bitmaps without having to drag the
> > corresponding array offset with them (which are common source of
> > unsoundness in the official Rust implementation)
> > * arrays can store buffers/bitmaps with independent offsets
> > * it does not roundtrip over the c data interface at zero cost, because
> the
> > c data interface only allows a single offset per array, not per
> > buffer/bitmap.
>
> To be clear, this only comes into play for bit buffers (such as the
> validity bitmap), right?  Otherwise, the offset can just be incorporated
> into the buffer's base pointer.
>
>  > I have been benchmarking the consequences of this design choice and
> reached
>  > the conclusion that storing the offset on a per buffer basis offers at
>  > least 15% improvement in compute (results vary on kernel and likely
>  > implementation).
>
> This seems to assume that many or most arrays will have non-zero
> offsets.  Is this something that commonly happens in the Rust Arrow
> world?  In Arrow C++ I'm not sure non-zero offsets appear very frequently.
>
> Regards
>
> Antoine.
>


[Discuss] Single offset per array has a non-trivial performance implication

2021-10-26 Thread Jorge Cardoso Leitão
Hi,

One aspect of the design of "arrow2" is that it deals with array slices
differently from the rest of the implementations. Essentially, the offset
is not stored in ArrayData, but on each individual Buffer. Some important
consequence are:

* people can work with buffers and bitmaps without having to drag the
corresponding array offset with them (which are common source of
unsoundness in the official Rust implementation)
* arrays can store buffers/bitmaps with independent offsets
* it does not roundtrip over the c data interface at zero cost, because the
c data interface only allows a single offset per array, not per
buffer/bitmap.

I have been benchmarking the consequences of this design choice and reached
the conclusion that storing the offset on a per buffer basis offers at
least 15% improvement in compute (results vary on kernel and likely
implementation).

To understand why this is the case, consider comparing two boolean arrays
(a, b), where "a" has been sliced and has a validity and "b" does not. In
this case, we could compare the values of the arrays (taking into account
"a"'s offset), and clone "a"'s validity. However, this does not work
because the validity is "offsetted", while the result of the comparison of
the values is not. Thus, we need to create a shifted copy of the validity.
I measure 15% of the total compute time on my benches being done on
creating this shifted copy.

The root cause is that the C data interface declares an offset on the
ArrayData, as opposed to an offset on each of the buffers contained on it.
With an offset shared between buffers, we can't slice individual bitmap
buffers, which forbids us from leveraging the optimization of simply
cloning buffers instead of copying them.

I wonder whether this was discussed previously, or whether the "single
offset per array in the c data interface" considered this performance
implication.

Atm the solution we adopted is to incur the penalty cost of ("de-offseting
buffers") when passing offsetted arrays via the c data interface, since
this way users benefit from faster compute kernels and only incur this cost
when it is strictly needed for the C data interface, but my understanding
is that this design choice affects the compute kernels of most
implementations, since they all perform a copy to de-offset the sliced
buffers on every operation over sliced arrays?

Best,
Jorge


Re: [ANNOUNCE] New Arrow committer: Jiayu Liu

2021-10-07 Thread Jorge Cardoso Leitão
Congratulations!!! :)

On Thu, Oct 7, 2021, 21:58 Weston Pace  wrote:

> Congratulations Jiayu Liu!
>
> On Thu, Oct 7, 2021 at 8:02 AM Yijie Shen 
> wrote:
> >
> > Congratulations Jianyu
> >
> >
> > Micah Kornfield 于2021年10月8日 周五上午12:29写道:
> >
> > > A little late, but welcome and thank you for your contributions!
> > >
> > > On Thu, Oct 7, 2021 at 9:06 AM M G  wrote:
> > >
> > > > Hi everyone,
> > > > MG here,
> > > > Very nice to be in such a vibrant community and wonderful technology.
> > > >
> > > > Congrats dear Jiayu Liu and thanks for your contributions.
> > > >
> > > > I suggest
> > > > Best Regards
> > > > -
> > > > *M.Ghiasi*
> > > > Technology Architect
> > > > -
> > > > *"COVID-19 Kills! Wear a mask & regularly wash your hands to save the
> > > > world!"*
> > > >
> > > >
> > > > On Thu, Oct 7, 2021 at 2:27 PM Andrew Lamb 
> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > On behalf of the Arrow PMC, I'm happy to announce that
> > > > > Jiayu Liu has accepted an invitation to become a
> > > > > committer on Apache Arrow. Welcome, and thank you for your
> > > > > contributions!
> > > > >
> > > > >
> > > > > Andrew
> > > > >
> > > >
> > >
>


Re: [Question] Allocations along 64 byte cache lines

2021-09-09 Thread Jorge Cardoso Leitão
Thanks Yibo,

Yes, I expect aligned SIMD loads to be faster.

My understanding is that we do not need an alignment requirement for this,
though: split the buffer in 3, [unaligned][aligned][unaligned], use aligned
loads for the middle and un-aligned (or not even SIMD) for the prefix and
suffix. This is generic over the size of the SIMD and buffer slicing, where
alignment can be lost. Or am I missing something?

Best,
Jorge





On Wed, Sep 8, 2021 at 4:26 AM Yibo Cai  wrote:

> Thanks Jorge,
>
> I'm wondering if the 64 bytes alignment requirement is for cache or for
> simd register(avx512?).
>
> For simd, looks register width alignment does helps.
> E.g., _mm_load_si128 can only load 128 bits aligned data, it performs
> better than _mm_loadu_si128, which supports unaligned load.
>
> Again, be very skeptical to the benchmark :)
> https://quick-bench.com/q/NxyDu89azmKJmiVxF29Ei8FybWk
>
>
> On 9/7/21 7:16 PM, Jorge Cardoso Leitão wrote:
> > Thanks,
> >
> > I think that the alignment requirement in IPC is different from this one:
> > we enforce 8/64 byte alignment when serializing for IPC, but we (only)
> > recommend 64 byte alignment in memory addresses (at least this is my
> > understanding from the above link).
> >
> > I did test adding two arrays and the result is independent of the
> alignment
> > (on my machine, compiler, etc).
> >
> > Yibo, thanks a lot for that example. I am unsure whether it captures the
> > cache alignment concept, though: in the example we are reading a long (8
> > bytes) from a pointer that is not aligned with 8 bytes (63 % 8 != 0),
> which
> > is both slow and often undefined behavior. I think that the bench we want
> > is to change 63 to 64-8 (which is still not 64-bytes cache aligned but
> > aligned with a long), the difference vanishes (under the same gotchas
> that
> > you mentioned) https://quick-bench.com/q/EKIpQFJsAogSHXXLqamoWSTy-eE.
> > Alternatively, add an int32 with an offset of 4.
> >
> > I benched both with explicit (via intrinsics) SIMD and without (i.e. let
> > the compiler do it for us), and the alignment does not impact the
> benches.
> >
> > Best,
> > Jorge
> >
> > [1] https://stackoverflow.com/a/27184001/931303
> >
> >
> >
> >
> >
> > On Tue, Sep 7, 2021 at 4:29 AM Yibo Cai  wrote:
> >
> >> Did a quick bench of accessing long buffer not 8 bytes aligned. Giving
> >> enough conditions, looks it does shows unaligned access has some penalty
> >> over aligned access. But I don't think this is an issue in practice.
> >>
> >> Please be very skeptical to this benchmark. It's hard to get it right
> >> given the complexity of hardware, compiler, benchmark tool and env.
> >>
> >> https://quick-bench.com/q/GmyqRk6saGfRu8XnMUyoSXs4SCk
> >>
> >>
> >> On 9/7/21 7:55 AM, Micah Kornfield wrote:
> >>>>
> >>>> My own impression is that the emphasis may be slightly exagerated. But
> >>>> perhaps some other benchmarks would prove differently.
> >>>
> >>>
> >>> This is probably true.  [1] is the original mailing list discussion.  I
> >>> think lack of measurable differences and high overhead for 64 byte
> >>> alignment was the reason for relaxing to 8 byte alignment.
> >>>
> >>> Specifically, I performed two types of tests, a "random sum" where we
> >>>> compute the sum of the values taken at random indices, and "sum",
> where
> >> we
> >>>> sum all values of the array (buffer[1] of the primitive array), both
> for
> >>>> array ranging from 2^10 to 2^25 elements. I was expecting that, at
> >> least in
> >>>> the latter, prefetching would help, but I do not observe any
> difference.
> >>>
> >>>
> >>> The most likely place I think where this could make a difference would
> be
> >>> for operations on wider types (Decimal128 and Decimal256).   Another
> >> place
> >>> where I think alignment could help is when adding two primitive arrays
> >> (it
> >>> sounds like this was summing a single array?).
> >>>
> >>> [1]
> >>>
> >>
> https://lists.apache.org/thread.html/945b65fb4bc8bcdab695b572f9e9c2dca4cd89012fdbd896a6f2d886%401460092304%40%3Cdev.arrow.apache.org%3E
> >>>
> >>> On Mon, Sep 6, 2021 at 3:05 PM Antoine Pitrou 
> >> wrote:
> >>>
> >>>>
> >>>> Le 06/09/2021 à 23:20, Jorge Card

Re: [ANNOUNCE] New Arrow committer: Nic Crane

2021-09-09 Thread Jorge Cardoso Leitão
Congrats!! =)

On Thu, Sep 9, 2021, 20:12 Micah Kornfield  wrote:

> Congrats!
>
> On Thursday, September 9, 2021, Weston Pace  wrote:
>
> > Congratulations Nic!
> >
> > On Thu, Sep 9, 2021 at 7:43 AM Antoine Pitrou 
> wrote:
> > >
> > >
> > > Welcome on board Nic!
> > >
> > >
> > > On Thu, 9 Sep 2021 11:47:16 -0400
> > > Neal Richardson  wrote:
> > > > On behalf of the Apache Arrow PMC, I'm happy to announce that Nic
> Crane
> > > > has accepted an invitation to become a committer on Apache Arrow.
> > > >
> > > > Welcome and thank you for your contributions!
> > > >
> > > > Neal
> > > >
> > >
> > >
> > >
> >
>


Re: [Question] Allocations along 64 byte cache lines

2021-09-07 Thread Jorge Cardoso Leitão
Thanks,

I think that the alignment requirement in IPC is different from this one:
we enforce 8/64 byte alignment when serializing for IPC, but we (only)
recommend 64 byte alignment in memory addresses (at least this is my
understanding from the above link).

I did test adding two arrays and the result is independent of the alignment
(on my machine, compiler, etc).

Yibo, thanks a lot for that example. I am unsure whether it captures the
cache alignment concept, though: in the example we are reading a long (8
bytes) from a pointer that is not aligned with 8 bytes (63 % 8 != 0), which
is both slow and often undefined behavior. I think that the bench we want
is to change 63 to 64-8 (which is still not 64-bytes cache aligned but
aligned with a long), the difference vanishes (under the same gotchas that
you mentioned) https://quick-bench.com/q/EKIpQFJsAogSHXXLqamoWSTy-eE.
Alternatively, add an int32 with an offset of 4.

I benched both with explicit (via intrinsics) SIMD and without (i.e. let
the compiler do it for us), and the alignment does not impact the benches.

Best,
Jorge

[1] https://stackoverflow.com/a/27184001/931303





On Tue, Sep 7, 2021 at 4:29 AM Yibo Cai  wrote:

> Did a quick bench of accessing long buffer not 8 bytes aligned. Giving
> enough conditions, looks it does shows unaligned access has some penalty
> over aligned access. But I don't think this is an issue in practice.
>
> Please be very skeptical to this benchmark. It's hard to get it right
> given the complexity of hardware, compiler, benchmark tool and env.
>
> https://quick-bench.com/q/GmyqRk6saGfRu8XnMUyoSXs4SCk
>
>
> On 9/7/21 7:55 AM, Micah Kornfield wrote:
> >>
> >> My own impression is that the emphasis may be slightly exagerated. But
> >> perhaps some other benchmarks would prove differently.
> >
> >
> > This is probably true.  [1] is the original mailing list discussion.  I
> > think lack of measurable differences and high overhead for 64 byte
> > alignment was the reason for relaxing to 8 byte alignment.
> >
> > Specifically, I performed two types of tests, a "random sum" where we
> >> compute the sum of the values taken at random indices, and "sum", where
> we
> >> sum all values of the array (buffer[1] of the primitive array), both for
> >> array ranging from 2^10 to 2^25 elements. I was expecting that, at
> least in
> >> the latter, prefetching would help, but I do not observe any difference.
> >
> >
> > The most likely place I think where this could make a difference would be
> > for operations on wider types (Decimal128 and Decimal256).   Another
> place
> > where I think alignment could help is when adding two primitive arrays
> (it
> > sounds like this was summing a single array?).
> >
> > [1]
> >
> https://lists.apache.org/thread.html/945b65fb4bc8bcdab695b572f9e9c2dca4cd89012fdbd896a6f2d886%401460092304%40%3Cdev.arrow.apache.org%3E
> >
> > On Mon, Sep 6, 2021 at 3:05 PM Antoine Pitrou 
> wrote:
> >
> >>
> >> Le 06/09/2021 à 23:20, Jorge Cardoso Leitão a écrit :
> >>> Thanks a lot Antoine for the pointers. Much appreciated!
> >>>
> >>> Generally, it should not hurt to align allocations to 64 bytes anyway,
> >>>> since you are generally dealing with large enough data that the
> >>>> (small) memory overhead doesn't matter.
> >>>
> >>> Not for performance. However, 64 byte alignment in Rust requires
> >>> maintaining a custom container, a custom allocator, and the inability
> to
> >>> interoperate with `std::Vec` and the ecosystem that is based on it,
> since
> >>> std::Vec allocates with alignment T (.e.g int32), not 64 bytes. For
> >> anyone
> >>> interested, the background for this is this old PR [1] in this in
> arrow2
> >>> [2].
> >>
> >> I see. In the C++ implementation, we are not compatible with the default
> >> allocator either (but C++ allocators as defined by the standard library
> >> don't support resizing, which doesn't make them terribly useful for
> >> Arrow anyway).
> >>
> >>> Neither myself in micro benches nor Ritchie from polars (query engine)
> in
> >>> large scale benches observe any difference in the archs we have
> >> available.
> >>> This is not consistent with the emphasis we put on the memory
> alignments
> >>> discussion [3], and I am trying to understand the root cause for this
> >>> inconsistency.
> >>
> >> My own impression is that the emphasis may be slightly exagerated. But
> >> perhaps some other benchmarks would prove differently.
> >>
> >>> By prefetching I mean implicit; no intrinsics involved.
> >>
> >> Well, I'm not aware that implicit prefetching depends on alignment.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >
>


Re: [Question] Allocations along 64 byte cache lines

2021-09-06 Thread Jorge Cardoso Leitão
Thanks a lot Antoine for the pointers. Much appreciated!

Generally, it should not hurt to align allocations to 64 bytes anyway,
> since you are generally dealing with large enough data that the
> (small) memory overhead doesn't matter.
>

Not for performance. However, 64 byte alignment in Rust requires
maintaining a custom container, a custom allocator, and the inability to
interoperate with `std::Vec` and the ecosystem that is based on it, since
std::Vec allocates with alignment T (.e.g int32), not 64 bytes. For anyone
interested, the background for this is this old PR [1] in this in arrow2
[2].

Neither myself in micro benches nor Ritchie from polars (query engine) in
large scale benches observe any difference in the archs we have available.
This is not consistent with the emphasis we put on the memory alignments
discussion [3], and I am trying to understand the root cause for this
inconsistency.

By prefetching I mean implicit; no intrinsics involved.

Best,
Jorge

[1] https://github.com/apache/arrow/pull/8796
[2] https://github.com/jorgecarleitao/arrow2/pull/385
[2]
https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding





On Mon, Sep 6, 2021 at 6:51 PM Antoine Pitrou  wrote:

>
> Le 06/09/2021 à 19:45, Antoine Pitrou a écrit :
> >
> >> Specifically, I performed two types of tests, a "random sum" where we
> >> compute the sum of the values taken at random indices, and "sum", where
> we
> >> sum all values of the array (buffer[1] of the primitive array), both for
> >> array ranging from 2^10 to 2^25 elements. I was expecting that, at
> least in
> >> the latter, prefetching would help, but I do not observe any difference.
> >
> > By prefetching, you mean explicit prefetching using intrinsics?
> > Modern CPUs are very good at implicit prefetching, they are able to
> > detect memory access patterns and optimize for them. Implicit
> > prefetching would only possibly help if your access pattern is
> > complicated (for example you're walking a chain of pointers).
>
> Oops: *explicit* prefecting would only possibly help sorry.
>
> Regards
>
> Antoine.
>
>
> > If your
> > access is sequential, there is zero reason to prefetch explicitly
> > nowadays, AFAIK.
> >
> > Regards
> >
> > Antoine.
> >
> >
>


[Question] Allocations along 64 byte cache lines

2021-09-06 Thread Jorge Cardoso Leitão
Hi,

We have a whole section related to byte alignment (
https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding)
recommending 64 byte alignment and referring to intel's manual.

Do we have evidence that this alignment helps (besides intel claims)?

I am asking because going through the arrow-rs we use an alignment of 128
bytes (following the stream prefetch recommendation from intel [1]).

I recently experimented changing it to 64 bytes and also to the native
alignment (i.e. i32 is aligned with 4 bytes), and I observed no difference
in performance when compiled for "skylake-avx512".

Specifically, I performed two types of tests, a "random sum" where we
compute the sum of the values taken at random indices, and "sum", where we
sum all values of the array (buffer[1] of the primitive array), both for
array ranging from 2^10 to 2^25 elements. I was expecting that, at least in
the latter, prefetching would help, but I do not observe any difference.

I was wondering if anyone:

* has observed an equivalent behavior
* know a good benchmark where these things matter or
* have an explanation

Thanks a lot!

Best,
Jorge

[1]
https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf,
sec. 3.7.3, page 162


Re: Set of primitive physical types

2021-09-02 Thread Jorge Cardoso Leitão
Hi,

Thanks a lot for the feedback. Atm I was really just trying to get whether
others also saw these types as these packed structs.

Wrt to the extension type: I am not sure we can make it fast, though: the
interpretation of the bytes would need to be done dynamically (instead of
statically) because we can't compile the struct prior to receiving it (via
IPC or FFI). This interpretation would be part of hot loops (as we would
need to interpret the bytes on every element).

For this to work efficiently, IMO we would need some kind of "c extension"
whereby people could declare a c struct as part of the extension, which
consumers would compile to their own language for consumption. My
understanding is that in essence this is what we have been doing for the
interval types when we write things like

"A triple of the number of elapsed months, days, and nanoseconds.
//  The values are stored contiguously in 16 byte blocks. Months and
//  days are encoded as 32 bit integers and nanoseconds is encoded as a
//  64 bit integer. All integers are signed."

declare the struct, which implementations hard-code on their source code.

It is interesting that these resemble the idea of protobuf and thrift but
at the intra-process level (FFI).

Micah, I was thinking about the page with the memory layout [1],
specifically the primitive section, where some mental effort is required to
interpret the interval types as primitives (but not the FixedSizeBinary);
my understanding is that the former has a known packed struct while the
later does not.

Best,
Jorge

[1]
https://arrow.apache.org/docs/format/Columnar.html#physical-memory-layout





On Thu, Sep 2, 2021 at 4:45 AM Micah Kornfield 
wrote:

> I agree, it is what I would have proposed for the interval type if there
> wasn't an interval type in Arrow already.  I think FixedSizeList has for
> better or worse solved a lot of the problems that a struct type would be
> used for (e.g. coordinates)
>
> Cheers,
> Micah
>
> On Tue, Aug 31, 2021 at 8:27 AM Wes McKinney  wrote:
>
> > I do still think that having a "packed C struct" type would be a
> > useful thing, but thus far no one has needed it enough to develop
> > something in the columnar format specification.
> >
> > On Tue, Aug 31, 2021 at 1:33 AM Micah Kornfield 
> > wrote:
> > >
> > > Hi Jorge,
> > > Are there places in the docs that you think this would simplify?
> > > There is an old JIRA [1] about introducing a c-struct type that I
> > > think aligns with this observation [1]
> > >
> > > -Micah
> > >
> > > [1] https://issues.apache.org/jira/browse/ARROW-1790
> > >
> > > On Mon, Aug 30, 2021 at 2:57 PM Jorge Cardoso Leitão
> > >  wrote:
> > > >
> > > > Hi,
> > > >
> > > > Just came across this curiosity that IMO may help us to design
> physical
> > > > types in the future.
> > > >
> > > > Not sure if this was mentioned before, but it seems to me that
> > > > `DaysMilliseconds` and `MonthDayNano` belong to a broader class of
> > physical
> > > > types "typed tuples" in that they are constructed by defining the
> tuple
> > > > `(t_1,t_2,...,t_N)` where t_i (e.g. int32) is representable in memory
> > for a
> > > > given endianess, and each element of the array is written to the
> buffer
> > > > back to back as `... > endianess>`.
> > > >
> > > > Primitive arrays such as e.g. `Int32Array` are the extreme case where
> > the
> > > > tuple has a single entry (t1,), which leads to ` endianess>`.
> > The
> > > > others are:
> > > > * DaysMilliseconds = (int32, int32)
> > > > * MonthDayNano = (int32, int32, int64)
> > > >
> > > > In principle, we could re-write the in-memory layout page in these
> > terms
> > > > that places all the types above in the same "bucket".
> > > >
> > > > Best,
> > > > Jorge
> >
>


Re: C++ Determine Size of RecordBatch

2021-09-01 Thread Jorge Cardoso Leitão
note that that would be an upper bound because buffers can be shared
between arrays.

On Wed, Sep 1, 2021 at 2:15 PM Antoine Pitrou  wrote:

> On Tue, 31 Aug 2021 21:46:23 -0700
> Rares Vernica  wrote:
> >
> > I'm storing RecordBatch objects in a local cache to improve performance.
> I
> > want to keep track of the memory usage to stay within bounds. The arrays
> > stored in the batch are not nested.
> >
> > The best way I came up to compute the size of a RecordBatch is:
> >
> > size_t arrowSize = 0;
> > for (auto i = 0; i < arrowBatch->num_columns(); ++i) {
> > auto column = arrowBatch->column_data(i);
> > if (column->buffers[0])
> > arrowSize += column->buffers[0]->size();
> > if (column->buffers[1])
> > arrowSize += column->buffers[1]->size();
> > }
> >
> > Does this look reasonable? I guess we are over estimating a bit due to
> the
> > buffer alignment but that should be fine.
>
> Probably, but you should iterate over all buffers instead of
> selecting just buffers 0 and 1 (what if you have a string column?).
>
> So basically:
>
> ```
> size_t arrowSize = 0;
> for (const auto& column : batch->columns()) {
>   for (const auto& buffer : column->data()->buffers) {
> if (buffer)
>   arrowSize += buffer->size();
>   }
> }
> ```
>
> Regards
>
> Antoine.
>
>
>


Set of primitive physical types

2021-08-30 Thread Jorge Cardoso Leitão
Hi,

Just came across this curiosity that IMO may help us to design physical
types in the future.

Not sure if this was mentioned before, but it seems to me that
`DaysMilliseconds` and `MonthDayNano` belong to a broader class of physical
types "typed tuples" in that they are constructed by defining the tuple
`(t_1,t_2,...,t_N)` where t_i (e.g. int32) is representable in memory for a
given endianess, and each element of the array is written to the buffer
back to back as `...`.

Primitive arrays such as e.g. `Int32Array` are the extreme case where the
tuple has a single entry (t1,), which leads to ``. The
others are:
* DaysMilliseconds = (int32, int32)
* MonthDayNano = (int32, int32, int64)

In principle, we could re-write the in-memory layout page in these terms
that places all the types above in the same "bucket".

Best,
Jorge


Re: [VOTE][RUST] Release Apache Arrow Rust 5.3.0 RC1

2021-08-30 Thread Jorge Cardoso Leitão
for the avoidance of doubt, +1 on the vote: release :)

On Sat, Aug 28, 2021 at 12:12 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> +1
>
> Thanks, Andrew!
>
> On Sat, Aug 28, 2021 at 12:10 PM Andrew Lamb  wrote:
>
>> Update here: the issue is that we made a chance that is not compatible
>> with
>> older rust toolchains.
>>
>> The consensus on the PR seems to be that since arrow-rs doesn't offer any
>> explicit compatibility for older rust toolchains, the change is OK. I also
>> made a PR [1] to clarify this point in the readme.
>>
>> Thus I would like to proceed with this RC unless there are objections.
>>
>> Please weigh in on the PR[2] if you have opinions
>>
>> [1] https://github.com/apache/arrow-rs/pull/726
>> [2] https://github.com/apache/arrow-rs/pull/714
>>
>> On Fri, Aug 27, 2021 at 3:11 PM Andrew Lamb  wrote:
>>
>> > I have just seen
>> > https://github.com/apache/arrow-rs/pull/714#discussion_r69683 which
>> > might require creating a new RC. Will update this thread with results
>> >
>> > On Fri, Aug 27, 2021 at 2:34 PM Andy Grove 
>> wrote:
>> >
>> >> +1 (binding)
>> >>
>> >> I verified signatures and ran the verification script
>> >>
>> >> On Thu, Aug 26, 2021 at 2:56 PM Andrew Lamb 
>> wrote:
>> >>
>> >> > Hi,
>> >> >
>> >> > I would like to propose a release of Apache Arrow Rust
>> Implementation,
>> >> > version 5.3.0.
>> >> >
>> >> > This release candidate is based on commit:
>> >> > 9f7707cd2c0fe07e989dbd0afce4d4850921df4a [1]
>> >> >
>> >> > The proposed release tarball and signatures are hosted at [2].
>> >> >
>> >> > The changelog is located at [3].
>> >> >
>> >> > Please download, verify checksums and signatures, run the unit tests,
>> >> > and vote on the release. There is a script [4] that automates some of
>> >> > the verification.
>> >> >
>> >> > The vote will be open for at least 72 hours.
>> >> >
>> >> > [ ] +1 Release this as Apache Arrow Rust
>> >> > [ ] +0
>> >> > [ ] -1 Do not release this as Apache Arrow Rust  because...
>> >> >
>> >> > [1]:
>> >> >
>> >> >
>> >>
>> https://github.com/apache/arrow-rs/tree/9f7707cd2c0fe07e989dbd0afce4d4850921df4a
>> >> > [2]:
>> >> >
>> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-5.3.0-rc1
>> >> > [3]:
>> >> >
>> >> >
>> >>
>> https://github.com/apache/arrow-rs/blob/9f7707cd2c0fe07e989dbd0afce4d4850921df4a/CHANGELOG.md
>> >> > [4]:
>> >> >
>> >> >
>> >>
>> https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
>> >> >
>> >>
>> >
>>
>


Re: [VOTE][RUST] Release Apache Arrow Rust 5.3.0 RC1

2021-08-28 Thread Jorge Cardoso Leitão
+1

Thanks, Andrew!

On Sat, Aug 28, 2021 at 12:10 PM Andrew Lamb  wrote:

> Update here: the issue is that we made a chance that is not compatible with
> older rust toolchains.
>
> The consensus on the PR seems to be that since arrow-rs doesn't offer any
> explicit compatibility for older rust toolchains, the change is OK. I also
> made a PR [1] to clarify this point in the readme.
>
> Thus I would like to proceed with this RC unless there are objections.
>
> Please weigh in on the PR[2] if you have opinions
>
> [1] https://github.com/apache/arrow-rs/pull/726
> [2] https://github.com/apache/arrow-rs/pull/714
>
> On Fri, Aug 27, 2021 at 3:11 PM Andrew Lamb  wrote:
>
> > I have just seen
> > https://github.com/apache/arrow-rs/pull/714#discussion_r69683 which
> > might require creating a new RC. Will update this thread with results
> >
> > On Fri, Aug 27, 2021 at 2:34 PM Andy Grove 
> wrote:
> >
> >> +1 (binding)
> >>
> >> I verified signatures and ran the verification script
> >>
> >> On Thu, Aug 26, 2021 at 2:56 PM Andrew Lamb 
> wrote:
> >>
> >> > Hi,
> >> >
> >> > I would like to propose a release of Apache Arrow Rust Implementation,
> >> > version 5.3.0.
> >> >
> >> > This release candidate is based on commit:
> >> > 9f7707cd2c0fe07e989dbd0afce4d4850921df4a [1]
> >> >
> >> > The proposed release tarball and signatures are hosted at [2].
> >> >
> >> > The changelog is located at [3].
> >> >
> >> > Please download, verify checksums and signatures, run the unit tests,
> >> > and vote on the release. There is a script [4] that automates some of
> >> > the verification.
> >> >
> >> > The vote will be open for at least 72 hours.
> >> >
> >> > [ ] +1 Release this as Apache Arrow Rust
> >> > [ ] +0
> >> > [ ] -1 Do not release this as Apache Arrow Rust  because...
> >> >
> >> > [1]:
> >> >
> >> >
> >>
> https://github.com/apache/arrow-rs/tree/9f7707cd2c0fe07e989dbd0afce4d4850921df4a
> >> > [2]:
> >> >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-5.3.0-rc1
> >> > [3]:
> >> >
> >> >
> >>
> https://github.com/apache/arrow-rs/blob/9f7707cd2c0fe07e989dbd0afce4d4850921df4a/CHANGELOG.md
> >> > [4]:
> >> >
> >> >
> >>
> https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
> >> >
> >>
> >
>


Re: [VOTE][Format] Clarify allowed value range for the Time types

2021-08-20 Thread Jorge Cardoso Leitão
+1

On Fri, Aug 20, 2021 at 2:43 PM David Li  wrote:

> +1
>
> On Thu, Aug 19, 2021, at 18:33, Weston Pace wrote:
> > +1
> >
> > On Thu, Aug 19, 2021 at 9:18 AM Wes McKinney 
> wrote:
> > >
> > > +1
> > >
> > > On Thu, Aug 19, 2021 at 6:20 PM Antoine Pitrou 
> wrote:
> > > >
> > > >
> > > > Hello,
> > > >
> > > > I would like to propose clarifying the allowed value range for the
> Time
> > > > types.  Specifically, I would propose that:
> > > >
> > > > 1) allowed values fall between 0 (included) and 86400 seconds
> > > > (excluded), adjusted for the time unit;
> > > >
> > > > 2) leap seconds cannot be represented (see above: 86400 is outside of
> > > > the range of allowed values).
> > > >
> > > > The vote will be open for at least 72 hours.
> > > >
> > > > [ ] +1 Accept the proposed clarification
> > > > [ ] +0
> > > > [ ] -1 Do not accept the proposed clarification because...
> > > >
> > > > My vote is +1.
> > > >
> > > > If this proposal is accepted, I will submit a PR to enhance
> Schema.fbs
> > > > with additional comments.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> >


Re: [VOTE][Format] Add in a new interval type can combines Month, Days and Nanoseconds

2021-08-17 Thread Jorge Cardoso Leitão
+1

On Tue, Aug 17, 2021 at 8:50 PM Micah Kornfield 
wrote:

> Hello,
> As discussed previously [1], I'd like to call a vote to add a new interval
> type which is a triple of Month, Days, and nanoseconds.  The formal
> definition is defined in a PR [2] along with Java and C++ implementations
> that have been verified with integration tests.
>
> The PR has gone through one round of code review comments and all have been
> addressed (more are welcome).
>
> If this vote passes I will follow-up with the following work items:
> - Add issues to track implementation in other languages
> - Extend the C-data interface to support the new type.
> - Add conversion from C++ to Python/Pandas
>
> The vote will remain open for at least 72 hours.
>
> Please record your vote:
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Accept the new type and merge the PR after all comments have been
> addressed
> [ ] +0
> [ ] -1 Do not accept the new type because ...
>
> My vote is +1.
>
> Thanks,
> Micah
>
> [1]
>
> https://lists.apache.org/thread.html/rd919c4ed8ad2f2827a2d4f665d8da99e545ba92ef992b2e557831751%40%3Cdev.arrow.apache.org%3E
> [2] https://github.com/apache/arrow/pull/10177
>


Re: [VOTE][RUST][Datafusion] Release Apache Arrow Datafusion 5.0.0 RC3

2021-08-13 Thread Jorge Cardoso Leitão
+1

Great work everyone!


On Fri, Aug 13, 2021, 22:19 Daniël Heres  wrote:

> +1 (non binding). Looking good.
>
>
> On Fri, Aug 13, 2021, 07:49 QP Hou  wrote:
>
> > Good call Ruihang. I remember we used to have this toolchain file when
> > we were still in the main arrow repo. I will take a look into that.
> >
> > On Wed, Aug 11, 2021 at 5:36 PM Wayne Xia  wrote:
> > >
> > > Hi QP,
> > >
> > > When running this script I noticed that this might be because I was not
> > > using a stable toolchain when testing.
> > > Those failures occur with nightly (which is my default toolchain). And
> > > everything works fine after switching to stable 1.54.
> > > So I think it's ok from my side to vote +1.
> > >
> > > BTW, I think we can add a toolchain file [1] to datafusion repo.
> > >
> > > [1]:
> > https://rust-lang.github.io/rustup/overrides.html#the-toolchain-file
> > >
> > > On Thu, Aug 12, 2021 at 2:14 AM QP Hou  wrote:
> > >
> > > > Hi Ruihang,
> > > >
> > > > Thanks for helping with the validation. It would certainly be helpful
> > > > if you could share the error log with me.
> > > >
> > > > I have also prepared an updated version of the verification script at
> > > >
> > > >
> >
> https://github.com/houqp/arrow-datafusion/blob/qp_release/dev/release/verify-release-candidate.sh
> > > > .
> > > > This script does a clean checkout of everything before running tests
> > > > and linting tools. Could you give that a try to see if you are
> getting
> > > > the same results?
> > > >
> > > > Thanks,
> > > > QP
> > > >
> > > > On Wed, Aug 11, 2021 at 6:38 AM Wayne Xia 
> > wrote:
> > > > >
> > > > > Thanks, QP!
> > > > >
> > > > > I verified the signature and checked shasum, but got 3 failed case
> > while
> > > > > testing:
> > > > >
> > > > > - execution_plans::shuffle_writer::tests::test
> > > > > - execution_plans::shuffle_writer::tests::test_partitioned
> > > > > -
> > > >
> >
> physical_plan::repartition::tests::repartition_with_dropping_output_stream
> > > > >
> > > > > I set up env `ARROW_TEST_DATA` and `PARQUET_TEST_DATA`, then run
> the
> > test
> > > > > with
> > > > > "cargo test --all --no-fail-fast" on Linux 5.13.6 with x86_64 chip.
> > > > >
> > > > > Did I miss something? I can paste the log here or file an issue if
> > > > needed.
> > > > >
> > > > > Ruihang
> > > > >
> > > > > QP Hou :
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I would like to propose a release of Apache Arrow Datafusion
> > > > > > Implementation,
> > > > > > version 5.0.0.
> > > > > >
> > > > > > RC3 fixed a cargo publish issue discovered in RC1.
> > > > > >
> > > > > > This release candidate is based on commit:
> > > > > > deb929369c9aaba728ae0c2c49dcd05bfecc8bf8 [1]
> > > > > > The proposed release tarball and signatures are hosted at [2].
> > > > > > The changelog is located at [3].
> > > > > >
> > > > > > Please download, verify checksums and signatures, run the unit
> > tests,
> > > > and
> > > > > > vote
> > > > > > on the release. The vote will be open for at least 72 hours.
> > > > > >
> > > > > > [ ] +1 Release this as Apache Arrow Datafusion 5.0.0
> > > > > > [ ] +0
> > > > > > [ ] -1 Do not release this as Apache Arrow Datafusion 5.0.0
> > because...
> > > > > >
> > > > > > [1]:
> > > > > >
> > > >
> >
> https://github.com/apache/arrow-datafusion/tree/deb929369c9aaba728ae0c2c49dcd05bfecc8bf8
> > > > > > [2]:
> > > > > >
> > > >
> >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-5.0.0-rc3
> > > > > > [3]:
> > > > > >
> > > >
> >
> https://github.com/apache/arrow-datafusion/blob/deb929369c9aaba728ae0c2c49dcd05bfecc8bf8/CHANGELOG.md
> > > > > >
> > > > > > Thanks,
> > > > > > QP
> > > > > >
> > > >
> >
>


Re: [Question] what is the purpose of the typeids in the UnionArray?

2021-08-13 Thread Jorge Cardoso Leitão
Thanks a lot for the confirmation, Micah.

It still seems possible to remove the typeid without loss of generality
(not that I am advocating for), as all fields are declared as children of
the field, and it is thus possible to declare fields that the union does
not currently contain in the metadata, which are mapped to empty ArrayData
children (in dense).

A minor annoyance is that the typeid does not allow a zero-copy over the C
data interface, as we need to initialize an intermediary hashmap to make
the lookup fast, or take the penalty hit to search (linearly over the
number of fields) over the typeids for the type id.

Best,
Jorge




On Fri, Aug 13, 2021 at 7:07 PM Micah Kornfield 
wrote:

> Jorge,
> I think your analysis is correct.  Some historical context on why there is
> an indication  is covered on the original JIRA:
> https://issues.apache.org/jira/browse/ARROW-257
>
> Some other discussions:
>
> https://lists.apache.org/x/thread.html/75028183d54cb4f6ff588b043fe126f10b2cba8e373673fad6ba889d@%3Cdev.arrow.apache.org%3E
>
> https://lists.apache.org/x/thread.html/b219ef51dda71bef83dcdec94e68e2881d49f751b29a8c1251f653d5@%3Cdev.arrow.apache.org%3E
>
> -Micah
>
> On Fri, Aug 13, 2021 at 10:57 AM Keith Kraus 
> wrote:
>
> > How would using the typeid directly work with arbitrary Extension types?
> >
> > -Keith
> >
> > On Fri, Aug 13, 2021 at 12:49 PM Jorge Cardoso Leitão <
> > jorgecarlei...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > In the UnionArray, there is a level of indirection between types
> (buffer
> > of
> > > i8s) -> typeId (i8) -> field. For example, the generated_union part of
> > our
> > > integration tests has the data:
> > >
> > > types: [5, 5, 5, 5, 7, 7, 7, 7, 5, 5, 7] (len = 11)
> > > typeids: [5, 7]
> > > fields: [int32, utf8]
> > >
> > > My understanding is that, to get the field of item 4, we read types[4]
> > (7),
> > > look for the index of it in typeids (1), and take the field of index 1
> > > (utf8), and then read the value (4 or other depending on sparsess).
> > >
> > > Does someone know the rationale for the intermediare typeid? I.e.
> > couldn't
> > > the types contain the index of the field directly [0, 0, 0, 0, 1, 1, 1,
> > 1,
> > > 0, 0,1] (replace 5 by 0, 7 by 1, and not use typeids)?
> > >
> > > Best,
> > > Jorge
> > >
> >
>


[Question] what is the purpose of the typeids in the UnionArray?

2021-08-13 Thread Jorge Cardoso Leitão
Hi,

In the UnionArray, there is a level of indirection between types (buffer of
i8s) -> typeId (i8) -> field. For example, the generated_union part of our
integration tests has the data:

types: [5, 5, 5, 5, 7, 7, 7, 7, 5, 5, 7] (len = 11)
typeids: [5, 7]
fields: [int32, utf8]

My understanding is that, to get the field of item 4, we read types[4] (7),
look for the index of it in typeids (1), and take the field of index 1
(utf8), and then read the value (4 or other depending on sparsess).

Does someone know the rationale for the intermediare typeid? I.e. couldn't
the types contain the index of the field directly [0, 0, 0, 0, 1, 1, 1, 1,
0, 0,1] (replace 5 by 0, 7 by 1, and not use typeids)?

Best,
Jorge


Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-13 Thread Jorge Cardoso Leitão
Hi,

The requirements for the compute IR as I see it are:
>
> * Implementations in IR producer and consumer languages.
> * Strongly typed or the ability to easily validate a payload
>

What about:

1. easy to read and write by a large number of programming languages
2. easy to read and write by humans
3. fast to validate by a large number of programming languages

I.e. make the ability to read and write by humans be more important than
speed of validation.

In this order, JSON/toml/yaml are preferred because they are supported by
more languages and more human readable than faster to validate.

-

My understanding is that for an async experience, we need the ability to
`.await` at any `read_X` call so that if the read_X requests more bytes
than are buffered, the `read_X(...).await` triggers a new (async) request
to fill the buffer (which puts the future on a Pending state). When a
library does not offer the async version of `read_X`, any read_X can force
a request to fill the buffer, which is now blocking the thread. One way
around this is to wrap those blocking calls in async (e.g. via
tokio::spawn_blocking). However, this forces users to use that runtime, or
to create a new independent thread pool for their own async work. Neither
are great for low-level libraries.

E.g. thrift does not offer async -> parquet-format-rs does not offer async
-> parquet does not offer async -> datafusion wraps all parquet "IO-bounded
and CPU-bounded operations" in spawn_blocking or something equivalent.

Best,
Jorge


On Thu, Aug 12, 2021 at 10:03 PM Phillip Cloud  wrote:

> On Thu, Aug 12, 2021 at 1:03 PM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
> > I agree with Antoine that we should weigh the pros and cons of
> flatbuffers
> > (or protobuf or thrift for that matter) over a more human-friendly,
> > simpler, format like json or MsgPack. I also struggle a bit to reason
> with
> > the complexity of using flatbuffers for this.
> >
>
> Ultimately I think different representations of the format will emerge if
> compute IR is successful,
> and people will implement JSON/proto/thrift/etc versions of the IR.
>
> The requirements for the compute IR as I see it are:
>
> * Implementations in IR producer and consumer languages.
> * Strongly typed or the ability to easily validate a payload
>
> It seems like Protobuf, Flatbuffers and JSON all meet the criteria here.
> Beyond that,
> there's precedence in the codebase for flatbuffers (which is just to say
> that flatbuffers
> is the devil we know).
>
> Can people list other concrete requirements for the format? A
> non-requirement might
> be that there be _idiomatic_ implementations for every language arrow
> supports, for example.
>
> I think without agreement on requirements we won't ever arrive at
> consensus.
>
> The compute IR spec itself doesn't really depend on the specific choice of
> format, but we
> need to get some consensus on the format.
>
>
> > E.g. there is no async support for thrift, flatbuffers nor protobuf in
> > Rust, which e.g. means that we can't read neither parquet nor arrow IPC
> > async atm. These problems are usually easier to work around in simpler
> > formats.
> >
>
> Can you elaborate a bit on the lack of async support here and what it would
> mean for
> a particular in-memory representation to support async, and why that
> prevents reading
> a parquet file using async?
>
> Looking at JSON as an example, most libraries in the Rust ecosystem use
> serde and serde_json
> to serialize and deserialize JSON, and any async concerns occur at the
> level of
> a client/server library like warp (or some transitive dependency thereof
> like Hyper).
>
> Are you referring to something like the functionality implemented in
> tokio-serde-json? If so,
> I think you could probably build something for these other formats assuming
> they have serde
> support (flatbuffers notably does _not_, partially because of its incessant
> need to own everything),
> since tokio_serde is doing most of the work in tokio-serde-json. In any
> case, I don't think
> it's a requirement for the compute IR that there be a streaming transport
> implementation for the
> format.
>
>
> >
> > Best,
> > Jorge
> >
> >
> >
> > On Thu, Aug 12, 2021 at 2:43 PM Antoine Pitrou 
> wrote:
> >
> > >
> > > Le 12/08/2021 à 15:05, Wes McKinney a écrit :
> > > > It seems that one adjacent problem here is how to make it simpler for
> > > > third parties (especially ones that act as front end interfaces) to
> > > > build and serialize/deserialize the IR structures with some ki

Re: [Rust] Integration tests for recursive nested data?

2021-08-12 Thread Jorge Cardoso Leitão
Hi,

The checkout of arrow-rs on the failed build is over fa5acd971c97, which up
to 3hrs or so was master, so, I think it is picking the right code.

Did a quick investigation:

* The integration tests on arrow-rs have not been running since June the
30th. they stopped running after the merge of [1], which broke the workflow
[2].
* the integration tests have been running on apache/arrow - latest run was
2hrs against 1a5c26bc2af73 [3], which, as of writing, is master.
* the integration tests on apache/arrow have an annotation of an error [3]
(step Docker Push, failed authentication)

* Filed https://github.com/apache/arrow-rs/issues/690 for the broken CI on
arrow-rs
* Filed https://issues.apache.org/jira/browse/ARROW-13621 for annotation
error on apache/arrow

Coming to the actual error: given [3], that the interval branch did not
change anything related to nested, and that all other consumers are ok, I
can only conclude that this is a non-deterministic behavior on the arrow-rs
side.

[1]
https://github.com/apache/arrow-rs/commit/31bc052126abc4834edf9b3cd7cb72384f84ba3e
[2] https://github.com/apache/arrow-rs/actions/runs/993286543
[3] https://github.com/apache/arrow/runs/3314893335?check_suite_focus=true

Best,
Jorge




On Thu, Aug 12, 2021 at 12:57 PM Andrew Lamb  wrote:

> Hi Micah,
>
> There is no open issue that I know of, and while I may be mistaken it looks
> like the most recent run of the Integration Test on apache/master [3]
> passed successfully.
>
> The code [1] that is asserting on your branch doesn't look like it has been
> touched for a while (last touched in March 2021). Also, I reviewed the
> latest changes on arrow-rs master [2] and didn't see any obvious changes to
> the handling of nested structures.
>
> I wonder if it is picking up some old version of the crate or something
>
> Andrew
>
>
> [1]
>
> https://github.com/apache/arrow-rs/blob/fa5acd971c973161f17e69d5c6b50d6e77c7da03/arrow/src/array/equal/list.rs#L143
>
> [2] https://github.com/apache/arrow-rs/commits/master
>
> [3] https://github.com/apache/arrow/runs/3309489410
>
>
> On Wed, Aug 11, 2021 at 5:24 PM Micah Kornfield 
> wrote:
>
> > One of my PRs is showing integration test failures with Rust [1] for the
> > recursive nested test.
> >
> > With an error:
> >
> > Validating
> > /tmp/tmpg3zo83yh/de3ef975_generated_recursive_nested.json_as_file and
> > /tmp/arrow-integration-0bm7hcmd/generated_recursive_nested.json
> > Schemas match. JSON file has 2 batches.
> > thread 'main' panicked at 'called `Option::unwrap()` on a `None` value',
> > arrow/src/array/equal/list.rs:143:18
> > note: run with `RUST_BACKTRACE=1` environment variable to display a
> > backtrace
> >
> >
> > I don't think I did anything that would have touched this code for Rust,
> > and it seems like Rust is failing to consume for all producing languages:
> >
> > FAILED TEST: recursive_nested C++ producing,  Rust consuming
> > FAILED TEST: recursive_nested Java producing,  Rust consuming
> > FAILED TEST: recursive_nested JS producing,  Rust consuming
> > FAILED TEST: recursive_nested Rust producing,  Rust consuming
> >
> > Is this a known issue?  Anything I can try doing to fix this?
> >
> > Thanks,
> > Micah
> >
> >
> > [1]
> >
> https://github.com/apache/arrow/pull/10177/checks?check_run_id=3305682790
> >
>


Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-12 Thread Jorge Cardoso Leitão
I agree with Antoine that we should weigh the pros and cons of flatbuffers
(or protobuf or thrift for that matter) over a more human-friendly,
simpler, format like json or MsgPack. I also struggle a bit to reason with
the complexity of using flatbuffers for this.

E.g. there is no async support for thrift, flatbuffers nor protobuf in
Rust, which e.g. means that we can't read neither parquet nor arrow IPC
async atm. These problems are usually easier to work around in simpler
formats.

Best,
Jorge



On Thu, Aug 12, 2021 at 2:43 PM Antoine Pitrou  wrote:

>
> Le 12/08/2021 à 15:05, Wes McKinney a écrit :
> > It seems that one adjacent problem here is how to make it simpler for
> > third parties (especially ones that act as front end interfaces) to
> > build and serialize/deserialize the IR structures with some kind of
> > ready-to-go middleware library, written in a language like C++.
>
> A C++ library sounds a bit complicated to deal with for Java, Rust, Go,
> etc. developers.
>
> I'm not sure which design decision and set of compromises would make the
> most sense.  But this is why I'm asking the question "why not JSON?" (+
> JSON-Schema if you want to ease validation by third parties).
>
> (note I have already mentioned MsgPack, but only in the case a binary
> encoding is really required; it doesn't have any other advantage that I
> know of over JSON, and it's less ubiquitous)
>
> Regards
>
> Antoine.
>
>
> > To do that, one would need the equivalent of arrow/type.h and related
> > Flatbuffers schema serialization code that lives in arrow/ipc. If you
> > want to be able to completely and accurately serialize Schemas, you
> > need quite a bit of code now.
> >
> > One possible approach (and not go crazy) would be to:
> >
> > * Move arrow/types.h and its dependencies into a standalone C++
> > library that can be vendored into the main apache/arrow C++ library. I
> > don't know how onerous arrow/types.h's transitive dependencies /
> > interactions are at this point (there's a lot of stuff going on in
> > type.cc [1] now)
> > * Make the namespaces exported by this library configurable, so any
> > library can vendor the Arrow types / IR builder APIs privately into
> > their project
> > * Maintain this "Arrow types and ComputeIR library" as an always
> > zero-dependency library to facilitate vendoring
> > * Lightweight bindings in languages we care about (like Python or R or
> > GLib/Ruby) could be built to the IR builder middleware library
> >
> > This seems like what is more at issue compared with rather projects
> > are copying the Flatbuffers files out of their project from
> > apache/arrow or apache/arrow-format.
> >
> > [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.cc
> >
> > On Thu, Aug 12, 2021 at 2:05 PM Andrew Lamb 
> wrote:
> >>
> >> I support the idea of an independent repo that has the arrow flatbuffers
> >> format definition files.
> >>
> >> My rationale is that the Rust implementation has a copy of the `format`
> >> directory [1] and potential drift worries me (a bit). Having a single
> >> source of truth for the format that is not part of the large mono repo
> >> would be a good thing.
> >>
> >> Andrew
> >>
> >> [1] https://github.com/apache/arrow-rs/tree/master/format
> >>
> >> On Wed, Aug 11, 2021 at 2:40 PM Phillip Cloud 
> wrote:
> >>
> >>> Hi all,
> >>>
> >>> I'd like to bring up an idea from a recent thread ([1]) about moving
> the
> >>> `format/` directory out of the primary apache/arrow repository.
> >>>
> >>> I understand from that thread there are some concerns about using
> >>> submodules,
> >>> and I definitely sympathize with those concerns.
> >>>
> >>> In talking with David Li (disclaimer: we work together at Voltron
> Data), he
> >>> has
> >>> a great idea that I think makes everyone happy: an
> `apache/arrow-format`
> >>> repository that is the official mirror for the flatbuffers IDL, that
> >>> library
> >>> authors should use as the source of truth.
> >>>
> >>> It doesn't require a submodule, yet it also allows external projects
> the
> >>> ability to access the IDL without having to interact with the main
> arrow
> >>> repository and is backwards compatible to boot.
> >>>
> >>> In this scenario, repositories that are currently copying in the
> >>> flatbuffers
> >>> IDL can migrate to this repository at their leisure.
> >>>
> >>> My motivation for this was around sharing data structures for the
> compute
> >>> IR
> >>> proposal ([2]).
> >>>
> >>> I can think of at least two ways for IR producers and consumers of all
> >>> languages to share the flatbuffers IDL:
> >>>
> >>> 1. A set of bindings built in some language that other languages can
> >>> integrate
> >>> with, likely C++, that allows library users to build IR using an
> API.
> >>>
> >>> The primary downside to this is that we'd have to deal with
> >>> building another library while working out any kinks in the IR design
> and
> >>> I'd
> >>> rather avoid that in the initial phases of this projec

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-08-11 Thread Jorge Cardoso Leitão
Couple of questions

1. Is the goal that IRs have equal semantics, i.e. given (IR,data), the
operation "(IR,data) - engine -> result" MUST be the same for all "engine"?
2. if yes, imo we may need to worry about:
* a definition of equality that implementations agree on.
* agreement over what the semantics look like. For example, do we use
kleene logic for AND and OR?

To try to understand the gist, let's pick an aggregated count over a
column: engines often do partial counts over partitions followed by a final
"sum" over the partial counts. Is the idea that the query engine would
communicate with the compute engine via 2 IRs where one is "count me these"
the other is "sum me these"?

Best,
Jorge





On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud  wrote:

> Thanks Wes,
>
> Great to be back working on Arrow again and engaging with the community. I
> am really excited about this effort.
>
> I think there are a number of concerns I see as important to address in the
> compute IR proposal:
>
> 1. Requirement for output types.
>
> I think that so far there's been many reasons for requiring conforming IR
> producers and consumers to adhere to output types, but I haven't seen a
> strong rationale for keeping them optional (in the semantic sense, not WRT
> any particular serialization format's representation of optionality).
>
> I think a design that includes unambiguous semantics for output types (a
> consumer must produce a value of the requested type or it's an
> error/non-conforming) is simpler to reason about for producers, and
> provides a strong guarantee for end users (humans or machines constructing
> IR from an API and expecting the thing they ask for back from the IR
> consumer).
>
> 2. Flexibility
>
> The current PR is currently unable to support what I think are killer
> features of the IR: custom operators (relational or column) and UDFs. In my
> mind, on top of the generalized compute description that the IR offers, the
> ability for producers and consumers of IR to extend the IR without needing
> to modify Arrow or depend on anything except the format is itself something
> that is necessary to gain adoption.
>
> Developers will need to build custom relational operators (e.g., scans of
> backends that don't exist anywhere for which a user has code to implement)
> and custom functions (anything operating on a column that doesn't already
> exist, really). Furthermore, I think we can actually drive building an
> Arrow consumer using the same API that an end user would use to extend the
> IR.
>
> 3. Window Functions
>
> Window functions are, I think, an important part of the IR value
> proposition, as they are one of the more complex operators in databases. I
> think we need to have something in the initial IR proposal to support these
> operations.
>
> 4. Non relational Joins
>
> Things like as-of join and window join operators aren't yet fleshed out in
> the IR, and I'm not sure they should be in scope for the initial prototype.
> I think once we settle on a design, we can work the design of these
> particular operators out during the initial prototype. I think the
> specification of these operators should basically be PR #2 after the
> initial design lands.
>
> # Order of Work
>
> 1. Nail down the design. Anything else is a non-starter.
>
> 2. Prototype an IR producer using Ibis
>
> Ibis is IMO a good candidate for a first IR producer as it has a number of
> desirable properties that make prototyping faster and allow for us to
> refine the design of the IR as needed based on how the implementation goes:
> * It's written in Python so it has native support for nearly all of
> flatbuffers' features without having to creating bindings to C++.
> * There's already a set of rules for type checking, as well as APIs for
> constructing expression trees, which means we don't need to worry about
> building a type checker for the prototype.
>
> 3. Prototype an IR consumer in C++
>
> I think in parallel to the producer prototype we can further inform the
> design from the consumer side by prototyping an IR consumer in C++ . I know
> Ben Kietzman has expressed interest in working on this.
>
> Very interested to hear others' thoughts.
>
> -Phillip
>
> On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney  wrote:
>
> > Thank you for all the feedback and comments on the document. I'm on
> > vacation this week, so I'm delayed responding to everything, but I
> > will get to it as quickly as I can. I will be at VLDB in Copenhagen
> > next week if anyone would like to chat in person about it, and we can
> > relay the content of any discussions back to the document/PR/e-mail
> > thread.
> >
> > I know that Phillip Cloud expressed interest in working on the PR and
> > helping work through many of the details, so I'm glad to have the
> > help. If there are others who would like to work on the PR or dig into
> > the details, please let me know. We might need to figure out how to
> > accommodate "many cooks" by setting up the ComputeIR 

Re: [Discuss] [Rust] Arrow2/parquet2 going foward

2021-08-06 Thread Jorge Cardoso Leitão
 that merging two repos would
> > add more overhead to his work and slow him down.
> >
> > For those who want to contribute to arrow2 to accelerate the
> > transition, I don't think they would have problem sending PRs to the
> > arrow2 repo. For those who are not interested in contributing to
> > arrow2, merging the arrow2 code base into the current arrow-rs repo
> > won't incentivize them to contribute. Merging arrow2 into current
> > arrow-rs repo could help with discovery. But I think this can be
> > achieved by adding a big note in the current arrow-rs README to
> > encourage contributions to the arrow2 repo as well.
> >
> > At the end of the day, Jorge is currently the sole active contributor
> > to the arrow2 implementation, so I think he would have the most say on
> > what's the most productive way to push arrow2 forward. The only
> > concern I have with regards to merging arrow2 into arrow-rs right now
> > is Jorge spent all the efforts to do the merge, then it turned out
> > that he is still the only active contributor to arrow2 within
> > arrow-rs, but with more overhead that he has to deal with.
> >
> > As for maintaining semantic versioning for arrow2, Andy had a good
> > point that we could still release arrow2 with its own versioning even
> > if we merge it into the arrow-rs repo. So I don't think we should
> > worry/focus too much about versioning in our discussion. Velocity to
> > close the gap between arrow-rs and arrow2 is the most important thing.
> >
> > Lastly, I do agree with Andrew that it would be good to only maintain
> > a single arrow crate in crates.io in the long run. As he mentioned,
> > when the current arrow2 code base becomes stable, we could still
> > release it under the arrow namespace in crates.io with a major version
> > bump. The absolute value in the major version doesn't really matter as
> > long as we stick to the convention that breaking change will result in
> > a major version bump.
> >
> > Thanks,
> > QP
> >
> >
> >
> > On Tue, Aug 3, 2021 at 5:31 PM paddy horan 
> wrote:
> > >
> > > Hi Jorge,
> > >
> > > I see value in consolidating development in a single repo and releasing
> > under the existing arrow crate.  Regarding versioning, I think once we
> > follow semantic versioning we are fine.  I don't think it's worth
> migrating
> > to a different repo and crate to comply with the de-facto standard you
> > mention.
> > >
> > > Just one person's opinion though,
> > > Paddy
> > >
> > >
> > > -Original Message-
> > > From: Jorge Cardoso Leitão 
> > > Sent: Tuesday, August 3, 2021 5:23 PM
> > > To: dev@arrow.apache.org
> > > Subject: Re: [Discuss] [Rust] Arrow2/parquet2 going foward
> > >
> > > Hi Paddy,
> > >
> > > > What do you think about moving Arrow2 into the main Arrow repo where
> > > > it
> > > is only enabled via an "experimental" feature flag?
> > >
> > > AFAIK this is already possible:
> > > * add `arrow2 = { version = "0.2.0", optional = true }` to Cargo.toml
> > > * add `#[cfg(feature = "arrow2")]\npub mod arrow2;\n` to lib.rs
> > >
> > > We do this kind of thing to expose APIs from non-arrow crates such as
> > parts of the parquet-format-rs crate, and is generally the way to go
> when a
> > crate wants to expose a third-party API.
> > >
> > > I would not recommend doing this, though: by exposing arrow2 from
> arrow,
> > we double the compilation time and binary size of all dependencies that
> > activate the flag. Furthermore, there are users of arrow2 that do not
> need
> > the arrow crate, which this model would not support.
> > >
> > > AFAIK where development happens is unrelated to this aspect, Rust
> > enables this by design.
> > >
> > > > but also this would be a clear signal that Arrow2 is <1.0.
> > > > the experimental flag will be a clear signal to the existing Arrow
> > > community that Arrow2 is the future but that it is <1.0
> > >
> > > arrow2 is already <1.0 <
> >
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcrates.io%2Fcrates%2Farrow2&data=04%7C01%7C%7Ca37de2cddc6e447a777b08d956c4dbce%7C84df9e7fe9f640afb435%7C1%7C0%7C637636225764521997%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=bJEw92M9

Re: [Discuss] [Rust] Arrow2/parquet2 going foward

2021-08-03 Thread Jorge Cardoso Leitão
al that Arrow2 is <1.0.  When we feel ready (i.e. Arrow2 is 1.0)
> we can release it in the next main release with Arrow2 being the default
> and move the existing implementation behind a "legacy" feature flag.
>
> Here is why I think this might work well:
>  - People contributing to the Arrow project will naturally contribute to
> Arrow2.  At the moment, some people will still contribute to Arrow instead
> of Arrow2 just by virtue of it being the "official" implementation.
> However, if both are in one repo people will want to contribute to the
> "future", i.e. Arrow2.
>  - the experimental flag will be a clear signal to the existing Arrow
> community that Arrow2 is the future but that it is <1.0
>  - existing users will be well supported in this transition
>  - In general, I think the longer that development proceeds in separate
> repos the harder it will be to eventually merge the two in a way that
> supports existing users.
>
> Do you think would work?
>
> Paddy
>
> -Original Message-
> From: Jorge Cardoso Leitão 
> Sent: Monday, August 2, 2021 1:59 PM
> To: dev@arrow.apache.org
> Subject: Re: [Discuss] [Rust] Arrow2/parquet2 going foward
>
> Hi,
>
> Sorry for the delay.
>
> If there is a path towards an official release under a <1.0.0 versioning
> schema aligned with the rest of the Rust ecosystem and in line with the
> stability of the API, then IMO we should move all development to within
> Apache experimental asap (I can handle this and the likely IP clearance
> round). If we require a release >=1.X.Y to it and/or a schedule, then I
> prefer to keep expectations aligned and postpone any movement.
>
> Under the move situation, I was thinking in something as follows:
>
> * gradually stop maintaining "arrow" in crates, offering a maintenance
> window over which we release patches (*)
> * work towards achieving feature parity on arrow2/parquet2 on the
> experimental repos.
> * keep releasing arrow2/parquet2 under a 0.X model during the step above
> (**)
> * migrate to arrow-rs and archive experimentals (***)
> * break arrow2 in smaller crates so that we can version the APIs at a
> different cadence
> * once a crate reaches some stability (this is always opinionated, but it
> is fine), we bump it to 1.0 and announce a maintenance plan ala tokio <
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftokio.rs%2Fblog%2F2020-12-tokio-1-0&data=04%7C01%7C%7C1b3176da8b6b45407c4208d955df3394%7C84df9e7fe9f640afb435%7C1%7C0%7C637635239391364824%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=lpj8KTpf3c3t0zxo28dSqtuJ82xfMtPssmxzNkrj%2BBQ%3D&reserved=0
> >.
>
> (*) e.g. "we will continue to patch the arrow crate up to at least 6
> months starting after the first release of arrow2 that supports
> a) nested parquet read and write
> b) union array (including IPC integration tests)
> c) map array (including IPC integration tests)"
>
> (**) officially or un-officially (I would suggest officially so that we
> can acknowledge everyone's work on it, but no strong feelings)
>
> (***) something like:
> 1. place arrow2 on top of a clear arrow repo so that the full contribution
> history up to that point preserved 2. make arrow-rs the home of arrow2
> (i.e. we start releasing arrow2 from
> arrow-rs) and archive the experimental repos; create arrow-rs-parquet or
> something for parquet2.
>
> In summary, the core pain point for me is the current versioning of arrow,
> which I feel is incompatible with my goals for arrow2 and the ecosystem I
> envision it supporting :)
>
> Best,
> Jorge
>
> On Fri, Jul 30, 2021 at 8:44 PM Wes McKinney  wrote:
>
> > I think it would also be fine to push "beta" arrow2 crates out of a
> > repo under apache/ so long as they are not marked on crates.io as
> > being Apache-official releases. There's a possible slippery slope
> > there, but as long as we are on a path to formalizing the releases I
> think it is okay.
> >
> > On Fri, Jul 30, 2021 at 1:07 PM Andrew Lamb 
> wrote:
> >
> > > Jorge -- do you feel like we have a resolution on what to do with
> > > arrow2
> > in
> > > the near term?
> > >
> > > The current state of affairs seems to me that arrow2 is released
> > > from
> > >
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjorgecarleitao%2Farrow2&data=04%7C01%7C%7C1b3176da8b6b45407c4208d955df3394%7C84df9e7fe9f640afb435%7C1%7C0%7C637635239391364824%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1

Re: [Discuss] [Rust] Arrow2/parquet2 going foward

2021-08-02 Thread Jorge Cardoso Leitão
Hi,

Sorry for the delay.

If there is a path towards an official release under a <1.0.0 versioning
schema aligned with the rest of the Rust ecosystem and in line with the
stability of the API, then IMO we should move all development to within
Apache experimental asap (I can handle this and the likely IP clearance
round). If we require a release >=1.X.Y to it and/or a schedule, then I
prefer to keep expectations aligned and postpone any movement.

Under the move situation, I was thinking in something as follows:

* gradually stop maintaining "arrow" in crates, offering a maintenance
window over which we release patches (*)
* work towards achieving feature parity on arrow2/parquet2 on the
experimental repos.
* keep releasing arrow2/parquet2 under a 0.X model during the step above
(**)
* migrate to arrow-rs and archive experimentals (***)
* break arrow2 in smaller crates so that we can version the APIs at a
different cadence
* once a crate reaches some stability (this is always opinionated, but it
is fine), we bump it to 1.0 and announce a maintenance plan ala tokio
.

(*) e.g. "we will continue to patch the arrow crate up to at least 6 months
starting after the first release of arrow2 that supports
a) nested parquet read and write
b) union array (including IPC integration tests)
c) map array (including IPC integration tests)"

(**) officially or un-officially (I would suggest officially so that we can
acknowledge everyone's work on it, but no strong feelings)

(***) something like:
1. place arrow2 on top of a clear arrow repo so that the full contribution
history up to that point preserved
2. make arrow-rs the home of arrow2 (i.e. we start releasing arrow2 from
arrow-rs) and archive the experimental repos; create arrow-rs-parquet or
something for parquet2.

In summary, the core pain point for me is the current versioning of arrow,
which I feel is incompatible with my goals for arrow2 and the ecosystem I
envision it supporting :)

Best,
Jorge

On Fri, Jul 30, 2021 at 8:44 PM Wes McKinney  wrote:

> I think it would also be fine to push “beta” arrow2 crates out of a repo
> under apache/ so long as they are not marked on crates.io as being
> Apache-official releases. There’s a possible slippery slope there, but as
> long as we are on a path to formalizing the releases I think it is okay.
>
> On Fri, Jul 30, 2021 at 1:07 PM Andrew Lamb  wrote:
>
> > Jorge -- do you feel like we have a resolution on what to do with arrow2
> in
> > the near term?
> >
> > The current state of affairs seems to me that arrow2 is released from
> > https://github.com/jorgecarleitao/arrow2 to crates.io (which is fine).
> > Are
> > you happy with keeping development in the jorgecarleitao repo where you
> > will retain maximal control and flexibility until it is ready to start
> > integrating?
> >
> > Or would you prefer to put it into one of the apache repos and subject
> its
> > development and release to the normal Arrow governance model (tarball,
> > vote, etc)?
> >
> > Since you are the primary author/architect I think you should have a
> > substantial say at this stage.
> >
> > Andrew
> >
> >
> > On Tue, Jul 27, 2021 at 7:16 PM Andrew Lamb 
> wrote:
> >
> > > I would be happy with this approach. Thank you for the suggestion
> > >
> > > This hybrid approach of both arrow and arrow2 in the same repo seems
> > > better to me than separate repos.
> > >
> > > What I really care about is ensuring we don't have two crates/APIs
> > > indefinitely -- as long as we are continually making progress towards
> > > unification that is what is important to me.
> > >
> > > Andrew
> > >
> > > On Tue, Jul 27, 2021 at 1:40 PM Andy Grove 
> > wrote:
> > >
> > >> Apologies for being late to this discussion.
> > >>
> > >> There is a hybrid option to consider here where we add the arrow2 code
> > >> into
> > >> the arrow crate as a separate module, so we release one crate
> containing
> > >> the "old" API (which we can mark as deprecated) as well as the new
> API.
> > >> Java did a similar thing a long time ago with "java.io" versus
> > "java.nio"
> > >> (new IO).
> > >>
> > >> I agree that the versioning wouldn't be ideal, but this seems like it
> > >> might
> > >> be a pragmatic compromise?
> > >>
> > >> Thanks,
> > >>
> > >> Andy.
> > >>
> > >>
> > >> On Tue, Jul 20, 2021 at 5:41 AM Andrew Lamb 
> > wrote:
> > >>
> > >> > What I meant is that when you decide arrow2 is suitable for release
> to
> > >> > existing arrow users, I stand ready to help you incorporate it into
> > >> arrow.
> > >> >
> > >> > All the feedback I have heard so far from the rest of the community
> is
> > >> that
> > >> > we are ready. One might even say we are anxious to do so :)
> > >> >
> > >> > Andrew
> > >> >
> > >>
> > >
> >
>


Re: [ANNOUNCE] New Arrow PMC member: Neville Dipale

2021-07-29 Thread Jorge Cardoso Leitão
Congratulations, Neville :)

On Fri, Jul 30, 2021 at 8:18 AM QP Hou  wrote:

> Well deserved, congratulations Neville!
>
> On Thu, Jul 29, 2021 at 3:20 PM Wes McKinney  wrote:
> >
> > The Project Management Committee (PMC) for Apache Arrow has invited
> > Neville Dipale to become a PMC member and we are pleased to announce
> > that Neville has accepted.
> >
> > Congratulations and welcome!
>


Re: [ANNOUNCE] New Arrow committer: QP Hou

2021-07-26 Thread Jorge Cardoso Leitão
Congratulations and thank you for all the great work! It is a pleasure to
work with you.

Best,
Jorge


On Mon, Jul 26, 2021 at 7:38 PM Niranda Perera 
wrote:

> Congrats QP! :-)
>
> On Mon, Jul 26, 2021 at 1:24 PM Micah Kornfield 
> wrote:
>
> > Congrats QP!
> >
> > On Mon, Jul 26, 2021 at 10:02 AM Andrew Lamb 
> wrote:
> >
> > > Congratulations QP!
> > >
> > > On Mon, Jul 26, 2021 at 10:41 AM Jarek Potiuk 
> wrote:
> > >
> > > > Congrats QP :). I see you are in more Apache projects now :).
> > > >
> > > > J.
> > > >
> > > > On Mon, Jul 26, 2021 at 4:10 PM Daniël Heres 
> > > > wrote:
> > > >
> > > > > Welcome QP!
> > > > >
> > > > > Thanks for the work you are doing on DataFusion / arrow-rs and
> > > delta-rs!
> > > > >
> > > > > Daniël
> > > > >
> > > > > Op ma 26 jul. 2021 om 16:05 schreef Wes McKinney <
> > wesmck...@gmail.com
> > > >:
> > > > >
> > > > > > On behalf of the Arrow PMC, I'm happy to announce that QP has
> > > accepted
> > > > an
> > > > > > invitation to become a committer on Apache Arrow. Welcome, and
> > thank
> > > > you
> > > > > > for your contributions!
> > > > > >
> > > > > > Wes
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Daniël Heres
> > > > >
> > > >
> > > >
> > > > --
> > > > +48 660 796 129
> > > >
> > >
> >
>
>
> --
> Niranda Perera
> https://niranda.dev/
> @n1r44 
>


[DISCUSS] Release Python Datafusion 0.3.0

2021-07-20 Thread Jorge Cardoso Leitão
Hi,

I would like to gauge your interest in a release of the Python bindings for
DataFusion.

There has been a tremendous amount of updates to it, including support for
Python 3.9.

This release is backward compatible and there are no blockers.

This would be the first time a release of this is cut since it has been
donated. This is a binary release, as DataFusion / Rust must be compiled
against windows, mac and many linux (see
https://pypi.org/project/datafusion/#files for the previous version's
offering).

My suggestion is to perform an equivalent set of operations as we do it in
apache/arrow:

1. build the different binaries in the CI and store them as CI artifacts
2. download the binaries, verify RAT of source, and sign
3. push to apache dev and create changelog
4. vote
5. push to apache
6. push to pypi
7. create tag in repo
8. announce

I am happy to build the CI tooling for this, but I would like to know your
thoughts or if anyone has a better idea.

Best,
Jorge


Re: [Discuss] [Rust] Arrow2/parquet2 going foward

2021-07-20 Thread Jorge Cardoso Leitão
Hi,

I meant to stop releasing "arrow" in crates.io and start releasing it as
"arrow2" under a different versioning schema; like "psycopg" -> "psycopg2"
in pypi and others that suffered from large architectural changes that
required a different versioning that better represents the state of the new
API.

> The only thing preventing that movement is for you to decide you are
ready to release it to the wider audience and then let us help you do that.

Uhm? imo that is a bit misleading, but there you go:
https://crates.io/crates/arrow2 and https://crates.io/crates/parquet2 : now
they are available to the wider audience. imo the disagreement here is over
how we version arrow2.

Since there is no consensus, I propose that we postpone this to a later
point when the APIs are mature enough to be released under Arrow's stable
versioning schema. Until then, I need them in crates.io to be able to
gather feedback about the API, its usability, missing stuff, etc.

A bit of a bummer since I have been blocking releases to crates.io for the
past 6 months and other Apache-related bureaucracies, but life goes on.

Best,
Jorge


On Mon, Jul 19, 2021 at 3:05 PM Andrew Lamb  wrote:

> > If we do indeed have the expectation of stability over its whole public
> surface,
>
> I certainly do not have this expectation between major releases. Who does?
>
> I believe it is a disservice to the overall community to release two API
> incompatible Rust implementations of Arrow to crates.io. It will
> 1. potentially confuse new users
> 2. split development effort
> 3. encourage writing more code that relies on the old API.
>
> The Rust Arrow community has been *more than supportive* of the changes you
> are proposing in arrow2 -- there is strong support for switching; The only
> thing preventing that movement is for you to decide you are ready to
> release it to the wider audience and then let us help you do that.
>
> Making major public API changes for additional benefit between arrow 5.0.0
> and arrow 6.0.0 (or other future versions) is perfectly compatible with
> semantic versioning and other software projects.
>
> Andrew
>
> On Mon, Jul 19, 2021 at 2:08 AM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
> > Hi,
> >
> > Whatever its technical faults may be, projects that rely on arrow (such
> as
> > > anything based on DataFusion, like my own) need to be supported as they
> > > have made the bet on Rust Arrow.
> > >
> >
> > 1.X versioning in Apache Arrow was never meant to represent stability of
> > their individual libraries, but only the stability of the C++/Python and
> > the spec. It is a misconception that Rust implementation is stable and/or
> > ready for production; its version is aligned with Apache Arrow general
> > versioning simply for historical reasons. Requiring arrow2 to also be
> > marked as stable is imo just dragging this onwards.
> >
> > As primary developer of arrow2 and a contributor of some of the major
> > pieces of arrow-rs, I am saying that:
> >
> > * arrow-rs does not have a stable API: it requires large large
> incompatible
> > changes to even make it *safe*
> > * arrow2 does not have a stable API: it requires incompatible changes to
> > improve UX, performance, and functionality
> > * using arrow2 core API results in faster, safer, and less error-prone
> code
> >
> > The main difference is that arrow-rs requires API changes to its core
> > modules (buffer, bytes, etc), while arrow2 requires changes to its
> > peripheral modules (compute and IO). This is why imo we can make arrow2
> > available: expected changes will only break a small surface of the public
> > API which, while incompatible, are easy to address.
> >
> > Which is the gist of my proposal:
> >
> >- Arrow2 starts its release in cargo.io as 0.1
> >- A major release (e.g. 0.16.2 -> 1.0.0):
> >   - must be voted
> >   - may be backward incompatible
> >- Minor releases (e.g. 0.16.1 -> 0.17.0):
> >   - must be voted
> >   - may be backward incompatible
> >   - may have new features
> >- Patch releases (e.g. 0.16.1 -> 0.16.2):
> >   - may be voted
> >   - must not be backward compatible
> >   - may have new features
> >- Minor releases may have a maintenance period (e.g. 3+ months) over
> >which we guarantee security patches and feature backports.
> >- Major releases have a maintenance period over which we guarantee
> >security patches and feature backports according to semver 2.0.
> >
> > So that:
> >
> >- It alig

Re: [Discuss] [Rust] Arrow2/parquet2 going foward

2021-07-18 Thread Jorge Cardoso Leitão
Hi,

Whatever its technical faults may be, projects that rely on arrow (such as
> anything based on DataFusion, like my own) need to be supported as they
> have made the bet on Rust Arrow.
>

1.X versioning in Apache Arrow was never meant to represent stability of
their individual libraries, but only the stability of the C++/Python and
the spec. It is a misconception that Rust implementation is stable and/or
ready for production; its version is aligned with Apache Arrow general
versioning simply for historical reasons. Requiring arrow2 to also be
marked as stable is imo just dragging this onwards.

As primary developer of arrow2 and a contributor of some of the major
pieces of arrow-rs, I am saying that:

* arrow-rs does not have a stable API: it requires large large incompatible
changes to even make it *safe*
* arrow2 does not have a stable API: it requires incompatible changes to
improve UX, performance, and functionality
* using arrow2 core API results in faster, safer, and less error-prone code

The main difference is that arrow-rs requires API changes to its core
modules (buffer, bytes, etc), while arrow2 requires changes to its
peripheral modules (compute and IO). This is why imo we can make arrow2
available: expected changes will only break a small surface of the public
API which, while incompatible, are easy to address.

Which is the gist of my proposal:

   - Arrow2 starts its release in cargo.io as 0.1
   - A major release (e.g. 0.16.2 -> 1.0.0):
  - must be voted
  - may be backward incompatible
   - Minor releases (e.g. 0.16.1 -> 0.17.0):
  - must be voted
  - may be backward incompatible
  - may have new features
   - Patch releases (e.g. 0.16.1 -> 0.16.2):
  - may be voted
  - must not be backward compatible
  - may have new features
   - Minor releases may have a maintenance period (e.g. 3+ months) over
   which we guarantee security patches and feature backports.
   - Major releases have a maintenance period over which we guarantee
   security patches and feature backports according to semver 2.0.

So that:

   - It aligns expectations wrt to the current state of Rust's
   implementation
   - it offers support to downstream dependencies that require longer-term
   stability
   - it offers room for developers to improve its API, scrutinize security,
   etc.

If we do indeed have an expectation of stability over its whole public
surface, then I suggest that we keep arrow2 in the experimental repo as it
is today.

Btw, this is why some in the Rust community recommend using smaller crates:
so that versioning is not bound to a large public API surface and can thus
more easily be applied to smaller surfaces. There is of course a tradeoff
with maintenance of CI and releases.

Best,
Jorge

On Sat, Jul 17, 2021 at 1:59 PM Andrew Lamb  wrote:

> What if we released "beta" [1] versions of arrow on cargo at whatever pace
> was necessary? That way dependent crates could opt in to bleeding edge
> functionality / APIs.
>
> There is tension between full technical freedom to change APIs and the
> needs of downstream projects for a more stable API.
>
> Whatever its technical faults may be, projects that rely on arrow (such as
> anything based on DataFusion, like my own) need to be supported as they
> have made the bet on Rust Arrow. I don't think we can abandon maintenance
> on the existing codebase until we have a successor ready.
>
> Andrew
>
> p.s. I personally very much like Adam's suggestion for "Arrow 6.0 in Oct
> 2021 be based on arrow2" but that is predicated on wanting to have arrow2
> widely used by downstreams at that point.
>
> [1]
>
> https://stackoverflow.com/questions/46373028/how-to-release-a-beta-version-of-a-crate-for-limited-public-testing
>
>
> On Sat, Jul 17, 2021 at 5:56 AM Adam Lippai  wrote:
>
> > 5.0 is being released right now, which means from timing perspective this
> > is the worst moment for arrow2, indeed. You'd need to wait the full 3
> > months. On the other hand does releasing a 6.0 beta based on arrow2 on
> Aug
> > 1st, rc on Sept 1st and releasing the stable on Oct 1st sound like a bad
> > plan?
> >
> > I don't think a 6.0-beta release would be confusing and dedicating most
> of
> > the 5.0->6.0 cycle to this change doesn't sound excessive.
> >
> > I think this approach wouldn't result in extra work (backporting the
> > important changes to 5.1,5.2 release). It only shows the magnitude of
> this
> > change, the work would be done by you anyways, this would just make it
> > clear this is a huge effort.
> >
> > Best regards,
> > Adam Lippai
> >
> > On Sat, Jul 17, 2021, 11:31 Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com
> > >
> > wrote:
> >
> &

[Discuss] [Rust] Arrow2/parquet2 going foward

2021-07-17 Thread Jorge Cardoso Leitão
Hi,

Arrow2 and parquet2 have passed the IP clearance vote and are ready to be
merged to apache/* repos.

My plan is to merge them and PR to both of them to the latest updates on my
own repo, so that I can temporarily (and hopefully permanently) archive the
versions of my account and move development to apache/*.

Most of the work happening in arrow-rs is backward compatible or simple to
deprecate. However, this situation is different in arrow2 and parquet2. A
release cadence of a major every 3 months is prohibitive at the pace that I
am plowing through.

The core API (types, alloc, buffer, bitmap, array, mutable array) is imo
stable and not prone to change much, but the non-core API (namely IO and
compute) is prone to change. Examples:

* Add Scalar API to allow dynamic casting over the aggregate kernels and
parquet statistics
* move compute/ from the arrow crate into a separate crate
* move io/ from the arrow crate into a separate crate
* add option to select encoding based on DataType and field name when
writing to parquet

(I will create issues for them in the experimental repos for proper
visibility and discussion).

This situation is usually addressed via the 0.X model in semver 2 (in
Python fastAPI  is a predominant example
that uses it, and almost all in Rust also uses it). However, there are a
couple of blockers in this context:

1. We do not allow releases of experimental repos to avoid confusion over
which is *the* official package.
2. arrow-rs is at version 5, and some dependencies like IOx/Influx seem to
prefer a slower release cadence of breaking changes.

On the other hand, other parts of the community do not care about this
aspect. Polars for example, the fastest DataFrame in H2O benchmarks,
currently maintains an arrow2 branch that is faster and safer than master
[1], and will be releasing the Python binaries from the arrow2 branch. We
would like to release the Rust API also based on arrow2, which requires it
to be in Cargo.

The best “hack” that I can come up with given the constraints above is to
release arrow2 and parquet2 in cargo.io from my personal account so that
dependents can release to cargo while still making it obvious that they are
not the official release. However, this is obviously not ideal.

Any suggestions?

[1] https://github.com/pola-rs/polars/pull/922

Best,
Jorge


Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

2021-07-13 Thread Jorge Cardoso Leitão
Awesome. Thanks Wes.

I have now initiated the vote for both projects.

Best,
Jorge


On Sat, Jul 10, 2021 at 1:26 PM Wes McKinney  wrote:

> The process for updating the website is described on
>
> https://incubator.apache.org/guides/website.html
>
> It looks like you need to add the new entries to the index.xml file
> and then trigger a website build (which should be triggered by changes
> to SVN, but if not you can trigger one manually through Jenkins).
>
> After the new IP clearance pages are visible you should send an IP
> clearance lazy consensus vote to gene...@incubator.apache.org like
>
>
> https://lists.apache.org/thread.html/r319b85f0f24f9b0529865387ccfe1b2a00a16f394a48144ba25c3225%40%3Cgeneral.incubator.apache.org%3E
>
> On Sat, Jul 10, 2021 at 7:48 AM Jorge Cardoso Leitão
>
>  wrote:
> >
> > Thanks a lot Wes,
> >
> > I am not sure how to proceed from here:
> >
> > 1. how do we generate the html from the xml? I.e.
> > https://incubator.apache.org/ip-clearance/arrow-rust-ballista.html
> > 2. how do I trigger the the process to start? can I just email the
> > incubator with the proposal?
> >
> > Best,
> > Jorge
> >
> >
> >
> > On Mon, Jul 5, 2021 at 10:38 AM Wes McKinney 
> wrote:
> >
> > > Great, thanks for the update and pushing this forward. Let us know if
> > > you need help with anything.
> > >
> > > On Sun, Jul 4, 2021 at 8:26 PM Jorge Cardoso Leitão
> > >  wrote:
> > > >
> > > > Hi,
> > > >
> > > > Wes and Neils,
> > > >
> > > > Thank you for your feedback and offer. I have created the two .xml
> > > reports:
> > > >
> > > >
> > >
> http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-arrow.xml
> > > >
> > >
> http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-parquet.xml
> > > >
> > > > I based them on the report for Ballista. I also requested, on the PRs
> > > > [1,2], clarification wrt to every contributors' contributions to
> each.
> > > >
> > > > Best,
> > > > Jorge
> > > >
> > > > [1] https://github.com/apache/arrow-experimental-rs-arrow2/pull/1
> > > > [2] https://github.com/apache/arrow-experimental-rs-parquet2/pull/1
> > > >
> > > >
> > > >
> > > > On Mon, Jun 7, 2021 at 11:55 PM Wes McKinney 
> > > wrote:
> > > >
> > > > > On Sun, Jun 6, 2021 at 1:47 AM Jorge Cardoso Leitão
> > > > >  wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Thanks a lot for your feedback. I agree with all the arguments
> put
> > > > > forward,
> > > > > > including Andrew's point about the large change.
> > > > > >
> > > > > > I tried a gradual 4 months ago, but it was really difficult and I
> > > gave
> > > > > up.
> > > > > > I estimate that the work involved is half the work of writing
> > > parquet2
> > > > > and
> > > > > > arrow2 in the first place. The internal dependency on ArrayData
> (the
> > > main
> > > > > > culprit of the unsafe) on arrow-rs is so prevalent that all core
> > > > > components
> > > > > > need to be re-written from scratch (IPC, FFI, IO,
> array/transform/*,
> > > > > > compute, SIMD). I personally do not have the motivation to do it,
> > > though.
> > > > > >
> > > > > > Jed, the public API changes are small for end users. A typical
> > > migration
> > > > > is
> > > > > > [1]. I agree that we can further reduce the change-set by keeping
> > > legacy
> > > > > > interfaces available.
> > > > > >
> > > > > > Andy, on my machine, the current benchmarks on query 1 yield:
> > > > > >
> > > > > > type, master (ms), PR [2] for arrow2+parquet2 (ms)
> > > > > > memory (-m): 332.9, 239.6
> > > > > > load (the initial time in -m with --format parquet): 5286.0,
> 3043.0
> > > > > > parquet format: 1316.1, 930.7
> > > > > > tbl format: 5297.3, 5383.1
> > > > > >
> > > > > > i.e. I am observing som

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

2021-07-09 Thread Jorge Cardoso Leitão
Thanks a lot Wes,

I am not sure how to proceed from here:

1. how do we generate the html from the xml? I.e.
https://incubator.apache.org/ip-clearance/arrow-rust-ballista.html
2. how do I trigger the the process to start? can I just email the
incubator with the proposal?

Best,
Jorge



On Mon, Jul 5, 2021 at 10:38 AM Wes McKinney  wrote:

> Great, thanks for the update and pushing this forward. Let us know if
> you need help with anything.
>
> On Sun, Jul 4, 2021 at 8:26 PM Jorge Cardoso Leitão
>  wrote:
> >
> > Hi,
> >
> > Wes and Neils,
> >
> > Thank you for your feedback and offer. I have created the two .xml
> reports:
> >
> >
> http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-arrow.xml
> >
> http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-parquet.xml
> >
> > I based them on the report for Ballista. I also requested, on the PRs
> > [1,2], clarification wrt to every contributors' contributions to each.
> >
> > Best,
> > Jorge
> >
> > [1] https://github.com/apache/arrow-experimental-rs-arrow2/pull/1
> > [2] https://github.com/apache/arrow-experimental-rs-parquet2/pull/1
> >
> >
> >
> > On Mon, Jun 7, 2021 at 11:55 PM Wes McKinney 
> wrote:
> >
> > > On Sun, Jun 6, 2021 at 1:47 AM Jorge Cardoso Leitão
> > >  wrote:
> > > >
> > > > Hi,
> > > >
> > > > Thanks a lot for your feedback. I agree with all the arguments put
> > > forward,
> > > > including Andrew's point about the large change.
> > > >
> > > > I tried a gradual 4 months ago, but it was really difficult and I
> gave
> > > up.
> > > > I estimate that the work involved is half the work of writing
> parquet2
> > > and
> > > > arrow2 in the first place. The internal dependency on ArrayData (the
> main
> > > > culprit of the unsafe) on arrow-rs is so prevalent that all core
> > > components
> > > > need to be re-written from scratch (IPC, FFI, IO, array/transform/*,
> > > > compute, SIMD). I personally do not have the motivation to do it,
> though.
> > > >
> > > > Jed, the public API changes are small for end users. A typical
> migration
> > > is
> > > > [1]. I agree that we can further reduce the change-set by keeping
> legacy
> > > > interfaces available.
> > > >
> > > > Andy, on my machine, the current benchmarks on query 1 yield:
> > > >
> > > > type, master (ms), PR [2] for arrow2+parquet2 (ms)
> > > > memory (-m): 332.9, 239.6
> > > > load (the initial time in -m with --format parquet): 5286.0, 3043.0
> > > > parquet format: 1316.1, 930.7
> > > > tbl format: 5297.3, 5383.1
> > > >
> > > > i.e. I am observing some improvements. Queries with joins are still
> > > slower.
> > > > The pruning of parquet groups and pages based on stats are not yet
> > > there; I
> > > > am working on them.
> > > >
> > > > I agree that this should go through IP clearance. I will start this
> > > > process. My thinking would be to create two empty repos on apache/*,
> and
> > > > create 2 PRs from the main branches of each of my repos to those
> repos,
> > > and
> > > > only merge them once IP is cleared. Would that be a reasonable
> process,
> > > Wes?
> > >
> > > This sounds plenty fine to me — I'm happy to assist with the IP
> > > clearance process having done it several times in the past. I don't
> > > have an opinion about the names, but having experimental- in the name
> > > sounds in line with the previous discussion we had about this.
> > >
> > > > Names: arrow-experimental-rs2 and arrow-experimental-rs-parquet2, or?
> > > >
> > > > Best,
> > > > Jorge
> > > >
> > > > [1]
> > > >
> > >
> https://github.com/apache/arrow-datafusion/pull/68/files#diff-2ec0d66fd16c73ff72a23d40186944591e040507c731228ad70b4e168e2a4660
> > > > [2] https://github.com/apache/arrow-datafusion/pull/68
> > > >
> > > >
> > > > On Fri, May 28, 2021 at 5:22 AM Josh Taylor  >
> > > wrote:
> > > >
> > > > > I played around with it, for my use case I really like the new way
> of
> > > > > writing CSVs, it's much more obvious. I love the

Re: [Rust] Eliminate Timezone field from Timestamp types?

2021-07-07 Thread Jorge Cardoso Leitão
Hi,

AFAIK timezone is part of the spec. In Python, that would be [1]

import pyarrow as pa
dt1 = pa.timestamp("ms", "+00:10")
dt2 = pa.timestamp("ms")

arrow-rs is not very consistent with how it handles it. imo that is an
artifact of being currently difficult (API wise) to create an array with a
timezone, which have caused people to not use it much (and thus not
implement kernels with it / test it properly).

I do not see how removing it would be compatible with the Arrow spec,
though.

Best,
Jorge

[1] https://arrow.apache.org/docs/python/generated/pyarrow.timestamp.html



On Wed, Jul 7, 2021 at 6:37 PM Evan Chan  wrote:

> Hi folks,
>
> Some of us are having a discussion about a direction change for Rust Arrow
> timestamp types, which current support both a resolution field (Ns, Micros,
> Ms, Seconds) similar to the other language implementations, but also
> optionally a timezone string field.   I believe the timezone field is
> unique to the Rust implementation, as I don’t find it in the C/C++ and
> Python docs.   At the same time, in reality if the timezone field is non
> null, this is not well supported at all in the current code.  Functions
> returning timestamps pretty much all return a null timezone, for example,
> and don’t allow the timezone to be specified.
>
> The proposal would be to eliminate the timezone field and bring the Rust
> Arrow timestamp type in line with that of the other language
> implementations, also simplifying implementation.   It seems this is in
> line with direction of other projects (Parquet, Spark, and most DBs have
> timestamp types which do not have explicit timezones or are implicitly UTC).
>
> Please feel free to see
> https://github.com/apache/arrow-datafusion/issues/686 <
> https://github.com/apache/arrow-datafusion/issues/686>
> (Or would it be better to discuss here in mailing list?)
>
> Cheers!
> Evan


Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

2021-07-04 Thread Jorge Cardoso Leitão
Hi,

Wes and Neils,

Thank you for your feedback and offer. I have created the two .xml reports:

http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-arrow.xml
http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-parquet.xml

I based them on the report for Ballista. I also requested, on the PRs
[1,2], clarification wrt to every contributors' contributions to each.

Best,
Jorge

[1] https://github.com/apache/arrow-experimental-rs-arrow2/pull/1
[2] https://github.com/apache/arrow-experimental-rs-parquet2/pull/1



On Mon, Jun 7, 2021 at 11:55 PM Wes McKinney  wrote:

> On Sun, Jun 6, 2021 at 1:47 AM Jorge Cardoso Leitão
>  wrote:
> >
> > Hi,
> >
> > Thanks a lot for your feedback. I agree with all the arguments put
> forward,
> > including Andrew's point about the large change.
> >
> > I tried a gradual 4 months ago, but it was really difficult and I gave
> up.
> > I estimate that the work involved is half the work of writing parquet2
> and
> > arrow2 in the first place. The internal dependency on ArrayData (the main
> > culprit of the unsafe) on arrow-rs is so prevalent that all core
> components
> > need to be re-written from scratch (IPC, FFI, IO, array/transform/*,
> > compute, SIMD). I personally do not have the motivation to do it, though.
> >
> > Jed, the public API changes are small for end users. A typical migration
> is
> > [1]. I agree that we can further reduce the change-set by keeping legacy
> > interfaces available.
> >
> > Andy, on my machine, the current benchmarks on query 1 yield:
> >
> > type, master (ms), PR [2] for arrow2+parquet2 (ms)
> > memory (-m): 332.9, 239.6
> > load (the initial time in -m with --format parquet): 5286.0, 3043.0
> > parquet format: 1316.1, 930.7
> > tbl format: 5297.3, 5383.1
> >
> > i.e. I am observing some improvements. Queries with joins are still
> slower.
> > The pruning of parquet groups and pages based on stats are not yet
> there; I
> > am working on them.
> >
> > I agree that this should go through IP clearance. I will start this
> > process. My thinking would be to create two empty repos on apache/*, and
> > create 2 PRs from the main branches of each of my repos to those repos,
> and
> > only merge them once IP is cleared. Would that be a reasonable process,
> Wes?
>
> This sounds plenty fine to me — I'm happy to assist with the IP
> clearance process having done it several times in the past. I don't
> have an opinion about the names, but having experimental- in the name
> sounds in line with the previous discussion we had about this.
>
> > Names: arrow-experimental-rs2 and arrow-experimental-rs-parquet2, or?
> >
> > Best,
> > Jorge
> >
> > [1]
> >
> https://github.com/apache/arrow-datafusion/pull/68/files#diff-2ec0d66fd16c73ff72a23d40186944591e040507c731228ad70b4e168e2a4660
> > [2] https://github.com/apache/arrow-datafusion/pull/68
> >
> >
> > On Fri, May 28, 2021 at 5:22 AM Josh Taylor 
> wrote:
> >
> > > I played around with it, for my use case I really like the new way of
> > > writing CSVs, it's much more obvious. I love the `read_stream_metadata`
> > > function as well.
> > >
> > > I'm seeing a very slight speed (~8ms) improvement on my end, but I
> read a
> > > bunch of files in a directory and spit out a CSV, the bottleneck is the
> > > parsing of lots of files, but it's pretty quick per file.
> > >
> > > old:
> > > /home/josh/staging/019c4715-3200-48fa--4105000cd71e/data_0_0_0
> 120224
> > > bytes took 1ms
> > > /home/josh/staging/019c4715-3200-48fa--4105000cd71e/data_0_0_1
> 123144
> > > bytes took 1ms
> > > /home/josh/staging/019c4715-3200-48fa--4105000cd71e/data_0_0_10
> > > 17127928 bytes took 159ms
> > > /home/josh/staging/019c4715-3200-48fa--4105000cd71e/data_0_0_11
> > > 17127144 bytes took 160ms
> > > /home/josh/staging/019c4715-3200-48fa--4105000cd71e/data_0_0_12
> > > 17130352 bytes took 158ms
> > > /home/josh/staging/019c4715-3200-48fa--4105000cd71e/data_0_0_13
> > > 17128544 bytes took 158ms
> > > /home/josh/staging/019c4715-3200-48fa--4105000cd71e/data_0_0_14
> > > 17128664 bytes took 158ms
> > > /home/josh/staging/019c4715-3200-48fa--4105000cd71e/data_0_0_15
> > > 17128328 bytes took 158ms
> > > /home/josh/staging/019c4715-3200-48fa--4105000cd71e/data_0_0_16
> > > 17129288 bytes took 158ms

[RESULT] [VOTE] Donation of rust arrow2 and parquet2

2021-07-02 Thread Jorge Cardoso Leitão
With 10 +1, 3 +1 non-binding, and no 0 nor -1, the vote passed.

Thank you all for your participation and for this clarification. I will
start work with the incubator for the IP clearance.

Best,
Jorge


Re: [Discuss] Consider renaming "Arrow" in HO2 benchmarks?

2021-07-01 Thread Jorge Cardoso Leitão
Hi,

I did not know what to change there for renaming the bench, as the bench
name seems to be used in different places. I thus started with an issue,
https://github.com/h2oai/db-benchmark/issues/229.

Best,
Jorge


On Fri, Jun 25, 2021 at 1:04 PM Wes McKinney  wrote:

> I recommend sending a PR to the benchmark repo that clarifies that
> it's executing the query using the arrow R/C++ library, when in fact
> the query is actually primarily handled by dplyr and not Arrow at all.
> The benchmark is very misleading in its current form.
>
> On Fri, Jun 25, 2021 at 11:55 AM Jorge Cardoso Leitão
>  wrote:
> >
> > Hi,
> >
> > HO2 has a set of benchmarks comparing different query engines [1].
> >
> > There is currently an implementation named "Arrow", backed by the Arrow R
> > implementation [2].
> >
> > This is one of the least performant implementations evaluated. I sense
> that
> > this may negatively affect the Arrow format, as people will (even if
> > unfairly) associate "Arrow" to "poor performance". In fact, polars and
> > cuDF, the top performers, also use Arrow as their backing in-memory
> format.
> >
> > Would it make sense to avoid naming specific query engines as "Arrow"
> (e.g.
> > like we do with DataFusion, Grandiva, etc), so that these
> misunderstandings
> > are avoided?
> >
> > Best,
> > Jorge
> >
> > [1] https://h2oai.github.io/db-benchmark/
> > [2] https://github.com/h2oai/db-benchmark/tree/master/arrow
>


Re: Improving PR workload management for Arrow maintainers

2021-06-29 Thread Jorge Cardoso Leitão
I just had a quick chat over the ASF's slack with Daniel Gruno from the
infra team and they are rolling out the "triage role" [1] for
non-committers, which AFAIK offers useful tools in this context:

* add/remove labels
* assign reviewees
* mark duplicates
* close, open and assign to issues and PRs

One does not disregard the other, just though it could be useful
information to this topic, as maybe this cover some ground?

Best,
Jorge

[1]
https://docs.github.com/en/organizations/managing-access-to-your-organizations-repositories/repository-permission-levels-for-an-organization


On Tue, Jun 29, 2021 at 3:10 PM Andrew Lamb  wrote:

> The thing that would make me more efficient reviewing PRs is figuring out
> which one of the open reviews are ready for additional feedback.
>
> I think the idea of a webapp or something that shows active reviews would
> be helpful (though I get most of that from appropriate email filters).
>
> What about a system involving labels (for which there is already a basic
> GUI in github)? Something low tech like
>
> (Waiting for Review)
> (Addressing Feedback)
> (Approved, waiting for Merge)
>
> With maybe some automation prompting people to add the "Waiting on Review"
> label when they want feedback
>
> Andrew
>
> On Tue, Jun 29, 2021 at 4:28 AM Wes McKinney  wrote:
>
> > hi folks,
> >
> > I've noted that the volume of PRs for Arrow has been steadily
> > increasing (and will likely continue to increase), and while I've
> > personally had less time for development / maintenance / code reviews
> > over the last year, I would like to have a discussion about what we
> > could do to improve our tooling for maintainers to optimize the
> > efficiency of time spent tending to the PR queue. In my own
> > experience, I have felt that I have wasted a lot of time digging
> > around the queue looking for PRs that are awaiting feedback or need to
> > be merged.
> >
> > I note first of all that around 70 out of 173 open PRs have been
> > updated in the last 7 days, so while there is some PR staleness, to
> > have nearly half of the PRs active is pretty good. That said, ~70
> > active PRs is a lot of PRs to tend to.
> >
> > I scraped the project's code review comment history, and here are the
> > individuals who have left the most comments on PRs since genesis
> >
> > pitrou6802
> > wesm  5023
> > emkornfield   3032
> > bkietz2834
> > kou   1489
> > nealrichardson1439
> > fsaintjacques 1356
> > kszucs1250
> > alamb 1133
> > jorisvandenbossche1094
> > liyafan82  831
> > lidavidm   816
> > westonpace 794
> > xhochy 770
> > nevi-me643
> > BryanCutler639
> > jorgecarleitao 635
> > cpcloud551
> > sunchao536
> > ianmcook   499
> >
> > Since we're probably stuck using GitHub to receive code contributions
> > (as opposed to systems — Gerrit is one I'm familiar with — that
> > provide more structure for reviewers to track the patches they "own"
> > as well as the outgoing/incoming state of reviews), I am wondering
> > what kinds of tools we could create to make it easier for maintainers
> > to keep track of PRs they are shepherding through the contribution
> > process. Ideally this wouldn't involve maintainers having to engage in
> > some explicit action like assigning themselves as a PR reviewer.
> >
> > Here's one idea: a web application that displays "your reviews", a
> > table of PRs that you have interacted with in any way (commented, left
> > code review, assigned as reviewer, someone mentioned you, etc.) sorted
> > either by last commit or last comment to assess "freshness". So if you
> > comment on a PR or leave a code review, it will automatically show up
> > in "your reviews". It could also indicate whether there has been
> > activity on the PR since the last time you interacted with it.
> >
> > Having now used the GitHub API to pull comments from PRs for the above
> > analysis, there is certainly enough information available to help
> > create this kind of tool. I'd be willing to contribute to building the
> > backend of such a web application.
> >
> > This is just one idea, but I am curious to hear from others who are
> > spending a lot of time doing code review / PR merging to see what
> > might help them use their time more effectively.
> >
> > Thanks,
> > Wes
> >
>


[VOTE] Donation of rust arrow2 and parquet2

2021-06-26 Thread Jorge Cardoso Leitão
Hi,

I would like to bring to this mailing list a proposal to donate the source
code of arrow2 [1] and parquet2 [2] as experimental repositories [3] within
Apache Arrow, conditional on IP clearance.

The specific PRs are:

* https://github.com/apache/arrow-experimental-rs-arrow2/pull/1
* https://github.com/apache/arrow-experimental-rs-parquet2/pull/1

The source code contains rewrites of the arrow and parquet crates with
safety and security in mind. In particular,

* no buffer transmutes
* no unsafe APIs marked as safe
* parquet's implementation is unsafe free

There are many other important features, such as big endian support and IPC
2.0 support. There is one regression over latest: support nested types in
parquet read and write. I observe no negative impact on performance.

See a longer discussion in [4] over the reasons why the current rust
implementation is susceptible to safety violations. In particular, many
core APIs of the crate are considered security vulnerabilities under
RustSec's [5] definitions, and are difficult to address on its current
design.

I validated that it is possible to migrate DataFusion [6] and Polars [7]
without further code changes.

The vote will be open for at least 72 hours.

[ ] +1 Accept the code donation as experimental repos.
[ ] +0
[ ] -1 Do not accept the code donation as experimental repos because...

[1]
https://github.com/apache/arrow/blob/master/docs/source/developers/experimental_repos.rst
[2] https://github.com/jorgecarleitao/arrow2
[3] https://github.com/jorgecarleitao/parquet2
[4] https://github.com/jorgecarleitao/arrow2#faq
[5] https://rustsec.org/
[6] https://github.com/apache/arrow-datafusion/pull/68
[7] https://github.com/pola-rs/polars


Re: [VOTE][RUST] Release Apache Arrow Rust 4.4.0 RC1

2021-06-25 Thread Jorge Cardoso Leitão
+1

Ran verification script on Apple intel.


On Fri, Jun 25, 2021 at 12:16 AM Andrew Lamb  wrote:

> Hi,
>
> I would like to propose a release of Apache Arrow Rust Implementation,
> version 4.4.0.
>
> This release candidate is based on commit:
> 32b835e5bee228d8a52015190596f4c33765849a [1]
>
> The proposed release tarball and signatures are hosted at [2].
>
> The changelog is located at [3].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Release this as Apache Arrow Rust
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow Rust  because...
>
> [1]:
>
> https://github.com/apache/arrow-rs/tree/32b835e5bee228d8a52015190596f4c33765849a
> [2]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-4.4.0-rc1
> [3]:
>
> https://github.com/apache/arrow-rs/blob/32b835e5bee228d8a52015190596f4c33765849a/CHANGELOG.md
>


Re: [VOTE] Clarify meaning of timestamp without time zone to equal the concept of "LocalDateTime"

2021-06-25 Thread Jorge Cardoso Leitão
+1

On Fri, Jun 25, 2021 at 7:47 PM Julian Hyde  wrote:

> +1
>
> > On Jun 25, 2021, at 10:36 AM, Antoine Pitrou  wrote:
> >
> >
> > Le 24/06/2021 à 21:16, Weston Pace a écrit :
> >> The discussion in [1] led to the following proposal which I would like
> >> to submit for a vote.
> >> ---
> >> Arrow allows a timestamp column to omit the time zone property.  This
> >> has caused confusion because some people have interpreted a timestamp
> >> without a time zone to be an Instant while others have interpreted it
> >> to be a LocalDateTime.
> >> This proposal is to clarify the Arrow schema (via comments) and assert
> >> that a timestamp without a time zone should be interpreted as
> >> LocalDateTime.
> >> Note: For definitions of Instant and LocalDateTime (and a discussion
> >> on the semantics) please refer to [3]
> >> ---
> >
> > +1
>
>


[Discuss] Consider renaming "Arrow" in HO2 benchmarks?

2021-06-25 Thread Jorge Cardoso Leitão
Hi,

HO2 has a set of benchmarks comparing different query engines [1].

There is currently an implementation named "Arrow", backed by the Arrow R
implementation [2].

This is one of the least performant implementations evaluated. I sense that
this may negatively affect the Arrow format, as people will (even if
unfairly) associate "Arrow" to "poor performance". In fact, polars and
cuDF, the top performers, also use Arrow as their backing in-memory format.

Would it make sense to avoid naming specific query engines as "Arrow" (e.g.
like we do with DataFusion, Grandiva, etc), so that these misunderstandings
are avoided?

Best,
Jorge

[1] https://h2oai.github.io/db-benchmark/
[2] https://github.com/h2oai/db-benchmark/tree/master/arrow


Re: [PAPER] Selection vectors and bitmaps for filter results

2021-06-22 Thread Jorge Cardoso Leitão
Thank you for sharing, Wes, an interesting paper indeed.

In Rust we currently use a different strategy.

We build an iterator over ranges [a_i, b_i[ to be selected from the
filter bitmap, and filter the array based on those ranges. For a
single filter, the ranges are iterated as they are being built; for
multiple filters, we persist all ranges ([a_0, b_0[, [a_1, b_1[, ...),
and then apply them to each array (potentially in parallel).

The main advantages are:
1. it allows memcopying values buffers in slices instead of individual
items, thereby reducing the total number of memcopies performed.
2. When the bitmap has no offsets, we efficiently skip chunks via a
simple comparison (byte == 0)
3. it lends itself well to clustered data (i.e. when there are
clusters of selections in the data)

The best case scenario is both [0, 0, 0, 0] and [1, 1, 1, 1]; worst
case scenario is when the selectivity is [0, 1, 0, 1, 0, 1], which is
more expensive than "take"ing, as there are extra cycles in deriving
the ranges.
I do not have the numbers, but there was a performance improvement
that motivated us to switch strategies.

Besides the obvious memcopies, my understanding is that this is not so
much dependent on the selectivity, but rather on the number of
switches throughout the bitmap.

Best,
Jorge

On Wed, Jun 23, 2021 at 2:05 AM Wes McKinney  wrote:
>
> Some on this list might be interested in a new paper out of CMU/MIT
> about the use of selection vectors and bitmaps for handling the
> intermediate results of filters:
>
> https://db.cs.cmu.edu/papers/2021/ngom-damon2021.pdf
>
> The research was done in the context of NoisePage which uses Arrow as
> its memory format. I found some of the observations related to AVX512
> to be interesting.


[ANNOUNCE] Apache 4.0.1 released

2021-06-21 Thread Jorge Cardoso Leitão
The Apache Arrow team is pleased to announce the 4.0.1 release. This
release covers general bug fixes on the different implementations, notably
C++, R, Python and JavaScript.
The list is available [1], with the list of contributors [2] and changelog [3].

As usual, see the install page [4] for instructions on how to install it.

What is Apache Arrow?
-

Apache Arrow is a columnar in-memory analytics layer designed to accelerate big
data. It houses a set of canonical in-memory representations of flat and
hierarchical data along with multiple language-bindings for structure
manipulation. It also provides low-overhead streaming and batch messaging,
zero-copy interprocess communication (IPC), and vectorized in-memory analytics
libraries.

Please report any feedback to the mailing lists ([5])

Regards,
The Apache Arrow community


[1]: 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20%3D%20Resolved%20AND%20fixVersion%20%3D%204.0.1
[2]: https://arrow.apache.org/release/4.0.1.html#contributors
[3]: https://arrow.apache.org/release/4.0.1.html
[4]: https://arrow.apache.org/install/
[5]: https://lists.apache.org/list.html?dev@arrow.apache.org


[Rust] experimental parquet2 repo

2021-06-19 Thread Jorge Cardoso Leitão
Hi,

I have created a new experimental repo [1] to lay foundations for the
parquet2 work within Arrow.

The way I proceeded so far:

1. pushed arrow-rs master to it (from commit 9f56a)
2. removed all arrow-related code and committed
3. removed all (rm -rf *) and committed
4. PRed parquet2, rebased on top of the cleaned repo

The rational for this sequence of events assumes the following:

* We wish to keep history and contributors; therefore, step 1+4.
* we wish to only keep commits that are about parquet (like we did for
arrow-rs); therefore the two-step process 2+3.

This ensures lineage all the way up to the original changes in
apache/arrow for changes pertaining to the
parquet/parquet_derive/parquet_derive_test crates.

I have also PRed parquet2 to it [2]. This PR has 3 contributors and is +10k LOC.

The PR still needs ASF headers, RAT checks, metadata to be in
compliance, that I am working on. That PR is the one I was planning to
propose for IP clearance.

Wes, is the process around IP clearance about creating a thread on the
IP mailing list initiating the process, like e.g. Andy did for
Ballista? E.g. can I go through the archive and follow its steps or do
you suggest another route?

Best,
Jorge


[1] https://github.com/apache/arrow-experimental-rs-parquet2
[2] https://github.com/apache/arrow-experimental-rs-parquet2/pull/1


Re: post-release tasks (4.0.1)

2021-06-18 Thread Jorge Cardoso Leitão
Sutou was able to push them (we needed to login via yarn, not npm).

I think that we are all good; the last item was done. I have created a
short post [1] for it.

Do we usually announce it anywhere else?

Best,
Jorge


[1] https://github.com/apache/arrow-site/pull/122




On Sat, Jun 12, 2021 at 5:57 AM Jorge Cardoso Leitão
 wrote:
>
> Thanks a lot, Krisztian.
>
> The JS packages are still missing. I already have access to npm (thanks 
> Sutou). As part of the npm-release.sh in 4.0.1, we require all tests to pass 
> [1]. However, there are tests failing on my computer [2], which blocks the 
> release.
>
> What is the procedure when this happens? We have voted on the release and we 
> have shipped other packages, so my feeling is that this should not block an 
> in-progress release; imo we should not run the tests prior to publication, as 
> there is nothing we can do at that point in time.
>
> One idea is to manually comment the "running the tests during this release" 
> and open an issue to not run the tests prior to the "publish" script, since 
> at that point it is too late to gate the release; that should have been done 
> before we reached this point.
>
> [1] https://github.com/apache/arrow/blob/release-4.0.1/js/npm-release.sh#L23
> [2] https://issues.apache.org/jira/browse/ARROW-13046
>
> Best,
> Jorge
>
>
> On Thu, Jun 10, 2021 at 1:15 PM Krisztián Szűcs  
> wrote:
>>
>> On Thu, Jun 10, 2021 at 6:57 AM Jorge Cardoso Leitão
>>  wrote:
>> >
>> > I have been unable to generate the docs from any of my two machines (my
>> > macbook and a VM on azure), and I do not think we should delay this
>> > further. Could someone kindly create a PR with the generated docs to the
>> > website?
>> Hi!
>>
>> I'm going to update the docs.
>> Are there any remaining post-release tasks?
>>
>> Thanks, Krisztian
>> >
>> > I think that the command amounts to "dev/release/post-09-docs.sh 4.0.1".
>> >
>> > Best,
>> > Jorge
>> >
>> >
>> >
>> >
>> >
>> > On Sun, Jun 6, 2021 at 6:39 AM Jorge Cardoso Leitão <
>> > jorgecarlei...@gmail.com> wrote:
>> >
>> > > Hi,
>> > >
>> > > Sorry for the delay on this, but it is not being easy to build the docs
>> > > [1-5], which is why this is taking some time. It seems that our CI is
>> > > caching docker layers when testing, which causes it to miss new errors
>> > > happening during those layers that are only triggered when the image is
>> > > built from scratch.
>> > >
>> > > Best,
>> > > Jorge
>> > >
>> > > [1] https://issues.apache.org/jira/browse/ARROW-12971
>> > > [2] https://issues.apache.org/jira/browse/ARROW-12846
>> > > [3] https://issues.apache.org/jira/browse/ARROW-12909
>> > > [4] https://issues.apache.org/jira/browse/ARROW-12915
>> > > [5] https://issues.apache.org/jira/browse/ARROW-12954
>> > >
>> > >
>> > >
>> > >
>> > > On Mon, May 31, 2021 at 1:03 PM Jorge Cardoso Leitão <
>> > > jorgecarlei...@gmail.com> wrote:
>> > >
>> > >> Thanks a lot, both.
>> > >>
>> > >> Accepted. Will upload this evening.
>> > >>
>> > >> Best,
>> > >> Jorge
>> > >>
>> > >>
>> > >> On Mon, May 31, 2021 at 12:55 PM Krisztián Szűcs <
>> > >> szucs.kriszt...@gmail.com> wrote:
>> > >>
>> > >>> On Sun, May 30, 2021 at 7:37 PM Jorge Cardoso Leitão
>> > >>>  wrote:
>> > >>> >
>> > >>> > Hi,
>> > >>> >
>> > >>> > Sorry for the delay here.
>> > >>> >
>> > >>> > Below is the list of post-release tasks:
>> > >>> >
>> > >>> 1.  [Krisztian] open a pull request to bump the version numbers in the
>> > >>> source code
>> > >>> 2.  [x] upload source
>> > >>> 3.  [x] upload binaries
>> > >>> 4.  [x] update website
>> > >>> 5.  [x] upload ruby gems
>> > >>> 6.  [Jorge] upload js packages
>> > >>> 8.  [x] upload C# packages
>> > >>> 9.  [Won't do] upload rust crates
>> > >>> 10. [x] update conda recipes
>> > >>> 11. [x] upload wheels/sdist to pypi
>> > >>> 12. [x] update homebrew packages
>> > >>> 13. [x] update maven artifacts
>> > >>> 14. [x] update msys2
>> > >>> 15. [x] update R packages
>> > >>> 16. [Jorge] update docs (in progress, Jorge)
>> > >>> >
>> > >>> > * Could someone add me to npmjs so that I can publish it there
>> > >>> > (jorgecarleitao [1])?
>> > >>> Sent you an invite, could you please check it?
>> > >>> >
>> > >>> > Thank you for your patience,
>> > >>> > Jorge
>> > >>> >
>> > >>> > [1] https://www.npmjs.com/~jorgecarleitao
>> > >>>
>> > >>


Re: Future of Rust sync call

2021-06-18 Thread Jorge Cardoso Leitão
Hi Wes,

Yes, on ASF Slack, #arrow-rust. Andy advertised it here some time ago.

Most relevant topics there end up either as a github issue or over
this mailing list. On this note, hat tip to Andrew, who has been doing
a lot of the curation.

There are other informal discussions, more about Rust lang
developments, projects using DataFusion, practical questions, etc.
Being both on Slack and ursalab's Zulip, I would say they are at the
same level: some initial discussion over some idea => move it to issue
tracker / mailing list.

Best,
Jorge






On Fri, Jun 18, 2021 at 4:07 PM Wes McKinney  wrote:
>
> hi Jorge — there is a Rust Slack channel? On that, I would just say to
> be vigilant about what communication takes place there (since Slack is
> semi-private) versus on channels that are being archived / mirrored to
> mailing lists. It's useful for coordination and quick questions but
> not a place to make make decisions.
>
> Thanks,
> Wes
>
> On Fri, Jun 18, 2021 at 12:14 AM Jorge Cardoso Leitão
>  wrote:
> >
> > Hi,
> >
> > I agree that the communication improved a lot with moving the issues
> > to Github and slack, which made the sync call less relevant.
> >
> > Best,
> > Jorge
> >
> >
> > On Thu, Jun 17, 2021 at 11:55 PM Andrew Lamb  wrote:
> > >
> > > I think dropping back from the Rust sync call and using the regular Arrow
> > > Sync call should that be necessary is a good idea
> > >
> > > Andrew
> > >
> > >
> > > On Thu, Jun 17, 2021 at 12:27 PM Andy Grove  wrote:
> > >
> > > > I would like to propose canceling the bi-weekly Rust sync call, at 
> > > > least in
> > > > its current form.
> > > >
> > > > The call has not been very active since we moved to the new GitHub
> > > > repositories and implemented some changes to the development process. It
> > > > seems that the Rust community is communicating well without the need for
> > > > this Rust-specific sync call, and we can always join the regular Arrow 
> > > > sync
> > > > call if there are issues that need discussing.
> > > >
> > > > If there are no objections, I will create a PR soon to remove references
> > > > for this call from our documentation.
> > > >
> > > > Andy.
> > > >


[Question] Rational for offsets instead of deltas

2021-06-17 Thread Jorge Cardoso Leitão
Hi,

(this has no direction; I am just genuinely curious)

I am wondering, what is the rational to use "offsets" instead of
"lengths" to represent variable sized arrays?

I.e. ["a", "", None, "ab"] is represented as

offsets: [0, 1, 1, 1, 3]
values: "aab"

what is the reasoning to use this over

lengths: [1, 0, 0, 2]
values: "aab"

I am asking this because I have seen people using the LargeUtf8 type,
or breaking Record batches in chunks, to avoid hitting the ceiling of
i32 of large arrays with strings.

Is it to ensure O(1) random access (instead of having to sum all
deltas up to the index)?

Best,
Jorge


Re: Future of Rust sync call

2021-06-17 Thread Jorge Cardoso Leitão
Hi,

I agree that the communication improved a lot with moving the issues
to Github and slack, which made the sync call less relevant.

Best,
Jorge


On Thu, Jun 17, 2021 at 11:55 PM Andrew Lamb  wrote:
>
> I think dropping back from the Rust sync call and using the regular Arrow
> Sync call should that be necessary is a good idea
>
> Andrew
>
>
> On Thu, Jun 17, 2021 at 12:27 PM Andy Grove  wrote:
>
> > I would like to propose canceling the bi-weekly Rust sync call, at least in
> > its current form.
> >
> > The call has not been very active since we moved to the new GitHub
> > repositories and implemented some changes to the development process. It
> > seems that the Rust community is communicating well without the need for
> > this Rust-specific sync call, and we can always join the regular Arrow sync
> > call if there are issues that need discussing.
> >
> > If there are no objections, I will create a PR soon to remove references
> > for this call from our documentation.
> >
> > Andy.
> >


Re: [Format][Important] Needed clarification of timezone-less timestamps

2021-06-17 Thread Jorge Cardoso Leitão
Thank you everyone for participating so far; really important and
useful discussion.

I think of this discussion as a set of test cases over behavior:

parameterization:
* Timestamp(ms, None)
* Timestamp(ms, "00:00")
* Timestamp(ms, "01:00")

Cases:
* its string representation equals to
* add a duration equals to
* add an interval equals to
* subtract a Timestamp(ms, None) equals to
* subtract a Timestamp(ms, "01:00") equals to
* subtract a Date32 equals to
* subtract a Time32(ms) equals to
* extract the day equals to
* extract the timezone equals to
* cast to Timestamp(ms, None) equals to
* cast to Timestamp(ms, "01:00") equals to
* write to parquet v2 equals to (physical value and logical type)

In all cases, the result may either be valid or invalid. If valid, we
would need a datatype and an actual value.
I was hoping to be able to answer each of the above at the end of this
discussion.

I've suggested adding these in the google docs.

Best,
Jorge

On Fri, Jun 18, 2021 at 12:15 AM Micah Kornfield  wrote:
>
> I've posted the examples above in
> https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing
> because I think it would be better to collaborate there instead of linear
> e-mail history and then bring the consensus back to the list.
>
> On Thu, Jun 17, 2021 at 2:56 PM Micah Kornfield 
> wrote:
>
> > I feel like we might still be talking past each other here or at least I
> > don't understand the two sides of this.  I'll try to expand Weston's
> > example because I think it provides the best clarification.
> >
> > (1970, 1, 2, 14, 0) is stored as 0x0A4CB800 (17280, assuming
> > ms) for a timestamp column without timezone (always).   This represents an
> > offset from the unix epoch.  This interpretation should not change based on
> > the local system timezone.  Extracting the hour field always yields 14
> > (extraction is done relative to UTC).
> >
> > The alternative here seems to be that we can encode (1970, 1, 2, 14, 0) in
> > multiple different ways depending on what the current local system time
> > is.  As a note, I think ORC and Spark do this, and it leads to
> > confusion/misinterpretation when trying to transfer data.
> >
> > If we then convert this column to a timestamp with a timezone in "UTC"
> > timezone extracting the hour field still yields 14.  If the column is
> > converted to Timezone with timestamp PST.  Extracting an hour would yield 6
> > (assume PST = -8GMT).Through all of these changes the data bits do not
> > change.
> >
> > Display is not mentioned because I think the points about how a time
> > display is correct. Applications can choose what they feel makes sense to
> > them (as long as they don't start automatically tacking on timezones to
> > naive timestamps).  My interpretation of the specification has been display
> > was kind of shorthand for field extraction.
> >
> > Could others on the thread confirm this is the issue up for debate?  Are
> > there subtleties/operations we need to consider?
> >
> > I also agree that we should document recommended conversion practices from
> > other systems.
> >
> > -Micah
> >
> >
> >  So let's invent a third way.  I could use
> >> the first 16 bits for the year, the next 8 bits for the month, the
> >> next 8 bits for the day of month, the next 8 bits for the hour, the
> >> next 8 bits for the minute, and the remaining bits for the seconds.
> >> Using this method I would store (1970, 1, 2, 14, 0) as
> >> 0x07B201020E00.
> >
> > Aside, With some small variation this is what ZetaSql uses [2]
> >
> > [1]
> > https://arrow.apache.org/docs/python/timestamps.html#pandas-arrow-spark
> > [2]
> > https://github.com/google/zetasql/blob/master/zetasql/public/civil_time.h#L62
> >
> >
> >
> > On Thu, Jun 17, 2021 at 1:58 PM Adam Hooper  wrote:
> >
> >> On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney  wrote:
> >>
> >> >
> >> > The SQL standard (e.g. PostgresSQL) has two timestamp types:
> >> > with/without time zone — in some SQL implementations each slot can
> >> > have a different time zone
> >> > https://www.postgresql.org/docs/9.1/datatype-datetime.html
> >> > WITHOUT TIME ZONE: "timestamp without time zone value should be taken
> >> > or given as timezone local time"
> >> >
> >>
> >> RDBMSs conflict (universally) with ANSI.
> >>
> >> PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since the epoch.
> >> It has no timezone.
> >>
> >> MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int Instant
> >> since the epoch. It has no timezone.
> >>
> >> TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive datetime" in
> >> *function*, but not in implementation:
> >>
> >>- MySQL DATETIME
> >><
> >> https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html
> >> >
> >>is weird: 1-bit sign, 17-bit month, 5-bit day, 
> >>- MSSQL
> >><
> >> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-s

Re: post-release tasks (4.0.1)

2021-06-11 Thread Jorge Cardoso Leitão
Thanks a lot, Krisztian.

The JS packages are still missing. I already have access to npm (thanks
Sutou). As part of the npm-release.sh in 4.0.1, we require all tests to
pass [1]. However, there are tests failing on my computer [2], which blocks
the release.

What is the procedure when this happens? We have voted on the release and
we have shipped other packages, so my feeling is that this should not block
an in-progress release; imo we should not run the tests prior to
publication, as there is nothing we can do at that point in time.

One idea is to manually comment the "running the tests during this release"
and open an issue to not run the tests prior to the "publish" script, since
at that point it is too late to gate the release; that should have been
done before we reached this point.

[1] https://github.com/apache/arrow/blob/release-4.0.1/js/npm-release.sh#L23
[2] https://issues.apache.org/jira/browse/ARROW-13046

Best,
Jorge


On Thu, Jun 10, 2021 at 1:15 PM Krisztián Szűcs 
wrote:

> On Thu, Jun 10, 2021 at 6:57 AM Jorge Cardoso Leitão
>  wrote:
> >
> > I have been unable to generate the docs from any of my two machines (my
> > macbook and a VM on azure), and I do not think we should delay this
> > further. Could someone kindly create a PR with the generated docs to the
> > website?
> Hi!
>
> I'm going to update the docs.
> Are there any remaining post-release tasks?
>
> Thanks, Krisztian
> >
> > I think that the command amounts to "dev/release/post-09-docs.sh 4.0.1".
> >
> > Best,
> > Jorge
> >
> >
> >
> >
> >
> > On Sun, Jun 6, 2021 at 6:39 AM Jorge Cardoso Leitão <
> > jorgecarlei...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > Sorry for the delay on this, but it is not being easy to build the docs
> > > [1-5], which is why this is taking some time. It seems that our CI is
> > > caching docker layers when testing, which causes it to miss new errors
> > > happening during those layers that are only triggered when the image is
> > > built from scratch.
> > >
> > > Best,
> > > Jorge
> > >
> > > [1] https://issues.apache.org/jira/browse/ARROW-12971
> > > [2] https://issues.apache.org/jira/browse/ARROW-12846
> > > [3] https://issues.apache.org/jira/browse/ARROW-12909
> > > [4] https://issues.apache.org/jira/browse/ARROW-12915
> > > [5] https://issues.apache.org/jira/browse/ARROW-12954
> > >
> > >
> > >
> > >
> > > On Mon, May 31, 2021 at 1:03 PM Jorge Cardoso Leitão <
> > > jorgecarlei...@gmail.com> wrote:
> > >
> > >> Thanks a lot, both.
> > >>
> > >> Accepted. Will upload this evening.
> > >>
> > >> Best,
> > >> Jorge
> > >>
> > >>
> > >> On Mon, May 31, 2021 at 12:55 PM Krisztián Szűcs <
> > >> szucs.kriszt...@gmail.com> wrote:
> > >>
> > >>> On Sun, May 30, 2021 at 7:37 PM Jorge Cardoso Leitão
> > >>>  wrote:
> > >>> >
> > >>> > Hi,
> > >>> >
> > >>> > Sorry for the delay here.
> > >>> >
> > >>> > Below is the list of post-release tasks:
> > >>> >
> > >>> 1.  [Krisztian] open a pull request to bump the version numbers in
> the
> > >>> source code
> > >>> 2.  [x] upload source
> > >>> 3.  [x] upload binaries
> > >>> 4.  [x] update website
> > >>> 5.  [x] upload ruby gems
> > >>> 6.  [Jorge] upload js packages
> > >>> 8.  [x] upload C# packages
> > >>> 9.  [Won't do] upload rust crates
> > >>> 10. [x] update conda recipes
> > >>> 11. [x] upload wheels/sdist to pypi
> > >>> 12. [x] update homebrew packages
> > >>> 13. [x] update maven artifacts
> > >>> 14. [x] update msys2
> > >>> 15. [x] update R packages
> > >>> 16. [Jorge] update docs (in progress, Jorge)
> > >>> >
> > >>> > * Could someone add me to npmjs so that I can publish it there
> > >>> > (jorgecarleitao [1])?
> > >>> Sent you an invite, could you please check it?
> > >>> >
> > >>> > Thank you for your patience,
> > >>> > Jorge
> > >>> >
> > >>> > [1] https://www.npmjs.com/~jorgecarleitao
> > >>>
> > >>
>


Re: Complex Number support in Arrow

2021-06-10 Thread Jorge Cardoso Leitão
Isn't an array of complexes represented by what arrow already supports? In
particular, I see at least two valid in-memory representations to use, that
depend on what we are going to do with it:

* Struct[re, im]
* FixedList[2]

In the first case, we have two buffers, [x0, x1, ...] and [y0, y1, ...], in
the second case we have 1 buffer, [x0, y0, x1, y1, ...].

The first representation is useful for column-based operations (e.g. taking
the real part in case 1 is trivial; requires a copy in the second case),
the second representation is useful for row-base operations (e.g. "take"
and "filter" require a single pass over buffer 1). Case 2 does not support
Re and Im of different physical types (arguably an issue). Both cases
support nullability of individual items or combined.

What I conclude is that this does not seem to be a problem about a base
in-memory representation, but rather on whether we agree on a
representation that justifies adding associated metadata to the spec.

The case for the complex interval type recently proposed [1] is more
compelling to me because a complex ops over intervals usually required all
parts of the interval (and thus the "FixedList" representation is more
compelling), but each part has a different type. I.e. it is like a
"FixedTypedList[int32, int32, int64]", which we do not natively support.

[1] https://github.com/apache/arrow/pull/10177

Best,
Jorge



On Fri, Jun 11, 2021 at 1:48 AM Neal Richardson 
wrote:

>  It might help this discussion and future discussions like it if we could
> define how it is determined whether a type should be part of the Arrow
> format, an extension type (and what does it mean to say there is a
> "canonical" extension type), or just something that a language
> implementation or downstream library builds for itself with metadata. I
> feel like this has come up before but I don't recall a resolution.
>
> Examples might also help: are there examples of "canonical extension
> types"?
>
> Neal
>
> On Thu, Jun 10, 2021 at 4:20 PM Micah Kornfield 
> wrote:
>
> > >
> > > My understanding is that it means having COMPLEX as an entry in the
> > > arrow/type_fwd.h Type enum. I agree this would make implementation
> > > work in the C++ library much more straightforward.
> >
> > One idea I proposed would be to do that, and implement the
> > > serialization of the complex metadata using Extension types.
> >
> >
> > If this is a maintainable strategy for Canonical types it sounds good to
> > me.
> >
> > On Thu, Jun 10, 2021 at 4:02 PM Wes McKinney 
> wrote:
> >
> > > My understanding is that it means having COMPLEX as an entry in the
> > > arrow/type_fwd.h Type enum. I agree this would make implementation
> > > work in the C++ library much more straightforward.
> > >
> > > One idea I proposed would be to do that, and implement the
> > > serialization of the complex metadata using Extension types.
> > >
> > > On Thu, Jun 10, 2021 at 5:47 PM Weston Pace 
> > wrote:
> > > >
> > > > > While dedicated types are not strictly required, compute functions
> > > would
> > > > > be much easier to add for a first-class dedicated complex datatype
> > > > > rather than for an extension type.
> > > > @pitrou
> > > >
> > > > This is perhaps a naive question (and admittedly, I'm not up to speed
> > > > on my compute kernels) but why is this the case?  For example, if
> > > > adding a complex addition kernel it seems we would be talking
> about...
> > > >
> > > > dest_scalar.real = scalar1.real + scalar2.real;
> > > > dest_scalar.im = scalar1.im + scalar2.im;
> > > >
> > > > vs...
> > > >
> > > > dest_scalar[0] = scalar1[0] + scalar2[0];
> > > > dest_scalar[1] = scalar1[1] + scalar2[1];
> > > >
> > > > On Thu, Jun 10, 2021 at 11:27 AM Wes McKinney 
> > > wrote:
> > > > >
> > > > > I'd be supportive of starting with this as a "canonical" extension
> > > > > type so that all implementations are not expected to support
> complex
> > > > > types — this would encourage us to build sufficient integration
> e.g.
> > > > > with NumPy to get things working end-to-end with the on-wire
> > > > > representation being an extension type. We could certainly choose
> to
> > > > > treat the type as "first class" in the C++ library without it being
> > > > > "top level" in the Type union in Flatbuffers.
> > > > >
> > > > > I agree that the use cases are more specialized, and the fact that
> we
> > > > > haven't needed it until now (or at least, its absence suggests
> this)
> > > > > shows that this is the case.
> > > > >
> > > > > On Thu, Jun 10, 2021 at 4:17 PM Micah Kornfield <
> > emkornfi...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > >
> > > > > > > I'm convinced now that  first-class types seem to be the way to
> > go
> > > and I'm
> > > > > > > happy to take this approach.
> > > > > >
> > > > > > I agree from an implementation effort it is simpler, but I'm
> still
> > > not
> > > > > > convinced that we should be adding this as a first class type.
> As
> > > noted in
> > > > > > the surv

Re: Delta Lake support for DataFusion

2021-06-10 Thread Jorge Cardoso Leitão
Hi,

I agree with all of you. ^_^

I created https://github.com/apache/arrow-datafusion/issues/533 to track
this. I tried to encapsulate the three main use-cases for the SQL
extension. Feel free to edit at will.

Best,
Jorge




On Thu, Jun 10, 2021 at 8:37 AM QP Hou  wrote:

> Thanks Daniël for starting the discussion!
>
> Looks like we are on the same page to take this as an opportunity to
> make datafusion more extensible :)
>
> I think Neville and Daniël nailed the biggest missing piece at the
> moment: being able to extend SQL parser and planner with new syntaxes
> and map them to custom plan/expression nodes.
>
> Another thing that I think we should do is to come up with a way to
> better surface these datafusion extensions to help with discoveries.
> For example, pandas has a dedicated section [1] in their official doc
> for this. Perhaps we could start with adding a list of extensions in
> the readme.
>
> After thinking more on this, I feel like it's better to keep the
> extension within delta-rs for now. In the future, delta-rs will likely
> need to depend on ballista for processing delta table metadata using
> distributed compute. So if we move the extension code into
> arrow-datafusion, it might result in circular dependency. I don't see
> a lot of benefits in creating a dedicated datafusion-delta-rs repo at
> the moment. But I am happy to go that route if there are compelling
> reasons. My main goal is just to make sure we have a single officially
> maintained datafusion extension for delta lake.
>
> [1]: https://pandas.pydata.org/docs/ecosystem.html#io
>
> Thanks,
> QP Hou
>
> On Wed, Jun 9, 2021 at 11:30 AM Daniël Heres 
> wrote:
> >
> > Thanks all for the valuable input!
> >
> > I agree following the plugin / model makes a lot of sense for now (either
> > in arrow-datafusion repo or somewhere external, for example in delta-rs
> if
> > we're OK it not being part of Apache right now).
> >
> > In order to support certain Delta Lake features including SQL syntax we
> > probably need to do make DataFusion a bit more extensible besides what is
> > currently possible with the TableProvider, for example:
> >
> > * Allow registering a custom data format (for supporting things like
> *create
> > external table t stored as parquet*)
> > * Allow parsing and/or handling custom SQL syntax like *optimize*  /
> > *vacuum* / *select * from t version as of n* , etc.
> >
> > And probably some more I don't think of currently. I think this is useful
> > work as it also would enable other "extensions" to work in a similar way
> > (e.g. Apache Iceberg and other formats / readers / writers / syntax) and
> > make DataFusion a more flexible engine.
> >
> > Best, Daniël
> >
> > Op wo 9 jun. 2021 om 20:07 schreef Neville Dipale  >:
> >
> > > The correct approach might be to improve DataFusion support in
> > > delta-rs. TableProvider is already implemented here:
> > >
> https://github.com/delta-io/delta-rs/blob/main/rust/src/delta_datafusion.rs
> > >
> > > I've pinged QP to ask for their advice.
> > >
> > > Neville
> > >
> > > On Wed, 9 Jun 2021 at 19:58, Andrew Lamb  wrote:
> > >
> > > > I think the idea of DataFusion + DeltaLake is quite compelling and
> likely
> > > > useful.
> > > >
> > > > However, I think DataFusion is ideally an  "embeddable query engine"
> > > rather
> > > > than a database system in itself, so in that mental model Delta Lake
> > > > integration belongs somewhere other than the core DataFusion crate.
> > > >
> > > > My ideal structure would be a new crate (maybe not even part of the
> > > Apache
> > > > Arrow Project), perhaps called `datafusion-delta-rs`, that contained
> the
> > > > TableProvider and whatever else was needed to integrate DataFusion
> with
> > > > DeltaLake
> > > >
> > > > This structure could also start a pattern of publishing plugins for
> > > > DataFusion separately from the core.
> > > >
> > > > Andrew
> > > > p.s. now that Arrow is publishing more incrementally (e.g. 4.1.0,
> 4.2.0,
> > > > etc), I think delta-rs[1] and datafusion both only specify `4.x` so
> they
> > > > should work together nicely
> > > >
> > > > https://github.com/delta-io/delta-rs/blame/main/rust/Cargo.toml
> > > >
> > > > On Wed, Jun 9, 2021 at 2:29 AM Daniël Heres 
> > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I would like to receive some feedback about adding Delta Lake
> support
> > > to
> > > > > DataFusion (https://github.com/apache/arrow-datafusion/issues/525
> ).
> > > > > As you might know, Delta Lake  is a format
> adding
> > > > > features like ACID transactions, statistics, and storage
> optimization
> > > to
> > > > > Parquet and is getting quite some traction for managing data lakes.
> > > > > It seems a great feature to have in DataFusion as well.
> > > > >
> > > > > The delta-rs  project
> provides a
> > > > > native, Apache licensed, Rust implementation of Delta Lake, already
> > > > > supporting a large part of the 

Re: post-release tasks (4.0.1)

2021-06-09 Thread Jorge Cardoso Leitão
I have been unable to generate the docs from any of my two machines (my
macbook and a VM on azure), and I do not think we should delay this
further. Could someone kindly create a PR with the generated docs to the
website?

I think that the command amounts to "dev/release/post-09-docs.sh 4.0.1".

Best,
Jorge





On Sun, Jun 6, 2021 at 6:39 AM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Hi,
>
> Sorry for the delay on this, but it is not being easy to build the docs
> [1-5], which is why this is taking some time. It seems that our CI is
> caching docker layers when testing, which causes it to miss new errors
> happening during those layers that are only triggered when the image is
> built from scratch.
>
> Best,
> Jorge
>
> [1] https://issues.apache.org/jira/browse/ARROW-12971
> [2] https://issues.apache.org/jira/browse/ARROW-12846
> [3] https://issues.apache.org/jira/browse/ARROW-12909
> [4] https://issues.apache.org/jira/browse/ARROW-12915
> [5] https://issues.apache.org/jira/browse/ARROW-12954
>
>
>
>
> On Mon, May 31, 2021 at 1:03 PM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
>> Thanks a lot, both.
>>
>> Accepted. Will upload this evening.
>>
>> Best,
>> Jorge
>>
>>
>> On Mon, May 31, 2021 at 12:55 PM Krisztián Szűcs <
>> szucs.kriszt...@gmail.com> wrote:
>>
>>> On Sun, May 30, 2021 at 7:37 PM Jorge Cardoso Leitão
>>>  wrote:
>>> >
>>> > Hi,
>>> >
>>> > Sorry for the delay here.
>>> >
>>> > Below is the list of post-release tasks:
>>> >
>>> 1.  [Krisztian] open a pull request to bump the version numbers in the
>>> source code
>>> 2.  [x] upload source
>>> 3.  [x] upload binaries
>>> 4.  [x] update website
>>> 5.  [x] upload ruby gems
>>> 6.  [Jorge] upload js packages
>>> 8.  [x] upload C# packages
>>> 9.  [Won't do] upload rust crates
>>> 10. [x] update conda recipes
>>> 11. [x] upload wheels/sdist to pypi
>>> 12. [x] update homebrew packages
>>> 13. [x] update maven artifacts
>>> 14. [x] update msys2
>>> 15. [x] update R packages
>>> 16. [Jorge] update docs (in progress, Jorge)
>>> >
>>> > * Could someone add me to npmjs so that I can publish it there
>>> > (jorgecarleitao [1])?
>>> Sent you an invite, could you please check it?
>>> >
>>> > Thank you for your patience,
>>> > Jorge
>>> >
>>> > [1] https://www.npmjs.com/~jorgecarleitao
>>>
>>


Re: Delta Lake support for DataFusion

2021-06-09 Thread Jorge Cardoso Leitão
Hi,

Some questions that come to mind:

1. If we add vendor X to datafusion, will we be open to other vendor Y? How
do we compare vendors? How do we draw the line of "not sufficiently
relevant"?
2. How do we ensure that we do not distort the same level playing field
that some people expect from DataFusion?
3. What is the challenge of creating a binary that uses DataFusion +
Delta-lake custom table provider outside of DataFusion?

I see DataFusion's plugin system,

* custom nodes
* custom table providers
* custom physical optimizers
* custom logical optimizers
* UDFs
* UDAFs

as our answer to not bundle vendor-specific implementations (e.g. s3,
azure, Oracle, IOx, IBM, google, delta lake), and instead allow users to
build applications on top of, with whatever vendor-specific requirements
they have. Rust lends itself really well to this, as dependencies are
maintained in Cargo.toml, and binaries compiled with DataFusion can be
built with plugins and deployed in prod environments as a single binary.

AFAIK delta-lake itself is not bundled with spark, and is instead installed
separately (e.g. on the POM for java, pip install delta-spark for Python)
[1]. I think that this is a sustainable model whereby we do not have to
know about delta-lake specifics to be able to maintain the code, and
instead declare contracts for extensions, which others maintain for their
specific formats/systems.

Best,
Jorge

[1] https://docs.delta.io/1.0.0/quick-start.html


Re: Representation of "null" values for non-numeric types in Arrow/Pandas interop

2021-06-08 Thread Jorge Cardoso Leitão
Semantically, a NaN is defined according to the IEEE_754 for floating
points, while a null represents any value whose value is undefined,
unknown, etc.

An important set of problems that arrow solves is that it has a native
representation for null values (independent of NaNs): arrow's in-memory
model is designed ground up to support nulls; other in-memory
representations sometimes use NaN or some other variations to represent
nulls, which sometimes results in breaking memory alignments useful in
compute.

In Arrow, the value of a floating point array can be "non-null" or "null".
When non-null, it can be any valid value for the corresponding type. For
floats, that means any valid floating point number, including, NaN, inf,
-0.0, 0.0, etc.

Best,
Jorge



On Tue, Jun 8, 2021 at 9:59 PM Li Jin  wrote:

> Hello!
>
> Apologies if this has been brought before. I'd like to get devs' thoughts
> on this potential inconsistency of "what are the python objects for null
> values" between pandas and pyarrow.
>
> Demonstrated with the following example:
>
> (1)  pandas seems to use "np.NaN" to represent a missing value (with pandas
> 1.2.4):
>
> In [*32*]: df
>
> Out[*32*]:
>
>value
>
> key
>
> 1some_strign
>
>
> In [*33*]: df2
>
> Out[*33*]:
>
> value2
>
> key
>
> 2some_other_string
>
>
> In [*34*]: df.join(df2)
>
> Out[*34*]:
>
>value value2
>
> key
>
> 1some_strign*NaN*
>
>
>
> (2) pyarrow seems to use "None" to represent a missing value (4.0.1)
>
> >>> s = pd.Series(["some_string", np.NaN])
>
> >>> s
>
> 0some_string
>
> 1NaN
>
> dtype: object
>
> >>> pa.Array.from_pandas(s).to_pandas()
>
> 0some_string
>
> 1   None
>
> dtype: object
>
>
> I have looked around the pyarrow doc and didn't find an option to use
> np.NaN for null values with to_pandas so it's a bit hard to get around trip
> consistency.
>
>
> I appreciate any thoughts on this as to how to achieve consistency here.
>
>
> Thanks!
>
> Li
>


  1   2   3   >