Re: Any standard way for min/max values per record-batch?

Micah Kornfield Wed, 17 Feb 2021 19:37:53 -0800

There is key-value metadata available on Message which might be able to
work in the short term (some sort of encoded message).  I think
standardizing how we store statistics per batch does make sense.


We unfortunately can't add anything to field-node without breaking
compatibility.  But  another option would be to add a new structure as a
parallel list on RecordBatch itself.

If we do add a new structure or arbitrary key-value pair we should not use
KeyValue but should have something where the values can be bytes.

On Wed, Feb 17, 2021 at 7:17 PM Kohei KaiGai <kai...@heterodb.com> wrote:

> Hello,
>
> Does Apache Arrow have any standard way to embed min/max values of the
> fields
> per record-batch basis?
> It looks FieldNode supports neither dedicated min/max attribute nor
> custom-metadata.
> https://github.com/apache/arrow/blob/master/format/Message.fbs#L28
>
> If we embed an array of min/max values into the custom-metadata of the
> Field-node,
> we may be able to implement.
> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L344
>
> What I like to implement is something like BRIN index at PostgreSQL.
> http://heterodb.github.io/pg-strom/brin/
>
> This index contains only min/max values for a particular block ranges, and
> query
> executor can skip blocks that obviously don't contain the target data.
> If we can skip 9990 of 10000 record batch by checking metadata on a query
> that
> tries to fetch items in very narrow timestamps, it is a great
> acceleration more than
> full file scans.
>
> Best regards,
> --
> HeteroDB, Inc / The PG-Strom Project
> KaiGai Kohei <kai...@heterodb.com>
>

Re: Any standard way for min/max values per record-batch?

Reply via email to