Re: Any standard way for min/max values per record-batch?

Kohei KaiGai Wed, 17 Feb 2021 20:34:21 -0800

Thanks for the clarification.

> There is key-value metadata available on Message which might be able to
> work in the short term (some sort of encoded message).  I think
> standardizing how we store statistics per batch does make sense.
>
For example, JSON array of min/max values as a key-value metadata
in the Footer->Schema->Fields[]->custom_metadata?
Even though the metadata field must be less than INT_MAX, I think it
is enough portable and not invasive way.


> We unfortunately can't add anything to field-node without breaking
> compatibility.  But  another option would be to add a new structure as a
> parallel list on RecordBatch itself.
>
> If we do add a new structure or arbitrary key-value pair we should not use
> KeyValue but should have something where the values can be bytes.
>
What is the parallel-list means?
If we would have a standardized binary structure, like DictionaryBatch,
to store the statistics including min/max values, it exactly makes sense
more than text-encoded key-value metadata, of course.

Best regards,

2021年2月18日(木) 12:37 Micah Kornfield <[email protected]>:
>
> There is key-value metadata available on Message which might be able to
> work in the short term (some sort of encoded message).  I think
> standardizing how we store statistics per batch does make sense.
>
> We unfortunately can't add anything to field-node without breaking
> compatibility.  But  another option would be to add a new structure as a
> parallel list on RecordBatch itself.
>
> If we do add a new structure or arbitrary key-value pair we should not use
> KeyValue but should have something where the values can be bytes.
>
> On Wed, Feb 17, 2021 at 7:17 PM Kohei KaiGai <[email protected]> wrote:
>
> > Hello,
> >
> > Does Apache Arrow have any standard way to embed min/max values of the
> > fields
> > per record-batch basis?
> > It looks FieldNode supports neither dedicated min/max attribute nor
> > custom-metadata.
> > https://github.com/apache/arrow/blob/master/format/Message.fbs#L28
> >
> > If we embed an array of min/max values into the custom-metadata of the
> > Field-node,
> > we may be able to implement.
> > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L344
> >
> > What I like to implement is something like BRIN index at PostgreSQL.
> > http://heterodb.github.io/pg-strom/brin/
> >
> > This index contains only min/max values for a particular block ranges, and
> > query
> > executor can skip blocks that obviously don't contain the target data.
> > If we can skip 9990 of 10000 record batch by checking metadata on a query
> > that
> > tries to fetch items in very narrow timestamps, it is a great
> > acceleration more than
> > full file scans.
> >
> > Best regards,
> > --
> > HeteroDB, Inc / The PG-Strom Project
> > KaiGai Kohei <[email protected]>
> >



-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <[email protected]>

Re: Any standard way for min/max values per record-batch?

Reply via email to