Thanks for the clarification. > There is key-value metadata available on Message which might be able to > work in the short term (some sort of encoded message). I think > standardizing how we store statistics per batch does make sense. > For example, JSON array of min/max values as a key-value metadata in the Footer->Schema->Fields[]->custom_metadata? Even though the metadata field must be less than INT_MAX, I think it is enough portable and not invasive way.
> We unfortunately can't add anything to field-node without breaking > compatibility. But another option would be to add a new structure as a > parallel list on RecordBatch itself. > > If we do add a new structure or arbitrary key-value pair we should not use > KeyValue but should have something where the values can be bytes. > What is the parallel-list means? If we would have a standardized binary structure, like DictionaryBatch, to store the statistics including min/max values, it exactly makes sense more than text-encoded key-value metadata, of course. Best regards, 2021年2月18日(木) 12:37 Micah Kornfield <emkornfi...@gmail.com>: > > There is key-value metadata available on Message which might be able to > work in the short term (some sort of encoded message). I think > standardizing how we store statistics per batch does make sense. > > We unfortunately can't add anything to field-node without breaking > compatibility. But another option would be to add a new structure as a > parallel list on RecordBatch itself. > > If we do add a new structure or arbitrary key-value pair we should not use > KeyValue but should have something where the values can be bytes. > > On Wed, Feb 17, 2021 at 7:17 PM Kohei KaiGai <kai...@heterodb.com> wrote: > > > Hello, > > > > Does Apache Arrow have any standard way to embed min/max values of the > > fields > > per record-batch basis? > > It looks FieldNode supports neither dedicated min/max attribute nor > > custom-metadata. > > https://github.com/apache/arrow/blob/master/format/Message.fbs#L28 > > > > If we embed an array of min/max values into the custom-metadata of the > > Field-node, > > we may be able to implement. > > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L344 > > > > What I like to implement is something like BRIN index at PostgreSQL. > > http://heterodb.github.io/pg-strom/brin/ > > > > This index contains only min/max values for a particular block ranges, and > > query > > executor can skip blocks that obviously don't contain the target data. > > If we can skip 9990 of 10000 record batch by checking metadata on a query > > that > > tries to fetch items in very narrow timestamps, it is a great > > acceleration more than > > full file scans. > > > > Best regards, > > -- > > HeteroDB, Inc / The PG-Strom Project > > KaiGai Kohei <kai...@heterodb.com> > > -- HeteroDB, Inc / The PG-Strom Project KaiGai Kohei <kai...@heterodb.com>