Re: feather file and arrow internals

Weston Pace Wed, 03 Nov 2021 13:55:09 -0700

Great questions.

> is this because internally there is no metadata as to what a RecordBatch 
> contains and it has to iterate through all batches or it is just something 
> unsupported by api?

The former.  Row filtering in parquet relies on row group statistics
(min & max values) and someday we may also support using bloom filters
(more than just min/max) and data page statistics (still min/max but
at a finer resolution).  The feather-v2 format (a.k.a Arrow-IPC) does
not have any defined standard for storing row group statistics.
However, there is a spot for it (record batch metadata) and there has
been discussion in the past of adding similar capabilities someday.
If someone had enough motivation I think all the necessary parts are
ready so it is mainly just waiting for someone with motivation and
engineering time.

> should I use featherv2 in production if I'm ok with "drawbacks" (larger file, 
> less adoption, other stuff I'm not aware of...) or is feather just a poc?

Feather-v1 is something of a proof of concept (although we are
maintaining backwards compatibility with it).  Feather-v2, which is
sometimes just called the Arrow IPC format, is definitely intended to
be maintained and not just a proof of concept.

> most references to feather/storing arrow on disk have historically had a 
> disclaimer saying it's not meant to replace parquet.

Feather and parquet have different use cases and it's difficult to
describe which is more appropriate as it can depend on a lot of
details.  As a general rule of thumb parquet is more space-efficient
and should be used when you are limited by I/O bandwidth.  Feather is
more CPU-efficient and should be used when you are limited by CPU
bandwidth.  However, this is only a rule of thumb and there are plenty
of exceptions.

On Wed, Nov 3, 2021 at 9:01 AM gordon chung <[email protected]> wrote:
>
> hi,
>
> apologies if this in the doc or mailing list somewhere and I missed it but I 
> was hoping to understand the arrow file format a bit more.
>
> I noticed that when reading a feather file, the API, at least for Python, 
> doesn't support filtering. is this because internally there is no metadata as 
> to what a RecordBatch contains and it has to iterate through all batches or 
> it is just something unsupported by api? there are references that it 
> supports slicing but I'm thinking more like filtering to only get rows 
> fitting a specific condition (get rows where col1 == 'a' vs get rows 
> 1,3,5...).
>
> also, most references to feather/storing arrow on disk have historically had 
> a disclaimer saying it's not meant to replace parquet. that said, the 
> featherv2 post does have comparison against parquet and my limited testing 
> does show featherv2 performing favourably against it. i guess the question 
> is, should I use featherv2 in production if I'm ok with "drawbacks" (larger 
> file, less adoption, other stuff I'm not aware of...) or is feather just a 
> poc?
>
> thanks,
>
> gord

Re: feather file and arrow internals

Reply via email to