Great questions. > is this because internally there is no metadata as to what a RecordBatch > contains and it has to iterate through all batches or it is just something > unsupported by api?
The former. Row filtering in parquet relies on row group statistics (min & max values) and someday we may also support using bloom filters (more than just min/max) and data page statistics (still min/max but at a finer resolution). The feather-v2 format (a.k.a Arrow-IPC) does not have any defined standard for storing row group statistics. However, there is a spot for it (record batch metadata) and there has been discussion in the past of adding similar capabilities someday. If someone had enough motivation I think all the necessary parts are ready so it is mainly just waiting for someone with motivation and engineering time. > should I use featherv2 in production if I'm ok with "drawbacks" (larger file, > less adoption, other stuff I'm not aware of...) or is feather just a poc? Feather-v1 is something of a proof of concept (although we are maintaining backwards compatibility with it). Feather-v2, which is sometimes just called the Arrow IPC format, is definitely intended to be maintained and not just a proof of concept. > most references to feather/storing arrow on disk have historically had a > disclaimer saying it's not meant to replace parquet. Feather and parquet have different use cases and it's difficult to describe which is more appropriate as it can depend on a lot of details. As a general rule of thumb parquet is more space-efficient and should be used when you are limited by I/O bandwidth. Feather is more CPU-efficient and should be used when you are limited by CPU bandwidth. However, this is only a rule of thumb and there are plenty of exceptions. On Wed, Nov 3, 2021 at 9:01 AM gordon chung <[email protected]> wrote: > > hi, > > apologies if this in the doc or mailing list somewhere and I missed it but I > was hoping to understand the arrow file format a bit more. > > I noticed that when reading a feather file, the API, at least for Python, > doesn't support filtering. is this because internally there is no metadata as > to what a RecordBatch contains and it has to iterate through all batches or > it is just something unsupported by api? there are references that it > supports slicing but I'm thinking more like filtering to only get rows > fitting a specific condition (get rows where col1 == 'a' vs get rows > 1,3,5...). > > also, most references to feather/storing arrow on disk have historically had > a disclaimer saying it's not meant to replace parquet. that said, the > featherv2 post does have comparison against parquet and my limited testing > does show featherv2 performing favourably against it. i guess the question > is, should I use featherv2 in production if I'm ok with "drawbacks" (larger > file, less adoption, other stuff I'm not aware of...) or is feather just a > poc? > > thanks, > > gord
