Arrow IPC file is great, it focuses on in-memory representation and direct
computation.
Basically, it can support compression and dictionary encoding, and can
zero-copy
deserialize the file to memory Arrow format.

Parquet provides some strong functionality, like Statistics, which could
help pruning
unnecessary data during scanning and avoid cpu and io cust. And it has high
efficient
encoding, which could make the Parquet file smaller than the Arrow IPC file
under the same
data. However, currently some arrow data type cannot be convert to
correspond Parquet type
in the current arrow-cpp implementation. You can goto the arrow document to
take a look.

Adam Lippai <a...@rigo.sk> 于2023年10月18日周三 10:50写道:

> Also there is
> https://github.com/lancedb/lance between the two formats. Depending on the
> use case it can be a great choice.
>
> Best regards
> Adam Lippai
>
> On Tue, Oct 17, 2023 at 22:44 Matt Topol <zotthewiz...@gmail.com> wrote:
>
> > One benefit of the feather format (i.e. Arrow IPC file format) is the
> > ability to mmap the file to easily handle reading sections of a larger
> than
> > memory file of data. Since, as Felipe mentioned, the format is focused on
> > in-memory representation, you can easily and simply mmap the file and use
> > the raw bytes directly. For a large file that you only want to read
> > sections of, this can be beneficial for IO and memory usage.
> >
> > Unfortunately, you are correct that it doesn't allow for easy column
> > projecting (you're going to read all the columns for a record batch in
> the
> > file, no matter what). So it's going to be a trade off based on your
> needs
> > as to whether it makes sense, or if you should use a file format like
> > Parquet instead.
> >
> > -Matt
> >
> >
> > On Tue, Oct 17, 2023, 10:31 PM Felipe Oliveira Carvalho <
> > felipe...@gmail.com>
> > wrote:
> >
> > > It’s not the best since the format is really focused on in- memory
> > > representation and direct computation, but you can do it:
> > >
> > > https://arrow.apache.org/docs/python/feather.html
> > >
> > > —
> > > Felipe
> > >
> > > On Tue, 17 Oct 2023 at 23:26 Nara <narayanan.arunacha...@gmail.com>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > Is it a good idea to use Apache Arrow as a file format? Looks like
> > > > projecting columns isn't available by default.
> > > >
> > > > One of the benefits of Parquet file format is column projection,
> where
> > > the
> > > > IO is limited to just the columns projected.
> > > >
> > > > Regards ,
> > > > Nara
> > > >
> > >
> >
>

Reply via email to