Re: Does Arrow Support Larger-than-Memory Handling?

Daniel Nugent Thu, 22 Oct 2020 14:06:07 -0700

The biggest problem with mapped arrow data is that it's only possible with
uncompressed Feather files. Is there ever a possibility that compressed
files could be mappable (I know that you'd have to decompress a given
RecordBatch to actually work with it, but Feather files should be comprised
of many RecordBatches, right?)


-Dan Nugent


On Thu, Oct 22, 2020 at 4:49 PM Wes McKinney <wesmck...@gmail.com> wrote:

> I'm not sure where the conflict in what's written online is, but by
> virtue of being designed such that data structures do not require
> memory buffers to be RAM resident (i.e. can reference memory maps), we
> are set up well to process larger-than-memory datasets. In C++ at
> least we are putting the pieces in place to be able to do efficient
> query execution on on-disk datasets, and it may already be possible in
> Rust with DataFusion.
>
> On Thu, Oct 22, 2020 at 2:11 PM Chris Nuernberger <ch...@techascent.com>
> wrote:
> >
> > There are ways to handle datasets larger than memory.  mmap'ing one or
> more arrow files and going from there is a pathway forward here:
> >
> > https://techascent.com/blog/memory-mapping-arrow.html
> >
> > How this maps to other software ecosystems I don't know but many have
> mmap support.
> >
> > On Thu, Oct 22, 2020 at 12:47 PM Jacek Pliszka <jacek.plis...@gmail.com>
> wrote:
> >>
> >> I believe it would be good if you define your use case.
> >>
> >> I do handle larger than memory datasets with pyarrow with the use of
> >> dataset.scan but my use case is very specific as I am repartitioning
> >> and cleaning a bit large datasets.
> >>
> >> BR,
> >>
> >> Jacek
> >>
> >> czw., 22 paź 2020 o 20:39 Jacob Zelko <jacobsze...@gmail.com>
> napisał(a):
> >> >
> >> > Hi all,
> >> >
> >> > Very basic question as I have seen conflicting sources. I come from
> the Julia community and was wondering if Arrow can handle
> larger-than-memory datasets? I saw this post by Wes McKinney here
> discussing that the tooling is being laid down:
> >> >
> >> > Table columns in Arrow C++ can be chunked, so that appending to a
> table is a zero copy operation, requiring no non-trivial computation or
> memory allocation. By designing up front for streaming, chunked tables,
> appending to existing in-memory tabler is computationally inexpensive
> relative to pandas now. Designing for chunked or streaming data is also
> essential for implementing out-of-core algorithms, so we are also laying
> the foundation for processing larger-than-memory datasets.
> >> >
> >> > ~ Apache Arrow and the “10 Things I Hate About pandas”
> >> >
> >> > And then in the docs I saw this:
> >> >
> >> > The pyarrow.dataset module provides functionality to efficiently work
> with tabular, potentially larger than memory and multi-file datasets:
> >> >
> >> > A unified interface for different sources: supporting different
> sources and file formats (Parquet, Feather files) and different file
> systems (local, cloud).
> >> > Discovery of sources (crawling directories, handle directory-based
> partitioned datasets, basic schema normalization, ..)
> >> > Optimized reading with predicate pushdown (filtering rows),
> projection (selecting columns), parallel reading or fine-grained managing
> of tasks.
> >> >
> >> > Currently, only Parquet and Feather / Arrow IPC files are supported.
> The goal is to expand this in the future to other file formats and data
> sources (e.g. database connections).
> >> >
> >> > ~ Tabular Datasets
> >> >
> >> > The article from Wes was from 2017 and the snippet on Tabular
> Datasets is from the current documentation for pyarrow.
> >> >
> >> > Could anyone answer this question or at least clear up my confusion
> for me? Thank you!
> >> >
> >> > --
> >> > Jacob Zelko
> >> > Georgia Institute of Technology - Biomedical Engineering B.S. '20
> >> > Corning Community College - Engineering Science A.S. '17
> >> > Cell Number: (607) 846-8947
>

Re: Does Arrow Support Larger-than-Memory Handling?

Reply via email to