The biggest problem with mapped arrow data is that it's only possible with uncompressed Feather files. Is there ever a possibility that compressed files could be mappable (I know that you'd have to decompress a given RecordBatch to actually work with it, but Feather files should be comprised of many RecordBatches, right?)
-Dan Nugent On Thu, Oct 22, 2020 at 4:49 PM Wes McKinney <wesmck...@gmail.com> wrote: > I'm not sure where the conflict in what's written online is, but by > virtue of being designed such that data structures do not require > memory buffers to be RAM resident (i.e. can reference memory maps), we > are set up well to process larger-than-memory datasets. In C++ at > least we are putting the pieces in place to be able to do efficient > query execution on on-disk datasets, and it may already be possible in > Rust with DataFusion. > > On Thu, Oct 22, 2020 at 2:11 PM Chris Nuernberger <ch...@techascent.com> > wrote: > > > > There are ways to handle datasets larger than memory. mmap'ing one or > more arrow files and going from there is a pathway forward here: > > > > https://techascent.com/blog/memory-mapping-arrow.html > > > > How this maps to other software ecosystems I don't know but many have > mmap support. > > > > On Thu, Oct 22, 2020 at 12:47 PM Jacek Pliszka <jacek.plis...@gmail.com> > wrote: > >> > >> I believe it would be good if you define your use case. > >> > >> I do handle larger than memory datasets with pyarrow with the use of > >> dataset.scan but my use case is very specific as I am repartitioning > >> and cleaning a bit large datasets. > >> > >> BR, > >> > >> Jacek > >> > >> czw., 22 paź 2020 o 20:39 Jacob Zelko <jacobsze...@gmail.com> > napisał(a): > >> > > >> > Hi all, > >> > > >> > Very basic question as I have seen conflicting sources. I come from > the Julia community and was wondering if Arrow can handle > larger-than-memory datasets? I saw this post by Wes McKinney here > discussing that the tooling is being laid down: > >> > > >> > Table columns in Arrow C++ can be chunked, so that appending to a > table is a zero copy operation, requiring no non-trivial computation or > memory allocation. By designing up front for streaming, chunked tables, > appending to existing in-memory tabler is computationally inexpensive > relative to pandas now. Designing for chunked or streaming data is also > essential for implementing out-of-core algorithms, so we are also laying > the foundation for processing larger-than-memory datasets. > >> > > >> > ~ Apache Arrow and the “10 Things I Hate About pandas” > >> > > >> > And then in the docs I saw this: > >> > > >> > The pyarrow.dataset module provides functionality to efficiently work > with tabular, potentially larger than memory and multi-file datasets: > >> > > >> > A unified interface for different sources: supporting different > sources and file formats (Parquet, Feather files) and different file > systems (local, cloud). > >> > Discovery of sources (crawling directories, handle directory-based > partitioned datasets, basic schema normalization, ..) > >> > Optimized reading with predicate pushdown (filtering rows), > projection (selecting columns), parallel reading or fine-grained managing > of tasks. > >> > > >> > Currently, only Parquet and Feather / Arrow IPC files are supported. > The goal is to expand this in the future to other file formats and data > sources (e.g. database connections). > >> > > >> > ~ Tabular Datasets > >> > > >> > The article from Wes was from 2017 and the snippet on Tabular > Datasets is from the current documentation for pyarrow. > >> > > >> > Could anyone answer this question or at least clear up my confusion > for me? Thank you! > >> > > >> > -- > >> > Jacob Zelko > >> > Georgia Institute of Technology - Biomedical Engineering B.S. '20 > >> > Corning Community College - Engineering Science A.S. '17 > >> > Cell Number: (607) 846-8947 >