There are ways to handle datasets larger than memory. mmap'ing one or more arrow files and going from there is a pathway forward here:
https://techascent.com/blog/memory-mapping-arrow.html How this maps to other software ecosystems I don't know but many have mmap support. On Thu, Oct 22, 2020 at 12:47 PM Jacek Pliszka <jacek.plis...@gmail.com> wrote: > I believe it would be good if you define your use case. > > I do handle larger than memory datasets with pyarrow with the use of > dataset.scan but my use case is very specific as I am repartitioning > and cleaning a bit large datasets. > > BR, > > Jacek > > czw., 22 paź 2020 o 20:39 Jacob Zelko <jacobsze...@gmail.com> napisał(a): > > > > Hi all, > > > > Very basic question as I have seen conflicting sources. I come from the > Julia community and was wondering if Arrow can handle larger-than-memory > datasets? I saw this post by Wes McKinney here discussing that the tooling > is being laid down: > > > > Table columns in Arrow C++ can be chunked, so that appending to a table > is a zero copy operation, requiring no non-trivial computation or memory > allocation. By designing up front for streaming, chunked tables, appending > to existing in-memory tabler is computationally inexpensive relative to > pandas now. Designing for chunked or streaming data is also essential for > implementing out-of-core algorithms, so we are also laying the foundation > for processing larger-than-memory datasets. > > > > ~ Apache Arrow and the “10 Things I Hate About pandas” > > > > And then in the docs I saw this: > > > > The pyarrow.dataset module provides functionality to efficiently work > with tabular, potentially larger than memory and multi-file datasets: > > > > A unified interface for different sources: supporting different sources > and file formats (Parquet, Feather files) and different file systems > (local, cloud). > > Discovery of sources (crawling directories, handle directory-based > partitioned datasets, basic schema normalization, ..) > > Optimized reading with predicate pushdown (filtering rows), projection > (selecting columns), parallel reading or fine-grained managing of tasks. > > > > Currently, only Parquet and Feather / Arrow IPC files are supported. The > goal is to expand this in the future to other file formats and data sources > (e.g. database connections). > > > > ~ Tabular Datasets > > > > The article from Wes was from 2017 and the snippet on Tabular Datasets > is from the current documentation for pyarrow. > > > > Could anyone answer this question or at least clear up my confusion for > me? Thank you! > > > > -- > > Jacob Zelko > > Georgia Institute of Technology - Biomedical Engineering B.S. '20 > > Corning Community College - Engineering Science A.S. '17 > > Cell Number: (607) 846-8947 >