Sure, anything is possible if you want to write the code to do it. You could create a CompressedRecordBatch class where you only decompress a field/column when you need it.
On Thu, Oct 22, 2020 at 4:05 PM Daniel Nugent <nug...@gmail.com> wrote: > > The biggest problem with mapped arrow data is that it's only possible with > uncompressed Feather files. Is there ever a possibility that compressed files > could be mappable (I know that you'd have to decompress a given RecordBatch > to actually work with it, but Feather files should be comprised of many > RecordBatches, right?) > > -Dan Nugent > > > On Thu, Oct 22, 2020 at 4:49 PM Wes McKinney <wesmck...@gmail.com> wrote: >> >> I'm not sure where the conflict in what's written online is, but by >> virtue of being designed such that data structures do not require >> memory buffers to be RAM resident (i.e. can reference memory maps), we >> are set up well to process larger-than-memory datasets. In C++ at >> least we are putting the pieces in place to be able to do efficient >> query execution on on-disk datasets, and it may already be possible in >> Rust with DataFusion. >> >> On Thu, Oct 22, 2020 at 2:11 PM Chris Nuernberger <ch...@techascent.com> >> wrote: >> > >> > There are ways to handle datasets larger than memory. mmap'ing one or >> > more arrow files and going from there is a pathway forward here: >> > >> > https://techascent.com/blog/memory-mapping-arrow.html >> > >> > How this maps to other software ecosystems I don't know but many have mmap >> > support. >> > >> > On Thu, Oct 22, 2020 at 12:47 PM Jacek Pliszka <jacek.plis...@gmail.com> >> > wrote: >> >> >> >> I believe it would be good if you define your use case. >> >> >> >> I do handle larger than memory datasets with pyarrow with the use of >> >> dataset.scan but my use case is very specific as I am repartitioning >> >> and cleaning a bit large datasets. >> >> >> >> BR, >> >> >> >> Jacek >> >> >> >> czw., 22 paź 2020 o 20:39 Jacob Zelko <jacobsze...@gmail.com> napisał(a): >> >> > >> >> > Hi all, >> >> > >> >> > Very basic question as I have seen conflicting sources. I come from the >> >> > Julia community and was wondering if Arrow can handle >> >> > larger-than-memory datasets? I saw this post by Wes McKinney here >> >> > discussing that the tooling is being laid down: >> >> > >> >> > Table columns in Arrow C++ can be chunked, so that appending to a table >> >> > is a zero copy operation, requiring no non-trivial computation or >> >> > memory allocation. By designing up front for streaming, chunked tables, >> >> > appending to existing in-memory tabler is computationally inexpensive >> >> > relative to pandas now. Designing for chunked or streaming data is also >> >> > essential for implementing out-of-core algorithms, so we are also >> >> > laying the foundation for processing larger-than-memory datasets. >> >> > >> >> > ~ Apache Arrow and the “10 Things I Hate About pandas” >> >> > >> >> > And then in the docs I saw this: >> >> > >> >> > The pyarrow.dataset module provides functionality to efficiently work >> >> > with tabular, potentially larger than memory and multi-file datasets: >> >> > >> >> > A unified interface for different sources: supporting different sources >> >> > and file formats (Parquet, Feather files) and different file systems >> >> > (local, cloud). >> >> > Discovery of sources (crawling directories, handle directory-based >> >> > partitioned datasets, basic schema normalization, ..) >> >> > Optimized reading with predicate pushdown (filtering rows), projection >> >> > (selecting columns), parallel reading or fine-grained managing of tasks. >> >> > >> >> > Currently, only Parquet and Feather / Arrow IPC files are supported. >> >> > The goal is to expand this in the future to other file formats and data >> >> > sources (e.g. database connections). >> >> > >> >> > ~ Tabular Datasets >> >> > >> >> > The article from Wes was from 2017 and the snippet on Tabular Datasets >> >> > is from the current documentation for pyarrow. >> >> > >> >> > Could anyone answer this question or at least clear up my confusion for >> >> > me? Thank you! >> >> > >> >> > -- >> >> > Jacob Zelko >> >> > Georgia Institute of Technology - Biomedical Engineering B.S. '20 >> >> > Corning Community College - Engineering Science A.S. '17 >> >> > Cell Number: (607) 846-8947