Re: Does Arrow Support Larger-than-Memory Handling?

Wes McKinney Thu, 22 Oct 2020 14:10:48 -0700

Sure, anything is possible if you want to write the code to do it. You
could create a CompressedRecordBatch class where you only decompress a
field/column when you need it.


On Thu, Oct 22, 2020 at 4:05 PM Daniel Nugent <nug...@gmail.com> wrote:
>
> The biggest problem with mapped arrow data is that it's only possible with 
> uncompressed Feather files. Is there ever a possibility that compressed files 
> could be mappable (I know that you'd have to decompress a given RecordBatch 
> to actually work with it, but Feather files should be comprised of many 
> RecordBatches, right?)
>
> -Dan Nugent
>
>
> On Thu, Oct 22, 2020 at 4:49 PM Wes McKinney <wesmck...@gmail.com> wrote:
>>
>> I'm not sure where the conflict in what's written online is, but by
>> virtue of being designed such that data structures do not require
>> memory buffers to be RAM resident (i.e. can reference memory maps), we
>> are set up well to process larger-than-memory datasets. In C++ at
>> least we are putting the pieces in place to be able to do efficient
>> query execution on on-disk datasets, and it may already be possible in
>> Rust with DataFusion.
>>
>> On Thu, Oct 22, 2020 at 2:11 PM Chris Nuernberger <ch...@techascent.com> 
>> wrote:
>> >
>> > There are ways to handle datasets larger than memory.  mmap'ing one or 
>> > more arrow files and going from there is a pathway forward here:
>> >
>> > https://techascent.com/blog/memory-mapping-arrow.html
>> >
>> > How this maps to other software ecosystems I don't know but many have mmap 
>> > support.
>> >
>> > On Thu, Oct 22, 2020 at 12:47 PM Jacek Pliszka <jacek.plis...@gmail.com> 
>> > wrote:
>> >>
>> >> I believe it would be good if you define your use case.
>> >>
>> >> I do handle larger than memory datasets with pyarrow with the use of
>> >> dataset.scan but my use case is very specific as I am repartitioning
>> >> and cleaning a bit large datasets.
>> >>
>> >> BR,
>> >>
>> >> Jacek
>> >>
>> >> czw., 22 paź 2020 o 20:39 Jacob Zelko <jacobsze...@gmail.com> napisał(a):
>> >> >
>> >> > Hi all,
>> >> >
>> >> > Very basic question as I have seen conflicting sources. I come from the 
>> >> > Julia community and was wondering if Arrow can handle 
>> >> > larger-than-memory datasets? I saw this post by Wes McKinney here 
>> >> > discussing that the tooling is being laid down:
>> >> >
>> >> > Table columns in Arrow C++ can be chunked, so that appending to a table 
>> >> > is a zero copy operation, requiring no non-trivial computation or 
>> >> > memory allocation. By designing up front for streaming, chunked tables, 
>> >> > appending to existing in-memory tabler is computationally inexpensive 
>> >> > relative to pandas now. Designing for chunked or streaming data is also 
>> >> > essential for implementing out-of-core algorithms, so we are also 
>> >> > laying the foundation for processing larger-than-memory datasets.
>> >> >
>> >> > ~ Apache Arrow and the “10 Things I Hate About pandas”
>> >> >
>> >> > And then in the docs I saw this:
>> >> >
>> >> > The pyarrow.dataset module provides functionality to efficiently work 
>> >> > with tabular, potentially larger than memory and multi-file datasets:
>> >> >
>> >> > A unified interface for different sources: supporting different sources 
>> >> > and file formats (Parquet, Feather files) and different file systems 
>> >> > (local, cloud).
>> >> > Discovery of sources (crawling directories, handle directory-based 
>> >> > partitioned datasets, basic schema normalization, ..)
>> >> > Optimized reading with predicate pushdown (filtering rows), projection 
>> >> > (selecting columns), parallel reading or fine-grained managing of tasks.
>> >> >
>> >> > Currently, only Parquet and Feather / Arrow IPC files are supported. 
>> >> > The goal is to expand this in the future to other file formats and data 
>> >> > sources (e.g. database connections).
>> >> >
>> >> > ~ Tabular Datasets
>> >> >
>> >> > The article from Wes was from 2017 and the snippet on Tabular Datasets 
>> >> > is from the current documentation for pyarrow.
>> >> >
>> >> > Could anyone answer this question or at least clear up my confusion for 
>> >> > me? Thank you!
>> >> >
>> >> > --
>> >> > Jacob Zelko
>> >> > Georgia Institute of Technology - Biomedical Engineering B.S. '20
>> >> > Corning Community College - Engineering Science A.S. '17
>> >> > Cell Number: (607) 846-8947

Re: Does Arrow Support Larger-than-Memory Handling?

Reply via email to