Re: Does Arrow Support Larger-than-Memory Handling?

2020-10-22 Thread Jacob Quinn
Hi Jacob, Yes, the arrow format allows for larger-than-memory datasets. I can describe a little what this looks like on the Julia side of things, which should be pretty similar in other languages. When you write a dataset to the arrow format, either on disk or in memory, you're laying the data +

Re: Does Arrow Support Larger-than-Memory Handling?

2020-10-22 Thread Wes McKinney
Sure, anything is possible if you want to write the code to do it. You could create a CompressedRecordBatch class where you only decompress a field/column when you need it. On Thu, Oct 22, 2020 at 4:05 PM Daniel Nugent wrote: > > The biggest problem with mapped arrow data is that it's only

Re: Does Arrow Support Larger-than-Memory Handling?

2020-10-22 Thread Daniel Nugent
The biggest problem with mapped arrow data is that it's only possible with uncompressed Feather files. Is there ever a possibility that compressed files could be mappable (I know that you'd have to decompress a given RecordBatch to actually work with it, but Feather files should be comprised of

Re: Does Arrow Support Larger-than-Memory Handling?

2020-10-22 Thread Wes McKinney
I'm not sure where the conflict in what's written online is, but by virtue of being designed such that data structures do not require memory buffers to be RAM resident (i.e. can reference memory maps), we are set up well to process larger-than-memory datasets. In C++ at least we are putting the

Re: Does Arrow Support Larger-than-Memory Handling?

2020-10-22 Thread Chris Nuernberger
There are ways to handle datasets larger than memory. mmap'ing one or more arrow files and going from there is a pathway forward here: https://techascent.com/blog/memory-mapping-arrow.html How this maps to other software ecosystems I don't know but many have mmap support. On Thu, Oct 22, 2020

Re: Does Arrow Support Larger-than-Memory Handling?

2020-10-22 Thread Jacek Pliszka
I believe it would be good if you define your use case. I do handle larger than memory datasets with pyarrow with the use of dataset.scan but my use case is very specific as I am repartitioning and cleaning a bit large datasets. BR, Jacek czw., 22 paź 2020 o 20:39 Jacob Zelko napisał(a): > >

Does Arrow Support Larger-than-Memory Handling?

2020-10-22 Thread Jacob Zelko
Hi all, Very basic question as I have seen conflicting sources. I come from the Julia community and was wondering if Arrow can handle larger-than-memory datasets? I saw this post by Wes McKinney here discussing that the tooling is being laid down: Table columns in Arrow C++ can be chunked, so