I believe it would be good if you define your use case.

I do handle larger than memory datasets with pyarrow with the use of
dataset.scan but my use case is very specific as I am repartitioning
and cleaning a bit large datasets.

BR,

Jacek

czw., 22 paź 2020 o 20:39 Jacob Zelko <jacobsze...@gmail.com> napisał(a):
>
> Hi all,
>
> Very basic question as I have seen conflicting sources. I come from the Julia 
> community and was wondering if Arrow can handle larger-than-memory datasets? 
> I saw this post by Wes McKinney here discussing that the tooling is being 
> laid down:
>
> Table columns in Arrow C++ can be chunked, so that appending to a table is a 
> zero copy operation, requiring no non-trivial computation or memory 
> allocation. By designing up front for streaming, chunked tables, appending to 
> existing in-memory tabler is computationally inexpensive relative to pandas 
> now. Designing for chunked or streaming data is also essential for 
> implementing out-of-core algorithms, so we are also laying the foundation for 
> processing larger-than-memory datasets.
>
> ~ Apache Arrow and the “10 Things I Hate About pandas”
>
> And then in the docs I saw this:
>
> The pyarrow.dataset module provides functionality to efficiently work with 
> tabular, potentially larger than memory and multi-file datasets:
>
> A unified interface for different sources: supporting different sources and 
> file formats (Parquet, Feather files) and different file systems (local, 
> cloud).
> Discovery of sources (crawling directories, handle directory-based 
> partitioned datasets, basic schema normalization, ..)
> Optimized reading with predicate pushdown (filtering rows), projection 
> (selecting columns), parallel reading or fine-grained managing of tasks.
>
> Currently, only Parquet and Feather / Arrow IPC files are supported. The goal 
> is to expand this in the future to other file formats and data sources (e.g. 
> database connections).
>
> ~ Tabular Datasets
>
> The article from Wes was from 2017 and the snippet on Tabular Datasets is 
> from the current documentation for pyarrow.
>
> Could anyone answer this question or at least clear up my confusion for me? 
> Thank you!
>
> --
> Jacob Zelko
> Georgia Institute of Technology - Biomedical Engineering B.S. '20
> Corning Community College - Engineering Science A.S. '17
> Cell Number: (607) 846-8947

Reply via email to