Hi Richard,

I tried to reproduce [1] something akin to what you describe and I
also see worse-than-expected performance. I did find a GitHub Issue
[2] describing performance issues with wide record batches which might
be relevant here, though I'm not sure.

Have you tried the same kind of workflow but with Parquet as your
on-disk format instead of Feather?

[1] https://gist.github.com/amoeba/38591e99bd8682b60779021ac57f146b
[2] https://github.com/apache/arrow/issues/16270

On Wed, Sep 27, 2023 at 10:14 PM Richard Beare <richard.be...@gmail.com> wrote:
>
> Hi,
> I have created (from R) an arrow dataset consisting of 86 files (feather). 
> Each of them is 895M, with about 500 rows and 32000 columns. The natural 
> structure of the complete dataframe is a 86*500 row dataframe.
>
> My aim is to load a chunk consisting of all rows and a subset of columns (two 
> ID columns + 100 other columns), I'll do some manipulation and modelling on 
> that chunk, then move to the next and repeat.
>
> Each row in the dataframe corresponds to a flattened image, with two ID 
> columns. Each feather file contains the set of images corresponding to a 
> single measure.
>
> I want to run a series of collect(arrowdataset[, c("ID1", "ID2", "V1", "V2")])
>
> However the load time seems very slow (10 minutes+), and I'm wondering what 
> I've done wrong. I've tested on hosts with SSD.
>
> I can see a saving in which ID1 becomes part of the partitioning instead of 
> storing it with the data, but that sounds like a minor change.
>
> Any thoughts on what I've missed.

Reply via email to