Hi Richard, I tried to reproduce [1] something akin to what you describe and I also see worse-than-expected performance. I did find a GitHub Issue [2] describing performance issues with wide record batches which might be relevant here, though I'm not sure.
Have you tried the same kind of workflow but with Parquet as your on-disk format instead of Feather? [1] https://gist.github.com/amoeba/38591e99bd8682b60779021ac57f146b [2] https://github.com/apache/arrow/issues/16270 On Wed, Sep 27, 2023 at 10:14 PM Richard Beare <richard.be...@gmail.com> wrote: > > Hi, > I have created (from R) an arrow dataset consisting of 86 files (feather). > Each of them is 895M, with about 500 rows and 32000 columns. The natural > structure of the complete dataframe is a 86*500 row dataframe. > > My aim is to load a chunk consisting of all rows and a subset of columns (two > ID columns + 100 other columns), I'll do some manipulation and modelling on > that chunk, then move to the next and repeat. > > Each row in the dataframe corresponds to a flattened image, with two ID > columns. Each feather file contains the set of images corresponding to a > single measure. > > I want to run a series of collect(arrowdataset[, c("ID1", "ID2", "V1", "V2")]) > > However the load time seems very slow (10 minutes+), and I'm wondering what > I've done wrong. I've tested on hosts with SSD. > > I can see a saving in which ID1 becomes part of the partitioning instead of > storing it with the data, but that sounds like a minor change. > > Any thoughts on what I've missed.