Thanks for all the suggestions, I'm working through them.

One problem I've discovered relates to creating the dataset in the
first place. For the parquet version I'm using the approach described here:

 https://gist.github.com/thisisnic/5bdb85d2742bc318433f2f14b8bd77cf

in response to an earlier question. However I'm finding that the R ram
usage steadily grows through the loop, which it shouldn't. I suspect I
should be forcing a flush to disk at regular intervals, but I can't see how
to achieve that from R. There's no obvious reason why the ram use needs to
increase with iterations.

On Sat, Sep 30, 2023 at 1:23 AM Aldrin <octalene....@pm.me> wrote:

> how many rows are you including in a batch? you might want to try with
> smaller row batches since your columns are so wide.
>
> the other thing you can try instead of parquet is testing files with
> progressively more columns. If the width of your tables are the problem
> then you'll be able to see when it becomes a problem and what the peak load
> time is to know how much you may be missing out on.
>
> Sent from Proton Mail <https://proton.me/mail/home> for iOS
>
>
> On Thu, Sep 28, 2023 at 22:34, Bryce Mecum <bryceme...@gmail.com
> <On+Thu,+Sep+28,+2023+at+22:34,+Bryce+Mecum+%3C%3Ca+href=>> wrote:
>
> Hi Richard,
>
> I tried to reproduce [1] something akin to what you describe and I
> also see worse-than-expected performance. I did find a GitHub Issue
> [2] describing performance issues with wide record batches which might
> be relevant here, though I'm not sure.
>
> Have you tried the same kind of workflow but with Parquet as your
> on-disk format instead of Feather?
>
> [1] https://gist.github.com/amoeba/38591e99bd8682b60779021ac57f146b
> [2] https://github.com/apache/arrow/issues/16270
>
> On Wed, Sep 27, 2023 at 10:14 PM Richard Beare <richard.be...@gmail.com>
> wrote:
> >
> > Hi,
> > I have created (from R) an arrow dataset consisting of 86 files
> (feather). Each of them is 895M, with about 500 rows and 32000 columns. The
> natural structure of the complete dataframe is a 86*500 row dataframe.
> >
> > My aim is to load a chunk consisting of all rows and a subset of columns
> (two ID columns + 100 other columns), I'll do some manipulation and
> modelling on that chunk, then move to the next and repeat.
> >
> > Each row in the dataframe corresponds to a flattened image, with two ID
> columns. Each feather file contains the set of images corresponding to a
> single measure.
> >
> > I want to run a series of collect(arrowdataset[, c("ID1", "ID2", "V1",
> "V2")])
> >
> > However the load time seems very slow (10 minutes+), and I'm wondering
> what I've done wrong. I've tested on hosts with SSD.
> >
> > I can see a saving in which ID1 becomes part of the partitioning instead
> of storing it with the data, but that sounds like a minor change.
> >
> > Any thoughts on what I've missed.
>
>

Reply via email to