Re: Optimizing read performance - wide data frames

Richard Beare Wed, 11 Oct 2023 04:28:53 -0700

I've worked through my problem a bit further - as mentioned earlier in the
thread, I think there are memory leaks in the R interface around parquet
files, so I'm sticking with feather.


I've moved to a long structure instead - previously I was hoping to load a
column set, then pivot_longer inside R and do the necessary analysis. Now
I'm saving only 4 columns, 2 ID columns, and Index column (integer) and the
Value column. My aim now is to load batches of indexes, say 1000 at a time.
I have tested this using semi_join - I create an arrow_table containing the
indexes I want to select:

arrH <- arrow::open_dataset(TargetDir, format="arrow")
these <- as_arrow_table(these, schema = schema(field("Index" , int32())))

block <- semi_join(arrH, these,  by = "Index")

block <- collect(block)

Elapsed time for the collect operation is 490 seconds, and does not vary
much whether "these" contains 5 or 100 entries. The target directory is on
a spinning disk. During this operation the host load climbs to about 6,
while the rsession load sits at a low level (~!0%). Using an nfs mounted
volume appears to kill the machine (at least it becomes unresponsive).

There are two levels of ID in a crossed structure. 85 in the first level
and about 500 in the second. There is one file per first level ID.

At the moment it isn't fast enough to be useful, so what am I missing? Is
there a way to add an index (SQL style) to critical columns? Am I thinking
about this wrong?




On Tue, Oct 3, 2023 at 9:27 PM Richard Beare <richard.be...@gmail.com>
wrote:

> I'm collating the problems I observe while testing this.
>
> 1) Probably memory leak in parquet writer workflow.
>
> My workflow that creates my dataset is pretty simple - loops through a set
> of 500 images, loads them, extracts a subset based on a mask. The resulting
> vector becomes a row in the table. When using the approach at
> https://gist.github.com/thisisnic/f73196d490a9f661269b5403292dddc3, but
> including a gc() call in my loop, memory usage reported by top sits at 3.2%
> on my test machine. However, if I use the version for parquet files (
> https://gist.github.com/thisisnic/5bdb85d2742bc318433f2f14b8bd77cf),
> there is clearly a leak of some sort - after 160 of 500 images the RAM use
> grows to 17% (5.3G), and continues to grow.
>
> 2) Probably excess RAM use loading the dataset - I can open individual
> parquet files, but opening the dataset then doing
> collect(select(parquetdataset, ID, V1, V2, V3, V4))) crashes the R session,
> most likely due running out of RAM (unconfirmed).
>
> 3) I return to feather format and only save the first 100 data columns +
> ID columns - The number of rows remains the same as before. Now a
> collect(select(arrowdataset, ID1, ID2, V1, V2)) takes 3.6 seconds
> while collect(select(arrowdataset, ID1, ID2, starts_with("V")) takes 7.3
> seconds - i.e. much faster than extracting the same columns from a much
> wider dataset.
>
> It looks like I should try pivoting the data to a long format before
> saving and see if I can analyse it that way. I was previously fetching a
> set of columns and pivoting/nesting anyway, so perhaps it isn't a bad thing.
>
>
> On Sun, Oct 1, 2023 at 9:03 PM Richard Beare <richard.be...@gmail.com>
> wrote:
>
>> Thanks for all the suggestions, I'm working through them.
>>
>> One problem I've discovered relates to creating the dataset in the
>> first place. For the parquet version I'm using the approach described here:
>>
>>  https://gist.github.com/thisisnic/5bdb85d2742bc318433f2f14b8bd77cf
>>
>> in response to an earlier question. However I'm finding that the R ram
>> usage steadily grows through the loop, which it shouldn't. I suspect I
>> should be forcing a flush to disk at regular intervals, but I can't see how
>> to achieve that from R. There's no obvious reason why the ram use needs to
>> increase with iterations.
>>
>> On Sat, Sep 30, 2023 at 1:23 AM Aldrin <octalene....@pm.me> wrote:
>>
>>> how many rows are you including in a batch? you might want to try with
>>> smaller row batches since your columns are so wide.
>>>
>>> the other thing you can try instead of parquet is testing files with
>>> progressively more columns. If the width of your tables are the problem
>>> then you'll be able to see when it becomes a problem and what the peak load
>>> time is to know how much you may be missing out on.
>>>
>>> Sent from Proton Mail <https://proton.me/mail/home> for iOS
>>>
>>>
>>> On Thu, Sep 28, 2023 at 22:34, Bryce Mecum <bryceme...@gmail.com
>>> <On+Thu,+Sep+28,+2023+at+22:34,+Bryce+Mecum+%3C%3Ca+href=>> wrote:
>>>
>>> Hi Richard,
>>>
>>> I tried to reproduce [1] something akin to what you describe and I
>>> also see worse-than-expected performance. I did find a GitHub Issue
>>> [2] describing performance issues with wide record batches which might
>>> be relevant here, though I'm not sure.
>>>
>>> Have you tried the same kind of workflow but with Parquet as your
>>> on-disk format instead of Feather?
>>>
>>> [1] https://gist.github.com/amoeba/38591e99bd8682b60779021ac57f146b
>>> [2] https://github.com/apache/arrow/issues/16270
>>>
>>> On Wed, Sep 27, 2023 at 10:14 PM Richard Beare <richard.be...@gmail.com>
>>> wrote:
>>> >
>>> > Hi,
>>> > I have created (from R) an arrow dataset consisting of 86 files
>>> (feather). Each of them is 895M, with about 500 rows and 32000 columns. The
>>> natural structure of the complete dataframe is a 86*500 row dataframe.
>>> >
>>> > My aim is to load a chunk consisting of all rows and a subset of
>>> columns (two ID columns + 100 other columns), I'll do some manipulation and
>>> modelling on that chunk, then move to the next and repeat.
>>> >
>>> > Each row in the dataframe corresponds to a flattened image, with two
>>> ID columns. Each feather file contains the set of images corresponding to a
>>> single measure.
>>> >
>>> > I want to run a series of collect(arrowdataset[, c("ID1", "ID2", "V1",
>>> "V2")])
>>> >
>>> > However the load time seems very slow (10 minutes+), and I'm wondering
>>> what I've done wrong. I've tested on hosts with SSD.
>>> >
>>> > I can see a saving in which ID1 becomes part of the partitioning
>>> instead of storing it with the data, but that sounds like a minor change.
>>> >
>>> > Any thoughts on what I've missed.
>>>
>>>

Re: Optimizing read performance - wide data frames

Reply via email to