RE: PyArrow: Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)

Lee, David Tue, 22 Sep 2020 13:21:11 -0700

Try writing smaller chunks..

I usually try to size up my parquet files to 128 megs to match our Hadoop 
filesystem block size. Within that 128 meg parquet files I usually have around 
6 to 10 rowgroups which is basically 6 to 10 mini parquet files which are 12 to 
20 megs each.


Parquet files are compressed and dictionary encoded so writing a 3 gig parquet 
file as a single row group will take up 10 gigs of memory.


-----Original Message-----
From: Niklas B <niklas.biv...@enplore.com> 
Sent: Monday, September 21, 2020 4:59 AM
To: dev@arrow.apache.org; emkornfi...@gmail.com
Subject: Re: PyArrow: Incrementally using ParquetWriter without keeping entire 
dataset in memory (large than memory parquet files)

External Email: Use caution with links and attachments


Hi,

I’ve tried both with little success. I made a JIRA: 
https://urldefense.com/v3/__https://issues.apache.org/jira/browse/ARROW-10052__;!!KSjYCgUGsB4!IcuFxtTJGVKhcx3kJ7vsq7_tiioHnnaeZ2OOBd0BevCQG__v3MyfL1An43xESoWIFfo$
  
<https://urldefense.com/v3/__https://issues.apache.org/jira/browse/ARROW-10052__;!!KSjYCgUGsB4!IcuFxtTJGVKhcx3kJ7vsq7_tiioHnnaeZ2OOBd0BevCQG__v3MyfL1An43xESoWIFfo$
 >

Looking at it now when I've made a minimal example I see something I didn't 
see/realize before which is that while the memory usage is increasing it 
doesn't appear to be linear to the file written. This possibly indicates (I 
guess) that it isn't actually storing the written dataset, but something else.

I’ll keep digging, sorry for it not being as clear as I would have wanted it. 
In real world we see writing a 3 GB parquet file exhausting 10GB of memory when 
writing incrementally.

Regards,
Niklas

> On 20 Sep 2020, at 06:07, Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> Hi Niklas,
> Two suggestions:
> * Try to adjust row_group_size on write_table [1] to a smaller then 
> default value.  If I read the code correctly this is currently 64 
> million rows [2], which seems potentially two high as a default (I'll open a 
> JIRA about this).
> * If this is on linux/mac try setting the jemalloc decay which can 
> return memory the the OS more quickly [3]
>
> Just to confirm this is a local disk (not a blob store?) that you are 
> writing to?
>
> If you can produce a minimal example that still seems to hold onto all 
> memory, after trying these two items please open a JIRA as there could 
> be a bug or some unexpected buffering happening.
> <https://urldefense.com/v3/__https://github.com/apache/arrow/blob/a4eb
> 08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/memory.pxi*L156__;
> Iw!!KSjYCgUGsB4!IcuFxtTJGVKhcx3kJ7vsq7_tiioHnnaeZ2OOBd0BevCQG__v3MyfL1
> An43xEVMT5Fas$ >
>
> Thanks,
> Micah
>
> [1]
> https://urldefense.com/v3/__https://arrow.apache.org/docs/python/gener
> ated/pyarrow.parquet.ParquetWriter.html*pyarrow.parquet.ParquetWriter.
> write_table__;Iw!!KSjYCgUGsB4!IcuFxtTJGVKhcx3kJ7vsq7_tiioHnnaeZ2OOBd0B
> evCQG__v3MyfL1An43xEjwm0CNs$
> [2]
> https://urldefense.com/v3/__https://github.com/apache/arrow/blob/a4eb0
> 8d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/memory.pxi*L156__;I
> w!!KSjYCgUGsB4!IcuFxtTJGVKhcx3kJ7vsq7_tiioHnnaeZ2OOBd0BevCQG__v3MyfL1A
> n43xEVMT5Fas$
> [3]
> https://urldefense.com/v3/__https://github.com/apache/arrow/blob/a4eb0
> 8d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/memory.pxi*L156__;I
> w!!KSjYCgUGsB4!IcuFxtTJGVKhcx3kJ7vsq7_tiioHnnaeZ2OOBd0BevCQG__v3MyfL1A
> n43xEVMT5Fas$
>
> On Tue, Sep 15, 2020 at 8:46 AM Niklas B <niklas.biv...@enplore.com> wrote:
>
>> First of all: Thank you so much for all hard work on Arrow, it’s an 
>> awesome project.
>>
>> Hi,
>>
>> I'm trying to write a large parquet file onto disk (larger then 
>> memory) using PyArrows ParquetWriter and write_table, but even though 
>> the file is written incrementally to disk it still appears to keeps 
>> the entire dataset in memory (eventually getting OOM killed). 
>> Basically what I am trying to do
>> is:
>>
>> with pq.ParquetWriter(
>>                output_file,
>>                arrow_schema,
>>                compression='snappy',
>>                allow_truncated_timestamps=True,
>>                version='2.0',  # Highest available schema
>>                data_page_version='2.0',  # Highest available schema
>>        ) as writer:
>>            for rows_dataframe in function_that_yields_data():
>>                writer.write_table(
>>                    pa.Table.from_pydict(
>>                            rows_dataframe,
>>                            arrow_schema
>>                    )
>>                )
>>
>> Where I have a function that yields data and then write it in chunks 
>> using write_table.
>>
>> Is it possible to force the ParquetWriter to not keep the entire 
>> dataset in memory, or is it simply not possible for good reasons?
>>
>> I’m streaming data from a database and writes it to Parquet. The 
>> end-consumer has plenty of ram, but the machine that does the 
>> conversion doesn’t.
>>
>> Regards,
>> Niklas
>>
>> PS: I’ve also created a stack overflow question, which I will update 
>> with any answer I might get from the mailing list
>>
>> https://urldefense.com/v3/__https://stackoverflow.com/questions/63891
>> 231/pyarrow-incrementally-using-parquetwriter-without-keeping-entire-
>> dataset-in-mem__;!!KSjYCgUGsB4!IcuFxtTJGVKhcx3kJ7vsq7_tiioHnnaeZ2OOBd
>> 0BevCQG__v3MyfL1An43xEvRpCC6k$



This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
information.  Please refer to 
http://www.blackrock.com/corporate/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.


For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2020 BlackRock, Inc. All rights reserved.

RE: PyArrow: Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)

Reply via email to