I'm on a system with 12TB of memory and attempting to use Pyarrow's Plasma
client to convert a series of CSV files (via Pandas) into a Parquet store.

I've got a little over 20k CSV files to process which are about 1-2gb each.
I'm loading 500 to 1000 files at a time.

In each iteration, I'm loading a series of files, partitioning them by a
time field into separate dataframes, then writing parquet files in
directories for each day.

The problem I'm having is that the Plasma client & server appear to lock up
after about 2-3 iterations. It locks up to the point where I can't even
CTRL+C the server. I am able to stop the notebook and re-trying the code
just continues to lock up when interacting with Jupyter. There are no
errors in my logs to tell me something's wrong.

Just to make sure I'm not just being impatient and possibly need to wait
for some background services to finish, I allowed the code to run overnight
and it was still in the same state when I came in to work this morning. I'm
running the Plasma server with 4TB max.

In an attempt to pro-actively free up some of the object ids that I no
longer need, I also attempted to use the client.release() function but I
cannot seem to figure out how to make this work properly. It crashes my
Jupyter kernel each time I try.

I'm using Pyarrow 0.9.0

Thanks in advance.

Reply via email to