Wes,

Unfortunately, my code is on a separate network. I'll try to explain what
I'm doing and if you need further detail, I can certainly pseudocode
specifics.

I am using multiprocessing.Pool() to fire up a bunch of threads for
different filenames. In each thread, I'm performing a pd.read_csv(),
sorting by the timestamp field (rounded to the day) and chunking the
Dataframe into separate Dataframes. I create a new Plasma ObjectID for each
of the chunked Dataframes, convert them to RecordBuffer objects, stream the
bytes to Plasma and seal the objects. Only the objectIDs are returned to
the orchestration thread.

In follow-on processing, I'm combining the ObjectIDs for each of the unique
day timestamps into lists and I'm passing those into a function in parallel
using multiprocessing.Pool(). In this function, I'm iterating through the
lists of objectIds, loading them back into Dataframes, appending them
together until their size
is > some predefined threshold, and performing a df.to_parquet().

The steps in the 2 paragraphs above are performing in a loop, batching up
500-1k files at a time for each iteration.

When I run this iteration a few times, it eventually locks up the Plasma
client. With regards to the release() fault, it doesn't seem to matter when
or where I run it (in the orchestration thread or in other threads), it
always seems to crash the Jupyter kernel. I'm thinking I might be using it
wrong, I'm just trying to figure out where and what I'm doing.

Thanks again!

On Tue, Jul 10, 2018 at 12:05 PM Wes McKinney <wesmck...@gmail.com> wrote:

> hi Corey,
>
> Can you provide the code (or a simplified version thereof) that shows
> how you're using Plasma?
>
> - Wes
>
> On Tue, Jul 10, 2018 at 11:45 AM, Corey Nolet <cjno...@gmail.com> wrote:
> > I'm on a system with 12TB of memory and attempting to use Pyarrow's
> Plasma
> > client to convert a series of CSV files (via Pandas) into a Parquet
> store.
> >
> > I've got a little over 20k CSV files to process which are about 1-2gb
> each.
> > I'm loading 500 to 1000 files at a time.
> >
> > In each iteration, I'm loading a series of files, partitioning them by a
> > time field into separate dataframes, then writing parquet files in
> > directories for each day.
> >
> > The problem I'm having is that the Plasma client & server appear to lock
> up
> > after about 2-3 iterations. It locks up to the point where I can't even
> > CTRL+C the server. I am able to stop the notebook and re-trying the code
> > just continues to lock up when interacting with Jupyter. There are no
> > errors in my logs to tell me something's wrong.
> >
> > Just to make sure I'm not just being impatient and possibly need to wait
> > for some background services to finish, I allowed the code to run
> overnight
> > and it was still in the same state when I came in to work this morning.
> I'm
> > running the Plasma server with 4TB max.
> >
> > In an attempt to pro-actively free up some of the object ids that I no
> > longer need, I also attempted to use the client.release() function but I
> > cannot seem to figure out how to make this work properly. It crashes my
> > Jupyter kernel each time I try.
> >
> > I'm using Pyarrow 0.9.0
> >
> > Thanks in advance.
>

Reply via email to