Robert,

Yes I am using separate Plasma clients in each different thread. I also
verified that I am not using up all the file descriptors or reaching the
overcommit limit.

I do see that the Plasma server is evicting objects every so often. I'm
assuming this eviction may be going on in the background? Is it possible
that the locking up may be the result of a massive eviction? I am
allocating over 8TB for the Plasma server.

Wes,

Best practices would be great. I did find that the @ray.remote scheduler
from the Ray project has drastically simplified my code.

I also attempted using single-node PySpark but the type conversion I need
for going from CSV->Dataframes was orders of magnitude slower than Pandas
and Python.



On Mon, Jul 16, 2018 at 8:17 PM Wes McKinney <wesmck...@gmail.com> wrote:

> Seems like we might want to write down some best practices for this
> level of large scale usage, essentially a supercomputer-like rig. I
> wouldn't even know where to come by a machine with a machine with >
> 2TB memory for scalability / concurrency load testing
>
> On Mon, Jul 16, 2018 at 2:59 PM, Robert Nishihara
> <robertnishih...@gmail.com> wrote:
> > Are you using the same plasma client from all of the different threads?
> If
> > so, that could cause race conditions as the client is not thread safe.
> >
> > Alternatively, if you have a separate plasma client for each thread, then
> > you may be running out of file descriptors somewhere (either the client
> > process or the store).
> >
> > Can you check if the object store evicting objects (it prints something
> to
> > stdout/stderr when this happens)? Could you be running out of memory but
> > failing to release the objects?
> >
> > On Tue, Jul 10, 2018 at 9:48 AM Corey Nolet <cjno...@gmail.com> wrote:
> >
> >> Update:
> >>
> >> I'm investigating the possibility that I've reached the overcommit
> limit in
> >> the kernel as a result of all the parallel processes.
> >>
> >> This still doesn't fix the client.release() problem but it might explain
> >> why the processing appears to halt, after some time, until I restart the
> >> Jupyter kernel.
> >>
> >> On Tue, Jul 10, 2018 at 12:27 PM Corey Nolet <cjno...@gmail.com> wrote:
> >>
> >> > Wes,
> >> >
> >> > Unfortunately, my code is on a separate network. I'll try to explain
> what
> >> > I'm doing and if you need further detail, I can certainly pseudocode
> >> > specifics.
> >> >
> >> > I am using multiprocessing.Pool() to fire up a bunch of threads for
> >> > different filenames. In each thread, I'm performing a pd.read_csv(),
> >> > sorting by the timestamp field (rounded to the day) and chunking the
> >> > Dataframe into separate Dataframes. I create a new Plasma ObjectID for
> >> each
> >> > of the chunked Dataframes, convert them to RecordBuffer objects,
> stream
> >> the
> >> > bytes to Plasma and seal the objects. Only the objectIDs are returned
> to
> >> > the orchestration thread.
> >> >
> >> > In follow-on processing, I'm combining the ObjectIDs for each of the
> >> > unique day timestamps into lists and I'm passing those into a
> function in
> >> > parallel using multiprocessing.Pool(). In this function, I'm iterating
> >> > through the lists of objectIds, loading them back into Dataframes,
> >> > appending them together until their size
> >> > is > some predefined threshold, and performing a df.to_parquet().
> >> >
> >> > The steps in the 2 paragraphs above are performing in a loop,
> batching up
> >> > 500-1k files at a time for each iteration.
> >> >
> >> > When I run this iteration a few times, it eventually locks up the
> Plasma
> >> > client. With regards to the release() fault, it doesn't seem to matter
> >> when
> >> > or where I run it (in the orchestration thread or in other threads),
> it
> >> > always seems to crash the Jupyter kernel. I'm thinking I might be
> using
> >> it
> >> > wrong, I'm just trying to figure out where and what I'm doing.
> >> >
> >> > Thanks again!
> >> >
> >> > On Tue, Jul 10, 2018 at 12:05 PM Wes McKinney <wesmck...@gmail.com>
> >> wrote:
> >> >
> >> >> hi Corey,
> >> >>
> >> >> Can you provide the code (or a simplified version thereof) that shows
> >> >> how you're using Plasma?
> >> >>
> >> >> - Wes
> >> >>
> >> >> On Tue, Jul 10, 2018 at 11:45 AM, Corey Nolet <cjno...@gmail.com>
> >> wrote:
> >> >> > I'm on a system with 12TB of memory and attempting to use Pyarrow's
> >> >> Plasma
> >> >> > client to convert a series of CSV files (via Pandas) into a Parquet
> >> >> store.
> >> >> >
> >> >> > I've got a little over 20k CSV files to process which are about
> 1-2gb
> >> >> each.
> >> >> > I'm loading 500 to 1000 files at a time.
> >> >> >
> >> >> > In each iteration, I'm loading a series of files, partitioning them
> >> by a
> >> >> > time field into separate dataframes, then writing parquet files in
> >> >> > directories for each day.
> >> >> >
> >> >> > The problem I'm having is that the Plasma client & server appear to
> >> >> lock up
> >> >> > after about 2-3 iterations. It locks up to the point where I can't
> >> even
> >> >> > CTRL+C the server. I am able to stop the notebook and re-trying the
> >> code
> >> >> > just continues to lock up when interacting with Jupyter. There are
> no
> >> >> > errors in my logs to tell me something's wrong.
> >> >> >
> >> >> > Just to make sure I'm not just being impatient and possibly need to
> >> wait
> >> >> > for some background services to finish, I allowed the code to run
> >> >> overnight
> >> >> > and it was still in the same state when I came in to work this
> >> morning.
> >> >> I'm
> >> >> > running the Plasma server with 4TB max.
> >> >> >
> >> >> > In an attempt to pro-actively free up some of the object ids that
> I no
> >> >> > longer need, I also attempted to use the client.release() function
> >> but I
> >> >> > cannot seem to figure out how to make this work properly. It
> crashes
> >> my
> >> >> > Jupyter kernel each time I try.
> >> >> >
> >> >> > I'm using Pyarrow 0.9.0
> >> >> >
> >> >> > Thanks in advance.
> >> >>
> >> >
> >>
>

Reply via email to