Re: Pyarrow Plasma client.release() fault

2018-07-20 Thread Philipp Moritz
Also you should avoid calling release directly, because it will also be called automatically here: https://github.com/apache/arrow/blob/master/python/pyarrow/_plasma.pyx#L222 Instead, you should call "del buffer" on the PlasmaBuffer. I'll submit a PR to make the release method private. The only

Re: Pyarrow Plasma client.release() fault

2018-07-20 Thread Robert Nishihara
Hi Corey, It is possible that the current eviction policy will evict a ton of objects at once. Since the plasma store is single threaded, this could cause the plasma store to be unresponsive while the eviction is happening (though it should not hang permanently, just temporarily). You could

Re: Pyarrow Plasma client.release() fault

2018-07-20 Thread Corey Nolet
Robert, Yes I am using separate Plasma clients in each different thread. I also verified that I am not using up all the file descriptors or reaching the overcommit limit. I do see that the Plasma server is evicting objects every so often. I'm assuming this eviction may be going on in the

Re: Pyarrow Plasma client.release() fault

2018-07-16 Thread Wes McKinney
Seems like we might want to write down some best practices for this level of large scale usage, essentially a supercomputer-like rig. I wouldn't even know where to come by a machine with a machine with > 2TB memory for scalability / concurrency load testing On Mon, Jul 16, 2018 at 2:59 PM, Robert

Re: Pyarrow Plasma client.release() fault

2018-07-16 Thread Robert Nishihara
Are you using the same plasma client from all of the different threads? If so, that could cause race conditions as the client is not thread safe. Alternatively, if you have a separate plasma client for each thread, then you may be running out of file descriptors somewhere (either the client

Re: Pyarrow Plasma client.release() fault

2018-07-10 Thread Corey Nolet
Update: I'm investigating the possibility that I've reached the overcommit limit in the kernel as a result of all the parallel processes. This still doesn't fix the client.release() problem but it might explain why the processing appears to halt, after some time, until I restart the Jupyter

Re: Pyarrow Plasma client.release() fault

2018-07-10 Thread Corey Nolet
Wes, Unfortunately, my code is on a separate network. I'll try to explain what I'm doing and if you need further detail, I can certainly pseudocode specifics. I am using multiprocessing.Pool() to fire up a bunch of threads for different filenames. In each thread, I'm performing a pd.read_csv(),

Re: Pyarrow Plasma client.release() fault

2018-07-10 Thread Wes McKinney
hi Corey, Can you provide the code (or a simplified version thereof) that shows how you're using Plasma? - Wes On Tue, Jul 10, 2018 at 11:45 AM, Corey Nolet wrote: > I'm on a system with 12TB of memory and attempting to use Pyarrow's Plasma > client to convert a series of CSV files (via

Pyarrow Plasma client.release() fault

2018-07-10 Thread Corey Nolet
I'm on a system with 12TB of memory and attempting to use Pyarrow's Plasma client to convert a series of CSV files (via Pandas) into a Parquet store. I've got a little over 20k CSV files to process which are about 1-2gb each. I'm loading 500 to 1000 files at a time. In each iteration, I'm