Re: [Pytables-users] PyTables and Multiprocessing

2013-07-12 Thread Anthony Scopatz
On Fri, Jul 12, 2013 at 1:51 AM, Mathieu Dubois  wrote:

>  Hi Anthony,
>
> Thank you very much for your answer (it works). I will try to remodel my
> code around this trick but I'm not sure it's possible because I use a
> framework that need arrays.
>

I think that this method still works.  You can always send back a numpy
array to the main process that you pull out from a subprocess.


> Can somebody explain what is going on? I was thinking that PyTables keep
> weakref to the file for lazy loading but I'm not sure.
>
> How
>
> In any case, the PyTables community is very helpful.
>

Glad to help!

Be Well
Anthony


>
> Thanks,
> Mathieu
>
> Le 12/07/2013 00:44, Anthony Scopatz a écrit :
>
> Hi Mathieu,
>
>  I think you should try opening a new file handle per process.  The
> following works for me on v3.0:
>
>  import tables
> import random
> import multiprocessing
>
>  # Reload the data
>
>  # Use multiprocessing to perform a simple computation (column average)
>
>  def f(filename):
> h5file = tables.openFile(filename, mode='r')
> name = multiprocessing.current_process().name
> column = random.randint(0, 10)
> print '%s use column %i' % (name, column)
> rtn = h5file.root.X[:, column].mean()
> h5file.close()
> return rtn
>
>  p = multiprocessing.Pool(2)
> col_mean = p.map(f, ['test.hdf5', 'test.hdf5', 'test.hdf5'])
>
>  Be well
> Anthony
>
>
> On Thu, Jul 11, 2013 at 3:43 PM, Mathieu Dubois <
> duboismathieu_g...@yahoo.fr> wrote:
>
>>  Le 11/07/2013 21:56, Anthony Scopatz a écrit :
>>
>>
>>
>>
>> On Thu, Jul 11, 2013 at 2:49 PM, Mathieu Dubois <
>> duboismathieu_g...@yahoo.fr> wrote:
>>
>>> Hello,
>>>
>>> I wanted to use PyTables in conjunction with multiprocessing for some
>>> embarrassingly parallel tasks.
>>>
>>> However, it seems that it is not possible. In the following (very
>>> stupid) example, X is a Carray of size (100, 10) stored in the file
>>> test.hdf5:
>>>
>>> import tables
>>>
>>> import multiprocessing
>>>
>>> # Reload the data
>>>
>>> h5file = tables.openFile('test.hdf5', mode='r')
>>>
>>> X = h5file.root.X
>>>
>>> # Use multiprocessing to perform a simple computation (column average)
>>>
>>> def f(X):
>>>
>>>  name = multiprocessing.current_process().name
>>>
>>>  column = random.randint(0, n_features)
>>>
>>>  print '%s use column %i' % (name, column)
>>>
>>>  return X[:, column].mean()
>>>
>>> p = multiprocessing.Pool(2)
>>>
>>> col_mean = p.map(f, [X, X, X])
>>>
>>> When executing it the following error:
>>>
>>> Exception in thread Thread-2:
>>>
>>> Traceback (most recent call last):
>>>
>>>File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
>>>
>>>  self.run()
>>>
>>>File "/usr/lib/python2.7/threading.py", line 504, in run
>>>
>>>  self.__target(*self.__args, **self.__kwargs)
>>>
>>>File "/usr/lib/python2.7/multiprocessing/pool.py", line 319, in
>>> _handle_tasks
>>>
>>>  put(task)
>>>
>>> PicklingError: Can't pickle : attribute lookup
>>> __builtin__.weakref failed
>>>
>>>
>>> I have googled for weakref and pickle but can't find a solution.
>>>
>>> Any help?
>>>
>>
>>  Hello Mathieu,
>>
>>  I have used multiprocessing and files opened in read mode many times so
>> I am not sure what is going on here.
>>
>>  Thanks for your answer. Maybe you can point me to an working example?
>>
>>
>>   Could you provide the test.hdf5 file so that we could try to reproduce
>> this.
>>
>>  Here is the script that I have used to generate the data:
>>
>> import tables
>>
>> import numpy
>>
>> # Create data & store it
>>
>> n_features = 10
>>
>> n_obs  = 100
>>
>> X = numpy.random.rand(n_obs, n_features)
>>
>> h5file = tables.openFile('test.hdf5', mode='w')
>>
>> Xatom = tables.Atom.from_dtype(X.dtype)
>>
>> Xhdf5 = h5file.createCArray(h5file.root, 'X', Xatom, X.shape)
>>
>> Xhdf5[:] = X
>>
>> h5file.close()
>>
>>
>> I hope it's not a stupid mistake. I am using PyTables 2.3.1 on Ubuntu
>> 12.04 (libhdf5 is 1.8.4patch1).
>>
>>
>>
>>
>>> By the way, I have noticed that by slicing a Carray, I get a numpy array
>>> (I created the HDF5 file with numpy). Therefore, everything is copied to
>>> memory. Is there a way to avoid that?
>>>
>>
>>  Only the slice that you ask for is brought into memory an it is
>> returned as a non-view numpy array.
>>
>>  OK. I may be careful about that.
>>
>>
>>
>>  Be Well
>> Anthony
>>
>>
>>>
>>> Mathieu
>>>
>>>
>>> --
>>> See everything from the browser to the database with AppDynamics
>>> Get end-to-end visibility with application monitoring from AppDynamics
>>> Isolate bottlenecks and diagnose root cause in seconds.
>>> Start your free trial of AppDynamics Pro today!
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>> ___
>>> Pytables-users mailing list
>>> Pytables-users@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/l

Re: [Pytables-users] HDF5/PyTables/NumPy Question

2013-07-12 Thread Anthony Scopatz
Hi Robert,

Glad these materials can be helpful.  (Note: these questions really should
be asked on the pytables-users mailing list -- CC'd here -- so please join
that list: https://lists.sourceforge.net/lists/listinfo/pytables-users)

On Fri, Jul 12, 2013 at 12:48 PM, Robert Nelson <
rrnel...@atmos.colostate.edu> wrote:

> Dr. Scopatz,
>
> I came across your SciPy 2012 "HDF5 is for lovers" video and thought you
> might be able to help me.
>
> I'm trying to read large (>1GB) HDF files and do multidimensional indexing
> (with repeated values) on them. I saw a 
> postof
>  yours from over a year ago saying that the best solution would be to
> convert it to a NumPy array but this takes too long.
>

I think that the strategy is the same as before.  Ask (to the best of my
recollection) did not open an issue and so no changes have been made to
PyTables to handle this.

Also in this strategy, you should only be loading in the indices to start
with.  I doubt (though I could be wrong) that you have 1 Gb worth of index
data alone.  The whole idea here is to do a unique (set) and a sort
operation on the much smaller index data AND THEN use fancy indexing to
pull the actual data back out.

As always some sample code and a sample file would be extremely helpful.  I
don't think I can do much more for you without these.

Be Well
Anthony


> Have there been any updates in PyTables that would make this possible?
>
> Thank you!
>
> Robert Nelson
> Colorado State University
> rob.r.nel...@gmail.com
>  763-354-8411
>
--
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users