A Tuesday 17 June 2008, Glenn escrigué:
> Francesc Alted <faltet <at> pytables.com> writes:
> > A Monday 16 June 2008, Glenn escrigué:
> > > Hello,
> > > I am storing 400000 rows to an EArray as follows:
> > > if grp.__contains__('normI'):
> > >   fh.removeNode(grp,'normI')
> > > fh.createEArray(grp,'normI',Float32Atom(), (0,512),
> > > expectedrows=800000)
> > >
> > > ... populate 400000 rows of normI array ...
> > >
> > > When I use it as follows:
> > > tmp = np.asarray(grp.normI[:,k])  # Grab the k'th column of the
> > > Earray tmp = SomeCalculation(tmp) #this is very fast
> > > grp.SomeCArray[:,k] = tmp #this is also very fast, but I am only
> > > storing ~100 # values, so I'm not sure if it actually has good
> > > performance or not
> > >
> > >
> > >  it is horribly slow, the np.asarray call takes ~30 seconds,
> > > which is only 32Kbyte/s if only 400000*4 bytes are being read as
> > > it should be, but 16Mbyte/s if all 512*4*400000 are being read,
> > > and then sliced. When I check the disk read performance, I see
> > > that indeed it is reading continuously at around 16 Mbyte/s. Am I
> > > doing something wrong?
> >
> > Mmmm, I think your message above misses some information.  Could
> > you please double check which is exactly the statement showing slow
> > performance?
> >
> > Also, I'm not sure why you are using the:
> >
> > tmp = np.asarray(grp.normI[:,k])
> >
> > idiom.  It is not:
> >
> > tmp = grp.normI[:,k]
> >
> > enough?  Not that the asarray() call would be slowing down things,
> > it is just that I'm curious.
> >
> > Cheers,
>
> I tried both with and without asarray, same result.
> My loop looks like this:
>
> tic = time.time()
> for k in range(512):
>   print "start", k, time.time()-tic
>   tmp = np.asarray(grp.normI[:,k])
>   print "get asarray", k, time.time() - tic
>   tmp = MyCalculation(tmp)
>   print "calc", k, time.time() - tic
>   grp.tpavI[:,k] = tmp
>   print "store", k, time.time() - tic
>
> I clearly see that the only part that is taking up any time is
> reading the data in, because all of the other print statements do not
> show any increase in time. If this is the correct way of accessing
> the EArrays, and should be fast, then I will see if I can make a
> simple example to reproduce the effect.

Ah, I think I see now what is going on: you are reading complete 
*columns* of your data on disk normI[:,k].  As HDF5 is a C library, the 
data is saved, by default, ordered by rows, instead of columns.  So, 
accessing it the 'wrong' way leads to this sort of effects.  For 
example, in:

In [43]: ea = fh.createEArray('/','normI',tables.Float32Atom(), (0,512), 
expectedrows=800000)

the EArray has a chunkshape of:

In [45]: ea.chunkshape
Out[45]: (32, 512)

The HDF5 library treats chunks as atomic objects -- disk I/O is always 
in terms of complete chunks -- which means that, if you are accessing 
the element, say, [5, 6] of ea, you will have to read the complete 
[:32, :] chunk from disk.  So, if you want to read just 1 column, you 
will end reading the complete EArray (and this is what you are seeing).

If you want to keep accessing your EArrays by columns, you may want to
pass a user-defined chunkshape:

In [50]: ea2 = fh.createEArray('/','normI2',tables.Float32Atom(), 
(512,0), chunkshape=(512*32,1))

In [51]: ea2.chunkshape
Out[51]: (16384, 1)

So, in that case, you will read just the information you are interested 
in, saving time in I/O effort.

Hope that helps,

-- 
Francesc Alted
Freelance developer
Tel +34-964-282-249

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to