A Tuesday 17 June 2008, Glenn escrigué:
> Francesc Alted <faltet <at> pytables.com> writes:
> > A Monday 16 June 2008, Glenn escrigué:
> > > Hello,
> > > I am storing 400000 rows to an EArray as follows:
> > > if grp.__contains__('normI'):
> > > fh.removeNode(grp,'normI')
> > > fh.createEArray(grp,'normI',Float32Atom(), (0,512),
> > > expectedrows=800000)
> > >
> > > ... populate 400000 rows of normI array ...
> > >
> > > When I use it as follows:
> > > tmp = np.asarray(grp.normI[:,k]) # Grab the k'th column of the
> > > Earray tmp = SomeCalculation(tmp) #this is very fast
> > > grp.SomeCArray[:,k] = tmp #this is also very fast, but I am only
> > > storing ~100 # values, so I'm not sure if it actually has good
> > > performance or not
> > >
> > >
> > > it is horribly slow, the np.asarray call takes ~30 seconds,
> > > which is only 32Kbyte/s if only 400000*4 bytes are being read as
> > > it should be, but 16Mbyte/s if all 512*4*400000 are being read,
> > > and then sliced. When I check the disk read performance, I see
> > > that indeed it is reading continuously at around 16 Mbyte/s. Am I
> > > doing something wrong?
> >
> > Mmmm, I think your message above misses some information. Could
> > you please double check which is exactly the statement showing slow
> > performance?
> >
> > Also, I'm not sure why you are using the:
> >
> > tmp = np.asarray(grp.normI[:,k])
> >
> > idiom. It is not:
> >
> > tmp = grp.normI[:,k]
> >
> > enough? Not that the asarray() call would be slowing down things,
> > it is just that I'm curious.
> >
> > Cheers,
>
> I tried both with and without asarray, same result.
> My loop looks like this:
>
> tic = time.time()
> for k in range(512):
> print "start", k, time.time()-tic
> tmp = np.asarray(grp.normI[:,k])
> print "get asarray", k, time.time() - tic
> tmp = MyCalculation(tmp)
> print "calc", k, time.time() - tic
> grp.tpavI[:,k] = tmp
> print "store", k, time.time() - tic
>
> I clearly see that the only part that is taking up any time is
> reading the data in, because all of the other print statements do not
> show any increase in time. If this is the correct way of accessing
> the EArrays, and should be fast, then I will see if I can make a
> simple example to reproduce the effect.
Ah, I think I see now what is going on: you are reading complete
*columns* of your data on disk normI[:,k]. As HDF5 is a C library, the
data is saved, by default, ordered by rows, instead of columns. So,
accessing it the 'wrong' way leads to this sort of effects. For
example, in:
In [43]: ea = fh.createEArray('/','normI',tables.Float32Atom(), (0,512),
expectedrows=800000)
the EArray has a chunkshape of:
In [45]: ea.chunkshape
Out[45]: (32, 512)
The HDF5 library treats chunks as atomic objects -- disk I/O is always
in terms of complete chunks -- which means that, if you are accessing
the element, say, [5, 6] of ea, you will have to read the complete
[:32, :] chunk from disk. So, if you want to read just 1 column, you
will end reading the complete EArray (and this is what you are seeing).
If you want to keep accessing your EArrays by columns, you may want to
pass a user-defined chunkshape:
In [50]: ea2 = fh.createEArray('/','normI2',tables.Float32Atom(),
(512,0), chunkshape=(512*32,1))
In [51]: ea2.chunkshape
Out[51]: (16384, 1)
So, in that case, you will read just the information you are interested
in, saving time in I/O effort.
Hope that helps,
--
Francesc Alted
Freelance developer
Tel +34-964-282-249
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users