Re: [Pytables-users] Writing to CArray

2013-03-11 Thread Anthony Scopatz
On Sun, Mar 10, 2013 at 8:47 PM, Tim Burgess timburg...@mac.com wrote:


 On 08/03/2013, at 2:51 AM, Anthony Scopatz wrote:

  Hey Tim,
 
  Awesome dataset! And neat image!
 
  As per your request, a couple of minor things I noticed were that you
 probably don't need to do the sanity check each time (great for debugging,
 but not needed always), you are using masked arrays which while sometimes
 convenient are generally slower than creating an array, a mask and applying
 the mask to the array, and you seem to be downcasting from float64 to
 float32 for some reason that I am not entirely clear on (size, speed?).
 
  To the more major question of write performance, one thing that you
 could try is compression.  You might want to do some timing studies to find
 the best compressor and level. Performance here can vary a lot based on how
 similar your data is (and how close similar data is to each other).  If you
 have got a bunch of zeros and only a few real data points, even zlib 1 is
 going to be blazing fast compared to writing all those zeros out explicitly.
 
  Another thing you could try doing is switching to EArray and using the
 append() method.  This might save PyTables, numpy, hdf5, etc from having to
 check that the shape of sst_node[qual_indices] is actually the same as
 the data you are giving it.  Additionally dumping a block of memory to the
 file directly (via append()) is generally faster than having to resolve
 fancy indexes (which are notoriously the slow part of even numpy).
 
  Lastly, as a general comment, you seem to be doing a lot of stuff in the
 inner most loop -- including writing to disk.  I would look at how you
 could restructure this to move as much as possible out of this loop.  Your
 data seems to be about 12 GB for a year, so this is probably too big to
 build up the full sst array completely in memory prior to writing.  That
 is, unless you have a computer much bigger than my laptop ;).  But issuing
 one fat write command is probably going to be faster than making 365 of
 them.
 
  Happy hacking!
  Be Well
  Anthony
 


 Thanks Anthony for being so responsive and touching on a number of points.

 The netCDF library gives me a masked array so I have to explicitly
 transform that into a regular numpy array.


Ahh interesting.  Depending on the netCDF version the file was made with,
you should be able to read the file directly from PyTables.  You could thus
directly get a normal numpy array.  This *should* be possible, but I have
never tried it ;)


 I've looked under the covers and have seen that the ma masked
 implementation is all pure Python and so there is a performance drawback.
 I'm not up to speed yet on where the numpy.na masking implementation is
 (started a new job here).

 I tried to do an implementation in memory (except for the final write) and
 found that I have about 2GB of indices when I extract the quality indices.
 Simply using those indexes, memory usage grows to over 64GB and I
 eventually run out of memory and start churning away in swap.

 For the moment, I have pulled down the latest git master and am using the
 new in-memory HDF feature. This seems to give be better performance and is
 code-wise pretty simple so for the moment, it's good enough.


Awesome! I am glad that this is working for you.


 Cheers and thanks again, Tim

 BTW I viewed your SciPy tutorial. Good stuff!


Thanks!






 --
 Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
 Wave(TM): Endpoint Security, Q1 2013 and remains a good choice in the
 endpoint security space. For insight on selecting the right partner to
 tackle endpoint security challenges, access the full report.
 http://p.sf.net/sfu/symantec-dev2dev
 ___
 Pytables-users mailing list
 Pytables-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/pytables-users

--
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and remains a good choice in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Writing to CArray

2013-03-11 Thread Tim Burgess
  The netCDF library gives me a masked array so I have to explicitly transform that into a regular numpy array.Ahh interesting. Depending on the netCDF version the file was made with, you should be able to read the file directly from PyTables. You could thus directly get a normal numpy array. This *should* be possible, but I have never tried it ;)I think the netCDF3 functionality has been taken out or at least deprecated (https://github.com/PyTables/PyTables/issues/68). Using the python-netCDF4 module allows me to pull from pretty much any netcdf file and the inherent masking is sometimes very useful where the dataset is smaller and I can live with the lower performance of masks. I've looked under the covers and have seen that the ma masked implementation is all pure Python and so there is a performance drawback. I'm not up to speed yet on where the numpy.na masking implementation is (started a new job here).  I tried to do an implementation in memory (except for the final write) and found that I have about 2GB of indices when I extract the quality indices. Simply using those indexes, memory usage grows to over 64GB and I eventually run out of memory and start churning away in swap.  For the moment, I have pulled down the latest git master and am using the new in-memory HDF feature. This seems to give be better performance and is code-wise pretty simple so for the moment, it's good enough.Awesome! I am glad that this is working for you.Yes - appears to work great!--
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and remains a good choice in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Writing to CArray

2013-03-07 Thread Anthony Scopatz
Hey Tim,

Awesome dataset! And neat image!

As per your request, a couple of minor things I noticed were that you
probably don't need to do the sanity check each time (great for debugging,
but not needed always), you are using masked arrays which while
sometimes convenient are generally slower than creating an array, a mask
and applying the mask to the array, and you seem to be downcasting from
float64 to float32 for some reason that I am not entirely clear on (size,
speed?).

To the more major question of write performance, one thing that you could
try is 
compressionhttp://pytables.github.com/usersguide/optimization.html#compression-issues.
 You might want to do some timing studies to find the best compressor and
level. Performance here can vary a lot based on how similar your data is
(and how close similar data is to each other).  If you have got a bunch of
zeros and only a few real data points, even zlib 1 is going to be blazing
fast compared to writing all those zeros out explicitly.

Another thing you could try doing is switching to EArray and using the
append() method.  This might save PyTables, numpy, hdf5, etc from having to
check that the shape of sst_node[qual_indices] is actually the same as
the data you are giving it.  Additionally dumping a block of memory to the
file directly (via append()) is generally faster than having to resolve
fancy indexes (which are notoriously the slow part of even numpy).

Lastly, as a general comment, you seem to be doing a lot of stuff in the
inner most loop -- including writing to disk.  I would look at how you
could restructure this to move as much as possible out of this loop.  Your
data seems to be about 12 GB for a year, so this is probably too big to
build up the full sst array completely in memory prior to writing.  That
is, unless you have a computer much bigger than my laptop ;).  But issuing
one fat write command is probably going to be faster than making 365 of
them.

Happy hacking!
Be Well
Anthony


On Wed, Mar 6, 2013 at 11:25 PM, Tim Burgess timburg...@mac.com wrote:

 I'm producing a large chunked HDF5 using CArray and want to clarify that
 the performance I'm getting is what would normally be expected.

 The source data is a large annual satellite dataset - 365 days x 4320
 latitiude by 8640 longitude of 32bit floats. I'm only interested in pixels
 of a certain quality so I am iterating over the source data (which is in
 daily files) and then determining the indices of all quality pixels in that
 day. There are usually about 2 million quality pixels in a day.

 I then set the equivalent CArray locations to the value of the quality
 pixels. As you can see in the code below, the source numpy array is a 1 x
 4320 x 8640. So for addressing the CArray, I simply take the first index
 and set it to the current day to map indices to the 365 x 4320 x 8640
 CArray.

 I've tried a couple of different chunkshapes. As I will be reading the HDF
 sequentially day by day and as the data comes from a polar-orbit, I'm using
 a 1 x 1080 x 240 chunk to try and optimize for chunks that will have no
 data (and therefore reduce the total filesize). You can see an image of an
 example day at


 http://data.nodc.noaa.gov/pathfinder/Version5.2/browse_images/2011/sea_surface_temperature/20110101001735-NODC-L3C_GHRSST-SSTskin-AVHRR_Pathfinder-PFV5.2_NOAA19_G_2011001_night-v02.0-fv01.0-sea_surface_temperature.png


 To produce a day takes about 2.5 minutes on a Linux (Ubuntu 12.04) machine
 with two SSDs in RAID 0. The system has 64GB of RAM but I don't think
 memory is a constraint here.
 Looking at a profile, most of that 2.5 minutes is spent in _g_writeCoords
 in tables.hdf5Extension.Array

 Here's the pertinent code:

 for year in range(2011, 2012):

 # create dataset and add global attrs
 annualfile_path =
 '%sPF4km/V5.2/hdf/annual/PF52-%d-c1080x240-test.h5' % (crwdir, year)
 print 'Creating ' + annualfile_path


 with tables.openFile(annualfile_path, 'w', title=('Pathfinder V5.2
 %d' % year)) as h5f:

 # write lat lons
 lat_node = h5f.createArray('/', 'lat', lats, title='latitude')
 lon_node = h5f.createArray('/', 'lon', lons, title='longitude')


 # glob all the region summaries in a year
 files = [glob.glob('%sPF4km/V5.2/%d/*night*' % (crwdir,
 year))[0]]
 print 'Found %d days' % len(files)
 files.sort()


 # create a 365 x 4320 x 8640 array
 shape = (NUMDAYS, 4320, 8640)
 atom = tables.Float32Atom(dflt=np.nan)
 # we chunk into daily slices and then further chunk days
 sst_node = h5f.createCArray(h5f.root, 'sst', atom, shape,
 chunkshape=(1, 1080, 240))


 for filename in files:

 # get day
 day = int(filename[-25:-22])
 print 'Processing %d day %d' % (year, day)

 ds = Dataset(filename)
 kelvin64 =