Re: [Pytables-users] Writing to CArray
On Sun, Mar 10, 2013 at 8:47 PM, Tim Burgess timburg...@mac.com wrote: On 08/03/2013, at 2:51 AM, Anthony Scopatz wrote: Hey Tim, Awesome dataset! And neat image! As per your request, a couple of minor things I noticed were that you probably don't need to do the sanity check each time (great for debugging, but not needed always), you are using masked arrays which while sometimes convenient are generally slower than creating an array, a mask and applying the mask to the array, and you seem to be downcasting from float64 to float32 for some reason that I am not entirely clear on (size, speed?). To the more major question of write performance, one thing that you could try is compression. You might want to do some timing studies to find the best compressor and level. Performance here can vary a lot based on how similar your data is (and how close similar data is to each other). If you have got a bunch of zeros and only a few real data points, even zlib 1 is going to be blazing fast compared to writing all those zeros out explicitly. Another thing you could try doing is switching to EArray and using the append() method. This might save PyTables, numpy, hdf5, etc from having to check that the shape of sst_node[qual_indices] is actually the same as the data you are giving it. Additionally dumping a block of memory to the file directly (via append()) is generally faster than having to resolve fancy indexes (which are notoriously the slow part of even numpy). Lastly, as a general comment, you seem to be doing a lot of stuff in the inner most loop -- including writing to disk. I would look at how you could restructure this to move as much as possible out of this loop. Your data seems to be about 12 GB for a year, so this is probably too big to build up the full sst array completely in memory prior to writing. That is, unless you have a computer much bigger than my laptop ;). But issuing one fat write command is probably going to be faster than making 365 of them. Happy hacking! Be Well Anthony Thanks Anthony for being so responsive and touching on a number of points. The netCDF library gives me a masked array so I have to explicitly transform that into a regular numpy array. Ahh interesting. Depending on the netCDF version the file was made with, you should be able to read the file directly from PyTables. You could thus directly get a normal numpy array. This *should* be possible, but I have never tried it ;) I've looked under the covers and have seen that the ma masked implementation is all pure Python and so there is a performance drawback. I'm not up to speed yet on where the numpy.na masking implementation is (started a new job here). I tried to do an implementation in memory (except for the final write) and found that I have about 2GB of indices when I extract the quality indices. Simply using those indexes, memory usage grows to over 64GB and I eventually run out of memory and start churning away in swap. For the moment, I have pulled down the latest git master and am using the new in-memory HDF feature. This seems to give be better performance and is code-wise pretty simple so for the moment, it's good enough. Awesome! I am glad that this is working for you. Cheers and thanks again, Tim BTW I viewed your SciPy tutorial. Good stuff! Thanks! -- Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester Wave(TM): Endpoint Security, Q1 2013 and remains a good choice in the endpoint security space. For insight on selecting the right partner to tackle endpoint security challenges, access the full report. http://p.sf.net/sfu/symantec-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester Wave(TM): Endpoint Security, Q1 2013 and remains a good choice in the endpoint security space. For insight on selecting the right partner to tackle endpoint security challenges, access the full report. http://p.sf.net/sfu/symantec-dev2dev___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Writing to CArray
The netCDF library gives me a masked array so I have to explicitly transform that into a regular numpy array.Ahh interesting. Depending on the netCDF version the file was made with, you should be able to read the file directly from PyTables. You could thus directly get a normal numpy array. This *should* be possible, but I have never tried it ;)I think the netCDF3 functionality has been taken out or at least deprecated (https://github.com/PyTables/PyTables/issues/68). Using the python-netCDF4 module allows me to pull from pretty much any netcdf file and the inherent masking is sometimes very useful where the dataset is smaller and I can live with the lower performance of masks. I've looked under the covers and have seen that the ma masked implementation is all pure Python and so there is a performance drawback. I'm not up to speed yet on where the numpy.na masking implementation is (started a new job here). I tried to do an implementation in memory (except for the final write) and found that I have about 2GB of indices when I extract the quality indices. Simply using those indexes, memory usage grows to over 64GB and I eventually run out of memory and start churning away in swap. For the moment, I have pulled down the latest git master and am using the new in-memory HDF feature. This seems to give be better performance and is code-wise pretty simple so for the moment, it's good enough.Awesome! I am glad that this is working for you.Yes - appears to work great!-- Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester Wave(TM): Endpoint Security, Q1 2013 and remains a good choice in the endpoint security space. For insight on selecting the right partner to tackle endpoint security challenges, access the full report. http://p.sf.net/sfu/symantec-dev2dev___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Writing to CArray
Hey Tim, Awesome dataset! And neat image! As per your request, a couple of minor things I noticed were that you probably don't need to do the sanity check each time (great for debugging, but not needed always), you are using masked arrays which while sometimes convenient are generally slower than creating an array, a mask and applying the mask to the array, and you seem to be downcasting from float64 to float32 for some reason that I am not entirely clear on (size, speed?). To the more major question of write performance, one thing that you could try is compressionhttp://pytables.github.com/usersguide/optimization.html#compression-issues. You might want to do some timing studies to find the best compressor and level. Performance here can vary a lot based on how similar your data is (and how close similar data is to each other). If you have got a bunch of zeros and only a few real data points, even zlib 1 is going to be blazing fast compared to writing all those zeros out explicitly. Another thing you could try doing is switching to EArray and using the append() method. This might save PyTables, numpy, hdf5, etc from having to check that the shape of sst_node[qual_indices] is actually the same as the data you are giving it. Additionally dumping a block of memory to the file directly (via append()) is generally faster than having to resolve fancy indexes (which are notoriously the slow part of even numpy). Lastly, as a general comment, you seem to be doing a lot of stuff in the inner most loop -- including writing to disk. I would look at how you could restructure this to move as much as possible out of this loop. Your data seems to be about 12 GB for a year, so this is probably too big to build up the full sst array completely in memory prior to writing. That is, unless you have a computer much bigger than my laptop ;). But issuing one fat write command is probably going to be faster than making 365 of them. Happy hacking! Be Well Anthony On Wed, Mar 6, 2013 at 11:25 PM, Tim Burgess timburg...@mac.com wrote: I'm producing a large chunked HDF5 using CArray and want to clarify that the performance I'm getting is what would normally be expected. The source data is a large annual satellite dataset - 365 days x 4320 latitiude by 8640 longitude of 32bit floats. I'm only interested in pixels of a certain quality so I am iterating over the source data (which is in daily files) and then determining the indices of all quality pixels in that day. There are usually about 2 million quality pixels in a day. I then set the equivalent CArray locations to the value of the quality pixels. As you can see in the code below, the source numpy array is a 1 x 4320 x 8640. So for addressing the CArray, I simply take the first index and set it to the current day to map indices to the 365 x 4320 x 8640 CArray. I've tried a couple of different chunkshapes. As I will be reading the HDF sequentially day by day and as the data comes from a polar-orbit, I'm using a 1 x 1080 x 240 chunk to try and optimize for chunks that will have no data (and therefore reduce the total filesize). You can see an image of an example day at http://data.nodc.noaa.gov/pathfinder/Version5.2/browse_images/2011/sea_surface_temperature/20110101001735-NODC-L3C_GHRSST-SSTskin-AVHRR_Pathfinder-PFV5.2_NOAA19_G_2011001_night-v02.0-fv01.0-sea_surface_temperature.png To produce a day takes about 2.5 minutes on a Linux (Ubuntu 12.04) machine with two SSDs in RAID 0. The system has 64GB of RAM but I don't think memory is a constraint here. Looking at a profile, most of that 2.5 minutes is spent in _g_writeCoords in tables.hdf5Extension.Array Here's the pertinent code: for year in range(2011, 2012): # create dataset and add global attrs annualfile_path = '%sPF4km/V5.2/hdf/annual/PF52-%d-c1080x240-test.h5' % (crwdir, year) print 'Creating ' + annualfile_path with tables.openFile(annualfile_path, 'w', title=('Pathfinder V5.2 %d' % year)) as h5f: # write lat lons lat_node = h5f.createArray('/', 'lat', lats, title='latitude') lon_node = h5f.createArray('/', 'lon', lons, title='longitude') # glob all the region summaries in a year files = [glob.glob('%sPF4km/V5.2/%d/*night*' % (crwdir, year))[0]] print 'Found %d days' % len(files) files.sort() # create a 365 x 4320 x 8640 array shape = (NUMDAYS, 4320, 8640) atom = tables.Float32Atom(dflt=np.nan) # we chunk into daily slices and then further chunk days sst_node = h5f.createCArray(h5f.root, 'sst', atom, shape, chunkshape=(1, 1080, 240)) for filename in files: # get day day = int(filename[-25:-22]) print 'Processing %d day %d' % (year, day) ds = Dataset(filename) kelvin64 =