Hi,

Here are some patches Carabos might consider for PyTables 1.4. I've developed these against PyTables 1.4alpha (20060505) to assist in experimenting with different HDF chunk sizes and read buffer strategies for analysing very large datasets. My main data set comes in 270MB monthly files with 61 fields, 244 bytes/row, 4,135,482 rows, which is 962MB uncompressed and 272MB compressed with LZO. By trying different file storage strategies (uncompressed, zlib, LZO), different HDF chunk sizes (32, 64 ,128, 256, 512 rows), different read buffer sizes (1024 up to 256k rows) and different numbers of HDF chunks to read at once (4,16,64), I've been able to increase row read speeds from 250k rows/sec up to 900k rows/sec, and can process data at sustained rates over 500k rows/sec on my year-old Thinkpad (i.e. it's not very fast, no Core Duo, etc). The most successful approach seems to be optimising how PyTable's read buffers and HDF file chunking interact with WinXP's disk caching, then once the data is in memory, process it using numarray functions to minimise the number of intermediate variables Python has to create and destroy.

The changes described below involve both Python and Pyrex files. I'm no C programmer (the last time I'd compiled a line of C code was 10 years ago) and even though I've always been too scared of Python source distributions on Windows, it took just 2 hours to download Pyrex, Mingw32, the sources for HDF, LZO, ZLIB and PyTables, make my code changes to PyTables' files, recompile PyTables into a binary installer, reinstall PyTables, and have the test scripts run successfully. The whole process was ridiculously easy and to my great surprise, everything worked first time (after correcting a redefine ssize_t error in HDF). Now I know what I am doing, it takes less than 30 seconds to recompile PyTables and reinstall. So many thanks to Greg Ewing for Pyrex and the Carabos team for PyTables, which together make such a powerful and simple set of tools!


Cheers

Stephen

============================================
Change #1: An iterator that returns a table's data into a numarray RecArray buffer, reusing the same buffer for each call

This eliminates the duplication of creating a buffer to read in data and then copying it somewhere else to analyse. By simply passing the destination numarray array to _read_records(), the data is read straight from disk into the numarray array. This function defines a new iterator Table.iterbuffer() in Table.py which returns the one numarray array filled with a succession of slices from the disk file. A patch is also needed to fix _read_records() in TableExtension.pyx to check for a buffer offset in the destination array.

------ new function Table.iterbuffer() in Table.py ------
def iterbuffer(table, buffer_rows=32768, read_rows=4096):
   """
   High-speed iteration over a table in blocks of buffer_rows at a time.
This initially creates a numarray RecArray buffer of length buffer_rows
   and returns it filled with new data on each iteration.
Reading from disk seems to be faster if the buffer is filled with smaller chunks (perhaps it makes OS's caching more predictable), so rows are read
   in read_rows at a time until the buffer is full or the data is finished.
   # SS 2006-0506
   """
   # Work out sensible values for the buffer and the sub-buffer read it
   if not buffer_rows or buffer_rows<=0 or not read_rows or read_rows<=0:
       raise StopIteration
   if read_rows>buffer_rows:
       read_rows = buffer_rows
   buffer = numarray.records.array(None, shape=buffer_rows,
                           formats=table.description._v_nestedFormats,
                           names=table.description._v_nestedNames)
   total_rows = table.nrows
   tpos = 0        # Current position in disk table
   bpos = 0        # Current position in memory buffer
   while 1:
       if bpos + read_rows > buffer_rows:
           # No room for another chunk, so return buffer data
           yield buffer[:bpos]
           bpos = 0
       # Read another row_rows rows of data into the buffer
       # NB: Passing the buffer offset by bpos only works if
       #   TableExtension.pyx is patched to recognise the buffer's offset.
       #   Otherwise _read_records ignores the offset and reads all data
       #   into buffer[0:read_rows]
       num_read = table._read_records(tpos, read_rows, buffer[bpos:])
       tpos += num_read
       bpos += num_read
       if num_read < read_rows or tpos >= total_rows:
           # At the end of the file so return any data left in the buffer
           yield buffer[:bpos]
           raise StopIteration
----------------

This is supported by a patch to _read_records() to consider whether the read buffer has an offset. Any offset is currently ignored so that trying to read data into buffer[100:200] will actually write it into buffer[0:100]. This is easily fixed by adding one line of code:

------ _read_records() in TableExtension.pyx ------
 def _read_records(self, hsize_t start, hsize_t nrecords, object recarr):
   cdef long buflen
   cdef void *rbuf
   cdef int ret

   # Correct the number of records to read, if needed
   if (start + nrecords) > self.totalrecords:
     nrecords = self.totalrecords - start

   # Get the pointer to the buffer data area
   buflen = NA_getBufferPtrAndSize(recarr._data, 1, &rbuf)
   # SS 2006-0506 - Correct the offset
   rbuf = <void *>(<char *>rbuf + recarr._byteoffset)

   # Read the records from disk
   Py_BEGIN_ALLOW_THREADS
   ret = H5TBOread_records(self.dataset_id, self.type_id, start,
                           nrecords, rbuf)
   Py_END_ALLOW_THREADS
   if ret < 0:
     raise HDF5ExtError("Problems reading records.")

   # Convert some HDF5 types to Numarray after reading.
   self._convertTypes(recarr, nrecords, 1)

   return nrecords
-----------------



============================================
Change #2: Read the HDF chunksize parameter when opening a file

PyTables currently uses a rough-and-ready heuristic in function calcBufferSize() in utils.py to work out reasonably efficient sizes for the read buffer and HDF chunksize. In Tables.py, both
_g_create() and _g_open() have code like:
       # Compute some values for buffering and I/O parameters
(self._v_maxTuples, self._v_chunksize) = calcBufferSize(self.rowsize, self._v_expectedrows)
so _v_ chunksize is not actually read from the file.

I wanted more control over these parameters, so modified _g_open() in Table.py and _getInfo() in TableExtension.pyx to read self._v_chunksize from the HDF file rather than assuming the value returned by calcBufferSize() is appropriate.

------ _getInfo() in TableExtension.pyx ------
 def _getInfo(self):
   "Get info from a table on disk."
   cdef hid_t   space_id
   cdef size_t  type_size
   cdef hsize_t dims[1]
   cdef hid_t   plist
   cdef H5D_layout_t layout

   # Open the dataset
   self.dataset_id = H5Dopen(self.parent_id, self.name)

   <snip>

   # SS 2006-0506 Get chunksize in the file
   plist = H5Dget_create_plist(self.dataset_id)
   H5Pget_chunk(plist, 1, dims)
   self.chunksize = dims[0]
   self._v_chunksize = self.chunksize
   H5Pclose(plist)

<snip>

---------------------------------------------

------ _g_open() in Table.py ------
   def _g_open(self):
       """Opens a table from disk and read the metadata on it.

       Creates an user description on the fly to ease access to
       the actual data.

       """
       # Get table info
       # SS 2006-0506 _getInfo now fills in self._v_chunksize so take out
       # assignment from calcBufferSize() further down
       self._v_objectID, description = self._getInfo()
<snip>

       # Compute buffer size
       # SS 2006-0506 Took out assignment to self._v_chunksize
      # as this is now done in _getInfo()
      (self._v_maxTuples, dummy) = \
             calcBufferSize(self.rowsize, self.nrows)
   <snip>
------------------------------------


============================================
Change #3: Allow direct control over the HDF chunksize when creating a file

Finally I needed a way to create HDF files with different chunksizes that that selected by calcBufferSize(). This involves (i) modifying createTable(), openTable() and copyFile() in File.py to accept keyword arguments for passing in a chunk_rows parameter; (ii) modifing _g_create() and _g_copyWithStats() in Table.py to use a chunk_rows keyword argument in place of the value supplied by calcBufferSize() (iii) adding **kwargs to the argument list and exiting function call of __init__(), _g_copy() and _f_copy() in Group.py, Node.py and Leaf.py. This enables the chunk_rows keyword in (i) to flow through to the table creating functions in (ii). I think I've added more of these **kwargs than are strictly necessary, but everything seems to work just fine.


------ createTable() in File.py ------
   def createTable(self, where, name, description, title="",
                   filters=None, expectedrows=10000,
                   buffer_rows=None, chunk_rows=None, # SS 2006-0506
                   compress=None, complib=None):  # Deprecated
       """<snip>"""
       parentNode = self.getNode(where)  # Does the parent node exist?
       fprops = _checkFilters(filters, compress, complib)
       return Table(parentNode, name,
                    description=description, title=title,
                    filters=fprops, expectedrows=expectedrows
buffer_rows=buffer_rows, chunk_rows=chunk_rows, # SS 2006-0506
                    )
---------------------------------------

------ Table.__init__() in Table.py ------
   def __init__(self, parentNode, name,
                description=None, title="", filters=None,
                expectedrows=EXPECTED_ROWS_TABLE,
                **kwargs, # SS 2006-0506
                log=True):
       """<snip>"""
      <snip>
       # SS 2006-0506 - Modified to read _v_MaxTuples and _v_chunksize
       # from buffer_rows and chunk_rows arguments to Table(xx)
       self._v_maxTuples = kwargs.get('buffer_rows', None)
       """The number of rows that fit in the table buffer."""
       self._v_chunksize = kwargs.get('chunk_rows', None)
       """The HDF5 chunk size."""
       <snip>
-----------------------------------


------ _g_create() in Table.py ------
   def _g_create(self):
       """Create a new table on disk."""
       <snip>
       # Compute some values for buffering and I/O parameters
       # SS 2006-0506 Only uses calcBufferSize if default
   # parameters not supplied to File.createTable()
(calc_mt, calc_cs) = calcBufferSize(self.rowsize, self._v_expectedrows)
   if self._v_maxTuples is None:
       self._v_maxTuples = calc_mt
   if self._v_chunksize is None:
       self._v_chunksize = calc_cs
   <snip>
-------------------------------------


------ _g_copyWithStats() in Table.py ------
   # SS 2006-0506 Added **kwargs
   def _g_copyWithStats(self, group, name, start, stop, step,
                        title, filters, log, **kwargs):
       "Private part of Leaf.copy() for each kind of leaf"
       # Build the new Table object
   <snip>

   object = Table(
           group, name, description, title=title, filters=filters,
           expectedrows=self.nrows, log=log,
           **kwargs # SS 2006-0506
           )
----------------------------------

------ _g_copy() in Leaf.py ------
   # SS 2006-0506 - Added **kwargs to _g_copy()
   def _g_copy(self, newParent, newName, recursive, log, **kwargs):
       # Compute default arguments.
<snip>
       # Create a copy of the object.
       (newNode, bytes) = self._g_copyWithStats(newParent, newName,
           start, stop, step, title, filters, log,
           **kwargs, # SS 2006-0506
           )

<snip>
----------------------------------

------ _g_copy() in Group.py ------
def _g_copy(self, newParent, newName, recursive, log, **kwargs):
       # Compute default arguments.
<snip>
       # Create a copy of the object.
       # SS 2006-0506 - Add kwargs for passing parameters
       newNode = Group(newParent, newName,
                       title, new=True, filters=filters, log=log, **kwargs)
<snip>
------------------------------------

------__init__() in Group.py ------
   # SS 20060506 - Add **kwargs at start and end of __init__()
   def __init__(self, parentNode, name,
                title="", new=False, filters=None,
                log=True, **kwargs):
       """Create the basic structures to keep group information.
<snip>
       """
<snip>
       # Finally, set up this object as a node.
       super(Group, self).__init__(parentNode, name, log, **kwargs)
-----------------------------------

------__init__() in Leaf.py ------
   # SS 20060506 - Add **kwargs at start and end of __init__()
class Leaf(Node):
<snip>
   def __init__(self, parentNode, name,
                new=False, filters=None,
                log=True,
        **kwargs        # SS 2006-0506
       ):
<snip>
       super(Leaf, self).__init__(parentNode, name, log, **kwargs)

-----------------------------------

__init__() in Node complained about extra keyword arguments, so I added kwargs here as well. It doesn't actually do anything useful, so perhaps some PyTables expert could work out where else I've got too many kwargs.

------ __init__() in Node.py ------
   # SS 20060506 - Add **kwargs at start and end of __init__()
class Node(object):
<snip>
   def __init__(self, parentNode, name, log=True, **kwargs):
      <snip>
-----------------------------------



-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to