A Wednesday 02 July 2008, Hans Fangohr escrigué:
> Dear pytables developers,
>
> I have come across an oddity when I read pytables files written with
> pytables 1.0 and read with pytables 2.0, and would like your opinion
> on this.
>
> What I consider a problem for my application is outlined below, and I
> wanted to make sure that you are aware of this behaviour (it could be
> a bug).
Could be, but unfortunately, this is the intended behaviour. It is not
easy to explain, but I'll try it below.
>
> I create two h5 files using the following piece of code (once run
> with pytables 1.3.2 and once with pytables 2.0.3)::
>
>
> import numpy
> import tables
>
> print "Running with tables version %s" % tables.__version__
>
> master_version = int(tables.__version__[0])
>
> if master_version == 1:
> fileName = 'tbl1.h5'
> elif master_version == 2:
> fileName = 'tbl2.h5'
> else:
> raise "Impossible"
>
> h5f = tables.openFile(fileName, 'w')
>
> if master_version == 1:
> class Particle(tables.IsDescription):
> name = tables.StringAtom(length=16)
> idnumber = tables.Int64Atom()
> else:
> class Particle(tables.IsDescription):
> name = tables.StringCol(16)
> idnumber = tables.Int64Col()
>
> my_table = h5f.createTable(h5f.root, 'test', Particle,
> "watch the white space")
>
> particle = my_table.row
> particle['name']="A"
> particle['idnumber']=01234
> particle.append()
>
> particle['name']="BCD"
> particle['idnumber']=56789
> particle.append()
>
> my_table.flush
>
> h5f.close()
>
>
> Subsequently, I run the following piece of code with pytables 2.0::
>
> for fileName,version in [('tbl1.h5','tables1.3.2'),
> ('tbl2.h5','tables2.0.3')]:
> print "Reading the file written with %s" % version
> h5f = tables.openFile(fileName)
> for row in h5f.root.test.iterrows():
> print row
> h5f.close()
>
>
> I get the following output::
>
>
> Running with tables version 2.0.3
> Reading the file written with tables1.3.2
> (668, 'A ')
> (56789, 'BCD ')
> Reading the file written with tables2.0.3
> (668, 'A')
> (56789, 'BCD')
>
>
> Note that the strings 'A' and 'BCD' are returned with whitespace
> (filling up to the maximum length of the string) when reading the
> file written with pytables 1.0 but not when reading the file written
> with pytables 2.0.
>
> I believe that the desired behaviour is not to return the white
> space.
[snip]
Yeah, but this is not easy to do. The root of the problem is the
different padding conventions that numarray (the array package at the
core of PyTables 1.x series) and NumPy (the one at the core of PyTables
2.x series) do have.
As you already may know, both numarray and NumPy implement string arrays
as *fixed* length datatypes (mainly for performance reasons). So, if
you have defined an array of strings having a length up to, say, 4
chars per element, and you want to save a, say, empty string, you will
have to define how to fill-in the remaining 4 chars. And this is where
numarray and NumPy critically diverged: numarray had chosen a Fortran
convention and filled the unused space with *white spaces* while NumPy
has chosen the C convention and filled the unused space with NULL
chars.
Here it is an example of the above:
In [43]: nas = numarray.strings.array("", itemsize=4)
In [44]: str(nas._data)
Out[44]: ' '
In [45]: nps = numpy.array("", dtype="S4")
In [46]: str(nps.data)
Out[46]: '\x00\x00\x00\x00'
After realizing that this could led into problems between files created
with PyTables 1.x and reading them with PyTables 2.x I've tried to
convince the numarray crew to change their default for padding to be
the NULL char, but I had no success (they had to maintain the Fortran
convention for compatibility with many code they've already made).
However, they allowed me to introduce a new parameter, called 'padc',
in the string array factory so that you can choose the padding char.
Here it is how it works:
In [47]: padded_nas = numarray.strings.array("", itemsize=4,
padc="\x00")
In [48]: str(padded_nas._data)
Out[48]: '\x00\x00\x00\x00'
[See the thread:
http://projects.scipy.org/pipermail/numpy-discussion/2005-January/003781.html
for more info about the patch]
So, perhaps, if you need to use PyTables 1.x in some places to create
files that are to be read by PyTables 2.x in other places, instead of:
particle['name'] = "A"
you can write:
particle['name'] = numarray.strings.array("A", itemsize=16, padc="\x00")
this way, your files written with PyTables 1.x would be read correctly
with PyTables 2.x (i.e. without the padding issue).
However, if what you want is to be able to read existing PyTables 1.x
files with PyTables 2.x without padding issues, then I'm afraid that
you are out of luck. In that case, your best bet would be that you end
with a tool for doing the conversion by yourself -- incidentally,
the 'correct' way to do this would be to hack the ptrepack utility
which comes with PyTables so as to do this automatically.
Hope that helps,
--
Francesc Alted
Freelance developer
Tel +34-964-282-249
-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users