date:20130415

Re: [Pytables-users] Row.append() performance

2013-04-15 Thread Anthony Scopatz

Hello Shyam,

Can you please post the full traceback?  In any event, I am fairly certain
that this error is coming from the np.fromiter step.  The problem here is
that you are trying to read yur entire SQL query into a single numpy array
in memory.  This is impossible because you don't have enough RAM.
Therefore, you are going to need to read and write in chunks.  Something
like the following:

def getDataAndWriteHDF5(table):
databaseConn= pyodbc.connect(, )
cursor= databaseConn.cursor()
cursor.execute("SQL Query")
dt = np.dtype([('name', numpy.str_, 180), ('address', numpy.str_,
4200),
   ('email', numpy.str_, 180), ('phone', numpy.str_, 256)])
citer = iter(cursor)
chunksize = 4096  # This is just a guess, other values might work better
crange = range(chunksize)
while True:
resultSet = np.fromiter((tuple(row) for i, row in zip(crange,
citer)), dtype=dt)
table.append(resultSet)
if len(resultSet) < chunksize:
break

You may want to tweak some things, but that is the basic strategy.

Be Well
Anthony


On Mon, Apr 15, 2013 at 10:16 PM, Shyam Parimal Katti wrote:

> Hello Anthony,
>
>
> Thank you for your suggestions. When I mentioned that I am reading the data 
> from database, I meant a DB2 database, not a HDF5 database/file.
>
>
> I followed your suggestions, so the code looks as follows:
>
>
> def createHDF5File():
>
>  h5File= tables.openFile(, mode="a")
>
>   table.createTable(h5File.root, "Contact", Contact, "Contact", 
> expectedrows=700)
>
>  .
>
>
> def getDataAndWriteHDF5(table):
>
>  databaseConn= pyodbc.connect(, )
>
>  cursor= databaseConn.cursor()
>
>  cursor.execute("SQL Query")
>
>   resultSet= np.fromiter(( tuple(row) for row in cursor), 
> dtype=[('name', numpy.str_, 180), ('address', numpy.str_, 4200), ('email', 
> numpy.str_, 180), ('phone', numpy.str_, 256)])
>
>table.append(resultSet)
>
>
>
> Error message: MemoryError: cannot allocate array memory.
>
>
>
> I am setting the `expectedrows` parameter when creating the table in HDF5 
> file, and yet encounter the error above. Looking forward to suggestions.
>
>
>
>
>
> > Hello Anthony,
> >
> > Thank you for replying back with suggestions.
> >
> > In response to your suggestions, I am *not reading the data from a file
> > in the first step, but instead a database*.
> >
>
> Hello Shyam,
>
> To put too fine a point on it, hdf5 databases are files.  And reading from
> any kind of file incurs the same disk read overhead.
>
>
> >  I did try out your 1st suggestion of doing a table.append(list of
> > tuples), which took a little more than the executed time I got with the
> > original code. Can you please guide me in how to chunk the data (that I got
> > from database and stored as a list of tuples in Python) ?
> >
>
> Ahh, so you should not be using list of tuples.  These are Pythonic types
> and conversion between HDF5 types and Python types is what is slowing you
> down.  You should be passing a numpy structured array into append().  Numpy
> types are very similar (and often exactly the same as) HDF5 types.  For
> large, continuous, structured data you want to avoid the Python interpreter
> as much as possible.  Use Python here as the glue code to compose a series
> of fast operations using the APIs exposed by numpy, pytables, etc.
>
> Be Well
> Anthony
>
>
>
> On Thu, Apr 11, 2013 at 6:16 PM, Shyam Parimal Katti wrote:
>
>> Hello Anthony,
>>
>> Thank you for replying back with suggestions.
>>
>> In response to your suggestions, I am *not reading the data from a file
>> in the first step, but instead a database*. I did try out your 1st
>> suggestion of doing a table.append(list of tuples), which took a little
>> more than the executed time I got with the original code. Can you please
>> guide me in how to chunk the data (that I got from database and stored as a
>> list of tuples in Python) ?
>>
>>
>> Thanks,
>> Shyam
>>
>>
>> Hi Shyam,
>>
>> The pattern that you are using to write to a table is basically one for
>> writing Python data to HDF5.  However, your data is already in a machine /
>> HDF5 native format.  Thus what you are doing here is an excessive amount of
>> work:  read data from file -> convert to Python data structures -> covert
>> back to HDF5 data structures -> write to file.
>>
>> When reading from a table you get back a numpy structured array (look them
>> up on the numpy website).
>>
>> Then instead of using rows to write back the data, just use Table.append()
>> [1] which lets you pass in a bunch of rows simultaneously.  (Note: that you
>> data in this case is too large to fit into memory, so you may have to spit
>> it up into chunks or use the new iterators which are in the development
>> branch.)
>>
>> Additionally, if all you are doing is copying a table wholesale, you should
>> use the Table.copy(). [2]  Or if you only want to copy some subset based on
>> a conditional you

Re: [Pytables-users] Row.append() performance

2013-04-15 Thread Shyam Parimal Katti

Hello Anthony,


Thank you for your suggestions. When I mentioned that I am reading the
data from database, I meant a DB2 database, not a HDF5 database/file.


I followed your suggestions, so the code looks as follows:


def createHDF5File():

 h5File= tables.openFile(, mode="a")

  table.createTable(h5File.root, "Contact", Contact,
"Contact", expectedrows=700)

 .


def getDataAndWriteHDF5(table):

 databaseConn= pyodbc.connect(, )

 cursor= databaseConn.cursor()

 cursor.execute("SQL Query")

  resultSet= np.fromiter(( tuple(row) for row in cursor),
dtype=[('name', numpy.str_, 180), ('address', numpy.str_, 4200),
('email', numpy.str_, 180), ('phone', numpy.str_, 256)])

   table.append(resultSet)



Error message: MemoryError: cannot allocate array memory.


I am setting the `expectedrows` parameter when creating the table in
HDF5 file, and yet encounter the error above. Looking forward to
suggestions.




> Hello Anthony,
>
> Thank you for replying back with suggestions.
>
> In response to your suggestions, I am *not reading the data from a file
> in the first step, but instead a database*.
>

Hello Shyam,

To put too fine a point on it, hdf5 databases are files.  And reading from
any kind of file incurs the same disk read overhead.


>  I did try out your 1st suggestion of doing a table.append(list of
> tuples), which took a little more than the executed time I got with the
> original code. Can you please guide me in how to chunk the data (that I got
> from database and stored as a list of tuples in Python) ?
>

Ahh, so you should not be using list of tuples.  These are Pythonic types
and conversion between HDF5 types and Python types is what is slowing you
down.  You should be passing a numpy structured array into append().  Numpy
types are very similar (and often exactly the same as) HDF5 types.  For
large, continuous, structured data you want to avoid the Python interpreter
as much as possible.  Use Python here as the glue code to compose a series
of fast operations using the APIs exposed by numpy, pytables, etc.

Be Well
Anthony



On Thu, Apr 11, 2013 at 6:16 PM, Shyam Parimal Katti  wrote:

> Hello Anthony,
>
> Thank you for replying back with suggestions.
>
> In response to your suggestions, I am *not reading the data from a file
> in the first step, but instead a database*. I did try out your 1st
> suggestion of doing a table.append(list of tuples), which took a little
> more than the executed time I got with the original code. Can you please
> guide me in how to chunk the data (that I got from database and stored as a
> list of tuples in Python) ?
>
>
> Thanks,
> Shyam
>
> Hi Shyam,
>
> The pattern that you are using to write to a table is basically one for
> writing Python data to HDF5.  However, your data is already in a machine /
> HDF5 native format.  Thus what you are doing here is an excessive amount of
> work:  read data from file -> convert to Python data structures -> covert
> back to HDF5 data structures -> write to file.
>
> When reading from a table you get back a numpy structured array (look them
> up on the numpy website).
>
> Then instead of using rows to write back the data, just use Table.append()
> [1] which lets you pass in a bunch of rows simultaneously.  (Note: that you
> data in this case is too large to fit into memory, so you may have to spit
> it up into chunks or use the new iterators which are in the development
> branch.)
>
> Additionally, if all you are doing is copying a table wholesale, you should
> use the Table.copy(). [2]  Or if you only want to copy some subset based on
> a conditional you provide, use whereAppend() [3].
>
> Finally, if you want to do math or evaluate expressions on one table to
> create a new table, use the Expr class [4].
>
> All of these will be way faster than what you are doing right now.
>
> Be Well
> Anthony
>
> 1.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.append
> 2.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.copy
> 3.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.whereAppend
> 4. http://pytables.github.io/usersguide/libref/expr_class.html
>
>
>
>
> On Thu, Apr 11, 2013 at 12:23 PM, Shyam Parimal Katti wrote:
>
>> Hello,
>>
>> I am writing a lot of data(close to 122GB ) to a hdf5 file using
>> PyTables. The execution time for writing the query result to the file is
>> close to 10 hours, which includes querying the database and then writing to
>> the file. When I timed the entire execution, I found that it takes as much
>> time to get the data from the database as it takes to write to the hdf5
>> file. Here is the small snippet(P.S: the execution time noted below is not
>> for 122GB data, but a small subset close to 10GB):
>>
>> class ContactClass(table.IsDescription):
>> name= tb.StringCol(4200)
>> address= tb.StringCol(4200)
>> emailAddr= tb.String

Re: [Pytables-users] Some method like a "table.readWhereSorted"

2013-04-15 Thread Anthony Scopatz

On Mon, Apr 15, 2013 at 9:40 AM, Julio Trevisan wrote:

> Hi Anthony
>
> Thanks for adding this issue.
>
> Is there a way to use CS indexes to get the row coordinates satisfying a
> simple condition sorted by the column in the condition? I would like to
> avoid using numpy.sort() since the sorting order is probably already
> available within the index information.
>
> My condition is simply (timestamp >= %d) & (timestamp <= %d)" % (ts1, ts2)
>
> If you could please give me some guidelines so that I could put together
> such a method, that would be great.
>
> Julio
>
>
> (like using getWhereList()), but using an index to get the coordinate list
> ordered by column indexes to get the coordinates ? I nhas a *sort*parameter 
> that uses numpy.sort() to do the work, and that readSorted() uses
> a full index to get the sorted sequence. I couldn't make complete sense yet
> of "chunkmaps" being passed to numexpr evaluator inside _where().
>

Hi Julio,

Thanks for taking this on!  You probably what to read [1] to figure out how
numexpr works and what chunkmaps is doing, if you haven't already.

However, probably the simplest implementation of this method would be
basically part of the read_where() method followed by the read_sorted()
body.   It would look like this, but you'll have to try it:


def read_where_sorted(self, ...):
self._g_check_open()
condcoords = set([p.nrow for p in
  self._where(condition, condvars, start, stop, step)])
self._where_condition = None  # reset the conditions
index = self._check_sortby_csi(sortby, checkCSI)
sortcoords = index[start:stop:step]
coords = [c for c in sortcoords if c in condcoords]
return self.read_coordinates(coords, field)

There may be faster, more elegant solutions, but I think that something
like this would work.

Be Well
Anthony

1. http://code.google.com/p/numexpr/


>
>
> On Thu, Apr 11, 2013 at 1:14 PM, Anthony Scopatz wrote:
>
>> Thanks for bringing this up, Julio.
>>
>> Hmm I don't think that this exists currently, but since there are
>> readWhere() and readSorted() it shouldn't be too hard to implement.  I have
>> opened issue #225 to this effect.  Pull requests welcome!
>>
>> https://github.com/PyTables/PyTables/issues/225
>>
>> Be Well
>> Anthony
>>
>>
>> On Wed, Apr 10, 2013 at 1:02 PM, Dr. Louis Wicker 
>> wrote:
>>
>>> I am also interested in the this capability, if it exists in some way...
>>>
>>> Lou
>>>
>>> On Apr 10, 2013, at 12:35 PM, Julio Trevisan 
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > Is there a way that I could have the ability of readWhere (i.e.,
>>> specify condition, and fast result) but also using a CSIndex so that the
>>> rows come sorted in a particular order?
>>> >
>>> > I checked readSorted() but it is iterative and does not allow to
>>> specify a condition.
>>> >
>>> > Julio
>>> >
>>> --
>>> > Precog is a next-generation analytics platform capable of advanced
>>> > analytics on semi-structured data. The platform includes APIs for
>>> building
>>> > apps and a phenomenal toolset for data science. Developers can use
>>> > our toolset for easy data analysis & visualization. Get a free account!
>>> >
>>> http://www2.precog.com/precogplatform/slashdotnewsletter___
>>> > Pytables-users mailing list
>>> > Pytables-users@lists.sourceforge.net
>>> > https://lists.sourceforge.net/lists/listinfo/pytables-users
>>>
>>>
>>> 
>>> | Dr. Louis J. Wicker
>>> | NSSL/WRDD  Rm 4366
>>> | National Weather Center
>>> | 120 David L. Boren Boulevard, Norman, OK 73072
>>> |
>>> | E-mail:   louis.wic...@noaa.gov
>>> | HTTP:http://www.nssl.noaa.gov/~lwicker
>>> | Phone:(405) 325-6340
>>> | Fax:(405) 325-6780
>>> |
>>> |
>>> I  For every complex problem, there is a solution that is simple,
>>> |  neat, and wrong.
>>> |
>>> |   -- H. L. Mencken
>>> |
>>>
>>> 
>>> | "The contents  of this message are mine personally and
>>> | do not reflect any position of  the Government or NOAA."
>>>
>>> 
>>>
>>>
>>>
>>> --
>>> Precog is a next-generation analytics platform capable of advanced
>>> analytics on semi-structured data. The platform includes APIs for
>>> building
>>> apps and a phenomenal toolset for data science. Developers can use
>>> our toolset for easy data analysis & visualization. Get a free account!
>>> http://www2.precog.com/precogplatform/slashdotnewsletter
>>> ___
>>> Pytables-users mailing list
>>> Pytables-users@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>>
>>
>>
>>
>> --

Re: [Pytables-users] PyTables in-kernel query using Time64Col returns wrong results

2013-04-15 Thread Anthony Scopatz

And here is the issue: https://github.com/PyTables/PyTables/issues/230


On Mon, Apr 15, 2013 at 6:07 PM, Anthony Scopatz  wrote:

> Hi Charles,
>
> This is very likely a bug with respect to querying based off of Time64Cols
> not being converted to Float64s for the query itself.  Under the covers,
> HDF5 and PyTables represent Time64 as a posix times, which are structs of
> two 4 byte ints [1].  These obviously have a very different memory layout
> than your standard float64.  This is why this comparison is failing.
>
> numexpr doesn't support the time64 datatype, nor does it support bit shift
> operators.  This makes it difficult to impossible to use time64 columns
> properly from within a query right now.
>
> I'll make open a ticket for this, but if you want something working right
> now using Float64Col is probably your best bet.  This is what I have always
> done, and it works just fine.  I think that the Time64 stuff is in there
> largely for C/HDF5 compliance.  Sorry about the confusion.
>
> Be Well
> Anthony
>
> 1. http://pubs.opengroup.org/onlinepubs/95399/basedefs/sys/time.h.html
>
>
> On Mon, Apr 15, 2013 at 2:20 PM, Charles de Villiers wrote:
>
>> Hi Anthony,
>>
>> Thanks for your response.
>>
>> I had come across that discussion, but I don't think the floating-point
>> precision thing really explains my results, because I'm querying for
>> intervals, not instants.
>> if I have a table containing, say, one-second samples between 500.0 and
>> 1500.0, and I use a where clause like this:
>> '(update_seconds >= 1000.0) & (update_seconds <= 1060.0)'
>> then I expect to get at least 58 samples, even with floating-point
>> 'fuzziness' - but in fact I get none.
>> However, I have now tried the approach of storing my epoch seconds in
>> Float64Cols and that seems to be working just fine.
>> The question I'm left with is - just what does a Time64Col represent?
>> Since there's no standard Python Time class with a float representation, I
>> just guessed I  could assign it float seconds a la time.time(), but
>> Float64 works just as well for that (and as it turns out, better). How
>> could you use a Time64Col in practice?
>>
>> Thanks again,
>>
>> Charles de Villiers
>>
>> "They have computers, and they may have other weapons of mass
>> destruction."
>> (Janet Reno)
>>
>>   --
>>  *From:* Anthony Scopatz 
>> *To:* Charles de Villiers ; Discussion list for
>> PyTables 
>> *Sent:* Monday, April 15, 2013 5:13 PM
>> *Subject:* Re: [Pytables-users] PyTables in-kernel query using Time64Col
>> returns wrong results
>>
>> Hi Charles,
>>
>> We just discussed this last week and I am too lazy to retype it all so
>> here is a link to the archive post [1].
>>
>> Be Well
>> Anthony
>>
>> 1. http://sourceforge.net/mailarchive/message.php?msg_id=30708089
>>
>>
>> On Mon, Apr 15, 2013 at 9:20 AM, Charles de Villiers 
>> wrote:
>>
>>
>> 0down 
>> votefavorite
>> **
>>  I'm using PyTables 2.4.0 and Python 2.7 I've got a database that
>> contains the following typical table:
>>
>> /anc/asc_wind_speed (Table(87591,), shuffle, blosc(3)) 'Wind speed'
>>   description := {
>>   "value_seconds": Time64Col(shape=(), dflt=0.0, pos=0),
>>   "update_seconds": Time64Col(shape=(),
>>  dflt=0.0, pos=1),
>>   "status": UInt8Col(shape=(), dflt=0, pos=2),
>>   "value": Float64Col(shape=(), dflt=0.0, pos=3)}
>>   byteorder := 'little'
>>   chunkshape := (2621,)
>>   autoIndex := True
>>   colindexes := {
>> "update_seconds": Index(9,
>>  full, shuffle, zlib(1)).is_CSI=True,
>> "value": Index(9,
>>  full, shuffle, zlib(1)).is_CSI=True}
>>
>> I populate the timestamp columns using float seconds.
>> The data looks OK in my IPython session:
>>
>> array([(1343779432.2160001, 1343779431.852, 0, 5.29750003),
>>(1343779433.2190001, 1343779432.9430001, 0, 5.74749996),
>>(1343779434.217, 1343779433.980, 0, 5.8603), ...,
>>(1343866301.934, 1343866301.513, 0, 3.84249998),
>>(1343866302.934, 1343866302.579, 0, 4.0596),
>>(1343866303.934, 1343866303.642, 0, 3.78250002)],
>>
>>   dtype=[('value_seconds', '> '|u1'), ('value', '>
>> .. but when I try to do an in-kernel search using the indexed column
>> 'update_seconds', everything goes pear-shaped:
>>
>> len(wstable.readWhere('(update_seconds <= 1343866303.642)'))0
>>
>> ie I get 0 rows returned when I was expecting all 87591 of them.
>> Occasionally I do manage to get some rows with a '>=' query, but the
>> timestamp columns are then returned as huge floats (~10^79). It seems that
>> there is some implicit type-conversion going on that causes the Time64Col
>> values to be misinterpreted. Can someone spot my mistake, or should I
>> forget about Time64Cols and convert them all to Float64 (and how do I do
>> this?)
>>
>>
>>
>> ---

Re: [Pytables-users] PyTables in-kernel query using Time64Col returns wrong results

2013-04-15 Thread Anthony Scopatz

Hi Charles,

This is very likely a bug with respect to querying based off of Time64Cols
not being converted to Float64s for the query itself.  Under the covers,
HDF5 and PyTables represent Time64 as a posix times, which are structs of
two 4 byte ints [1].  These obviously have a very different memory layout
than your standard float64.  This is why this comparison is failing.

numexpr doesn't support the time64 datatype, nor does it support bit shift
operators.  This makes it difficult to impossible to use time64 columns
properly from within a query right now.

I'll make open a ticket for this, but if you want something working right
now using Float64Col is probably your best bet.  This is what I have always
done, and it works just fine.  I think that the Time64 stuff is in there
largely for C/HDF5 compliance.  Sorry about the confusion.

Be Well
Anthony

1. http://pubs.opengroup.org/onlinepubs/95399/basedefs/sys/time.h.html


On Mon, Apr 15, 2013 at 2:20 PM, Charles de Villiers wrote:

> Hi Anthony,
>
> Thanks for your response.
>
> I had come across that discussion, but I don't think the floating-point
> precision thing really explains my results, because I'm querying for
> intervals, not instants.
> if I have a table containing, say, one-second samples between 500.0 and
> 1500.0, and I use a where clause like this:
> '(update_seconds >= 1000.0) & (update_seconds <= 1060.0)'
> then I expect to get at least 58 samples, even with floating-point
> 'fuzziness' - but in fact I get none.
> However, I have now tried the approach of storing my epoch seconds in
> Float64Cols and that seems to be working just fine.
> The question I'm left with is - just what does a Time64Col represent?
> Since there's no standard Python Time class with a float representation, I
> just guessed I  could assign it float seconds a la time.time(), but
> Float64 works just as well for that (and as it turns out, better). How
> could you use a Time64Col in practice?
>
> Thanks again,
>
> Charles de Villiers
>
> "They have computers, and they may have other weapons of mass destruction."
> (Janet Reno)
>
>   --
>  *From:* Anthony Scopatz 
> *To:* Charles de Villiers ; Discussion list for
> PyTables 
> *Sent:* Monday, April 15, 2013 5:13 PM
> *Subject:* Re: [Pytables-users] PyTables in-kernel query using Time64Col
> returns wrong results
>
> Hi Charles,
>
> We just discussed this last week and I am too lazy to retype it all so
> here is a link to the archive post [1].
>
> Be Well
> Anthony
>
> 1. http://sourceforge.net/mailarchive/message.php?msg_id=30708089
>
>
> On Mon, Apr 15, 2013 at 9:20 AM, Charles de Villiers wrote:
>
>
> 0down 
> votefavorite
> **
>  I'm using PyTables 2.4.0 and Python 2.7 I've got a database that contains
> the following typical table:
>
> /anc/asc_wind_speed (Table(87591,), shuffle, blosc(3)) 'Wind speed'
>   description := {
>   "value_seconds": Time64Col(shape=(), dflt=0.0, pos=0),
>   "update_seconds": Time64Col(shape=(),
>  dflt=0.0, pos=1),
>   "status": UInt8Col(shape=(), dflt=0, pos=2),
>   "value": Float64Col(shape=(), dflt=0.0, pos=3)}
>   byteorder := 'little'
>   chunkshape := (2621,)
>   autoIndex := True
>   colindexes := {
> "update_seconds": Index(9,
>  full, shuffle, zlib(1)).is_CSI=True,
> "value": Index(9,
>  full, shuffle, zlib(1)).is_CSI=True}
>
> I populate the timestamp columns using float seconds.
> The data looks OK in my IPython session:
>
> array([(1343779432.2160001, 1343779431.852, 0, 5.29750003),
>(1343779433.2190001, 1343779432.9430001, 0, 5.74749996),
>(1343779434.217, 1343779433.980, 0, 5.8603), ...,
>(1343866301.934, 1343866301.513, 0, 3.84249998),
>(1343866302.934, 1343866302.579, 0, 4.0596),
>(1343866303.934, 1343866303.642, 0, 3.78250002)],
>
>   dtype=[('value_seconds', ' '|u1'), ('value', '
> .. but when I try to do an in-kernel search using the indexed column
> 'update_seconds', everything goes pear-shaped:
>
> len(wstable.readWhere('(update_seconds <= 1343866303.642)'))0
>
> ie I get 0 rows returned when I was expecting all 87591 of them.
> Occasionally I do manage to get some rows with a '>=' query, but the
> timestamp columns are then returned as huge floats (~10^79). It seems that
> there is some implicit type-conversion going on that causes the Time64Col
> values to be misinterpreted. Can someone spot my mistake, or should I
> forget about Time64Cols and convert them all to Float64 (and how do I do
> this?)
>
>
>
> --
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis

Re: [Pytables-users] PyTables in-kernel query using Time64Col returns wrong results

2013-04-15 Thread Charles de Villiers

Hi Anthony,

Thanks for your response.

I had come across that discussion, but I don't think the floating-point 
precision thing really explains my results, because I'm querying for intervals, 
not instants.
if I have a table containing, say, one-second samples between 500.0 and 1500.0, 
and I use a where clause like this:
'(update_seconds >= 1000.0) & (update_seconds <= 1060.0)'
then I expect to get at least 58 samples, even with floating-point 'fuzziness' 
- but in fact I get none. 
However, I have now tried the approach of storing my epoch seconds in 
Float64Cols and that seems to be working just fine. 
The question I'm left with is - just what does a Time64Col represent? Since 
there's no standard Python Time class with a float representation, I just 
guessed I  could assign it float seconds a la time.time(), but Float64 works 
just as well for that (and as it turns out, better). How could you use a 
Time64Col in practice?

Thanks again, 
 

Charles de Villiers

"They have computers, and they may have other weapons of mass destruction."
(Janet Reno) 




 From: Anthony Scopatz 
To: Charles de Villiers ; Discussion list for PyTables 
 
Sent: Monday, April 15, 2013 5:13 PM
Subject: Re: [Pytables-users] PyTables in-kernel query using Time64Col returns 
wrong results
 


Hi Charles, 

We just discussed this last week and I am too lazy to retype it all so here is 
a link to the archive post [1].

Be Well
Anthony

1. http://sourceforge.net/mailarchive/message.php?msg_id=30708089



On Mon, Apr 15, 2013 at 9:20 AM, Charles de Villiers  wrote:


>0
>down vote
>favorite I'm using PyTables 2.4.0 and Python 2.7 I've got a database that 
>contains the following typical table:
>/anc/asc_wind_speed (Table(87591,),shuffle,blosc(3))'Wind speed'description 
>:={"value_seconds":Time64Col(shape=(),dflt=0.0,pos=0),"update_seconds":Time64Col(shape=(),dflt=0.0,pos=1),"status":UInt8Col(shape=(),dflt=0,pos=2),"value":Float64Col(shape=(),dflt=0.0,pos=3)}byteorder
> :='little'chunkshape :=(2621,)autoIndex :=Truecolindexes 
>:={"update_seconds":Index(9,full,shuffle,zlib(1)).is_CSI=True,"value":Index(9,full,shuffle,zlib(1)).is_CSI=True}
>I populate the timestamp columns using float seconds.
>The data looks OK in my IPython session:
>array([(1343779432.2160001,1343779431.852,0,5.29750003),(1343779433.2190001,1343779432.9430001,0,5.74749996),(1343779434.217,1343779433.980,0,5.8603),...,(1343866301.934,1343866301.513,0,3.84249998),(1343866302.934,1343866302.579,0,4.0596),(1343866303.934,1343866303.642,0,3.78250002)],dtype=[('value_seconds','.. but when I try to do an in-kernel search using the indexed column 
>'update_seconds', everything goes pear-shaped:
>len(wstable.readWhere('(update_seconds <= 1343866303.642)'))0
>ie I get 0 rows returned when I was expecting all 87591 of them. Occasionally 
>I do manage to get some rows with a '>=' query, but the timestamp columns are 
>then returned as huge floats (~10^79). It seems that there is some implicit 
>type-conversion going on that causes the Time64Col values to be 
>misinterpreted. Can someone spot my mistake, or should I forget about 
>Time64Cols and convert them all to Float64 (and how do I do this?) 
>
>
>--
>Precog is a next-generation analytics platform capable of advanced
>analytics on semi-structured data. The platform includes APIs for building
>apps and a phenomenal toolset for data science. Developers can use
>our toolset for easy data analysis & visualization. Get a free account!
>http://www2.precog.com/precogplatform/slashdotnewsletter
>___
>Pytables-users mailing list
>Pytables-users@lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>--
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] PyTables in-kernel query using Time64Col returns wrong results

2013-04-15 Thread Anthony Scopatz

Hi Charles,

We just discussed this last week and I am too lazy to retype it all so here
is a link to the archive post [1].

Be Well
Anthony

1. http://sourceforge.net/mailarchive/message.php?msg_id=30708089


On Mon, Apr 15, 2013 at 9:20 AM, Charles de Villiers wrote:

>
> 0down 
> votefavorite
> **
>  I'm using PyTables 2.4.0 and Python 2.7 I've got a database that contains
> the following typical table:
>
> /anc/asc_wind_speed (Table(87591,), shuffle, blosc(3)) 'Wind speed'
>   description := {
>   "value_seconds": Time64Col(shape=(), dflt=0.0, pos=0),
>   "update_seconds": Time64Col(shape=(), dflt=0.0, pos=1),
>   "status": UInt8Col(shape=(), dflt=0, pos=2),
>   "value": Float64Col(shape=(), dflt=0.0, pos=3)}
>   byteorder := 'little'
>   chunkshape := (2621,)
>   autoIndex := True
>   colindexes := {
> "update_seconds": Index(9, full, shuffle, zlib(1)).is_CSI=True,
> "value": Index(9, full, shuffle, zlib(1)).is_CSI=True}
>
> I populate the timestamp columns using float seconds.
> The data looks OK in my IPython session:
>
> array([(1343779432.2160001, 1343779431.852, 0, 5.29750003),
>(1343779433.2190001, 1343779432.9430001, 0, 5.74749996),
>(1343779434.217, 1343779433.980, 0, 5.8603), ...,
>(1343866301.934, 1343866301.513, 0, 3.84249998),
>(1343866302.934, 1343866302.579, 0, 4.0596),
>(1343866303.934, 1343866303.642, 0, 3.78250002)],
>
>   dtype=[('value_seconds', ' '|u1'), ('value', '
> .. but when I try to do an in-kernel search using the indexed column
> 'update_seconds', everything goes pear-shaped:
>
> len(wstable.readWhere('(update_seconds <= 1343866303.642)'))0
>
> ie I get 0 rows returned when I was expecting all 87591 of them.
> Occasionally I do manage to get some rows with a '>=' query, but the
> timestamp columns are then returned as huge floats (~10^79). It seems that
> there is some implicit type-conversion going on that causes the Time64Col
> values to be misinterpreted. Can someone spot my mistake, or should I
> forget about Time64Cols and convert them all to Float64 (and how do I do
> this?)
>
>
>
> --
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis & visualization. Get a free account!
> http://www2.precog.com/precogplatform/slashdotnewsletter
> ___
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
--
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Some method like a "table.readWhereSorted"

2013-04-15 Thread Julio Trevisan

Hi Anthony

Thanks for adding this issue.

Is there a way to use CS indexes to get the row coordinates satisfying a
simple condition sorted by the column in the condition? I would like to
avoid using numpy.sort() since the sorting order is probably already
available within the index information.

My condition is simply (timestamp >= %d) & (timestamp <= %d)" % (ts1, ts2)

If you could please give me some guidelines so that I could put together
such a method, that would be great.

Julio


(like using getWhereList()), but using an index to get the coordinate list
ordered by column indexes to get the coordinates ? I nhas a
*sort*parameter that uses numpy.sort() to do the work, and that
readSorted() uses
a full index to get the sorted sequence. I couldn't make complete sense yet
of "chunkmaps" being passed to numexpr evaluator inside _where().


On Thu, Apr 11, 2013 at 1:14 PM, Anthony Scopatz  wrote:

> Thanks for bringing this up, Julio.
>
> Hmm I don't think that this exists currently, but since there are
> readWhere() and readSorted() it shouldn't be too hard to implement.  I have
> opened issue #225 to this effect.  Pull requests welcome!
>
> https://github.com/PyTables/PyTables/issues/225
>
> Be Well
> Anthony
>
>
> On Wed, Apr 10, 2013 at 1:02 PM, Dr. Louis Wicker 
> wrote:
>
>> I am also interested in the this capability, if it exists in some way...
>>
>> Lou
>>
>> On Apr 10, 2013, at 12:35 PM, Julio Trevisan 
>> wrote:
>>
>> > Hi,
>> >
>> > Is there a way that I could have the ability of readWhere (i.e.,
>> specify condition, and fast result) but also using a CSIndex so that the
>> rows come sorted in a particular order?
>> >
>> > I checked readSorted() but it is iterative and does not allow to
>> specify a condition.
>> >
>> > Julio
>> >
>> --
>> > Precog is a next-generation analytics platform capable of advanced
>> > analytics on semi-structured data. The platform includes APIs for
>> building
>> > apps and a phenomenal toolset for data science. Developers can use
>> > our toolset for easy data analysis & visualization. Get a free account!
>> >
>> http://www2.precog.com/precogplatform/slashdotnewsletter___
>> > Pytables-users mailing list
>> > Pytables-users@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>>
>> 
>> | Dr. Louis J. Wicker
>> | NSSL/WRDD  Rm 4366
>> | National Weather Center
>> | 120 David L. Boren Boulevard, Norman, OK 73072
>> |
>> | E-mail:   louis.wic...@noaa.gov
>> | HTTP:http://www.nssl.noaa.gov/~lwicker
>> | Phone:(405) 325-6340
>> | Fax:(405) 325-6780
>> |
>> |
>> I  For every complex problem, there is a solution that is simple,
>> |  neat, and wrong.
>> |
>> |   -- H. L. Mencken
>> |
>>
>> 
>> | "The contents  of this message are mine personally and
>> | do not reflect any position of  the Government or NOAA."
>>
>> 
>>
>>
>>
>> --
>> Precog is a next-generation analytics platform capable of advanced
>> analytics on semi-structured data. The platform includes APIs for building
>> apps and a phenomenal toolset for data science. Developers can use
>> our toolset for easy data analysis & visualization. Get a free account!
>> http://www2.precog.com/precogplatform/slashdotnewsletter
>> ___
>> Pytables-users mailing list
>> Pytables-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>
>
>
> --
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis & visualization. Get a free account!
> http://www2.precog.com/precogplatform/slashdotnewsletter
> ___
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
--
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lis

[Pytables-users] PyTables in-kernel query using Time64Col returns wrong results

2013-04-15 Thread Charles de Villiers


0
down vote
favorite I'm using PyTables 2.4.0 and Python 2.7 I've got a database that 
contains the following typical table:
/anc/asc_wind_speed (Table(87591,),shuffle,blosc(3))'Wind speed'description 
:={"value_seconds":Time64Col(shape=(),dflt=0.0,pos=0),"update_seconds":Time64Col(shape=(),dflt=0.0,pos=1),"status":UInt8Col(shape=(),dflt=0,pos=2),"value":Float64Col(shape=(),dflt=0.0,pos=3)}byteorder
 :='little'chunkshape :=(2621,)autoIndex :=Truecolindexes 
:={"update_seconds":Index(9,full,shuffle,zlib(1)).is_CSI=True,"value":Index(9,full,shuffle,zlib(1)).is_CSI=True}
I populate the timestamp columns using float seconds.
The data looks OK in my IPython session:
array([(1343779432.2160001,1343779431.852,0,5.29750003),(1343779433.2190001,1343779432.9430001,0,5.74749996),(1343779434.217,1343779433.980,0,5.8603),...,(1343866301.934,1343866301.513,0,3.84249998),(1343866302.934,1343866302.579,0,4.0596),(1343866303.934,1343866303.642,0,3.78250002)],dtype=[('value_seconds','=' query, but the timestamp columns are 
then returned as huge floats (~10^79). It seems that there is some implicit 
type-conversion going on that causes the Time64Col values to be misinterpreted. 
Can someone spot my mistake, or should I forget about Time64Cols and convert 
them all to Float64 (and how do I do this?) 
--
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Row.append() performance

Re: [Pytables-users] Row.append() performance

Re: [Pytables-users] Some method like a "table.readWhereSorted"

Re: [Pytables-users] PyTables in-kernel query using Time64Col returns wrong results

Re: [Pytables-users] PyTables in-kernel query using Time64Col returns wrong results

Re: [Pytables-users] PyTables in-kernel query using Time64Col returns wrong results

Re: [Pytables-users] PyTables in-kernel query using Time64Col returns wrong results

Re: [Pytables-users] Some method like a "table.readWhereSorted"

[Pytables-users] PyTables in-kernel query using Time64Col returns wrong results

9 matches

Site Navigation

Mail list logo

Footer information