Re: [Tutor] is this use or abuse of __getitem__ ?

2012-09-15 Thread eryksun
On Fri, Sep 14, 2012 at 2:33 PM, Albert-Jan Roskam fo...@yahoo.com wrote:
 On 14/09/12 22:16, Albert-Jan Roskam wrote:

 Is it recommended to define the geitem() function inside the __getitem__() 
 method?
 I was thinking I could also define a _getitem() private method.

 def getitem(key):
 retcode1 = self.iomodule.SeekNextCase(self.fh, 
 ctypes.c_long(int(key)))
 


I wouldn't do this since it incurs the cost of a repeated function
call. A slice could involve thousands of such calls. Maybe use a
boolean variable like is_slice. Then use a for loop to build the
records list (maybe only 1 item). If is_slice, return records, else
return records[0].


 if isinstance(key, slice):
 records = [getitem(i) for i in range(*key.indices(self.nCases))]
 return records
 elif hasattr(key, __int__): # isinstance(key, (int, float)):
 if abs(key)  (self.nCases - 1):
 raise IndexError
 else:
 key = self.nCases + key if key  0 else key
 record = getitem(key)
 return record
 else:
 raise TypeError


I agree with Steven's reasoning that it doesn't make sense to support
floating point indexes. Python 2.6+ has the __index__ special method.
int and long have this method. float, Decimal,and Fraction do not have
it. It lets you support any user-defined class that can be used as an
index. For example:

 class MyInt(object):
... def __index__(self):
... return 5

 slice(MyInt(), MyInt(), MyInt()).indices(10)
(5, 5, 5)

operator.index() is the corresponding function. It raises TypeError if
__index__ isn't supported.

But watch out because you're using ctypes.c_long. It doesn't do any
range checking. It just silently wraps around modulo the size of a
long on your platform:

 c_long(2**32-1), c_long(2**32), c_long(2**32+1)
(c_long(-1), c_long(0), c_long(1))

Calling int(key) or index(key) is no help because it will silently
return a Python long (big int). You need to do range checking on the
upper bound and raise a ValueError.

For example:

from operator import index  # calls obj.__index__()

is_slice = isinstance(key, slice)

if is_slice:
start, stop, step = key.indices(self.nCases)  # may raise TypeError
else:
start = index(self.nCases + key if key  0 else key)  # may
raise TypeError
stop = start + 1
step = 1

if stop  2 ** (ctypes.sizeof(ctypes.c_long) * 8 - 1):
raise ValueError('useful message')

records = []
for i in range(start, stop, step):
retcode1 = self.iomodule.SeekNextCase(self.fh, ctypes.c_long(i))
self.caseBuffer, self.caseBufferPtr = self.getCaseBuffer()
retcode2 = self.iomodule.WholeCaseIn(self.fh, self.caseBufferPtr)
record = struct.unpack(self.structFmt, self.caseBuffer.raw)
if any([retcode1, retcode2]):
raise RuntimeError(Error retrieving record %d [%s, %s] %
(i, retcodes[retcode1], retcodes[retcode2]))
records.append(record)

if not is_slice:
records = records[0]
return records
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] is this use or abuse of __getitem__ ?

2012-09-15 Thread eryksun
On Sat, Sep 15, 2012 at 4:43 AM, eryksun eryk...@gmail.com wrote:

 else:
 start = index(self.nCases + key if key  0 else key)  # may
 raise TypeError
 stop = start + 1
 step = 1


Gmail is such a pain sometimes. I should have called index first anyway:

key = index(key)  # may raise TypeError
start = key + self.nCases if key  0 else key
stop = start + 1
step = 1


 records = []
 for i in range(start, stop, step):
 ...
 records.append(record)


You can boost the performance here a bit by caching the append method.
This avoids a LOAD_ATTR operation on each iteration:

records = []
append = records.append
for i in range(start, stop, step):
...
append(record)
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] is this use or abuse of __getitem__ ?

2012-09-15 Thread Albert-Jan Roskam
On Sat, Sep 15, 2012 at 4:43 AM, eryksun eryk...@gmail.com wrote:

     else:
         start = index(self.nCases + key if key  0 else key)  # may
 raise TypeError
         stop = start + 1
         step = 1


Gmail is such a pain sometimes. I should have called index first anyway:



        key = index(key)  # may raise TypeError
        start = key + self.nCases if key  0 else key
        stop = start + 1
        step = 1


Thanks, I hadn't noticed this yet. I am refactoring some of the rest of my code 
and I hadn't run anything yet. My code has two methods that return record(s): 
an iterator (__getitem__) and a generator (readFile, which is also called by 
__enter__). Shouldn't I also take the possibility of a MemoryError into account 
when the caller does something like data[:10**8]? It may no longer fit into 
memory, esp. when the dataset is also wide.



     records = []
     for i in range(start, stop, step):
         ...
         records.append(record)


You can boost the performance here a bit by caching the append method.
This avoids a LOAD_ATTR operation on each iteration:

    records = []
    append = records.append
    for i in range(start, stop, step):
        ...
        append(record)


I knew that trick from 
http://wiki.python.org/moin/PythonSpeed/PerformanceTips#Avoiding_dots... but I 
didn't know about LOAD_ATTR. Is a list comprehension still faster than this? 
Does it also mean that e.g. from ctypes import * (-- c_long()) is faster 
than import ctypes (-- ctypes.c_long()). I am now putting as much as 
possible in __init__. I don't like the first way of importing at all.

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] is this use or abuse of __getitem__ ?

2012-09-15 Thread eryksun
On Sat, Sep 15, 2012 at 10:18 AM, Albert-Jan Roskam fo...@yahoo.com wrote:

 Thanks, I hadn't noticed this yet. I am refactoring some of the rest of my 
 code
 and I hadn't run anything yet. My code has two methods that return record(s):
 an iterator (__getitem__) and a generator (readFile, which is also called by
 __enter__). Shouldn't I also take the possibility of a MemoryError into
 account when the caller does something like data[:10**8]? It may no longer fit
 into memory, esp. when the dataset is also wide.

The issue with c_long isn't a problem for a slice since
key.indices(self.nCases) limits the upper bound. For the individual
index you had it right the first time by raising IndexError before it
even gets to the c_long conversion. I'm sorry for wasting your time on
a non-problem. However, your test there is a bit off. A negative index
can be -nCases since counting from the end starts at -1. If you first
do the ternary check to add the offset to a negative index, afterward
you can raise an IndexError if not 0 = value  nCases.

As to MemoryError, dealing with gigabytes of data in main memory is
not a problem I've come up against in practice. You might still want a
reasonable upper bound for slices. Often when the process runs out of
memory it won't even see a MemoryError. The OS simply kills it. On the
other hand, while bugs like a c_long wrapping around need to be caught
to prevent silent corruption of data, there's nothing at all silent
about crashing the process. It's up to you how much you want to
micromanage the situation. You might want to check out psutil as a
cross-platform way to monitor the process memory usage:

http://code.google.com/p/psutil

If you're also supporting the iterator protocol with the __iter__
method, then I think a helper _items(start, stop, step) generator
function would be a good idea.

Here's an updated example (not tested however; it's just a suggestion):


import operator

def _items(self, start=0, stop=None, step=1):
if stop is None:
stop = self.nCases

for i in range(start, stop, step):
retcode1 = self.iomodule.SeekNextCase(self.fh, ctypes.c_long(i))
self.caseBuffer, self.caseBufferPtr = self.getCaseBuffer()
retcode2 = self.iomodule.WholeCaseIn(self.fh, self.caseBufferPtr)
record = struct.unpack(self.structFmt, self.caseBuffer.raw)
if any([retcode1, retcode2]):
raise RuntimeError(Error retrieving record %d [%s, %s] %
(i, retcodes[retcode1], retcodes[retcode2]))
yield record


def __iter__(self):
return self._items()


def __getitem__(self, key):

is_slice = isinstance(key, slice)

if is_slice:
start, stop, step = key.indices(self.nCases)
else:
key = operator.index(key)
start = key + self.nCases if key  0 else key
if not 0 = start  self.nCases:
raise IndexError
stop = start + 1
step = 1

records = self._items(start, stop, step)
if is_slice:
return list(records)
return next(records)


 but I didn't know about LOAD_ATTR.

That's the bytecode operation to fetch an attribute. Whether or not
bypassing it will provide a significant speedup depends on what else
you're doing in the loop. If the the single LOAD_ATTR is only a small
fraction of the total processing time, or you're not looping thousands
of times, then this little change is insignificant.


 Is a list comprehension still faster than this?

I think list comprehensions or generator expressions are best if the
evaluated expression isn't too complex and uses built-in types and
functions. I won't typically write a function just to use a list
comprehension for a single statement. Compared to a regular for loop
(especially if append is cached in a fast local), the function call
overhead makes it a wash or worse, even given the comprehension's
efficiency at building the list. If the main work of the loop is the
most significant factor, then the choice of for loop vs list
comprehension doesn't matter much with regard to performance, but I
still think it's simpler to just use a regular for loop. You can also
write a generator function if you need to reuse an iteration in
multiple statements.

 Does it also mean that e.g. from ctypes import * (-- c_long()) is
 faster than import ctypes (-- ctypes.c_long()). I am now putting as much as
 possible in __init__. I don't like the first way of importing at all.

It's not a good idea to pollute your namespace with import *
statements. In a function, you can cache an attribute locally if doing
so will provide a significant speedup. Or you can use a default
argument like this:

def f(x, c_long=ctypes.c_long):
return c_long(x)
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:

[Tutor] is this use or abuse of __getitem__ ?

2012-09-14 Thread Albert-Jan Roskam
Hi,

I defined a __getitem__ special method in a class that reads a binary data file 
using a C library. The docstring should clarify
the purpose of the method. This works exactly as I intended it, however, the 
key argument is actually used as an index
(it also raises an IndexError when key is greater than the number of records 
in the file). Am I abusing the __getitem__ method, or is this just a creative 
way of using it?


# Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] 
on win32

    def __getitem__(self, key):
     This function reports the record of case number key.
    For example: firstRecord = FileReader(fileName)[0] 
    if not isinstance(key, (int, float)):
    raise TypeError
    if abs(key)  self.nCases:
    raise IndexError
    retcode1 = self.iomodule.SeekNextCase(self.fh, ctypes.c_long(int(key)))
    self.caseBuffer, self.caseBufferPtr = self.getCaseBuffer()
    retcode2 = self.iomodule.WholeCaseIn(self.fh, self.caseBufferPtr)
    record = struct.unpack(self.structFmt, self.caseBuffer.raw)
    if any([retcode1, retcode2]):
    raise RuntimeError, Error retrieving record %d [%s, %s] % \
  (key, retcodes[retcode1], retcodes[retcode2])
    return record
 
Regards,
Albert-Jan


~~
All right, but apart from the sanitation, the medicine, education, wine, public 
order, irrigation, roads, a 
fresh water system, and public health, what have the Romans ever done for us?
~~ 
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] is this use or abuse of __getitem__ ?

2012-09-14 Thread eryksun
On Fri, Sep 14, 2012 at 8:16 AM, Albert-Jan Roskam fo...@yahoo.com wrote:

 Am I abusing the __getitem__ method, or is this just a creative way of using 
 it?

No, you're using it the normal way. The item to get can be an index, a
key, or even a slice.

http://docs.python.org/reference/datamodel.html#object.__getitem__

 if not isinstance(key, (int, float)):
 raise TypeError

Instead you could raise a TypeError if not hasattr(key, '__int__')
since later you call int(key).

 if abs(key)  self.nCases:
 raise IndexError

You might also want to support slicing. Here's an example:

http://stackoverflow.com/a/2936876/205580
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] is this use or abuse of __getitem__ ?

2012-09-14 Thread Steven D'Aprano

On 14/09/12 22:16, Albert-Jan Roskam wrote:

Hi,

I defined a __getitem__ special method in a class that reads a binary data
file using a C library. The docstring should clarify the purpose of the
method. This works exactly as I intended it, however, the key argument is
actually used as an index (it also raises an IndexError whenkey  is
greater than the number of records in the file). Am I abusing the __getitem__
method, or is this just a creative way of using it?


No, that's exactly what __getitem__ is for. It does double-duty for key-lookup
in mappings (dict[key]) and index-lookup in sequences (list[index]).

You can also support ranges of indexes by accepting a slice argument.

Another comment below:



# Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] 
on win32

 def __getitem__(self, key):
  This function reports the record of case numberkey.
 For example: firstRecord = FileReader(fileName)[0] 
 if not isinstance(key, (int, float)):
 raise TypeError


Floats? Do you actually have have case number (for example)
0.14285714285714285 ?

For this case, I think it is reasonable to insist on exactly an int,
and nothing else (except possibly a slice object, to support for
example obj[2:15]).



--
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] is this use or abuse of __getitem__ ?

2012-09-14 Thread Albert-Jan Roskam
 On 14/09/12 22:16, Albert-Jan Roskam wrote:

  Hi,
 
  I defined a __getitem__ special method in a class that reads a binary data
  file using a C library. The docstring should clarify the purpose of the
 method. This works exactly as I intended it, however, the key 
 argument is
  actually used as an index (it also raises an IndexError whenkey  is
 greater than the number of records in the file). Am I abusing the 
 __getitem__
 method, or is this just a creative way of using it?
 
 No, that's exactly what __getitem__ is for. It does double-duty for 
 key-lookup
 in mappings (dict[key]) and index-lookup in sequences (list[index]).
 
 You can also support ranges of indexes by accepting a slice argument.

 

COOL! I was already wondering how this could be implemented. Dive into Python 
is pretty exhaustive wrt special methods,
but I don't think they mentioned using the slice class. Below is how I did it. 
Is it recommended to define the geitem() function inside the __getitem__() 
method? I was thinking I could also define a _getitem() private method. Hmmm, 
maybe getitem() is redefined over and over again the way I did it now?


    def __getitem__(self, key):
     This function reports the record of case number key.
    For example: firstRecord = SavReader(savFileName)[0] 
    def getitem(key):
    retcode1 = self.iomodule.SeekNextCase(self.fh, 
ctypes.c_long(int(key)))
    self.caseBuffer, self.caseBufferPtr = self.getCaseBuffer()
    retcode2 = self.iomodule.WholeCaseIn(self.fh, self.caseBufferPtr)
    record = struct.unpack(self.structFmt, self.caseBuffer.raw)
    if any([retcode1, retcode2]):
    raise RuntimeError, Error retrieving record %d [%s, %s] % \
  (key, retcodes[retcode1], retcodes[retcode2])
    return record
    if isinstance(key, slice):
    records = [getitem(i) for i in range(*key.indices(self.nCases))]
    return records
    elif hasattr(key, __int__): # isinstance(key, (int, float)):
    if abs(key)  (self.nCases - 1):
    raise IndexError
    else:
    key = self.nCases + key if key  0 else key
    record = getitem(key)
    return record
    else:
    raise TypeError   


 Another comment below:
 
 
  # Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit 
 (Intel)] on win32
 
       def __getitem__(self, key):
            This function reports the record of case 
 numberkey.
           For example: firstRecord = FileReader(fileName)[0] 
 
           if not isinstance(key, (int, float)):
               raise TypeError
 
 Floats? Do you actually have have case number (for example)
 0.14285714285714285 ?
 
 For this case, I think it is reasonable to insist on exactly an int,
 and nothing else (except possibly a slice object, to support for
 example obj[2:15]).
 

I also accepted floats as a convenience. I had examples in mind like: record = 
data[1.0] . Kind of annoying when this raises a TypeError.
But in your example makes perfect sense to raise such an exception.

Eryksun, Steven: Thanks!!!

Albert-Jan
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor