[Numpy-discussion] recarray slow?

2010-07-21 Thread wheres pythonmonks
I have an recarray -- the first column is date.

I have the following function to compute the number of unique dates in
my data set:


def byName(): return(len(list(set(d['Date'])) ))

Question:  is the string 'Date' looked up at each iteration?  If so,
this is dumb, but explains my horrible performance.
Or, is there a better way to code the above?

Can I convert this to something indexed by column number and convert
'Date' to column number 0 upfront?  Would this help with speed?

W
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] recarray slow?

2010-07-21 Thread Robert Kern
On Wed, Jul 21, 2010 at 15:12, wheres pythonmonks
wherespythonmo...@gmail.com wrote:
 I have an recarray -- the first column is date.

 I have the following function to compute the number of unique dates in
 my data set:


 def byName(): return(len(list(set(d['Date'])) ))

 Question:  is the string 'Date' looked up at each iteration?  If so,
 this is dumb, but explains my horrible performance.
 Or, is there a better way to code the above?

len(np.unique(d['Date']))

If you can come up with a self-contained example that we can
benchmark, it would help. In my examples, I don't see any hideous
performance, but my examples may be missing some crucially important
detail about your data that is causing your performance problems.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
  -- Umberto Eco
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] recarray slow?

2010-07-21 Thread Pauli Virtanen
Wed, 21 Jul 2010 15:12:14 -0400, wheres pythonmonks wrote:

 I have an recarray -- the first column is date.
 
 I have the following function to compute the number of unique dates in
 my data set:
 
 
 def byName(): return(len(list(set(d['Date'])) ))

What this code does is:

1. d['Date']

   Extract an array slice containing the dates. This is fast.

2. set(d['Date'])

   Make copies of each array item, and box them into Python objects. 
   This is slow.

   Insert each of the objects in the set. Also this is somewhat slow.

3. list(set(d['Date']))

   Get each item in the set, and insert them to a new list.
   This is somewhat slow, and unnecessary if you only want to
   count.

4. len(list(set(d['Date'])))


So the slowness arises because the code is copying data around, and 
boxing it into Python objects.

You should try using Numpy functions (these don't re-box the data) to do 
this. http://docs.scipy.org/doc/numpy/reference/routines.set.html

-- 
Pauli Virtanen

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] recarray slow?

2010-07-21 Thread wheres pythonmonks
Thank you very much  better crack open a numpy reference manual
instead of relying on my python intuition.

On Wed, Jul 21, 2010 at 3:44 PM, Pauli Virtanen p...@iki.fi wrote:
 Wed, 21 Jul 2010 15:12:14 -0400, wheres pythonmonks wrote:

 I have an recarray -- the first column is date.

 I have the following function to compute the number of unique dates in
 my data set:


 def byName(): return(len(list(set(d['Date'])) ))

 What this code does is:

 1. d['Date']

   Extract an array slice containing the dates. This is fast.

 2. set(d['Date'])

   Make copies of each array item, and box them into Python objects.
   This is slow.

   Insert each of the objects in the set. Also this is somewhat slow.

 3. list(set(d['Date']))

   Get each item in the set, and insert them to a new list.
   This is somewhat slow, and unnecessary if you only want to
   count.

 4. len(list(set(d['Date'])))


 So the slowness arises because the code is copying data around, and
 boxing it into Python objects.

 You should try using Numpy functions (these don't re-box the data) to do
 this. http://docs.scipy.org/doc/numpy/reference/routines.set.html

 --
 Pauli Virtanen

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] recarray slow?

2010-07-21 Thread wheres pythonmonks
However: is there an automatic way to convert a named index to a position?

What about looping over tuples of my recarray:

for t in d:
date = t['Date']


I guess that the above does have to lookup 'Date' each time.
But the following does not need the hash lookup for each tuple:

for t in d:
date = t[0]


Should I create a map from dtype.names(), and use that to look up the
index based on the name in advance?  (if I really really want to
factorize out the lookup of 'Date']



On Wed, Jul 21, 2010 at 3:47 PM, wheres pythonmonks
wherespythonmo...@gmail.com wrote:
 Thank you very much  better crack open a numpy reference manual
 instead of relying on my python intuition.

 On Wed, Jul 21, 2010 at 3:44 PM, Pauli Virtanen p...@iki.fi wrote:
 Wed, 21 Jul 2010 15:12:14 -0400, wheres pythonmonks wrote:

 I have an recarray -- the first column is date.

 I have the following function to compute the number of unique dates in
 my data set:


 def byName(): return(len(list(set(d['Date'])) ))

 What this code does is:

 1. d['Date']

   Extract an array slice containing the dates. This is fast.

 2. set(d['Date'])

   Make copies of each array item, and box them into Python objects.
   This is slow.

   Insert each of the objects in the set. Also this is somewhat slow.

 3. list(set(d['Date']))

   Get each item in the set, and insert them to a new list.
   This is somewhat slow, and unnecessary if you only want to
   count.

 4. len(list(set(d['Date'])))


 So the slowness arises because the code is copying data around, and
 boxing it into Python objects.

 You should try using Numpy functions (these don't re-box the data) to do
 this. http://docs.scipy.org/doc/numpy/reference/routines.set.html

 --
 Pauli Virtanen

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] recarray slow?

2010-07-21 Thread Pierre GM

On Jul 21, 2010, at 4:22 PM, wheres pythonmonks wrote:

 However: is there an automatic way to convert a named index to a position?
 
 What about looping over tuples of my recarray:
 
 for t in d:
date = t['Date']


Why don't you use zip ?

 for (date, t) in (d['Date'], d)

That way, you save repetitive calls to __getitem__

 Should I create a map from dtype.names(), and use that to look up the
 index based on the name in advance?  (if I really really want to
 factorize out the lookup of 'Date']


Meh. I have a bad feeling about it that it won't be really performant.

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] recarray slow?

2010-07-21 Thread wheres pythonmonks
What about:

idx_by_name = dict(enumerate(d.dtype.names))

Then I can look up the index of the columns I want before the loop,
and then access by the index during the loop.

- W



On Wed, Jul 21, 2010 at 4:29 PM, Pierre GM pgmdevl...@gmail.com wrote:

 On Jul 21, 2010, at 4:22 PM, wheres pythonmonks wrote:

 However: is there an automatic way to convert a named index to a position?

 What about looping over tuples of my recarray:

 for t in d:
    date = t['Date']
    

 Why don't you use zip ?

 for (date, t) in (d['Date'], d)

 That way, you save repetitive calls to __getitem__

 Should I create a map from dtype.names(), and use that to look up the
 index based on the name in advance?  (if I really really want to
 factorize out the lookup of 'Date']


 Meh. I have a bad feeling about it that it won't be really performant.

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] recarray slow?

2010-07-21 Thread Pierre GM

On Jul 21, 2010, at 4:35 PM, wheres pythonmonks wrote:

 What about:
 
 idx_by_name = dict(enumerate(d.dtype.names))
 
 Then I can look up the index of the columns I want before the loop,
 and then access by the index during the loop.

Sure. Why don't you try both approaches, time them and document it ? 
I still bet that manipulating tuples of numbers might be easier and more 
performant than juggling w/ the fields of a numpy.void, but that's a gut 
feeling only...
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] recarray slow?

2010-07-21 Thread wheres pythonmonks
My code had a bug:

idx_by_name = dict((n,i) for i,n in enumerate(d.dtype.names))



On Wed, Jul 21, 2010 at 4:49 PM, Pauli Virtanen p...@iki.fi wrote:
 Wed, 21 Jul 2010 16:22:37 -0400, wheres pythonmonks wrote:
 However: is there an automatic way to convert a named index to a
 position?

 It's not really a named index -- it's a field name. Since the fields of
 an array element can be of different size, they cannot be referred to
 with an array index (in the sense that Numpy understands the concept).

 What about looping over tuples of my recarray:

 for t in d:
     date = t['Date']
     

 I guess that the above does have to lookup 'Date' each time.

 As Pierre said, you can move the lookups outside the loop.

        for date in t['Date']:
            ...

 If you want to iterate over multiple fields, it may be best to use
 itertools.izip so that you unbox a single element at a time.

 However, I'd be quite surprised if the hash lookups would actually take a
 significant part of the run time:

 1) Python dictionaries are ubiquitous and the implementation appears
   heavily optimized to be fast with strings.

 2) The hash of a Python string is cached, and only computed only once.

 3) String literals are interned, and represented by a single object only:

    'Date' is 'Date'
   True

   So when running the above Python code, the hash of 'Date' is computed
   exactly once.

 4) For small dictionaries containing strings, such as the fields
   dictionary, I'd expect 1-3) to be dwarfed by the overhead involved
   in making Python function calls (PyArg_*) and interpreting the
   bytecode.

 So as the usual optimization mantra applies here: measure first :)

 Of course, if you measure and show that the expectations 1-4) are
 actually wrong, that's fine.

 --
 Pauli Virtanen

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion