Re: finding most common elements between thousands of multiple arrays.

Scott David Daniels Mon, 06 Jul 2009 06:56:26 -0700

Peter Otten wrote:

Scott David Daniels wrote:

Scott David Daniels wrote:

     t = timeit.Timer('sum(part[:-1]==part[1:])',
                      'from __main__ import part')


What happens if you calculate the sum in numpy? Try

t = timeit.Timer('(part[:-1]==part[1:]).sum()',
                 'from __main__ import part')


Good idea, I hadn't thought of adding numpy bools.
    (part[:-1]==part[1:]).sum()
is only a slight improvement over
    len(part[part[:-1]==part[1:]])
when there are few elements, but it is almost twice
as fast when there are a lot (reflecting the work
of allocating and copying).

>>> import numpy
>>> import timeit
>>> original = numpy.random.normal(0, 100, (1000, 1000)).astype(int)
>>> data = original.flatten()
>>> data.sort()
>>> t = timeit.Timer('sum(part[:-1]==part[1:])',
                     'from __main__ import part')
>>> u = timeit.Timer('len(part[part[:-1]==part[1:]])',
                     'from __main__ import part')
>>> v = timeit.Timer('(part[:-1]==part[1:]).sum()',
                     'from __main__ import part')

>>> part = data[::100]
>>> (part[:-1]==part[1:]).sum()
9390
>>> t.repeat(3, 10)
[0.56368281443587875, 0.55615057220961717, 0.55465764503594528]
>>> u.repeat(3, 1000)
[0.89576580263690175, 0.89276374511291579, 0.8937328626963108]
>>> v.repeat(3, 1000)
[0.24798598704592223, 0.24715431709898894, 0.24498979618920202]
>>>
>>> part = original.flatten()[::100]
>>> (part[:-1]==part[1:]).sum()
27
>>> t.repeat(3, 10)
[0.57576898739921489, 0.56410158274297828, 0.56988248506445416]
>>> u.repeat(3, 1000)
[0.27312186325366383, 0.27315007913011868, 0.27214492344683094]
>>> v.repeat(3, 1000)
[0.28410342655297427, 0.28374053126867693, 0.28318990262732768]
>>>

Net result: go back to former definition of candidates (a number,
not the actual entries), but calculate that number as matches.sum(),
not len(part[matches]).

Now the latest version of this (compressed) code:
>     ...
>     sampled = data[::stride]
>     matches = sampled[:-1] == sampled[1:]
>     candidates = sum(matches) # count identified matches
>     while candidates > N * 10: # 10 -- heuristic
>         stride *= 2 # # heuristic increase
>         sampled = data[::stride]
>         matches = sampled[:-1] == sampled[1:]
>         candidates = sum(matches)
>     while candidates < N * 3: # heuristic slop for long runs
>         stride //= 2 # heuristic decrease
>         sampled = data[::stride]
>         matches = sampled[:-1] == sampled[1:]
>         candidates = sum(matches)
>     former = None
>     past = 0
>     for value in sampled[matches]:
>         ...
is:
      ...
      sampled = data[::stride]
      matches = sampled[:-1] == sampled[1:]
      candidates = matches.sum() # count identified matches
      while candidates > N * 10: # 10 -- heuristic
          stride *= 2 # # heuristic increase
          sampled = data[::stride]
          matches = sampled[:-1] == sampled[1:]
          candidates = matches.sum()
      while candidates < N * 3: # heuristic slop for long runs
          stride //= 2 # heuristic decrease
          sampled = data[::stride]
          matches = sampled[:-1] == sampled[1:]
          candidates = matches.sum()
      former = None
      past = 0
      for value in sampled[matches]:
          ...

Now I think I can let this problem go, esp. since it was
mclovin's problem in the first place.

--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list

Re: finding most common elements between thousands of multiple arrays.

Reply via email to