Re: help with making my code more efficient

Mitya Sirenef Thu, 20 Dec 2012 18:53:00 -0800

On 12/20/2012 09:39 PM, Mitya Sirenef wrote:

On 12/20/2012 08:46 PM, larry.mart...@gmail.com wrote:
On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote:
On 12/20/2012 07:19 PM, larry.mart...@gmail.com wrote:
I have a list of tuples that contains a tool_id, a time, and amessage. I want to select from this list all the elements where themessage matches some string, and all the other elements where thetime is within some diff of any matching message for that tool.
Here is how I am currently doing this:
No, it's not.  This is a fragment of code, without enough clues as to

what else is going.  We can guess, but that's likely to make a mess.
Of course it's a fragment - it's part of a large program and I wasjust showing the relevant parts.
First question is whether this code works exactly correctly?
Yes, the code works. I end up with just the rows I want.
Are you only concerned about speed, not fixing features?
Don't know what you mean by 'fixing features'. The code does what Iwant, it just takes too long.
As far as I can tell, the logic that includes the time comparison isbogus.
Not at all.
You don't do anything there to worry about the value of tup[2],just whether some
item has a nearby time.  Of course, I could misunderstand the spec.
The data comes from a database. tup[2] is a datetime column. tdiffcomes from a datetime.timedelta()
Are you making a global called 'self' ? That name is by convention only
used in methods to designate the instance object.  What's the attribute
self?
Yes, self is my instance object. self.message contains the string ofinterest that I need to look for.
Can cdata have duplicates, and are they significant?
No, it will not have duplicates.
Are you including the time building that as part of your 2 hourmeasurement?
No, the 2 hours is just the time to run the

cdata[:] = [tup for tup in cdata if determine(tup)]
Is the list sorted in any way?
Yes, the list is sorted by tool and datetime.
Chances are your performance bottleneck is the doubly-nested loop.  You
have a list comprehension at top-level code, and inside it calls a
function that also loops over the 600,000 items. So the inner loopgets
executed 360 billion times.  You can cut this down drastically by some
judicious sorting, as well as by having a map of lists, where themap is
keyed by the tool.
Thanks. I will try that.
# record time for each message matching the specified message foreach tool
messageTimes = {}
You're building a dictionary;  are you actually using the value (1), or
  is only the key relevant?
Only the keys.
A set is a dict without a value.
Yes, I could use a set, but I don't think that would make itmeasurably faster.
But more mportantly, you never look up anything in this dictionary.So why
isn't it a list?  For that matter, why don't you just use the
messageTimes list?
Yes, it could be a list too.
for row in cdata:   # tool, time, message
     if self.message in row[2]:
         messageTimes[row[0], row[1]] = 1
# now pull out each message that is within the time diff for eachmatched message
# as well as the matched messages themselves
def determine(tup):
     if self.message in tup[2]: return True      # matched message
     for (tool, date_time) in messageTimes:
         if tool == tup[0]:
             if abs(date_time-tup[1]) <= tdiff:
                return True
     return False
         cdata[:] = [tup for tup in cdata if determine(tup)]
As the code exists, there's no need to copy the list.  Just do a simple
bind.
This statement is to remove the items from cdata that I don't want. Idon't know what you mean by bind. I'm not familiar with that pythonfunction.
This code works, but it takes way too long to run - e.g. when cdatahas 600,000 elements (which is typical for my app) it takes 2 hoursfor this to run.
Can anyone give me some suggestions on speeding this up?
This code probably is not faster but it's simpler and may be easierfor you to work with
to experiment with speed-improving changes:


diffrng = 1

L = [
     # id, time, string
     (1, 5, "ok"),
     (1, 6, "ok"),
     (1, 7, "no"),
     (1, 8, "no"),
     ]

match_times = [t[1] for t in L if "ok" in t[2]]

def in_range(timeval):
    return bool( min([abs(timeval-v) for v in match_times]) <= diffrng )

print([t for t in L if in_range(t[1])])


But it really sounds like you could look into optimizing the db
query and db indexes, etc.


Actually, it might be slower.. this version of in_range should be better:

def in_range(timeval):
    return any( abs(timeval-v) <= diffrng for v in match_times )



--
Lark's Tongue Guide to Python: http://lightbird.net/larks/

--
http://mail.python.org/mailman/listinfo/python-list

Re: help with making my code more efficient

Reply via email to