On Friday, December 21, 2012 11:47:10 PM UTC-7, Dave Angel wrote: > On 12/21/2012 11:47 PM, larry.mart...@gmail.com wrote: > > On Friday, December 21, 2012 8:19:37 PM UTC-7, Dave Angel wrote: > >> On 12/21/2012 03:36 PM, larry.mart...@gmail.com wrote: > >> > >>>> <snip > > I think you're misunderstanding what I need to do. I have a set of rows > > from the database with tool, time, and message. The user has specified a > > string and a time threshold. From that set of rows I need to return all the > > rows that contain the user's string and all the other rows that match the > > tool from the matched rows and have a time within the threshold. > > > > cdata has all the rows. messageTimes has the times of all the matched > > messages, keyed by tool. In determine() I don't look though cdata - it gets > > one element from cdata and I see if that should be selected because it > > either matches the user's message, or it's within the threshold of one that > > did match. > > > > Here's my original code: > > > > # record time for each message matching the specified message for each tool > > messageTimes = {} > > for row in cdata: # tool, time, message > > if self.message in row[2]: > > messageTimes[row[0], row[1]] = 1 > > > > # now pull out each message that is within the time diff for each matched > > message > > # as well as the matched messages themselves > > > > def determine(tup): > > if self.message in tup[2]: return True # matched message > > > > for (tool, date_time) in messageTimes: > > if tool == tup[0]: > > if abs(date_time-tup[1]) <= tdiff: > > return True > > > > return False > > > > cdata[:] = [tup for tup in cdata if determine(tup)] > > > > Here's the code now: > > > > > > # Parse data and make a list of the time for each message matching > > the specified message for each tool > > messageTimes = defaultdict(list) # a dict with sorted lists > > > > for row in cdata: # tool, time, message > > if self.message in row[2]: > > messageTimes[row[0]].append(row[1]) > > > > # now pull out each message that is within the time context for > > each matched message > > # as well as the matched messages themselves > > > > # return true is we should keep this message > > def determine(tup): > > if self.message in tup[2]: return True # matched message > > if seconds == 0: return False # no time context > > specified > > > > times = messageTimes[tup[0]] # get the list of > > matched messages for this tool > > > > le = bisect.bisect_right(times, tup[1]) # find time less than > > or equal to tup[1] > > ge = bisect.bisect_left(times, tup[1]) # find time greater > > then to equal to tup[1] > > return (le and tup[1]-times[le-1] <= tdiff) or (ge != > > len(times) and times[ge]-tup[1] <= tdiff) > > > > cdata = [tup for tup in cdata if determine(tup)] > > > > In my test case, cdata started with 600k rows, 30k matched the users > > string, and a total of 110k needed to be returned (which is what cdata > > ended up with) - the 30k that matched the string, and 80k that were within > > the time threshold. > > > > I think the point you may have missed is the tool - I only return a row if > > it's the same tool as a matched message and within the threshold. > > > > I hope I've explained this better. Thanks again. > > That is better, and the point I missed noticing before is that > messageTimes is substantially smaller than cdata; it's already been > filtered down by looking for self.message in its row[2]. The code was > there, but I didn't relate. Remember I was bothered that you didn't > look at tup[2] when you were comparing for time-similarity. Well, you > did that implicitly, since messageTimes was already filtered. Sorry > about that. > > That also lowers my expectations for improvement ratio, since instead of > 600,000 * 600,000, we're talking "only" 600,000 * 30,000, 5% as much. So > now my expectations are only 4:1 to 10:1. > > Still, there's room for improvement. (1) You should only need one > bisect in determine, and (2) if you remember the last result for each > tool, you could speed that one up some. > > (1) Instead of getting both and le and a ge, get just one, by searching > for tup[1]-tdiff. Then by comparing that row's value against > tup[1]+tdiff, you can return immediately the boolean, the expression > being about half of the one you've now got.
Dave, I cannot thank you enough. With this change it went from 20 minutes to 10. > (2) Make a dict of ints, keyed by the tool, and initialized to zero. > Call that dict "found." Each time you do a bisect, specify a range > starting at found[tool]. Similarly, store the result of the bisect in > found[tool]. This should gradually restrict the range searched by > bisect, which COULD save some time. It works because everything's sorted. And with this change it took 3 minutes. WOW! Merry Christmas and Happy New Year!!! > Naturally, make these changes independently, and time the changes. In > > particular, it's possible that #2 will degrade performance instead of > > improving it. But #1 should double performance, pretty close. > > > > I hope these were clear enough. I don't want to write the changes > > myself, since with no test data, I could easily make a mess of it. > > .... > > > > I can think of other changes which are less likely to make substantial > > improvement, and which might degrade things. For one drastic example, > > instead of any messageTimes storage, you could have ("flags") a list of > > bool, same size as ctimes, which tracked the state of each line. You > > loop through cdata, identifying and marking any line whose tup[2] > > matches. And for each match, you scan backwards and forwards till the > > time gets out of range (in one direction stopping at time-tdiff or 0 or > > tool change, and in the other direction at time+tdiff or len, or > > toolchange), marking each one within tdiff. > > > > Then after one pass through cdata, you can have a very simple list > > comprehension, something like: > > > > cdata = [tup for index, tup in enumerate(cdata) if flags[index]] > > > > Will it be faster? Depends on the number of expected matches (eg. > > 30,000 out of 300,000 is 10%), and how much data forward and backwards > > you need to scan. > > .... > > > > A very simple difference? Perhaps using implicit unpacking instead of > > using tup[0] etc. will be faster. It'll certainly be easier to read. > > > > for tool, time, text in cdata: > > if self.message in text: > > messageTimes[tool]. append(time) > > > > def determine(tool, time, text): > > > > and call it with determine(*tup) -- http://mail.python.org/mailman/listinfo/python-list