2009/1/9 Daniel Kersten <[email protected]> > Rory, definitely! The reason I am using lists at all is because the > messages are similar and I don't know ahead of time which of the > fields I am interested in - so I either putt hem in multiple > dictionaries (sorted by different criteria) or in a list which is then > searched. I'm not sure if I can use sets or not - how are they matched > (eg for the `in` operation)?
IIRC they've got to implement __hash__ > > > Pablo, yes, I imagine that dictionary entries cause loads of hash > collisions after a certain size... Thanks for the links though - may > come in useful. From the third article (not yet read, just scanned > through), some notes: > 1) I already use a compiled regular expression, since its always the same. > 2) I do not need to skip unreadable lines, as I am guaranteed valid input. > 3) I'm not sure if I can easily multithread my application, because > of the dependencies on other messages, but its something I'm looking > at. > 4) I don't actually believe my bottleneck is IO, so memory mapping > the file may not help me much - though if needs be, I'll look at that > later. Re-reading the article, he only got like 0.3 seconds faster by using mmap, dunno why but I believed it was bigger. Sorry for the noise :P Anyway its an interesting optimization read, going from 8+min that took the original version written in Erlang to 0.9s on python. (1M rows). Pablo > Gotta read it in more detail though :-) > > > 2009/1/9 Pablo Martí Gamboa <[email protected]>: > > > > > > 2009/1/9 Daniel Kersten <[email protected]> > >> > >> Hi Rory, > >> > >> The main slowdown is that messages depend on previous messages, so I > >> need to match them together. This becomes slow if I have to search > >> through too many messages. > >> > >> It is somewhat inefficient and I am working on optimising the > >> algorithm as we speak, especially in these areas: > >> 1) replacing lists with dictionaries, since lists get horribly slow > >> as they grow. > > > > After certain size, operations on dict become horribly slow too [0]. I'd > go > > for bsddb if the dict size is too big, it offers the same interface than > a > > dict [1]. > > > > [2] is an interesting read regarding optimizations applied to a program > > relatively similar to yours, where a good chunk of the speedup came from > > using mmap instead of the plain file interface. Also multiple processes > will > > help too. > > > > [0] > http://mail.python.org/pipermail/python-dev/2008-December/084530.html > > [1] http://docs.python.org/library/bsddb.html > > [2] http://effbot.org/zone/wide-finder.htm > > > > Best regards, > > Pablo > > > >> > >> 2) either processing messages concurrently or doing multiple checks > >> on a single message concurrently. > >> 3) Diarmuid suggested perhaps use a separate process to prefetch from > >> the log files, this may work since I don't waste time waiting on disk > >> IO. > >> > >> There is one more area which will improve the speed drastically - I > >> was lucky in that it has been decided to change the tests slightly - > >> as a side effect, I can now make assumptions I couldn't before which > >> will (I think) remove one of the lists which I wasn't able to > >> dictionary-ify. > >> > >> Here is what currently happens, in pseudocode: > >> > >> for entry in log_file: > >> current_message = parse_using_regex(entry) > >> expected_messages_for_type = > >> list_of_expected_messages[current_message.message_type] > >> expected_messages = > >> expected_messages_for_type[current_message.id_dependant_on_message_type] > >> for message in expected_messages: > >> if message == current_message: > >> remove_from_expected_messages(message) > >> new_messages = get_next_expected_messages(current_message) > >> if new_messages is not None: > >> add_to_expected_messages(new_messages) > >> message_from_middle_operation() > >> else: > >> operation_completed() > >> > >> One thing to note is that the last line may add more than one message > >> (and the remove should remove them all) because some messages can > >> expect one of a selection of alternatives - this can now be removed, I > >> believe. > >> > >> Theres more going on than that.. In fact, theres quite a lot of > >> complexity added because messages from different "operations" are all > >> almost the same, so to determine the next expected message(s) requires > >> some convoluted logic. Gonna try figure out a better way, especially > >> as I don't think I'll need to deal with alternative messages in one > >> operation anymore. > >> > >> Hope that explained it somewhat. If you have any obvious suggestions > >> on speeding up the code further - please share! Unfortunately, I can't > >> share the code.. The code itself is sharable really, but the data > >> operated on (messages and operations) is not and can be inferred from > >> the code. > >> > >> > >> > >> Still, the point of my email was to show that even a naive use of > >> cython (compiling Python code without modification) can give quite a > >> good speed increase! (I would have simply wrote "hey i used cython and > >> my code is faster" but I figured some background would be nice) > >> > >> 2009/1/9 Rory Geoghegan <[email protected]>: > >> > > >> > Not to pick at you, but an hour and a half sounds a bit long for 171k > >> > entries. If you did not write the code for corporate interests, would > >> > you mind sharing it, on something like github or bitbicket? > >> > > >> > I mean, I know nothing of a) the problem you are trying to solve b) > >> > the constraints you are facing in programming a solution so I will > >> > obstinately hold down my views that there must be a problem with > >> > either the algorithm you employ or the way you've coded it, even in > >> > the face of precise and clear evidence. > >> > > >> > --Rory > >> > > >> > On Fri, Jan 9, 2009 at 2:01 PM, Daniel Kersten <[email protected]> > >> > wrote: > >> >> > >> >> Hi all, > >> >> > >> >> I wrote a Python program to generate a report from log files. The log > >> >> files are generated by a test-suite used to test a java program. The > >> >> report gives me details about how many messages (its a message based > >> >> program being tested) were processed, how long each operation (which > >> >> may consist of processing one or more message) took, counts of which > >> >> operations passed or didnt pass, messages processed per second etc. > >> >> The idea is that the main program can be run over a weekend or week > or > >> >> whatever and the log files from the test suite are checked by my > >> >> Python program. > >> >> > >> >> The log files can be huge. > >> >> > >> >> Yesterday, I ran my program on a log file with 171K entries - it took > >> >> an hour and a half! (This is why I'm interested in the sppedup patch) > >> >> There are some algorithmic changes which would be beneficial, but > that > >> >> would require significant code restructuring which, right now, I dont > >> >> have time for. So I'm looking for simpler ways. > >> >> > >> >> I decided to give Cython (cython.org) a shot, since it compiles > Python > >> >> code to C. IT supports almost all of Pythons constructs, the only > >> >> major limitation (IMHO - that is, the only feature I really use which > >> >> Cython does not support) being nested functions and lambdas. Removing > >> >> them from my code slowed it down a small bit, due to one of my > >> >> functions accessing a variable from the outer scope, so I couldn't > >> >> simply move it into the global scope - and I couldn't pass it as an > >> >> argument because I was storing the function as a callback. > >> >> Besides that, I made NO other changes to my Python code. > >> >> > >> >> The code that took 1 hour and 32 minutes to execute with the pure > >> >> python version completed in 48 minutes!! > >> >> > >> >> This can be improved more still, by strategically declaring functions > >> >> and variables as C types. > >> >> > >> >> Just thought I'd share, in case someone else needs more performance > >> >> out of their Python and doesn't know where to turn. > >> >> > >> >> -- > >> >> Daniel Kersten. > >> >> Leveraging dynamic paradigms since the synergies of 1985. > >> >> > >> >> > > >> >> > >> > > >> > > > >> > > >> > >> > >> > >> -- > >> Daniel Kersten. > >> Leveraging dynamic paradigms since the synergies of 1985. > >> > >> > > > > > > > > -- > > Pablo Martí > > http://www.linkedin.com/in/pmarti || http://www.warp.es > > python -c "print '706d6172746940776172702e6573'.decode('hex')" > > > > > > > > > > > > > -- > Daniel Kersten. > Leveraging dynamic paradigms since the synergies of 1985. > > > > -- Pablo Martí http://www.linkedin.com/in/pmarti || http://www.warp.es python -c "print '706d6172746940776172702e6573'.decode('hex')" --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Python Ireland" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.ie/group/pythonireland?hl=en -~----------~----~----~----~------~----~------~--~---
