HI, So, I decided I'd rewrite the program. Its much improved this time, though I make assumptions now that I did not before, which greatly reduces the complexity of the task - and allows for a giant speed increase. I now no longer need any sort of lists and can get away with just dictionaries. Also, because of the assumptions made, I can be much more aggressive about evicting old entries.
The rewritten, pure python solution running without psyco, completes the same dataset in 101 seconds. With cython, it completes in 98 seconds and with psyco it actually takes longer than without - 106 seconds average. I guess in this case, cython isn't a very big improvement, but I imagine with a 100gig log file, the seconds will accumulate to something worthwhile. My conclusions from all this are that Cython is very good at optimising crude and naive Python code, but not quite as good at optimising already fast Python code. Though, more experience is needed I guess. I'm not sure why psyco was slower though - time for some profiling to see whats eating the cycles, must be something which psyco cannot easily JIT. 2009/1/12 Pablo Martí Gamboa <[email protected]>: > > > 2009/1/9 Daniel Kersten <[email protected]> >> >> Rory, definitely! The reason I am using lists at all is because the >> messages are similar and I don't know ahead of time which of the >> fields I am interested in - so I either putt hem in multiple >> dictionaries (sorted by different criteria) or in a list which is then >> searched. I'm not sure if I can use sets or not - how are they matched >> (eg for the `in` operation)? > > IIRC they've got to implement __hash__ > >> >> >> Pablo, yes, I imagine that dictionary entries cause loads of hash >> collisions after a certain size... Thanks for the links though - may >> come in useful. From the third article (not yet read, just scanned >> through), some notes: >> 1) I already use a compiled regular expression, since its always the >> same. >> 2) I do not need to skip unreadable lines, as I am guaranteed valid >> input. >> 3) I'm not sure if I can easily multithread my application, because >> of the dependencies on other messages, but its something I'm looking >> at. >> 4) I don't actually believe my bottleneck is IO, so memory mapping >> the file may not help me much - though if needs be, I'll look at that >> later. > > Re-reading the article, he only got like 0.3 seconds faster by using mmap, > dunno why but I believed it was bigger. Sorry for the noise :P > Anyway its an interesting optimization read, going from 8+min that took the > original version written in Erlang to 0.9s on python. (1M rows). > > Pablo > > >> >> Gotta read it in more detail though :-) >> >> >> 2009/1/9 Pablo Martí Gamboa <[email protected]>: >> > >> > >> > 2009/1/9 Daniel Kersten <[email protected]> >> >> >> >> Hi Rory, >> >> >> >> The main slowdown is that messages depend on previous messages, so I >> >> need to match them together. This becomes slow if I have to search >> >> through too many messages. >> >> >> >> It is somewhat inefficient and I am working on optimising the >> >> algorithm as we speak, especially in these areas: >> >> 1) replacing lists with dictionaries, since lists get horribly slow >> >> as they grow. >> > >> > After certain size, operations on dict become horribly slow too [0]. I'd >> > go >> > for bsddb if the dict size is too big, it offers the same interface than >> > a >> > dict [1]. >> > >> > [2] is an interesting read regarding optimizations applied to a program >> > relatively similar to yours, where a good chunk of the speedup came from >> > using mmap instead of the plain file interface. Also multiple processes >> > will >> > help too. >> > >> > [0] >> > http://mail.python.org/pipermail/python-dev/2008-December/084530.html >> > [1] http://docs.python.org/library/bsddb.html >> > [2] http://effbot.org/zone/wide-finder.htm >> > >> > Best regards, >> > Pablo >> > >> >> >> >> 2) either processing messages concurrently or doing multiple checks >> >> on a single message concurrently. >> >> 3) Diarmuid suggested perhaps use a separate process to prefetch from >> >> the log files, this may work since I don't waste time waiting on disk >> >> IO. >> >> >> >> There is one more area which will improve the speed drastically - I >> >> was lucky in that it has been decided to change the tests slightly - >> >> as a side effect, I can now make assumptions I couldn't before which >> >> will (I think) remove one of the lists which I wasn't able to >> >> dictionary-ify. >> >> >> >> Here is what currently happens, in pseudocode: >> >> >> >> for entry in log_file: >> >> current_message = parse_using_regex(entry) >> >> expected_messages_for_type = >> >> list_of_expected_messages[current_message.message_type] >> >> expected_messages = >> >> >> >> expected_messages_for_type[current_message.id_dependant_on_message_type] >> >> for message in expected_messages: >> >> if message == current_message: >> >> remove_from_expected_messages(message) >> >> new_messages = get_next_expected_messages(current_message) >> >> if new_messages is not None: >> >> add_to_expected_messages(new_messages) >> >> message_from_middle_operation() >> >> else: >> >> operation_completed() >> >> >> >> One thing to note is that the last line may add more than one message >> >> (and the remove should remove them all) because some messages can >> >> expect one of a selection of alternatives - this can now be removed, I >> >> believe. >> >> >> >> Theres more going on than that.. In fact, theres quite a lot of >> >> complexity added because messages from different "operations" are all >> >> almost the same, so to determine the next expected message(s) requires >> >> some convoluted logic. Gonna try figure out a better way, especially >> >> as I don't think I'll need to deal with alternative messages in one >> >> operation anymore. >> >> >> >> Hope that explained it somewhat. If you have any obvious suggestions >> >> on speeding up the code further - please share! Unfortunately, I can't >> >> share the code.. The code itself is sharable really, but the data >> >> operated on (messages and operations) is not and can be inferred from >> >> the code. >> >> >> >> >> >> >> >> Still, the point of my email was to show that even a naive use of >> >> cython (compiling Python code without modification) can give quite a >> >> good speed increase! (I would have simply wrote "hey i used cython and >> >> my code is faster" but I figured some background would be nice) >> >> >> >> 2009/1/9 Rory Geoghegan <[email protected]>: >> >> > >> >> > Not to pick at you, but an hour and a half sounds a bit long for 171k >> >> > entries. If you did not write the code for corporate interests, would >> >> > you mind sharing it, on something like github or bitbicket? >> >> > >> >> > I mean, I know nothing of a) the problem you are trying to solve b) >> >> > the constraints you are facing in programming a solution so I will >> >> > obstinately hold down my views that there must be a problem with >> >> > either the algorithm you employ or the way you've coded it, even in >> >> > the face of precise and clear evidence. >> >> > >> >> > --Rory >> >> > >> >> > On Fri, Jan 9, 2009 at 2:01 PM, Daniel Kersten <[email protected]> >> >> > wrote: >> >> >> >> >> >> Hi all, >> >> >> >> >> >> I wrote a Python program to generate a report from log files. The >> >> >> log >> >> >> files are generated by a test-suite used to test a java program. The >> >> >> report gives me details about how many messages (its a message based >> >> >> program being tested) were processed, how long each operation (which >> >> >> may consist of processing one or more message) took, counts of which >> >> >> operations passed or didnt pass, messages processed per second etc. >> >> >> The idea is that the main program can be run over a weekend or week >> >> >> or >> >> >> whatever and the log files from the test suite are checked by my >> >> >> Python program. >> >> >> >> >> >> The log files can be huge. >> >> >> >> >> >> Yesterday, I ran my program on a log file with 171K entries - it >> >> >> took >> >> >> an hour and a half! (This is why I'm interested in the sppedup >> >> >> patch) >> >> >> There are some algorithmic changes which would be beneficial, but >> >> >> that >> >> >> would require significant code restructuring which, right now, I >> >> >> dont >> >> >> have time for. So I'm looking for simpler ways. >> >> >> >> >> >> I decided to give Cython (cython.org) a shot, since it compiles >> >> >> Python >> >> >> code to C. IT supports almost all of Pythons constructs, the only >> >> >> major limitation (IMHO - that is, the only feature I really use >> >> >> which >> >> >> Cython does not support) being nested functions and lambdas. >> >> >> Removing >> >> >> them from my code slowed it down a small bit, due to one of my >> >> >> functions accessing a variable from the outer scope, so I couldn't >> >> >> simply move it into the global scope - and I couldn't pass it as an >> >> >> argument because I was storing the function as a callback. >> >> >> Besides that, I made NO other changes to my Python code. >> >> >> >> >> >> The code that took 1 hour and 32 minutes to execute with the pure >> >> >> python version completed in 48 minutes!! >> >> >> >> >> >> This can be improved more still, by strategically declaring >> >> >> functions >> >> >> and variables as C types. >> >> >> >> >> >> Just thought I'd share, in case someone else needs more performance >> >> >> out of their Python and doesn't know where to turn. >> >> >> >> >> >> -- >> >> >> Daniel Kersten. >> >> >> Leveraging dynamic paradigms since the synergies of 1985. >> >> >> >> >> >> > >> >> >> >> >> > >> >> > > >> >> > >> >> >> >> >> >> >> >> -- >> >> Daniel Kersten. >> >> Leveraging dynamic paradigms since the synergies of 1985. >> >> >> >> >> > >> > >> > >> > -- >> > Pablo Martí >> > http://www.linkedin.com/in/pmarti || http://www.warp.es >> > python -c "print '706d6172746940776172702e6573'.decode('hex')" >> > >> > >> > > >> > >> >> >> >> -- >> Daniel Kersten. >> Leveraging dynamic paradigms since the synergies of 1985. >> >> > > > > -- > Pablo Martí > http://www.linkedin.com/in/pmarti || http://www.warp.es > python -c "print '706d6172746940776172702e6573'.decode('hex')" > > > > > -- Daniel Kersten. Leveraging dynamic paradigms since the synergies of 1985. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Python Ireland" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.ie/group/pythonireland?hl=en -~----------~----~----~----~------~----~------~--~---
