Stephen Nelson-Smith wrote:
Hi,

On Wed, Nov 11, 2009 at 10:05 AM, Alan Gauld <alan.ga...@btinternet.com> wrote:
"Stephen Nelson-Smith" <sanel...@gmail.com> wrote
I don't really want to admit defeat and have a cron job sort the logs
before entry.  Anyone got any other ideas?
Why would that be admitting defeat?

Well, it mean admitting defeat on solving the problem in python.  Yes
in practical terms, I should probably preprocess the data, but as a
programming exercise, learning how to sort a number of files into one
is something I'd like to crack.

Maybe the real lesson here is knowing which battles to fight, and a
good developer uses the right tools for the job.

S.

That's certainly a valuable lesson. On the other hand, I don't think you've run out of useful python options

The problem as I see it is that your input files are *mostly* sorted, and have an occasional out-of-order item. And further that you're not sure how far out-of-order that can get; you're not willing to pick a constant "10" and say that no item will have to be moved more than 10 items. Would you feel safe with 100 ? Or 1000 ?

If you give it to "a cron job" the code will be very generalized, and perhaps not very efficient at sorting a list which is almost in order already. So perhaps you can do better. You might want to time it to be sure, or you might just decide that "good enough" doesn't need measuring.

Back to the fundamental problem. Apparently the max amount of jitter is one second. So let's say you're processing all the 10:01 events, and you encounter a 10:02 one. You don't know whether you've really got all the 10:02 ones till you see your first 10:03. So you need a priority queue combined with a couple of threshold values, in this case 10:01 and 10:03. Fill the queue till you hit the upper threshold, letting the queue grow as needed. So you have all the 10:01 events. Now you feed out your queue into the merge logic, and don't replace items as they are removed, until you have removed the last one at your lower threshold. At this point raise the thresholds by one, and fill it again as before.

Such a routine can be done as a generator function, with the two thresholds simply being local variables in the function. And the generator can then be used in a merge function, just as any other iterator producing a sorted list would be used.

And of course you would need to carefully test such a generator, both with worst-case data, and against the live data. If our assumptions about worst case jitter is wrong, the thresholds might need changing.

Finally, consider the possibility that if you pick a fixed jitter size that statistically nearly always works (like 1000 items).then you could just continuously compare the output data as it's being generated. If once every 30 days it's out of order, then do a massive sort on the result file that day, and figure a speedup the other 29 days, with simpler code, was worth it. In this case, the iterator is simply a priority queue of fixed size.

DaveA

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to