Sorry, I think I said something confusing. Repeatedly reading is inefficient in my case because of the cost of decryption and log line parsing. Compression is usually GOOD in these cases because you are effectively multiplying the disk read rate by the compression rate (possibly 10 or 20x for log files) at the relatively moderate cost of some CPU cycles.
The reason for changing to a different compression type in my case is so that files can be sub-divided. This has two benefits. The obvious benefit is higher potential parallelism while still keeping the file size large. This is less important if you are rolling your files often as you say. The second, less obvious benefit is that you have more efficient load balancing if you can divide your input in to 3-5 times more pieces than you have task nodes. This happens because faster nodes can munch on more pieces than the slower nodes. If you have absolutely uniform tasks and absolutely uniform nodes, then this won't help, but I can't help thinking that with log files rotated by time, you will have at least 2x variation. That means that a significant number of log files will be much shorter tasks than others and many nodes will go idle in the last half of the map phase. If you combine this with 2x variation in speed due to task location and machine calibre, you could have considerable slack in the last 75% of the map phase. Not good. On 9/30/07 5:50 PM, "Stu Hood" <[EMAIL PROTECTED]> wrote: > So repeatedly reading the raw logs is out, due to their being compressed, but > also because it is a very small number of events that aren't emitted on the > first go round.