Re: InputFormat for Two Types

Ted Dunning Sun, 30 Sep 2007 19:42:08 -0700

Sorry, I think I said something confusing.

Repeatedly reading is inefficient in my case because of the cost of
decryption and log line parsing.  Compression is usually GOOD in these cases
because you are effectively multiplying the disk read rate by the
compression rate (possibly 10 or 20x for log files) at the relatively
moderate cost of some CPU cycles.

The reason for changing to a different compression type in my case is so
that files can be sub-divided.  This has two benefits.  The obvious benefit
is higher potential parallelism while still keeping the file size large.
This is less important if you are rolling your files often as you say.  The
second, less obvious benefit is that you have more efficient load balancing
if you can divide your input in to 3-5 times more pieces than you have task
nodes.  This happens because faster nodes can munch on more pieces than the
slower nodes.  If you have absolutely uniform tasks and absolutely uniform
nodes, then this won't help, but I can't help thinking that with log files
rotated by time, you will have at least 2x variation.  That means that a
significant number of log files will be much shorter tasks than others and
many nodes will go idle in the last half of the map phase.  If you combine
this with 2x variation in speed due to task location and machine calibre,
you could have considerable slack in the last 75% of the map phase.  Not
good.

On 9/30/07 5:50 PM, "Stu Hood" <[EMAIL PROTECTED]> wrote:

> So repeatedly reading the raw logs is out, due to their being compressed, but
> also because it is a very small number of events that aren't emitted on the
> first go round.

Re: InputFormat for Two Types

Reply via email to