I would agree with Ted. You should easily be able to get 100MBps write
throughput on a standard Netapp box (with read bandwidth left over -
since the peak write throughput rating is more than twice of that). Even
at an average write throughput rate of 50MBps - the daily data volume
would be (drumroll ..) 4+TB! 

So buffer to a decent box and copy stuff over ..

-----Original Message-----
From: Ted Dunning [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 29, 2008 11:33 AM
To: core-user@hadoop.apache.org
Subject: Re: long write operations and data recovery


Unless your volume is MUCH higher than ours, I think you can get by with
a
relatively small farm of log consolidators that collect and concatenate
files.

If each log line is 100 bytes after compression (that is huge really)
and
you have 10,000 events per second (also pretty danged high) then you are
only writing 1MB/s.  If you need a day of buffering (=100,000 seconds),
then
you need 100GB of buffer storage.  These are very, very moderate
requirements for your ingestion point.


On 2/29/08 11:18 AM, "Steve Sapovits" <[EMAIL PROTECTED]> wrote:

> Ted Dunning wrote:
> 
>> In our case, we looked at the problem and decided that Hadoop wasn't
>> feasible for our real-time needs in any case.  There were several
>> issues,
>> 
>> - first, of all, map-reduce itself didn't seem very plausible for
>> real-time applications.  That left hbase and hdfs as the capabilities
>> offered by hadoop (for real-time stuff)
> 
> We'll be using map-reduce batch mode, so we're okay there.
> 
>> The upshot is that we use hadoop extensively for batch operations
>> where it really shines.  The other nice effect is that we don't have
>> to worry all that much about HA (at least not real-time HA) since we
>> don't do real-time with hadoop.
> 
> What I'm struggling with is the write side of things.  We'll have a
huge
> amount of data to write that's essentially a log format.  It would
seem
> that writing that outside of HDFS then trying to batch import it would
> be a losing battle -- that you would need the distributed nature of
HDFS
> to do very large volume writes directly and wouldn't easily be able to
take
> some other flat storage model and feed it in as a secondary step
without
> having the HDFS side start to lag behind.
> 
> The realization is that Name Node could go down so we'll have to have
a
> backup store that might be used during temporary outages, but that
> most of the writes would be direct HDFS updates.
> 
> The alternative would seem to be to end up with a set of distributed
files
> without some unifying distributed file system (e.g., like lots of
Apache
> web logs on many many individual boxes) and then have to come up with
> some way to funnel those back into HDFS.

Reply via email to