RE: One petabyte of data loading into HDFS with in 10 min.

Siddharth Tiwari Mon, 10 Sep 2012 12:23:28 -0700

Well can't you load the incremental data only ? as the goal seems quite 
unrealistic. The big guns have already spoken :P

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of 
God.” 

"Maybe other people will try to limit me but I don't limit myself"

From: alex.gauth...@teradata.com
To: user@hadoop.apache.org; mike.se...@thinkbiganalytics.com
Subject: RE: One petabyte of data loading into HDFS with in 10 min.
Date: Mon, 10 Sep 2012 16:17:20 +0000

Well said Mike. Lots of “funny questions” around here lately…

From: Michael Segel [mailto:michael_se...@hotmail.com]

Sent: Monday, September 10, 2012 4:50 AM

To: user@hadoop.apache.org

Cc: Michael Segel

Subject: Re: One petabyte of data loading into HDFS with in 10 min.

On Sep 10, 2012, at 2:40 AM, prabhu K <prabhu.had...@gmail.com> wrote:

Hi Users,

Thanks for the response.

We have loaded 100GB data loaded into HDFS, time taken 1hr.with below 
configuration.
Each Node (1 machine master, 2 machines  are slave)

1.   
500 GB hard disk.

2.   
4Gb RAM

3.   
3 quad code CPUs.

4.   
Speed 1333 MHz

Now, we are planning to load 1 petabyte of data (single file)  into Hadoop HDFS 
and Hive table within 10-20 minutes. For this we need a clarification below.

Ok...

Some say that I am sometimes too harsh in my criticisms so take what I say with 
a grain of salt...

You loaded 100GB in an hour using woefully underperforming hardware and are now 
saying you want to load 1PB in 10 mins.

I would strongly suggest that you first learn more about Hadoop.  No really. 
Looking at your first machine, its obvious that you don't really grok hadoop 
and what it requires to achieve optimum performance.  You couldn't even 
extrapolate
 any meaningful data from your current environment.

Secondly, I think you need to actually think about the problem. Did you mean PB 
or TB? Because your math seems to be off by a couple orders of magnitude. 

A single file measured in PBs? That is currently impossible using today (2012) 
technology. In fact a single file that is measured in PBs wouldn't exist within 
the next 5 years and most likely the next decade. [Moore's law is all about CPU
 power, not disk density.]

Also take a look at networking. 

ToR switch design differs, however current technology, the fabric tends to max 
out at 40GBs.  What's the widest fabric on a backplane? 

That's your first bottleneck because even if you had a 1PB of data, you 
couldn't feed it to the cluster fast enough. 

Forget disk. look at PCIe based memory. (Money no object, right? ) 

You still couldn't populate it fast enough.

I guess Steve hit this nail on the head when he talked about this being a 
homework assignment. 

High school maybe? 

1. what are the system configuration setup required for all the 3 machine’s ?.
2. Hard disk size.
3. RAM size.
4. Mother board
5. Network cable
6. How much Gbps  Infiniband required.
 For the same setup we need cloud computing environment too?
Please suggest and help me on this.
 Thanks,
Prabhu.

On Fri, Sep 7, 2012 at 7:30 PM, Michael Segel <michael_se...@hotmail.com> wrote:
Sorry, but you didn't account for the network saturation.

And why 1GBe and not 10GBe? Also which version of hadoop?

Here MapR works well with bonding two 10GBe ports and with the right switch, 
you could do ok.

Also 2 ToR switches... per rack. etc...

How many machines? 150? 300? more?

Then you don't talk about how much memory, CPUs, what type of storage...

Lots of factors.

I'm sorry to interrupt this mental masturbation about how to load 1PB in 10min.

There is a lot more questions that should be asked that weren't.

Hey but look. Its a Friday, so I suggest some pizza, beer and then take it to a 
white board.

But what do I know? In a different thread, I'm talking about how to tame HR and 
Accounting so they let me play with my team Ninja!

:-P

On Sep 5, 2012, at 9:56 AM, zGreenfelder <zgreenfel...@gmail.com> wrote:

> On Wed, Sep 5, 2012 at 10:43 AM, Cosmin Lehene <cleh...@adobe.com> wrote:

>> Here's an extremely naïve ballpark estimation: at theoretical hardware

>> speed, for 3PB representing 1PB with 3x replication

>>

>> Over a single 1Gbps connection (and I'm not sure, you can actually reach

>> 1Gbps)

>> (3 petabytes) / (1 Gbps) = 291.271111 days

>>

>> So you'd need at least 40,000 1Gbps network cards to get that in 10 minutes

>> :) - (3PB/1Gbps)/40000

>>

>> The actual number of nodes would depend a lot on the actual network

>> architecture, the type of storage you use (SSD,  HDD), etc.

>>

>> Cosmin

>

> ah, I went te other direction with the math, and assumed no

> replication (completely unsafe and never reasonable for a real,

> production environment, but since we're all theory and just looking

> for starting point numbers)

>

>

> 1PB in 10 min ==

> 1,000,000gB in 10 min ==

> 8,000,000gb in 600 seconds ==

>

> 80,000/6  ~= 14k machines running at gigabit or about 1.5k machines if you

> get 10Gb connected machines.

>

> all assuming there's no network or cluster sync overhead

> (of course there would be)

>

>

> that seems like some pretty deep pockets to get to < 10 minute load

> time for that much data.

>

> I could also be off, I just threw some stuff together somewhat

> quickly.between conf calls.

>

> --

> Even the Magic 8 ball has an opinion on email clients: Outlook not so good.

>

RE: One petabyte of data loading into HDFS with in 10 min.

Reply via email to