What would cause the name node to have a GC issue?

- I am writing opening at max 5000 connections and writing continuously through 
those 5000 connections to 5000 files at a time.  
      - The volume of data that I would write through 5000 connections cannot 
be controlled as it is depends on upstream applications that publish data.

Now if the heap memory nears the full size (let say M GB) and when the major GC 
cycle kicks in, the NameNode could stop responding for some time.
This "stop the world" time should be directly proportional to the Heap Size.
This may cause the data being blogged on the streaming application's memory.

As of our architecture,

It has a cluster of JMS Queue and We have multithreaded application that picks 
the messages from the queue   and streams it to NameNode of a 20 Node cluster
using FileSystem API as exposed. 

BTW, in real world if you have a fast car, you can race and win against a slow 
train, it all depends from what reference frame you are in :)

Regards,
Jagaran 

________________________________
From: Michel Segel <michael_se...@hotmail.com>
To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
Cc: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>; jagaran 
das <jagaran_...@yahoo.co.in>
Sent: Wednesday, 10 August 2011 11:26 AM
Subject: Re: Namenode Scalability

So many questions, why stop there?

First question... What would cause the name node to have a GC issue?
Second question... You're streaming 1PB a day. Is this a single stream of data?
Are you writing this to one file before processing, or are you processing the 
data directly on the ingestion stream?

Are you also filtering the data so that you are not saving all of the data?

This sounds like a homework assignment than a real world problem.

I guess people don't race cars against trains or have two trains traveling in 
different directions anymore... :-)


Sent from a remote device. Please excuse any typos...

Mike Segel

On Aug 10, 2011, at 12:07 PM, jagaran das <jagaran_...@yahoo.co.in> wrote:

> To be precise, the projected data is around 1 PB.
> But the publishing rate is also around 1GBPS.
> 
> Please suggest.
> 
> 
> ________________________________
> From: jagaran das <jagaran_...@yahoo.co.in>
> To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
> Sent: Wednesday, 10 August 2011 12:58 AM
> Subject: Namenode Scalability
> 
> In my current project we  are planning to streams of data to Namenode (20 
> Node Cluster).
> Data Volume would be around 1 PB per day.
> But there are application which can publish data at 1GBPS.
> 
> Few queries:
> 
> 1. Can a single Namenode handle such high speed writes? Or it becomes 
> unresponsive when GC cycle kicks in.
> 2. Can we have multiple federated Name nodes  sharing the same slaves and 
> then we can distribute the writes accordingly.
> 3. Can multiple region servers of HBase help us ??
> 
> Please suggest how we can design the streaming part to handle such scale of 
> data. 
> 
> Regards,
> Jagaran Das 

Reply via email to