Re: Processing small xml files

Rohit Tue, 14 Feb 2012 10:59:34 -0800

Hi Mohit,

> How many are too many for namenode? We have around 100M files and 100M
> files every year
>  
>  
>

The name-node stores file and block metadata in RAM.  

This is an estimate at memory utilization per file and block:
"Estimates show that the name-node uses fewer than 200 bytes to store a single 
metadata object (a file inode or a block). According to statistics on our 
clusters, a file on average consists of 1.5 blocks, which means that it takes 
600 bytes (1 file object + 2 block objects) to store an average file in 
name-node’s RAM"

http://www.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf

Next generation Hadoop (Hadoop 0.23) brings HDFS Federation, which will improve 
scalability of the name-node. You can read more about that here:
http://hortonworks.com/an-introduction-to-hdfs-federation/

Rohit Bakhshi

www.hortonworks.com (http://www.hortonworks.com/)

On Tuesday, February 14, 2012 at 10:56 AM, W.P. McNeill wrote:

> I'm not sure what you mean by "flat format" here.
>  
> In my scenario, I have an file input.xml that looks like this.
>  
> <myfile>
> <section>
> <value>1</value>
> </section>
> <section>
> <value>2</value>
> </section>
> </myfile>
>  
> input.xml is a plain text file. Not a sequence file. If I read it with the
> XMLInputFormat my mapper gets called with (key, value) pairs that look like
> this:
>  
> (nnnn, <section><value>1</value></section>)
> (nnnn, <section><value>2</value></section>)
>  
> Where the keys are numerical offsets into the file. I then use this
> information to write a sequence file with these (key, value) pairs. So my
> Hadoop job that uses XMLInputFormat takes a text file as input and produces
> a sequence file as output.
>  
> I don't know a rule of thumb for how many small files is too many. Maybe
> someone else on the list can chime in. I just know that when your
> throughput gets slow that's one possible cause to investigate.
>  
>

Re: Processing small xml files

Reply via email to