:31 -0700
Subject: Re: Storing millions of small files
From: mcsri...@gmail.com
To: hdfs-user@hadoop.apache.org
Brendan, since you are looking for a distr file system that can store
multi millions of files, try out MapR. A few customers have actually
crossed over 1 trillion files
-Original Message-
From: Keith Wiley [mailto:kwi...@keithwiley.com]
Sent: Tuesday, May 22, 2012 9:57 PM
To: hdfs-user@hadoop.apache.org
Subject: Re: Storing millions of small files
In addition to the responses already provided, there is another downside to
using hadoop with numerous
?
Date: Tue, 22 May 2012 21:56:31 -0700
Subject: Re: Storing millions of small files
From: mcsri...@gmail.com
To: hdfs-user@hadoop.apache.org
Brendan, since you are looking for a distr file system that can store
multi millions of files, try out MapR. A few customers have
Hi,
Hi Brendan,
The number of files that can be stored in HDFS is limited by the size of
the NameNode's RAM. The downside with storing small files is that you would
saturate the NameNode's RAM with a small data set (sum of the size of all
your small files). However, you can store around 100
Hi Brendan,
Every file, directory and block in HDFS is represented as an
object in the namenode’s memory, each of which occupies 150 bytes.When
we store many small files in the HDFS, these small files occupy a
large portion of the namespace(large overhead on namenode). As a
consequence, the
Brendan,
The issue with using lots of small files is that your processing
overhead increases (repeated, avoidable file open-read(little)-close
calls). HDFS is also used by those who wish to also heavily process
the data they've stored and with a huge number of files such a process
is not gonna be
In addition to the responses already provided, there is another downside to
using hadoop with numerous files: it takes much longer to run a hadoop job!
Starting a hadoop job consists of communicating between the driver (which runs
on a client machine outside the cluster) and the namenode to
Brendan, since you are looking for a distr file system that can store multi
millions of files, try out MapR. A few customers have actually crossed
over 1 trillion files without hitting problems. Small files or large files
are handled equally well.
Of course, if you are doing map-reduce, it is