Re: HDFS - millions of files in one directory?

2009-01-27 Thread Sagar Naik


System with: 1 billion small files.
Namenode will need to maintain the data-structure for all those files.
System will have atleast 1 block per file. And if u have replication 
factor set to 3, the system will have 3 billion blocks.
Now , if you try to read all these files in a job , you will be making 
as many as 1 billion socket connections to get these blocks. (Big 
Brothers, correct me if I m wrong)


Datanodes routinely check for available disk space and collect block 
reports. These operations are directly dependent on number of blocks on 
a datanode.


Getting all data in one file, avoids all this unnecessary  IO and memory 
occupied by namenode


Number of maps in map-reduce job are based on number of blocks. In case 
of multiple files, we will have a large number of map-tasks.


-Sagar


Mark Kerzner wrote:

Carfield,

you might be right, and I may be able to combine them in one large file.
What would one use for a delimiter, so that it would never be encountered in
normal binary files? Performance does matter (rarely it doesn't). What are
the differences in performance between using multiple files and one large
file? I would guess that one file should in fact give better hardware/OS
performance, because it is more predictable and allows buffering.

thank you,
Mark

On Sun, Jan 25, 2009 at 9:50 PM, Carfield Yim wrote:

  

Really? I thought any file can be combines as long as you can figure
out an delimiter is ok, and you really cannot have some delimiters?
Like "X"? And in the worst case, or if performance is not
really a matter, may be just encode all binary to and from ascii?

On Mon, Jan 26, 2009 at 5:49 AM, Mark Kerzner 
wrote:


Yes, flip suggested such solution, but his files are text, so he could
combine them all in a large text file, with each lined representing
  

initial


files. My files, however, are binary, so I do not see how I could combine
them.

However, since my numbers are limited by about 1 billion files total, I
should be OK to put them all in a few directories with under, say, 10,000
files each. Maybe a little balanced tree, but 3-4 four levels should
suffice.

Thank you,
Mark

On Sun, Jan 25, 2009 at 11:43 AM, Carfield Yim   

Possible simple having a file large in size instead of having a lot of
small files?

On Sat, Jan 24, 2009 at 7:03 AM, Mark Kerzner 
wrote:


Hi,

there is a performance penalty in Windows (pardon the expression) if
  

you


put


too many files in the same directory. The OS becomes very slow, stops
  

seeing


them, and lies about their status to my Java requests. I do not know
  

if


this


is also a problem in Linux, but in HDFS - do I need to balance a
  

directory


tree if I want to store millions of files, or can I put them all in
  

the


same


directory?

Thank you,
Mark
  


  


Re: HDFS - millions of files in one directory?

2009-01-27 Thread Philip (flip) Kromer
Tossing one more on this king of all threads:
Stuart Sierra of AltLaw wrote a nice little tool to serialize tar.bz2 files
into SequenceFile, with filename as key and its contents a BLOCK-compressed
blob.
  http://stuartsierra.com/2008/04/24/a-million-little-files

flip


On Mon, Jan 26, 2009 at 3:20 PM, Mark Kerzner  wrote:

> Jason, this is awesome, thank you.
> By the way, is there a book or manual with "best practices?"
>
> On Mon, Jan 26, 2009 at 3:13 PM, jason hadoop  >wrote:
>
> > Sequence files rock, and you can use the
> > *
> > bin/hadoop dfs -text FILENAME* command line tool to get a toString level
> > unpacking of the sequence file key,value pairs.
> >
> > If you provide your own key or value classes, you will need to implement
> a
> > toString method to get some use out of this. Also, your class path will
> > need
> > to include the jars with your custom key/value classes.
> >
> > HADOOP_CLASSPATH="myjar1;myjar2..." *bin/hadoop dfs -text FILENAME*
> >
> >
> > On Mon, Jan 26, 2009 at 1:08 PM, Mark Kerzner 
> > wrote:
> >
> > > Thank you, Doug, then all is clear in my head.
> > > Mark
> > >
> > > On Mon, Jan 26, 2009 at 3:05 PM, Doug Cutting 
> > wrote:
> > >
> > > > Mark Kerzner wrote:
> > > >
> > > >> Okay, I am convinced. I only noticed that Doug, the originator, was
> > not
> > > >> happy about it - but in open source one has to give up control
> > > sometimes.
> > > >>
> > > >
> > > > I think perhaps you misunderstood my remarks.  My point was that, if
> > you
> > > > looked to Nutch's Content class for an example, it is, for historical
> > > > reasons, somewhat more complicated than it needs to be and is thus a
> > less
> > > > than perfect example.  But using SequenceFile to store web content is
> > > > certainly a best practice and I did not mean to imply otherwise.
> > > >
> > > > Doug
> > > >
> > >
> >
>



-- 
http://www.infochimps.org
Connected Open Free Data


Re: HDFS - millions of files in one directory?

2009-01-26 Thread Mark Kerzner
Jason, this is awesome, thank you.
By the way, is there a book or manual with "best practices?"

On Mon, Jan 26, 2009 at 3:13 PM, jason hadoop wrote:

> Sequence files rock, and you can use the
> *
> bin/hadoop dfs -text FILENAME* command line tool to get a toString level
> unpacking of the sequence file key,value pairs.
>
> If you provide your own key or value classes, you will need to implement a
> toString method to get some use out of this. Also, your class path will
> need
> to include the jars with your custom key/value classes.
>
> HADOOP_CLASSPATH="myjar1;myjar2..." *bin/hadoop dfs -text FILENAME*
>
>
> On Mon, Jan 26, 2009 at 1:08 PM, Mark Kerzner 
> wrote:
>
> > Thank you, Doug, then all is clear in my head.
> > Mark
> >
> > On Mon, Jan 26, 2009 at 3:05 PM, Doug Cutting 
> wrote:
> >
> > > Mark Kerzner wrote:
> > >
> > >> Okay, I am convinced. I only noticed that Doug, the originator, was
> not
> > >> happy about it - but in open source one has to give up control
> > sometimes.
> > >>
> > >
> > > I think perhaps you misunderstood my remarks.  My point was that, if
> you
> > > looked to Nutch's Content class for an example, it is, for historical
> > > reasons, somewhat more complicated than it needs to be and is thus a
> less
> > > than perfect example.  But using SequenceFile to store web content is
> > > certainly a best practice and I did not mean to imply otherwise.
> > >
> > > Doug
> > >
> >
>


Re: HDFS - millions of files in one directory?

2009-01-26 Thread jason hadoop
Sequence files rock, and you can use the
*
bin/hadoop dfs -text FILENAME* command line tool to get a toString level
unpacking of the sequence file key,value pairs.

If you provide your own key or value classes, you will need to implement a
toString method to get some use out of this. Also, your class path will need
to include the jars with your custom key/value classes.

HADOOP_CLASSPATH="myjar1;myjar2..." *bin/hadoop dfs -text FILENAME*


On Mon, Jan 26, 2009 at 1:08 PM, Mark Kerzner  wrote:

> Thank you, Doug, then all is clear in my head.
> Mark
>
> On Mon, Jan 26, 2009 at 3:05 PM, Doug Cutting  wrote:
>
> > Mark Kerzner wrote:
> >
> >> Okay, I am convinced. I only noticed that Doug, the originator, was not
> >> happy about it - but in open source one has to give up control
> sometimes.
> >>
> >
> > I think perhaps you misunderstood my remarks.  My point was that, if you
> > looked to Nutch's Content class for an example, it is, for historical
> > reasons, somewhat more complicated than it needs to be and is thus a less
> > than perfect example.  But using SequenceFile to store web content is
> > certainly a best practice and I did not mean to imply otherwise.
> >
> > Doug
> >
>


Re: HDFS - millions of files in one directory?

2009-01-26 Thread Mark Kerzner
Thank you, Doug, then all is clear in my head.
Mark

On Mon, Jan 26, 2009 at 3:05 PM, Doug Cutting  wrote:

> Mark Kerzner wrote:
>
>> Okay, I am convinced. I only noticed that Doug, the originator, was not
>> happy about it - but in open source one has to give up control sometimes.
>>
>
> I think perhaps you misunderstood my remarks.  My point was that, if you
> looked to Nutch's Content class for an example, it is, for historical
> reasons, somewhat more complicated than it needs to be and is thus a less
> than perfect example.  But using SequenceFile to store web content is
> certainly a best practice and I did not mean to imply otherwise.
>
> Doug
>


Re: HDFS - millions of files in one directory?

2009-01-26 Thread Doug Cutting

Mark Kerzner wrote:

Okay, I am convinced. I only noticed that Doug, the originator, was not
happy about it - but in open source one has to give up control sometimes.


I think perhaps you misunderstood my remarks.  My point was that, if you 
looked to Nutch's Content class for an example, it is, for historical 
reasons, somewhat more complicated than it needs to be and is thus a 
less than perfect example.  But using SequenceFile to store web content 
is certainly a best practice and I did not mean to imply otherwise.


Doug


Re: HDFS - millions of files in one directory?

2009-01-26 Thread Mark Kerzner
Okay, I am convinced. I only noticed that Doug, the originator, was not
happy about it - but in open source one has to give up control sometimes.
Thank you,
Mark

On Mon, Jan 26, 2009 at 2:36 PM, Andy Liu  wrote:

> SequenceFile supports transparent block-level compression out of the box,
> so
> you don't have to compress data in your code.
>
> Most the time, compression not only saves disk space but improves
> performance because there's less data to write.
>
> Andy
>
> On Mon, Jan 26, 2009 at 12:35 PM, Mark Kerzner  >wrote:
>
> > Doug,
> > SequenceFile looks like a perfect candidate to use in my project, but are
> > you saying that I better use uncompressed data if I am not interested in
> > saving disk space?
> >
> > Thank you,
> > Mark
> >
> > On Mon, Jan 26, 2009 at 11:30 AM, Doug Cutting 
> wrote:
> >
> > > Philip (flip) Kromer wrote:
> > >
> > >> Heretrix ,
> > >> Nutch,
> > >> others use the ARC file format
> > >>  http://www.archive.org/web/researcher/ArcFileFormat.php
> > >>  http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
> > >>
> > >
> > > Nutch does not use ARC format but rather uses Hadoop's SequenceFile to
> > > store crawled pages.  The keys of crawl content files are URLs and the
> > > values are:
> > >
> > >
> > >
> >
> http://lucene.apache.org/nutch/apidocs/org/apache/nutch/protocol/Content.html
> > >
> > > I believe that the implementation of this class pre-dates
> SequenceFile's
> > > support for compressed values, so the values are decompressed on
> demand,
> > > which needlessly complicates its implementation and API.  It's
> basically
> > a
> > > Writable that stores binary content plus headers, typically an HTTP
> > > response.
> > >
> > > Doug
> > >
> >
>


Re: HDFS - millions of files in one directory?

2009-01-26 Thread Andy Liu
SequenceFile supports transparent block-level compression out of the box, so
you don't have to compress data in your code.

Most the time, compression not only saves disk space but improves
performance because there's less data to write.

Andy

On Mon, Jan 26, 2009 at 12:35 PM, Mark Kerzner wrote:

> Doug,
> SequenceFile looks like a perfect candidate to use in my project, but are
> you saying that I better use uncompressed data if I am not interested in
> saving disk space?
>
> Thank you,
> Mark
>
> On Mon, Jan 26, 2009 at 11:30 AM, Doug Cutting  wrote:
>
> > Philip (flip) Kromer wrote:
> >
> >> Heretrix ,
> >> Nutch,
> >> others use the ARC file format
> >>  http://www.archive.org/web/researcher/ArcFileFormat.php
> >>  http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
> >>
> >
> > Nutch does not use ARC format but rather uses Hadoop's SequenceFile to
> > store crawled pages.  The keys of crawl content files are URLs and the
> > values are:
> >
> >
> >
> http://lucene.apache.org/nutch/apidocs/org/apache/nutch/protocol/Content.html
> >
> > I believe that the implementation of this class pre-dates SequenceFile's
> > support for compressed values, so the values are decompressed on demand,
> > which needlessly complicates its implementation and API.  It's basically
> a
> > Writable that stores binary content plus headers, typically an HTTP
> > response.
> >
> > Doug
> >
>


Re: HDFS - millions of files in one directory?

2009-01-26 Thread Raghu Angadi

Mark Kerzner wrote:

Raghu,

if I write all files only one, is the cost the same in one directory or do I
need to find the optimal directory size and when full start another
"bucket?"


If you write only once, then writing won't be much of an issue. You can 
write them in lexical order to help with buffer copies. These are all 
implementation details that a user should not depend on.


That said, the rest of the discussion in this thread is going in the 
right direction : to get you to use fewer files that combines a lot of 
these small files.


Large number of small files has overhead in many places in HDFS : strain 
on DataNodes, NameNode memory, etc.


Raghu.



Re: HDFS - millions of files in one directory?

2009-01-26 Thread jason hadoop
We like compression if the data is readily compressible and large as it
saves on IO time.


On Mon, Jan 26, 2009 at 9:35 AM, Mark Kerzner  wrote:

> Doug,
> SequenceFile looks like a perfect candidate to use in my project, but are
> you saying that I better use uncompressed data if I am not interested in
> saving disk space?
>
> Thank you,
> Mark
>
> On Mon, Jan 26, 2009 at 11:30 AM, Doug Cutting  wrote:
>
> > Philip (flip) Kromer wrote:
> >
> >> Heretrix ,
> >> Nutch,
> >> others use the ARC file format
> >>  http://www.archive.org/web/researcher/ArcFileFormat.php
> >>  http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
> >>
> >
> > Nutch does not use ARC format but rather uses Hadoop's SequenceFile to
> > store crawled pages.  The keys of crawl content files are URLs and the
> > values are:
> >
> >
> >
> http://lucene.apache.org/nutch/apidocs/org/apache/nutch/protocol/Content.html
> >
> > I believe that the implementation of this class pre-dates SequenceFile's
> > support for compressed values, so the values are decompressed on demand,
> > which needlessly complicates its implementation and API.  It's basically
> a
> > Writable that stores binary content plus headers, typically an HTTP
> > response.
> >
> > Doug
> >
>


Re: HDFS - millions of files in one directory?

2009-01-26 Thread Mark Kerzner
Doug,
SequenceFile looks like a perfect candidate to use in my project, but are
you saying that I better use uncompressed data if I am not interested in
saving disk space?

Thank you,
Mark

On Mon, Jan 26, 2009 at 11:30 AM, Doug Cutting  wrote:

> Philip (flip) Kromer wrote:
>
>> Heretrix ,
>> Nutch,
>> others use the ARC file format
>>  http://www.archive.org/web/researcher/ArcFileFormat.php
>>  http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
>>
>
> Nutch does not use ARC format but rather uses Hadoop's SequenceFile to
> store crawled pages.  The keys of crawl content files are URLs and the
> values are:
>
>
> http://lucene.apache.org/nutch/apidocs/org/apache/nutch/protocol/Content.html
>
> I believe that the implementation of this class pre-dates SequenceFile's
> support for compressed values, so the values are decompressed on demand,
> which needlessly complicates its implementation and API.  It's basically a
> Writable that stores binary content plus headers, typically an HTTP
> response.
>
> Doug
>


Re: HDFS - millions of files in one directory?

2009-01-26 Thread Doug Cutting

Philip (flip) Kromer wrote:

Heretrix ,
Nutch,
others use the ARC file format
  http://www.archive.org/web/researcher/ArcFileFormat.php
  http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml


Nutch does not use ARC format but rather uses Hadoop's SequenceFile to 
store crawled pages.  The keys of crawl content files are URLs and the 
values are:


http://lucene.apache.org/nutch/apidocs/org/apache/nutch/protocol/Content.html

I believe that the implementation of this class pre-dates SequenceFile's 
support for compressed values, so the values are decompressed on demand, 
which needlessly complicates its implementation and API.  It's basically 
a Writable that stores binary content plus headers, typically an HTTP 
response.


Doug


Re: HDFS - millions of files in one directory?

2009-01-26 Thread Steve Loughran

Philip (flip) Kromer wrote:

I ran in this problem, hard, and I can vouch that this is not a windows-only
problem. ReiserFS, ext3 and OSX's HFS+ become cripplingly slow with more
than a few hundred thousand files in the same directory. (The operation to
correct this mistake took a week to run.)  That is one of several hard
lessons I learned about "don't write your scraper to replicate the path
structure of each document as a file on disk."


I've seen a fair few machines (one of the network store programs) top 
out at 65K files/dir; shows while it is good to test your assumptions 
before you go live.




Re: HDFS - millions of files in one directory?

2009-01-25 Thread Mark Kerzner
Brian,

all the replies point in the direction of combining all files into one. I
have a few stages of processing, but each one is sequential. So I can create
one large file after another, and the performance will be the best it can
be, no deterioration from my artificial limitations.

I planned to have a little descriptor file next to the actual one - but I
may just as easily write the descriptor right after the actual file.

Thank you,
Mark

On Sun, Jan 25, 2009 at 9:57 PM, Brian Bockelman wrote:

> Hey Mark,
>
> You'll want to watch your name node requirements -- tossing a wild-guess
> out there, a billion files could mean that you need on the order of
> terabytes of RAM in your namenode.
>
> Have you considered using:
> a) Using SequenceFile (appropriate for binary data, I believe -- but limits
> you to Sequential I/O)
> b) Looking into the ARC file format which someone referenced previously on
> this list
>
> ?
>
> Brian
>
>
> On Jan 25, 2009, at 8:29 PM, Mark Kerzner wrote:
>
>  Thank you, Jason, this is awesome information. I am going to use a
>> balanced
>> directory tree structure, and I am going to make this independent of the
>> other parts of the system, so that I can change it later should practice
>> dictate me to do so.
>>
>> Mark
>>
>> On Sun, Jan 25, 2009 at 8:06 PM, jason hadoop > >wrote:
>>
>>  With large numbers of files you run the risk of the Datanodes timing out
>>> when they are performing their block report and or DU reports.
>>> Basically if a *find* in the dfs.data.dir takes more than 10 minutes you
>>> will have catastrophic problems with your hdfs.
>>> At attributor with 2million blocks on a datanode, under XFS centos (i686)
>>> 5.1 stock kernels would take 21 minutes with noatime, on a 6 disk raid 5
>>> array. 8way 2.5ghz xeons 8gig ram. Raid controller was a PERC and the
>>> machine basically served hdfs.
>>>
>>>
>>> On Sun, Jan 25, 2009 at 1:49 PM, Mark Kerzner 
>>> wrote:
>>>
>>>  Yes, flip suggested such solution, but his files are text, so he could
 combine them all in a large text file, with each lined representing

>>> initial
>>>
 files. My files, however, are binary, so I do not see how I could
 combine
 them.

 However, since my numbers are limited by about 1 billion files total, I
 should be OK to put them all in a few directories with under, say,
 10,000
 files each. Maybe a little balanced tree, but 3-4 four levels should
 suffice.

 Thank you,
 Mark

 On Sun, Jan 25, 2009 at 11:43 AM, Carfield Yim <
 carfi...@carfield.com.hk

> wrote:
>

  Possible simple having a file large in size instead of having a lot of
> small files?
>
> On Sat, Jan 24, 2009 at 7:03 AM, Mark Kerzner 
> wrote:
>
>>
>> Hi,
>>
>> there is a performance penalty in Windows (pardon the expression) if
>>
> you

> put
>
>> too many files in the same directory. The OS becomes very slow, stops
>>
> seeing
>
>> them, and lies about their status to my Java requests. I do not know
>>
> if
>>>
 this
>
>> is also a problem in Linux, but in HDFS - do I need to balance a
>>
> directory
>
>> tree if I want to store millions of files, or can I put them all in
>>
> the
>>>
 same
>
>> directory?
>>
>> Thank you,
>> Mark
>>
>
>

>>>
>


Re: HDFS - millions of files in one directory?

2009-01-25 Thread Carfield Yim
You need to test about it, but I think I would go into that direction
instead of figuring out how to store million files under one
directory, it is so easy to encounter whatever OS / file system
limitation which causeing bug that hard to track, and the bug can be
serious as JDK core dump.

As I am not really using hadoop in my work thus I am not sure if Brian
is right (i.e. have to load the whole file into memory for hadoop), if
that is the case, may be I will go to another direction.

On Mon, Jan 26, 2009 at 12:04 PM, Mark Kerzner  wrote:
> Carfield,
>
> you might be right, and I may be able to combine them in one large file.
> What would one use for a delimiter, so that it would never be encountered in
> normal binary files? Performance does matter (rarely it doesn't). What are
> the differences in performance between using multiple files and one large
> file? I would guess that one file should in fact give better hardware/OS
> performance, because it is more predictable and allows buffering.
>
> thank you,
> Mark
>
> On Sun, Jan 25, 2009 at 9:50 PM, Carfield Yim wrote:
>
>> Really? I thought any file can be combines as long as you can figure
>> out an delimiter is ok, and you really cannot have some delimiters?
>> Like "X"? And in the worst case, or if performance is not
>> really a matter, may be just encode all binary to and from ascii?
>>
>> On Mon, Jan 26, 2009 at 5:49 AM, Mark Kerzner 
>> wrote:
>> > Yes, flip suggested such solution, but his files are text, so he could
>> > combine them all in a large text file, with each lined representing
>> initial
>> > files. My files, however, are binary, so I do not see how I could combine
>> > them.
>> >
>> > However, since my numbers are limited by about 1 billion files total, I
>> > should be OK to put them all in a few directories with under, say, 10,000
>> > files each. Maybe a little balanced tree, but 3-4 four levels should
>> > suffice.
>> >
>> > Thank you,
>> > Mark
>> >
>> > On Sun, Jan 25, 2009 at 11:43 AM, Carfield Yim > >wrote:
>> >
>> >> Possible simple having a file large in size instead of having a lot of
>> >> small files?
>> >>
>> >> On Sat, Jan 24, 2009 at 7:03 AM, Mark Kerzner 
>> >> wrote:
>> >> >
>> >> > Hi,
>> >> >
>> >> > there is a performance penalty in Windows (pardon the expression) if
>> you
>> >> put
>> >> > too many files in the same directory. The OS becomes very slow, stops
>> >> seeing
>> >> > them, and lies about their status to my Java requests. I do not know
>> if
>> >> this
>> >> > is also a problem in Linux, but in HDFS - do I need to balance a
>> >> directory
>> >> > tree if I want to store millions of files, or can I put them all in
>> the
>> >> same
>> >> > directory?
>> >> >
>> >> > Thank you,
>> >> > Mark
>> >>
>> >
>>
>


Re: HDFS - millions of files in one directory?

2009-01-25 Thread Mark Kerzner
Carfield,

you might be right, and I may be able to combine them in one large file.
What would one use for a delimiter, so that it would never be encountered in
normal binary files? Performance does matter (rarely it doesn't). What are
the differences in performance between using multiple files and one large
file? I would guess that one file should in fact give better hardware/OS
performance, because it is more predictable and allows buffering.

thank you,
Mark

On Sun, Jan 25, 2009 at 9:50 PM, Carfield Yim wrote:

> Really? I thought any file can be combines as long as you can figure
> out an delimiter is ok, and you really cannot have some delimiters?
> Like "X"? And in the worst case, or if performance is not
> really a matter, may be just encode all binary to and from ascii?
>
> On Mon, Jan 26, 2009 at 5:49 AM, Mark Kerzner 
> wrote:
> > Yes, flip suggested such solution, but his files are text, so he could
> > combine them all in a large text file, with each lined representing
> initial
> > files. My files, however, are binary, so I do not see how I could combine
> > them.
> >
> > However, since my numbers are limited by about 1 billion files total, I
> > should be OK to put them all in a few directories with under, say, 10,000
> > files each. Maybe a little balanced tree, but 3-4 four levels should
> > suffice.
> >
> > Thank you,
> > Mark
> >
> > On Sun, Jan 25, 2009 at 11:43 AM, Carfield Yim  >wrote:
> >
> >> Possible simple having a file large in size instead of having a lot of
> >> small files?
> >>
> >> On Sat, Jan 24, 2009 at 7:03 AM, Mark Kerzner 
> >> wrote:
> >> >
> >> > Hi,
> >> >
> >> > there is a performance penalty in Windows (pardon the expression) if
> you
> >> put
> >> > too many files in the same directory. The OS becomes very slow, stops
> >> seeing
> >> > them, and lies about their status to my Java requests. I do not know
> if
> >> this
> >> > is also a problem in Linux, but in HDFS - do I need to balance a
> >> directory
> >> > tree if I want to store millions of files, or can I put them all in
> the
> >> same
> >> > directory?
> >> >
> >> > Thank you,
> >> > Mark
> >>
> >
>


Re: HDFS - millions of files in one directory?

2009-01-25 Thread Brian Bockelman

Hey Mark,

You'll want to watch your name node requirements -- tossing a wild- 
guess out there, a billion files could mean that you need on the order  
of terabytes of RAM in your namenode.


Have you considered using:
a) Using SequenceFile (appropriate for binary data, I believe -- but  
limits you to Sequential I/O)
b) Looking into the ARC file format which someone referenced  
previously on this list


?

Brian

On Jan 25, 2009, at 8:29 PM, Mark Kerzner wrote:

Thank you, Jason, this is awesome information. I am going to use a  
balanced
directory tree structure, and I am going to make this independent of  
the
other parts of the system, so that I can change it later should  
practice

dictate me to do so.

Mark

On Sun, Jan 25, 2009 at 8:06 PM, jason hadoop  
wrote:


With large numbers of files you run the risk of the Datanodes  
timing out

when they are performing their block report and or DU reports.
Basically if a *find* in the dfs.data.dir takes more than 10  
minutes you

will have catastrophic problems with your hdfs.
At attributor with 2million blocks on a datanode, under XFS centos  
(i686)
5.1 stock kernels would take 21 minutes with noatime, on a 6 disk  
raid 5

array. 8way 2.5ghz xeons 8gig ram. Raid controller was a PERC and the
machine basically served hdfs.


On Sun, Jan 25, 2009 at 1:49 PM, Mark Kerzner 
wrote:

Yes, flip suggested such solution, but his files are text, so he  
could

combine them all in a large text file, with each lined representing

initial
files. My files, however, are binary, so I do not see how I could  
combine

them.

However, since my numbers are limited by about 1 billion files  
total, I
should be OK to put them all in a few directories with under, say,  
10,000

files each. Maybe a little balanced tree, but 3-4 four levels should
suffice.

Thank you,
Mark

On Sun, Jan 25, 2009 at 11:43 AM, Carfield Yim 
wrote:


Possible simple having a file large in size instead of having a  
lot of

small files?

On Sat, Jan 24, 2009 at 7:03 AM, Mark Kerzner >

wrote:


Hi,

there is a performance penalty in Windows (pardon the  
expression) if

you

put
too many files in the same directory. The OS becomes very slow,  
stops

seeing
them, and lies about their status to my Java requests. I do not  
know

if

this

is also a problem in Linux, but in HDFS - do I need to balance a

directory
tree if I want to store millions of files, or can I put them all  
in

the

same

directory?

Thank you,
Mark










Re: HDFS - millions of files in one directory?

2009-01-25 Thread Carfield Yim
Really? I thought any file can be combines as long as you can figure
out an delimiter is ok, and you really cannot have some delimiters?
Like "X"? And in the worst case, or if performance is not
really a matter, may be just encode all binary to and from ascii?

On Mon, Jan 26, 2009 at 5:49 AM, Mark Kerzner  wrote:
> Yes, flip suggested such solution, but his files are text, so he could
> combine them all in a large text file, with each lined representing initial
> files. My files, however, are binary, so I do not see how I could combine
> them.
>
> However, since my numbers are limited by about 1 billion files total, I
> should be OK to put them all in a few directories with under, say, 10,000
> files each. Maybe a little balanced tree, but 3-4 four levels should
> suffice.
>
> Thank you,
> Mark
>
> On Sun, Jan 25, 2009 at 11:43 AM, Carfield Yim 
> wrote:
>
>> Possible simple having a file large in size instead of having a lot of
>> small files?
>>
>> On Sat, Jan 24, 2009 at 7:03 AM, Mark Kerzner 
>> wrote:
>> >
>> > Hi,
>> >
>> > there is a performance penalty in Windows (pardon the expression) if you
>> put
>> > too many files in the same directory. The OS becomes very slow, stops
>> seeing
>> > them, and lies about their status to my Java requests. I do not know if
>> this
>> > is also a problem in Linux, but in HDFS - do I need to balance a
>> directory
>> > tree if I want to store millions of files, or can I put them all in the
>> same
>> > directory?
>> >
>> > Thank you,
>> > Mark
>>
>


Re: HDFS - millions of files in one directory?

2009-01-25 Thread Mark Kerzner
Thank you, Jason, this is awesome information. I am going to use a balanced
directory tree structure, and I am going to make this independent of the
other parts of the system, so that I can change it later should practice
dictate me to do so.

Mark

On Sun, Jan 25, 2009 at 8:06 PM, jason hadoop wrote:

> With large numbers of files you run the risk of the Datanodes timing out
> when they are performing their block report and or DU reports.
> Basically if a *find* in the dfs.data.dir takes more than 10 minutes you
> will have catastrophic problems with your hdfs.
> At attributor with 2million blocks on a datanode, under XFS centos (i686)
> 5.1 stock kernels would take 21 minutes with noatime, on a 6 disk raid 5
> array. 8way 2.5ghz xeons 8gig ram. Raid controller was a PERC and the
> machine basically served hdfs.
>
>
> On Sun, Jan 25, 2009 at 1:49 PM, Mark Kerzner 
> wrote:
>
> > Yes, flip suggested such solution, but his files are text, so he could
> > combine them all in a large text file, with each lined representing
> initial
> > files. My files, however, are binary, so I do not see how I could combine
> > them.
> >
> > However, since my numbers are limited by about 1 billion files total, I
> > should be OK to put them all in a few directories with under, say, 10,000
> > files each. Maybe a little balanced tree, but 3-4 four levels should
> > suffice.
> >
> > Thank you,
> > Mark
> >
> > On Sun, Jan 25, 2009 at 11:43 AM, Carfield Yim  > >wrote:
> >
> > > Possible simple having a file large in size instead of having a lot of
> > > small files?
> > >
> > > On Sat, Jan 24, 2009 at 7:03 AM, Mark Kerzner 
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > there is a performance penalty in Windows (pardon the expression) if
> > you
> > > put
> > > > too many files in the same directory. The OS becomes very slow, stops
> > > seeing
> > > > them, and lies about their status to my Java requests. I do not know
> if
> > > this
> > > > is also a problem in Linux, but in HDFS - do I need to balance a
> > > directory
> > > > tree if I want to store millions of files, or can I put them all in
> the
> > > same
> > > > directory?
> > > >
> > > > Thank you,
> > > > Mark
> > >
> >
>


Re: HDFS - millions of files in one directory?

2009-01-25 Thread jason hadoop
With large numbers of files you run the risk of the Datanodes timing out
when they are performing their block report and or DU reports.
Basically if a *find* in the dfs.data.dir takes more than 10 minutes you
will have catastrophic problems with your hdfs.
At attributor with 2million blocks on a datanode, under XFS centos (i686)
5.1 stock kernels would take 21 minutes with noatime, on a 6 disk raid 5
array. 8way 2.5ghz xeons 8gig ram. Raid controller was a PERC and the
machine basically served hdfs.


On Sun, Jan 25, 2009 at 1:49 PM, Mark Kerzner  wrote:

> Yes, flip suggested such solution, but his files are text, so he could
> combine them all in a large text file, with each lined representing initial
> files. My files, however, are binary, so I do not see how I could combine
> them.
>
> However, since my numbers are limited by about 1 billion files total, I
> should be OK to put them all in a few directories with under, say, 10,000
> files each. Maybe a little balanced tree, but 3-4 four levels should
> suffice.
>
> Thank you,
> Mark
>
> On Sun, Jan 25, 2009 at 11:43 AM, Carfield Yim  >wrote:
>
> > Possible simple having a file large in size instead of having a lot of
> > small files?
> >
> > On Sat, Jan 24, 2009 at 7:03 AM, Mark Kerzner 
> > wrote:
> > >
> > > Hi,
> > >
> > > there is a performance penalty in Windows (pardon the expression) if
> you
> > put
> > > too many files in the same directory. The OS becomes very slow, stops
> > seeing
> > > them, and lies about their status to my Java requests. I do not know if
> > this
> > > is also a problem in Linux, but in HDFS - do I need to balance a
> > directory
> > > tree if I want to store millions of files, or can I put them all in the
> > same
> > > directory?
> > >
> > > Thank you,
> > > Mark
> >
>


Re: HDFS - millions of files in one directory?

2009-01-25 Thread Mark Kerzner
Yes, flip suggested such solution, but his files are text, so he could
combine them all in a large text file, with each lined representing initial
files. My files, however, are binary, so I do not see how I could combine
them.

However, since my numbers are limited by about 1 billion files total, I
should be OK to put them all in a few directories with under, say, 10,000
files each. Maybe a little balanced tree, but 3-4 four levels should
suffice.

Thank you,
Mark

On Sun, Jan 25, 2009 at 11:43 AM, Carfield Yim wrote:

> Possible simple having a file large in size instead of having a lot of
> small files?
>
> On Sat, Jan 24, 2009 at 7:03 AM, Mark Kerzner 
> wrote:
> >
> > Hi,
> >
> > there is a performance penalty in Windows (pardon the expression) if you
> put
> > too many files in the same directory. The OS becomes very slow, stops
> seeing
> > them, and lies about their status to my Java requests. I do not know if
> this
> > is also a problem in Linux, but in HDFS - do I need to balance a
> directory
> > tree if I want to store millions of files, or can I put them all in the
> same
> > directory?
> >
> > Thank you,
> > Mark
>


Re: HDFS - millions of files in one directory?

2009-01-25 Thread Carfield Yim
Possible simple having a file large in size instead of having a lot of
small files?

On Sat, Jan 24, 2009 at 7:03 AM, Mark Kerzner  wrote:
>
> Hi,
>
> there is a performance penalty in Windows (pardon the expression) if you put
> too many files in the same directory. The OS becomes very slow, stops seeing
> them, and lies about their status to my Java requests. I do not know if this
> is also a problem in Linux, but in HDFS - do I need to balance a directory
> tree if I want to store millions of files, or can I put them all in the same
> directory?
>
> Thank you,
> Mark


Re: HDFS - millions of files in one directory?

2009-01-24 Thread Philip (flip) Kromer
I think that Google developed
BigTable to
solve this; hadoop's HBase, or any of the myriad other distributed/document
databases should work depending on need:
http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/
  http://www.mail-archive.com/core-user@hadoop.apache.org/msg07011.html

Heretrix ,
Nutch,
others use the ARC file format
  http://www.archive.org/web/researcher/ArcFileFormat.php
  http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
These of course are industrial strength tools (and many of their authors are
here in the room with us :) The only question with those tools is whether
their might exceeds your needs.

There's some oddball project out there that does peer-to-peer something
something scraping but I can't find it anywhere in my bookmarks. I don't
recall whether they're file-backed or DB-backed.

If you, like us, want something more modest and targeted there is the
recently-released python-toolkit
  http://lucasmanual.com/mywiki/DataHub
I haven't looked at it to see if they've used it at scale.

We infochimps are working right now to clean up and organize for initial
release our own Infinite Monkeywrench, a homely but effective toolkit for
gathering and munging datasets.  (Those stupid little one-off scripts you
write and throw away? A Ruby toolkit to make them slightly less annoying.)
We frequently use it for directed scraping of APIs and websites.  If you're
willing to deal with pre-release code that's never strayed far from the
machines of the guys what wrote it I can point you to what we have.

I think I was probably too tough on bundling into files. If things are
immutable, and only treated in bulk, and are easily and reversibly
serialized, bundling many documents into a file is probably good. As I said,
our toolkit uses flat text files, with the advantages of simplicity and the
downside of ad hoc-ness. Storing into the ARC format lets you use the tools
in the Big Scraper ecosystem, but obvs. you'd need to convert out to use
with other things, possibly returning you to this same question.

If you need to grab arbitrary subsets of the data, and the one set of
locality tradeoffs is better than the other set of locality tradeoffs, or
you need better metadata management than bundled-into-file gives you then I
think that's why those distributed/document-type databases got invented.

flip

On Sat, Jan 24, 2009 at 7:21 PM, Mark Kerzner  wrote:

> Philip,
>
> it seems like you went through the same problems as I did, and confirmed my
> feeling that this is not a trivial problem. My first idea was to balance
> the
> directory tree somehow and to store the remaining metadata elsewhere, but
> as
> you say, it has limitations. I could use some solution like your specific
> one, but I am only surprised that this problem does not have a well-known
> solution, or solutions. Again, how does Google or Yahoo store the files
> that
> they have crawled? MapReduce paper says that they store them all first,
> that
> is a few billion pages. How do they do it?
>
> Raghu,
>
> if I write all files only one, is the cost the same in one directory or do
> I
> need to find the optimal directory size and when full start another
> "bucket?"
>
> Thank you,
> Mark
>
> On Fri, Jan 23, 2009 at 11:01 PM, Philip (flip) Kromer
> wrote:
>
> > I ran in this problem, hard, and I can vouch that this is not a
> > windows-only
> > problem. ReiserFS, ext3 and OSX's HFS+ become cripplingly slow with more
> > than a few hundred thousand files in the same directory. (The operation
> to
> > correct this mistake took a week to run.)  That is one of several hard
> > lessons I learned about "don't write your scraper to replicate the path
> > structure of each document as a file on disk."
> >
> > Cascading the directory structure works, but sucks in various other ways,
> > and itself stops scaling after a while.  What I eventually realized is
> that
> > I was using the filesystem as a particularly wrongheaded document
> database,
> > and that the metadata delivery of a filesystem just doesn't work for
> this.
> >
> > Since in our application the files are text and are immutable, our adhoc
> > solution is to encode and serialize each file with all its metadata, one
> > per
> > line, into a flat file.
> >
> > A distributed database is probably the correct answer, but this is
> working
> > quite well for now and even has some advantages. (No-cost replication
> from
> > work to home or offline by rsync or thumb drive, for example.)
> >
> > flip
> >
> > On Fri, Jan 23, 2009 at 5:49 PM, Raghu Angadi 
> > wrote:
> >
> > > Mark Kerzner wrote:
> > >
> > >> But it would seem then that making a balanced directory tree would not
> > >> help
> > >> either - because there would be another binary search, correct? I
> > assume,
> > >> either way it would be as fast as can be :)
> > >>
> > >
> > > But 

Re: HDFS - millions of files in one directory?

2009-01-24 Thread Mark Kerzner
Philip,

it seems like you went through the same problems as I did, and confirmed my
feeling that this is not a trivial problem. My first idea was to balance the
directory tree somehow and to store the remaining metadata elsewhere, but as
you say, it has limitations. I could use some solution like your specific
one, but I am only surprised that this problem does not have a well-known
solution, or solutions. Again, how does Google or Yahoo store the files that
they have crawled? MapReduce paper says that they store them all first, that
is a few billion pages. How do they do it?

Raghu,

if I write all files only one, is the cost the same in one directory or do I
need to find the optimal directory size and when full start another
"bucket?"

Thank you,
Mark

On Fri, Jan 23, 2009 at 11:01 PM, Philip (flip) Kromer
wrote:

> I ran in this problem, hard, and I can vouch that this is not a
> windows-only
> problem. ReiserFS, ext3 and OSX's HFS+ become cripplingly slow with more
> than a few hundred thousand files in the same directory. (The operation to
> correct this mistake took a week to run.)  That is one of several hard
> lessons I learned about "don't write your scraper to replicate the path
> structure of each document as a file on disk."
>
> Cascading the directory structure works, but sucks in various other ways,
> and itself stops scaling after a while.  What I eventually realized is that
> I was using the filesystem as a particularly wrongheaded document database,
> and that the metadata delivery of a filesystem just doesn't work for this.
>
> Since in our application the files are text and are immutable, our adhoc
> solution is to encode and serialize each file with all its metadata, one
> per
> line, into a flat file.
>
> A distributed database is probably the correct answer, but this is working
> quite well for now and even has some advantages. (No-cost replication from
> work to home or offline by rsync or thumb drive, for example.)
>
> flip
>
> On Fri, Jan 23, 2009 at 5:49 PM, Raghu Angadi 
> wrote:
>
> > Mark Kerzner wrote:
> >
> >> But it would seem then that making a balanced directory tree would not
> >> help
> >> either - because there would be another binary search, correct? I
> assume,
> >> either way it would be as fast as can be :)
> >>
> >
> > But the cost of memory copies would be much less with a tree (when you
> add
> > and delete files).
> >
> > Raghu.
> >
> >
> >
> >>
> >> On Fri, Jan 23, 2009 at 5:08 PM, Raghu Angadi 
> >> wrote:
> >>
> >>  If you are adding and deleting files in the directory, you might notice
> >>> CPU
> >>> penalty (for many loads, higher CPU on NN is not an issue). This is
> >>> mainly
> >>> because HDFS does a binary search on files in a directory each time it
> >>> inserts a new file.
> >>>
> >>> If the directory is relatively idle, then there is no penalty.
> >>>
> >>> Raghu.
> >>>
> >>>
> >>> Mark Kerzner wrote:
> >>>
> >>>  Hi,
> 
>  there is a performance penalty in Windows (pardon the expression) if
> you
>  put
>  too many files in the same directory. The OS becomes very slow, stops
>  seeing
>  them, and lies about their status to my Java requests. I do not know
> if
>  this
>  is also a problem in Linux, but in HDFS - do I need to balance a
>  directory
>  tree if I want to store millions of files, or can I put them all in
> the
>  same
>  directory?
> 
>  Thank you,
>  Mark
> 
> 
> 
> >>
> >
>
>
> --
> http://www.infochimps.org
> Connected Open Free Data
>


Re: HDFS - millions of files in one directory?

2009-01-23 Thread Philip (flip) Kromer
I ran in this problem, hard, and I can vouch that this is not a windows-only
problem. ReiserFS, ext3 and OSX's HFS+ become cripplingly slow with more
than a few hundred thousand files in the same directory. (The operation to
correct this mistake took a week to run.)  That is one of several hard
lessons I learned about "don't write your scraper to replicate the path
structure of each document as a file on disk."

Cascading the directory structure works, but sucks in various other ways,
and itself stops scaling after a while.  What I eventually realized is that
I was using the filesystem as a particularly wrongheaded document database,
and that the metadata delivery of a filesystem just doesn't work for this.

Since in our application the files are text and are immutable, our adhoc
solution is to encode and serialize each file with all its metadata, one per
line, into a flat file.

A distributed database is probably the correct answer, but this is working
quite well for now and even has some advantages. (No-cost replication from
work to home or offline by rsync or thumb drive, for example.)

flip

On Fri, Jan 23, 2009 at 5:49 PM, Raghu Angadi  wrote:

> Mark Kerzner wrote:
>
>> But it would seem then that making a balanced directory tree would not
>> help
>> either - because there would be another binary search, correct? I assume,
>> either way it would be as fast as can be :)
>>
>
> But the cost of memory copies would be much less with a tree (when you add
> and delete files).
>
> Raghu.
>
>
>
>>
>> On Fri, Jan 23, 2009 at 5:08 PM, Raghu Angadi 
>> wrote:
>>
>>  If you are adding and deleting files in the directory, you might notice
>>> CPU
>>> penalty (for many loads, higher CPU on NN is not an issue). This is
>>> mainly
>>> because HDFS does a binary search on files in a directory each time it
>>> inserts a new file.
>>>
>>> If the directory is relatively idle, then there is no penalty.
>>>
>>> Raghu.
>>>
>>>
>>> Mark Kerzner wrote:
>>>
>>>  Hi,

 there is a performance penalty in Windows (pardon the expression) if you
 put
 too many files in the same directory. The OS becomes very slow, stops
 seeing
 them, and lies about their status to my Java requests. I do not know if
 this
 is also a problem in Linux, but in HDFS - do I need to balance a
 directory
 tree if I want to store millions of files, or can I put them all in the
 same
 directory?

 Thank you,
 Mark



>>
>


-- 
http://www.infochimps.org
Connected Open Free Data


Re: HDFS - millions of files in one directory?

2009-01-23 Thread Raghu Angadi

Mark Kerzner wrote:

But it would seem then that making a balanced directory tree would not help
either - because there would be another binary search, correct? I assume,
either way it would be as fast as can be :)


But the cost of memory copies would be much less with a tree (when you 
add and delete files).


Raghu.




On Fri, Jan 23, 2009 at 5:08 PM, Raghu Angadi  wrote:


If you are adding and deleting files in the directory, you might notice CPU
penalty (for many loads, higher CPU on NN is not an issue). This is mainly
because HDFS does a binary search on files in a directory each time it
inserts a new file.

If the directory is relatively idle, then there is no penalty.

Raghu.


Mark Kerzner wrote:


Hi,

there is a performance penalty in Windows (pardon the expression) if you
put
too many files in the same directory. The OS becomes very slow, stops
seeing
them, and lies about their status to my Java requests. I do not know if
this
is also a problem in Linux, but in HDFS - do I need to balance a directory
tree if I want to store millions of files, or can I put them all in the
same
directory?

Thank you,
Mark








Re: HDFS - millions of files in one directory?

2009-01-23 Thread Mark Kerzner
But it would seem then that making a balanced directory tree would not help
either - because there would be another binary search, correct? I assume,
either way it would be as fast as can be :)



On Fri, Jan 23, 2009 at 5:08 PM, Raghu Angadi  wrote:

>
> If you are adding and deleting files in the directory, you might notice CPU
> penalty (for many loads, higher CPU on NN is not an issue). This is mainly
> because HDFS does a binary search on files in a directory each time it
> inserts a new file.
>
> If the directory is relatively idle, then there is no penalty.
>
> Raghu.
>
>
> Mark Kerzner wrote:
>
>> Hi,
>>
>> there is a performance penalty in Windows (pardon the expression) if you
>> put
>> too many files in the same directory. The OS becomes very slow, stops
>> seeing
>> them, and lies about their status to my Java requests. I do not know if
>> this
>> is also a problem in Linux, but in HDFS - do I need to balance a directory
>> tree if I want to store millions of files, or can I put them all in the
>> same
>> directory?
>>
>> Thank you,
>> Mark
>>
>>
>


Re: HDFS - millions of files in one directory?

2009-01-23 Thread Raghu Angadi

Raghu Angadi wrote:


If you are adding and deleting files in the directory, you might notice 
CPU penalty (for many loads, higher CPU on NN is not an issue). This is 
mainly because HDFS does a binary search on files in a directory each 
time it inserts a new file.


I should add that equal or even bigger cost is the memmove that 
ArrayList does when you add or delete entries.


ArrayList, rather than a map is used mainly to save memory, them most 
precious resource for NameNode.


Raghu.


If the directory is relatively idle, then there is no penalty.

Raghu.

Mark Kerzner wrote:

Hi,

there is a performance penalty in Windows (pardon the expression) if 
you put
too many files in the same directory. The OS becomes very slow, stops 
seeing
them, and lies about their status to my Java requests. I do not know 
if this
is also a problem in Linux, but in HDFS - do I need to balance a 
directory
tree if I want to store millions of files, or can I put them all in 
the same

directory?

Thank you,
Mark







Re: HDFS - millions of files in one directory?

2009-01-23 Thread Mark V
On Sat, Jan 24, 2009 at 10:03 AM, Mark Kerzner  wrote:
> Hi,
>
> there is a performance penalty in Windows (pardon the expression) if you put
> too many files in the same directory. The OS becomes very slow, stops seeing
> them, and lies about their status to my Java requests. I do not know if this
> is also a problem in Linux, but in HDFS - do I need to balance a directory
> tree if I want to store millions of files, or can I put them all in the same
> directory?
>
>From my old windows days...
There is a registry setting to turn off some feature where by Windows
keeps a mapping of 8.3 filenames to the full filenames - can't recall
it exactly but it is worth looking for.
Also try name your files so that the 'uniuqe' part of the filename
comes first, e.g. 123_inventoryid.ext is 'better' than
inventoryid_123.ext

HTH
Mark

> Thank you,
> Mark
>


Re: HDFS - millions of files in one directory?

2009-01-23 Thread Raghu Angadi


If you are adding and deleting files in the directory, you might notice 
CPU penalty (for many loads, higher CPU on NN is not an issue). This is 
mainly because HDFS does a binary search on files in a directory each 
time it inserts a new file.


If the directory is relatively idle, then there is no penalty.

Raghu.

Mark Kerzner wrote:

Hi,

there is a performance penalty in Windows (pardon the expression) if you put
too many files in the same directory. The OS becomes very slow, stops seeing
them, and lies about their status to my Java requests. I do not know if this
is also a problem in Linux, but in HDFS - do I need to balance a directory
tree if I want to store millions of files, or can I put them all in the same
directory?

Thank you,
Mark





HDFS - millions of files in one directory?

2009-01-23 Thread Mark Kerzner
Hi,

there is a performance penalty in Windows (pardon the expression) if you put
too many files in the same directory. The OS becomes very slow, stops seeing
them, and lies about their status to my Java requests. I do not know if this
is also a problem in Linux, but in HDFS - do I need to balance a directory
tree if I want to store millions of files, or can I put them all in the same
directory?

Thank you,
Mark