The part nomenclature does not refer to splits. It refers to how many reduce processes were involved in actually writing the output file. Files are split at read-time as necessary.
You will get more of them if you have more reducers. On 1/16/08 8:25 AM, "Jim the Standing Bear" <[EMAIL PROTECTED]> wrote: > Thanks Ted. I just didn't ask it right. Here is a stupid 101 > question, which I am sure the answer lies in the documentation > somewhere, just that I was having some difficulties in finding it... > > when I do an "ls" on the dfs, I would see this: > /user/bear/output/part-00000 <r 4> > > I probably got confused on what the part-##### means... I thought > part-##### tells how many splits a file has... so far, I have only > seen part-00000. When will it have part-00001, 00002, etc? > > > > On Jan 16, 2008 11:04 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: >> >> >> Parallelizing the processing of data occurs at two steps. The first is >> during the map phase where the input data file is (hopefully) split across >> multiple tasks. This should happen transparently most of the time unless >> you have a perverse data format or use unsplittable compression on your >> file. >> >> This parallelism can occur whether you have one input file or many. >> >> The second level of parallelism is at reduce phase. You set this by setting >> the number of reducers. This will also determine the number of output files >> that you get. >> >> Depending on your algorithm, it may help or hurt to have one or many >> reducers. The recent example of a program to find the 10 largest elements >> is an example that pretty much requires a single reducer. Other programs >> where the mapper produces huge amounts of output would be better served by >> having many reducers. >> >> This is a general answer since the question is kind of non-specific. >> >> >> >> On 1/16/08 7:59 AM, "Jim the Standing Bear" <[EMAIL PROTECTED]> wrote: >> >>> Hi, >>> >>> How do I make hadoop split its output? The program I am writing >>> crawls a catalog tree from a single url, so initially the input >>> contains only one entry. after a few iterations, it will have tens of >>> thousands of urls. But what I noticed is that the file is always in >>> one block (part-00000). What I would like to have is once the number >>> of entries increases, it can parallelize the job. Currently it >>> doesn't seem to be case. >> >> > >