Re: one key per output part file
No. That is a limitation of streaming. On 4/1/08 6:42 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote: > This seems like a reasonable solution - but I am using Hadoop streaming and > byreducer is a perl script. Is it possible to handle side-effect files in > streaming? I havent found > anything that indicates that you can... > > Ashish > > On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > >> >> >> Try opening the desired output file in the reduce method. Make sure that >> the output files are relative to the correct task specific directory (look >> for side-effect files on the wiki). >> >> >> >> On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote: >> >>> Hi, I am using Hadoop streaming and I am trying to create a MapReduce >> that >>> will generate output where a single key is found in a single output part >>> file. >>> Does anyone know how to ensure this condition? I want the reduce task >> (no >>> matter how many are specified), to only receive >>> key-value output from a single key each, process the key-value pairs for >>> this key, write an output part-XXX file, and only >>> then process the next key. >>> >>> Here is the task that I am trying to accomplish: >>> >>> Input: Corpus T (lines of text), Corpus V (each line has 1 word) >>> Output: Each part-XXX should contain the lines of T that contain the >> word >>> from line XXX in V. >>> >>> Any help/ideas are appreciated. >>> >>> Ashish >> >>
Re: one key per output part file
This seems like a reasonable solution - but I am using Hadoop streaming and byreducer is a perl script. Is it possible to handle side-effect files in streaming? I havent found anything that indicates that you can... Ashish On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > Try opening the desired output file in the reduce method. Make sure that > the output files are relative to the correct task specific directory (look > for side-effect files on the wiki). > > > > On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote: > > > Hi, I am using Hadoop streaming and I am trying to create a MapReduce > that > > will generate output where a single key is found in a single output part > > file. > > Does anyone know how to ensure this condition? I want the reduce task > (no > > matter how many are specified), to only receive > > key-value output from a single key each, process the key-value pairs for > > this key, write an output part-XXX file, and only > > then process the next key. > > > > Here is the task that I am trying to accomplish: > > > > Input: Corpus T (lines of text), Corpus V (each line has 1 word) > > Output: Each part-XXX should contain the lines of T that contain the > word > > from line XXX in V. > > > > Any help/ideas are appreciated. > > > > Ashish > >
Re: one key per output part file
Try opening the desired output file in the reduce method. Make sure that the output files are relative to the correct task specific directory (look for side-effect files on the wiki). On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote: > Hi, I am using Hadoop streaming and I am trying to create a MapReduce that > will generate output where a single key is found in a single output part > file. > Does anyone know how to ensure this condition? I want the reduce task (no > matter how many are specified), to only receive > key-value output from a single key each, process the key-value pairs for > this key, write an output part-XXX file, and only > then process the next key. > > Here is the task that I am trying to accomplish: > > Input: Corpus T (lines of text), Corpus V (each line has 1 word) > Output: Each part-XXX should contain the lines of T that contain the word > from line XXX in V. > > Any help/ideas are appreciated. > > Ashish
one key per output part file
Hi, I am using Hadoop streaming and I am trying to create a MapReduce that will generate output where a single key is found in a single output part file. Does anyone know how to ensure this condition? I want the reduce task (no matter how many are specified), to only receive key-value output from a single key each, process the key-value pairs for this key, write an output part-XXX file, and only then process the next key. Here is the task that I am trying to accomplish: Input: Corpus T (lines of text), Corpus V (each line has 1 word) Output: Each part-XXX should contain the lines of T that contain the word from line XXX in V. Any help/ideas are appreciated. Ashish
Re: distcp fails :Input source not found
Here is what is happening directly from the ec2 screen. The ID and Secret Key are the only things changed. I'm running hadoop 15.3 from the public ami. I launched a 2 machine cluster using the ec2 scripts in the src/contrib/ec2/bin . . . The file I try and copy is 9KB (I noticed previous discussion on empty files and files that are > 10MB) > First I make sure that we can copy the file from s3 [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop fs -copyToLocal s3://ID:[EMAIL PROTECTED]/InputFileFormat.xml /usr/InputFileFormat.xml > Now I see that the file is copied to the ec2 master (where I'm logged in) [EMAIL PROTECTED] hadoop-0.15.3]# dir /usr/Input* /usr/InputFileFormat.xml > Next I make sure I can access the HDFS and that the input directory is there [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop fs -ls / Found 2 items /input 2008-04-01 15:45 /mnt 2008-04-01 15:42 [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop fs -ls /input/ Found 0 items > I make sure hadoop is running just fine by running an example [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop jar hadoop-0.15.3-examples.jar pi 10 1000 Number of Maps = 10 Samples per Map = 1000 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Wrote input for Map #5 Wrote input for Map #6 Wrote input for Map #7 Wrote input for Map #8 Wrote input for Map #9 Starting Job 08/04/01 17:38:14 INFO mapred.FileInputFormat: Total input paths to process : 10 08/04/01 17:38:14 INFO mapred.JobClient: Running job: job_200804011542_0001 08/04/01 17:38:15 INFO mapred.JobClient: map 0% reduce 0% 08/04/01 17:38:22 INFO mapred.JobClient: map 20% reduce 0% 08/04/01 17:38:24 INFO mapred.JobClient: map 30% reduce 0% 08/04/01 17:38:25 INFO mapred.JobClient: map 40% reduce 0% 08/04/01 17:38:27 INFO mapred.JobClient: map 50% reduce 0% 08/04/01 17:38:28 INFO mapred.JobClient: map 60% reduce 0% 08/04/01 17:38:31 INFO mapred.JobClient: map 80% reduce 0% 08/04/01 17:38:33 INFO mapred.JobClient: map 90% reduce 0% 08/04/01 17:38:34 INFO mapred.JobClient: map 100% reduce 0% 08/04/01 17:38:43 INFO mapred.JobClient: map 100% reduce 20% 08/04/01 17:38:44 INFO mapred.JobClient: map 100% reduce 100% 08/04/01 17:38:45 INFO mapred.JobClient: Job complete: job_200804011542_0001 08/04/01 17:38:45 INFO mapred.JobClient: Counters: 9 08/04/01 17:38:45 INFO mapred.JobClient: Job Counters 08/04/01 17:38:45 INFO mapred.JobClient: Launched map tasks=10 08/04/01 17:38:45 INFO mapred.JobClient: Launched reduce tasks=1 08/04/01 17:38:45 INFO mapred.JobClient: Data-local map tasks=10 08/04/01 17:38:45 INFO mapred.JobClient: Map-Reduce Framework 08/04/01 17:38:45 INFO mapred.JobClient: Map input records=10 08/04/01 17:38:45 INFO mapred.JobClient: Map output records=20 08/04/01 17:38:45 INFO mapred.JobClient: Map input bytes=240 08/04/01 17:38:45 INFO mapred.JobClient: Map output bytes=320 08/04/01 17:38:45 INFO mapred.JobClient: Reduce input groups=2 08/04/01 17:38:45 INFO mapred.JobClient: Reduce input records=20 Job Finished in 31.028 seconds Estimated value of PI is 3.1556 > Finally, I try and copy the file over [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop distcp s3://ID:[EMAIL PROTECTED]/InputFileFormat.xml /input/InputFileFormat.xml With failures, global counters are inaccurate; consider running with -i Copy failed: org.apache.hadoop.mapred.InvalidInputException: Input source s3://ID:[EMAIL PROTECTED]/InputFileFormat.xml does not exist. at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:470) at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:550) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:563) [EMAIL PROTECTED] wrote: > That was a typo in my email. I do have s3:// in my command when it fails. Not sure what's wrong. Your command looks right to me. Would you mind to show me the exact error message you see? Nicholas - You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost.
What happens if a namenode fails?
What happens to your data if the namenode fails (hardware failure)? Assuming you replace it with a fresh box can you restore all of your data from the slaves? -Xavier
Re: Hadoop input path - can it have subdirectories
thanks Ted for that info [or hack :) ] I had this directory structure - monthData |- week1 |- week2 if I give monthData directory as input path, I get exception - 08/04/01 14:24:30 INFO mapred.FileInputFormat: Total input paths to process : 2 Exception in thread "main" java.io.IOException: Not a file: hdfs://master:54310/user/hadoop/monthData/week1 at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:170) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:515) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753) at com.bdc.dod.dashboard.BDCQueryStatsViewer.run(BDCQueryStatsViewer.java:829) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at com.bdc.dod.dashboard.BDCQueryStatsViewer.main(BDCQueryStatsViewer.java:796) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.hadoop.util.RunJar.main(RunJar.java:155) However, if I give monthData/* as input directory, the jobs run :) But I feel, the default behavior should be to recurse the parent directory and get all the files in subdirectories. It is not difficult to imagine a situation where one want to partition the data in subdirectories that are kept in one base directory, so that the job could be run either on base directory or on subdirectory. -Taran On Tue, Apr 1, 2008 at 12:04 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > But wildcards that match directories that contain files work well. > > > > > On 4/1/08 10:41 AM, "Peeyush Bishnoi" <[EMAIL PROTECTED]> wrote: > > > Hello , > > > > No Hadoop can't traverse recursively inside subdirectory with Java > Map-Reduce > > program. It have to be just directory containing files (and no > > sub-directories). > > > > > > --- > > Peeyush > > > > > > -Original Message- > > From: Tarandeep Singh [mailto:[EMAIL PROTECTED] > > Sent: Tue 4/1/2008 9:15 AM > > To: [EMAIL PROTECTED] > > Subject: Hadoop input path - can it have subdirectories > > > > Hi, > > > > Can I give a directory (having subdirectories) as input path to Hadoop > > Map-Reduce Job. > > I tried, but got error. > > > > Can Hadoop recursively traverse the input directory and collect all > > the file names or the input path has to be just a directory containing > > files (and no sub-directories) ? > > > > -Taran > > > >
Re: distcp fails :Input source not found
> That was a typo in my email. I do have s3:// in my command when it fails. Not sure what's wrong. Your command looks right to me. Would you mind to show me the exact error message you see? Nicholas
Re: distcp fails :Input source not found
Are you missing a colon on the first command? Probably just a transcription error when you composed your email (but I have made similar mistakes often enough and been unable to see them). On 4/1/08 1:18 PM, "Prasan Ary" <[EMAIL PROTECTED]> wrote: > Just to make sure that I am specifying the parameters correctly : >> bin/hadoop distcp s3//:@/fileone.txt >> /somefolder_on_hdfs/fileone.txt : Fails - Input source doesnt exist. > >> bin/hadoop fs -copyToLocal s3://:@bucket/fileone.txt >> /somefolder_on_local/fileone.txt: succeeds > > > Any correction? > > Thanks. > > > - > You rock. That's why Blockbuster's offering you one month of Blockbuster Total > Access, No Cost.
Re: distcp fails :Input source not found
That was a typo in my email. I do have s3:// in my command when it fails. --- [EMAIL PROTECTED] wrote: > bin/hadoop distcp s3//:@/fileone.txt /somefolder_on_hdfs/fileone.txt : Fails - Input source doesnt exist. Should "s3//..." be "s3://..."? Nicholas - You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost.
Re: distcp fails :Input source not found
> bin/hadoop distcp s3//:@/fileone.txt /somefolder_on_hdfs/fileone.txt : Fails - Input source doesnt exist. Should "s3//..." be "s3://..."? Nicholas
distcp fails :Input source not found
Hi, I am running hadoop 0.15.3 on 2 EC2 instances from a public ami ( ami-381df851) . Our input files are on S3. When I try to do a distcp for an Input file from S3 onto hdfs on EC2, the copy fails with an error that the file does not exist. However, if I run copyToLocal from S3 onto local machine, the copy succeeds, confirming that the input file does exist on S3 bucket. Furthermore, I can see the Input file on S3 bucket from Mozilla S3 Firefox organizer as well. I am left with no explanation as to why I am getting the File Not Found error on S3 when I have every confirmation that the file does exist over there. Just to make sure that I am specifying the parameters correctly : > bin/hadoop distcp s3//:@/fileone.txt /somefolder_on_hdfs/fileone.txt : Fails - Input source doesnt exist. > bin/hadoop fs -copyToLocal s3://:@bucket/fileone.txt /somefolder_on_local/fileone.txt: succeeds Any correction? Thanks. - You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost.
Re: Hadoop input path - can it have subdirectories
But wildcards that match directories that contain files work well. On 4/1/08 10:41 AM, "Peeyush Bishnoi" <[EMAIL PROTECTED]> wrote: > Hello , > > No Hadoop can't traverse recursively inside subdirectory with Java Map-Reduce > program. It have to be just directory containing files (and no > sub-directories). > > > --- > Peeyush > > > -Original Message- > From: Tarandeep Singh [mailto:[EMAIL PROTECTED] > Sent: Tue 4/1/2008 9:15 AM > To: [EMAIL PROTECTED] > Subject: Hadoop input path - can it have subdirectories > > Hi, > > Can I give a directory (having subdirectories) as input path to Hadoop > Map-Reduce Job. > I tried, but got error. > > Can Hadoop recursively traverse the input directory and collect all > the file names or the input path has to be just a directory containing > files (and no sub-directories) ? > > -Taran >
Re: Hadoop input path - can it have subdirectories
Sorry, I was wrong. I just checked my installation, and by default, streaming appears to work as people have described -- it doesn't recurse subdirectories. If I pass a directory containing only directories as the -input parameter, I get the following error: 08/04/01 15:00:56 ERROR streaming.StreamJob: Error Launching job : Not a file: hdfs://10.188.239.122:9000/kpi2/kpi3 Streaming Job Failed! It will, however, enumerate all files in the input directory, provided the input directory contains only files. I think that's what confused me. On Tue, Apr 1, 2008 at 1:51 PM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Peeyush Bishnoi wrote: > > Hello , > > > > No Hadoop can't traverse recursively inside subdirectory with Java > > Map-Reduce program. It have to be just directory containing files > > (and no sub-directories). > > That's not the case. > > This is actually a characteristic of the InputFormat that you're using. > Hadoop reads data using InputFormat-s, and standard implementations may > indeed not support subdirectories - but if you need this functionality > you can implement your own InputFormat. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >
Very basic question about sharing loading to difference machines
Hi , if we try to make use of hadoop to distribute the load of my application to difference hosts. However, as we already identified the bottleneck of the system is database access, just use map reduce to load balance the application properly doesn't help much, can we use Map reduce to also distribute loading of DB layer?
Re: Hadoop input path - can it have subdirectories
Peeyush Bishnoi wrote: Hello , No Hadoop can't traverse recursively inside subdirectory with Java Map-Reduce program. It have to be just directory containing files (and no sub-directories). That's not the case. This is actually a characteristic of the InputFormat that you're using. Hadoop reads data using InputFormat-s, and standard implementations may indeed not support subdirectories - but if you need this functionality you can implement your own InputFormat. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: Hadoop input path - can it have subdirectories
Hello , No Hadoop can't traverse recursively inside subdirectory with Java Map-Reduce program. It have to be just directory containing files (and no sub-directories). --- Peeyush -Original Message- From: Tarandeep Singh [mailto:[EMAIL PROTECTED] Sent: Tue 4/1/2008 9:15 AM To: [EMAIL PROTECTED] Subject: Hadoop input path - can it have subdirectories Hi, Can I give a directory (having subdirectories) as input path to Hadoop Map-Reduce Job. I tried, but got error. Can Hadoop recursively traverse the input directory and collect all the file names or the input path has to be just a directory containing files (and no sub-directories) ? -Taran
RE: Hadoop input path - can it have subdirectories
My experience running with the Java API is that subdirectories in the input path do cause an exception, so the streaming file input processing must be different. Jeff Eastman > -Original Message- > From: Norbert Burger [mailto:[EMAIL PROTECTED] > Sent: Tuesday, April 01, 2008 9:46 AM > To: core-user@hadoop.apache.org > Cc: [EMAIL PROTECTED] > Subject: Re: Hadoop input path - can it have subdirectories > > Yes, this is fine, at least for Hadoop Streaming. I specify the root of > my > logs directory as my -input parameter, and Hadoop correctly finds all of > child directories. What's the error you're seeing? Is a stack trace > available? > > Norbert > > On Tue, Apr 1, 2008 at 12:15 PM, Tarandeep Singh <[EMAIL PROTECTED]> > wrote: > > > Hi, > > > > Can I give a directory (having subdirectories) as input path to Hadoop > > Map-Reduce Job. > > I tried, but got error. > > > > Can Hadoop recursively traverse the input directory and collect all > > the file names or the input path has to be just a directory containing > > files (and no sub-directories) ? > > > > -Taran > >
Re: Hadoop input path - can it have subdirectories
Yes, this is fine, at least for Hadoop Streaming. I specify the root of my logs directory as my -input parameter, and Hadoop correctly finds all of child directories. What's the error you're seeing? Is a stack trace available? Norbert On Tue, Apr 1, 2008 at 12:15 PM, Tarandeep Singh <[EMAIL PROTECTED]> wrote: > Hi, > > Can I give a directory (having subdirectories) as input path to Hadoop > Map-Reduce Job. > I tried, but got error. > > Can Hadoop recursively traverse the input directory and collect all > the file names or the input path has to be just a directory containing > files (and no sub-directories) ? > > -Taran >
Hadoop input path - can it have subdirectories
Hi, Can I give a directory (having subdirectories) as input path to Hadoop Map-Reduce Job. I tried, but got error. Can Hadoop recursively traverse the input directory and collect all the file names or the input path has to be just a directory containing files (and no sub-directories) ? -Taran
Re: Nutch and Distributed Lucene
Hi, Nutch builds Lucene indexes. But Nutch is much more than that. It is a web search application software that crawls the web, inverts links and builds indexes. Each step is one or more Map/Reduce jobs. You can find more information at http://lucene.apache.org/nutch/ The Map/Reduce job to build Lucene indexes in Nutch is customized to the data schema/structures used in Nutch. The index contrib package in Hadoop provides a general/configurable process to build Lucene indexes in parallel using a Map/Reduce job. That's the main difference. There is also the difference that the index build job in Nutch builds indexes in reduce tasks, while the index contrib package builds indexes in both map and reduce tasks and there are advantages in doing that... Regards, Ning On 4/1/08, Naama Kraus <[EMAIL PROTECTED]> wrote: > Hi, > > I'd like to know if Nutch is running on top of Lucene, or is it non related > to Lucene. I.e. indexing, parsing, crawling, internal data structures ... - > all written from scratch using MapReduce (my impression) ? > > What is the relation between Nutch and the distributed Lucene patch that was > inserted lately into Hadoop ? > > Thanks for any enlightening, > Naama > > -- > oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo > 00 oo 00 oo > "If you want your children to be intelligent, read them fairy tales. If you > want them to be more intelligent, read them more fairy tales." (Albert > Einstein) >
Re: Performance impact of underlying file system?
I would expect that most file systems can saturate the disk bandwidth for the large sequential reads that hadoop does. We use ext3 with good results. On 4/1/08 8:08 AM, "Colin Freas" <[EMAIL PROTECTED]> wrote: > Is the performance of Hadoop impacted by the underlying file system on the > nodes at all? > > All my nodes are ext3. I'm wondering if using XFS, Reiser, or ZFS might > improve performance. > > Does anyone have any offhand knowledge about this? > > -Colin
Hadoop Reduce Problem
Hi, I am having problem in reduce if the firewall is enabled. I am using default settings from hadoop-default.xml and hadoop-site.xml And I just changed this port number mapred.task.tracker.report.address I created the firewall rule to allow port range 5:50100 between the slaves and master. But reduce on the slaves using some other ports seems. So Reduce always hangs with firewall enabled. If I disable the firewall it works fine. Could you please let me know what I am missing or where to control the hadoop random port creation? Thanks, Senthil
Performance impact of underlying file system?
Is the performance of Hadoop impacted by the underlying file system on the nodes at all? All my nodes are ext3. I'm wondering if using XFS, Reiser, or ZFS might improve performance. Does anyone have any offhand knowledge about this? -Colin
Nutch and Distributed Lucene
Hi, I'd like to know if Nutch is running on top of Lucene, or is it non related to Lucene. I.e. indexing, parsing, crawling, internal data structures ... - all written from scratch using MapReduce (my impression) ? What is the relation between Nutch and the distributed Lucene patch that was inserted lately into Hadoop ? Thanks for any enlightening, Naama -- oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo "If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales." (Albert Einstein)
Re: Performance / cluster scaling question
Hi Chris & Hadoopers, we changed our system architecture in that way so that most of the data is now streamed directly from the spiders/crawlers nodes instead of using/creating temporary files on the DFS - now it performs way better and the exceptions are gone :-) ...seems to be a good decision when having only a relatively small cluster (like ours w/ 8 data nodes) where the deletion of blocks seems not to catch up with the creation of new temp files (through the max 100 blocks/3 seconds deletion "restriction"). Cu on the 'net, Bye - bye, < André èrbnA > Chris K Wensel wrote: If it's any consolation, I'm seeing similar behaviors on 0.16.0 when running on EC2 when I push the number of nodes in the cluster past 40. On Mar 24, 2008, at 6:31 AM, André Martin wrote: Thanks for the clarification, dhruba :-) Anyway, what can cause those other exceptions such as "Could not get block locations" and "DataXceiver: java.io.EOFException"? Can anyone give me a little more insight about those exceptions? And does anyone have a similar workload (frequent writes and deletion of small files), and what could cause the performance degradation (see first post)? I think HDFS should be able to handle two million and more files/blocks... Also, I observed that some of my datanodes do not "heartbeat" to the namenode for several seconds (up to 400 :-() from time to time - when I check those specific datanodes and do a "top", I see the "du" command running that seems to got stuck?!? Thanks and Happy Easter :-) Cu on the 'net, Bye - bye, < André èrbnA > dhruba Borthakur wrote: The namenode lazily instructs a Datanode to delete blocks. As a response to every heartbeat from a Datanode, the Namenode instructs it to delete a maximum on 100 blocks. Typically, the heartbeat periodicity is 3 seconds. The heartbeat thread in the Datanode deletes the block files synchronously before it can send the next heartbeat. That's the reason a small number (like 100) was chosen. If you have 8 datanodes, your system will probably delete about 800 blocks every 3 seconds. Thanks, dhruba -Original Message- From: André Martin [mailto:[EMAIL PROTECTED] Sent: Friday, March 21, 2008 3:06 PM To: core-user@hadoop.apache.org Subject: Re: Performance / cluster scaling question After waiting a few hours (without having any load), the block number and "DFS Used" space seems to go down... My question is: is the hardware simply too weak/slow to send the block deletion request to the datanodes in a timely manner, or do simply those "crappy" HDDs cause the delay, since I noticed that I can take up to 40 minutes when deleting ~400.000 files at once manually using "rm -r"... Actually - my main concern is why the performance à la the throughput goes down - any ideas?
Re: small sized files - how to use MultiInputFileFormat
Hi, An example extracting one record per file would be : public class FooInputFormat extends MultiFileInputFormat { @Override public RecordReader getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException { return new FooRecordReader(job, (MultiFileSplit)split); } } public static class FooRecordReader implements RecordReader { private MultiFileSplit split; private long offset; private long totLength; private FileSystem fs; private int count = 0; private Path[] paths; public FooRecordReader(Configuration conf, MultiFileSplit split) throws IOException { this.split = split; fs = FileSystem.get(conf); this.paths = split.getPaths(); this.totLength = split.getLength(); this.offset = 0; } public WritableComparable createKey() { .. } public Writable createValue() { .. } public void close() throws IOException { } public long getPos() throws IOException { return offset; } public float getProgress() throws IOException { return ((float)offset) / split.getLength(); } public boolean next(Writable key, Writable value) throws IOException { if(offset >= totLength) return false; if(count >= split.numPaths()) return false; Path file = paths[count]; FSDataInputStream stream = fs.open(file); BufferedReader reader = new BufferedReader(new InputStreamReader(stream)); Scanner scanner = new Scanner(reader.readLine()); //read from file, fill in key and value reader.close(); stream.close(); offset += split.getLength(count); count++; return true; } } I guess, I should add an example code to the mapred tutorial, and examples directory. Jason Curtes wrote: Hello, I have been trying to run Hadoop on a set of small text files, not larger than 10k each. The total input size is 15MB. If I try to run the example word count application, it takes about 2000 seconds, more than half an hour to complete. However, if I merge all the files into one large file, it takes much less than a minute. I think using MultiInputFileFormat can be helpful at this point. However, the API documentation is not really helpful. I wonder if MultiInputFileFormat can really solve my problem, and if so, can you suggest me a reference on how to use it, or a few lines to be added to the word count example to make things more clear? Thanks in advance. Regards, Jason Curtes
Re: small sized files - how to use MultiInputFileFormat
In principle I agree with you Ted. However, in many cases we have multiple large jobs generating outputs that are not that big and as result the number of small size files (significantly smaller than a Hadoop block) is large, using the default splitting logic there triggers jobs with a large amount of tasks that inefficiently clogs the cluster. The MultipleFileInputFormat helps to avoid that, but it has a problem, if the file set is a mix of small and large files the splits are uneven and it does not do split on single large files. To address this we've written our own InputFormat (for Text and SequenceFiles) that collapses small files into a splits up to the block size and splits big files into the block size. It has a twist that you can you specify the max number of MAPs that you want or the BLOCK size you want to use for the splits. When a particular split contains multiple small files, the suggested host for the splits is order based on the host that has most of the data for those files. We'll still have to do some clean up on the code and then we'll submit it to Hadoop. A On Sat, Mar 29, 2008 at 10:20 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > Small files are a bad idea for high throughput no matter what technology you > use. The issue is that you need a larger file in order to avoid disk seeks. > > > > > On 3/28/08 7:34 PM, "Jason Curtes" <[EMAIL PROTECTED]> wrote: > > > Hello, > > > > I have been trying to run Hadoop on a set of small text files, not larger > > than 10k each. The total input size is 15MB. If I try to run the example > > word count application, it takes about 2000 seconds, more than half an hour > > to complete. However, if I merge all the files into one large file, it > takes > > much less than a minute. I think using MultiInputFileFormat can be helpful > > at this point. However, the API documentation is not really helpful. I > > wonder if MultiInputFileFormat can really solve my problem, and if so, can > > you suggest me a reference on how to use it, or a few lines to be added to > > the word count example to make things more clear? > > > > Thanks in advance. > > > > Regards, > > > > Jason Curtes > >