Re: Yahoo's production webmap is now on Hadoop
Guys, Thanks for the clarification and math explanations. Such a number would then likely be 100x my original estimate given that the web may have doubled for each year since that blog post and is growing exponentially. Index size was only a byproduct of trying to discern the significance of 1 trillion links in an inverted web graph. Hadoop has certainly arrived and become a valuable software asset likely to power next-generation Internet computing. Thanks again, Peter W. On Feb 19, 2008, at 5:33 PM, Eric Baldeschwieler wrote: Search engine Index size comparison is actually a very inexact science. Various 3rd parities comparing the major search engines do not come the the same conclusions. But ours is certainly world class and well over the discussed sizes. Here is an interesting bit of web history... A blog from AUGUST 08, 2005 discussing our index of over 19.2 billion web documents. It has only grown since then. http://www.ysearchblog.com/archives/000172.html On Feb 19, 2008, at 2:38 PM, Ted Dunning wrote: Sorry to be picky about the math, but 1 Trillion = 10^12 = million million. At 10 links per page, this gives 100 x 10^9 pages, not 1 x 10^9. At 100 links per page, this gives 10B pages. On 2/19/08 2:25 PM, "Peter W." <[EMAIL PROTECTED]> wrote: Amazing milestone, Looks like Y! had approximately 1B documents in the WebMap: one trillion links=(10k million links/10 links per page)=1000 million pages=one billion. If Google has 10B docs (indexed w/25 MR jobs) then Hadoop has acheived one-tenth of its scale? Good stuff, Peter W. On Feb 19, 2008, at 9:58 AM, Owen O'Malley wrote: The link inversion and ranking algorithms for Yahoo Search are now being generated on Hadoop: http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds- largest-production-hadoop.html Some Webmap size data: * Number of links between pages in the index: roughly 1 trillion links * Size of output: over 300 TB, compressed! * Number of cores used to run a single Map-Reduce job: over 10,000 * Raw disk used in the production cluster: over 5 Petabytes
Re: FileOutputFormat which does not write key value?
Re-reading the thread convinces me that this is a difference between TextOutputFormat and other output formats. On 2/19/08 6:01 PM, "Andy Li" <[EMAIL PROTECTED]> wrote: > Shouldn't the official way to do this is to implement your own RecordWriter > and implement the > OutputFormatClass. > > conf.setOutputFormat(yourClass); > > Inside the yourClass, you can return your own RecordWriter class in the > getRecordWriter method. > > I did it on the FileInputFormat with my own RecordReader and it worked for > me > to take KEY and null VALUE into the Mapper. I believe it is the same thing > vice versa. > > But there should be a formal way instead of try-and-error to see what the > system default > is. I guess the system does not have a standard spec to define what is the > default values? > Maybe this is why Ted has such concern of incompatible in 0.16.*? > > -Andy > > On Feb 19, 2008 3:02 PM, Lukas Vlcek <[EMAIL PROTECTED]> wrote: > >> Hmmm... >> >> May be I should rather go to bet (it is just midnight in my part of the >> world...) but I think I did what you are saying: >> >> Configuration: >> conf.setOutputKeyClass(NullWritable.class); >> conf.setOutputValueClass(Text.class); >> >> And the reducer: >> public class PermutationReduce extends MapReduceBase implements >> Reducer { >> >>public void reduce(Text key, Iterator values, >> OutputCollector output, Reporter reporter) throws >> IOException { >>while (values.hasNext()) { >>output.collect(NullWritable.get(), values.next()); >>} >> >>} >> } >> >> Regards, >> Lukas >> >> On 2/19/08, Owen O'Malley <[EMAIL PROTECTED]> wrote: >>> >>> >>> On Feb 19, 2008, at 1:52 PM, Lukas Vlcek wrote: >>> Hi, I don't care about key value in the output file. Is there any way how I can suppress key in the output? Is there a way how to tell (Text)OutputFormat not to write key but value only? Or can I pass my own implementation of RecordWriter into FileOutputFormat? >>> >>> The easiest way is to put either null or a NullWritable in for the >>> key coming out of the reduce. The TextOutputFormat will drop the tab >>> character. You can also define your own OutputFormat and encode them >>> as you wish. >>> >>> -- Owen >>> >> >> >> >> -- >> http://blog.lukas-vlcek.com/ >>
Re: FileOutputFormat which does not write key value?
Shouldn't the official way to do this is to implement your own RecordWriter and implement the OutputFormatClass. conf.setOutputFormat(yourClass); Inside the yourClass, you can return your own RecordWriter class in the getRecordWriter method. I did it on the FileInputFormat with my own RecordReader and it worked for me to take KEY and null VALUE into the Mapper. I believe it is the same thing vice versa. But there should be a formal way instead of try-and-error to see what the system default is. I guess the system does not have a standard spec to define what is the default values? Maybe this is why Ted has such concern of incompatible in 0.16.*? -Andy On Feb 19, 2008 3:02 PM, Lukas Vlcek <[EMAIL PROTECTED]> wrote: > Hmmm... > > May be I should rather go to bet (it is just midnight in my part of the > world...) but I think I did what you are saying: > > Configuration: > conf.setOutputKeyClass(NullWritable.class); > conf.setOutputValueClass(Text.class); > > And the reducer: > public class PermutationReduce extends MapReduceBase implements > Reducer { > >public void reduce(Text key, Iterator values, > OutputCollector output, Reporter reporter) throws > IOException { >while (values.hasNext()) { >output.collect(NullWritable.get(), values.next()); >} > >} > } > > Regards, > Lukas > > On 2/19/08, Owen O'Malley <[EMAIL PROTECTED]> wrote: > > > > > > On Feb 19, 2008, at 1:52 PM, Lukas Vlcek wrote: > > > > > Hi, > > > > > > I don't care about key value in the output file. Is there any way > > > how I can > > > suppress key in the output? > > > Is there a way how to tell (Text)OutputFormat not to write key but > > > value > > > only? Or can I pass my own implementation of RecordWriter into > > > FileOutputFormat? > > > > The easiest way is to put either null or a NullWritable in for the > > key coming out of the reduce. The TextOutputFormat will drop the tab > > character. You can also define your own OutputFormat and encode them > > as you wish. > > > > -- Owen > > > > > > -- > http://blog.lukas-vlcek.com/ >
Re: Yahoo's production webmap is now on Hadoop
Search engine Index size comparison is actually a very inexact science. Various 3rd parities comparing the major search engines do not come the the same conclusions. But ours is certainly world class and well over the discussed sizes. Here is an interesting bit of web history... A blog from AUGUST 08, 2005 discussing our index of over 19.2 billion web documents. It has only grown since then. http://www.ysearchblog.com/archives/000172.html On Feb 19, 2008, at 2:38 PM, Ted Dunning wrote: Sorry to be picky about the math, but 1 Trillion = 10^12 = million million. At 10 links per page, this gives 100 x 10^9 pages, not 1 x 10^9. At 100 links per page, this gives 10B pages. On 2/19/08 2:25 PM, "Peter W." <[EMAIL PROTECTED]> wrote: Amazing milestone, Looks like Y! had approximately 1B documents in the WebMap: one trillion links=(10k million links/10 links per page)=1000 million pages=one billion. If Google has 10B docs (indexed w/25 MR jobs) then Hadoop has acheived one-tenth of its scale? Good stuff, Peter W. On Feb 19, 2008, at 9:58 AM, Owen O'Malley wrote: The link inversion and ranking algorithms for Yahoo Search are now being generated on Hadoop: http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds- largest-production-hadoop.html Some Webmap size data: * Number of links between pages in the index: roughly 1 trillion links * Size of output: over 300 TB, compressed! * Number of cores used to run a single Map-Reduce job: over 10,000 * Raw disk used in the production cluster: over 5 Petabytes
RE: Questions about the MapReduce libraries and job schedulers inside JobTracker and JobClient running on Hadoop
Andy, it's great that you're taking a deeper look at the scheduling code. I don't think there is a complete document that describes what it does (the code is the documentation, for good or for bad). But there has been some concerted effort to improve the scheduler's performance and to make it take other things into consideration (rack awareness, for example). Start with http://issues.apache.org/jira/browse/HADOOP-2119, and also look at some of the Jiras it references. This should give you an idea of what kinds of changes people are looking at. The Jiras, especially 2119, should also have enough discussions on how the scheduling currently works. I would also recommend that you look at http://issues.apache.org/jira/browse/HADOOP-2491. This Jira is meant to capture a more generic discussion on how to do better scheduling within the MR framework. You could probably add some of your suggestions to it. -Original Message- From: Eric Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 19, 2008 11:50 AM To: core-user@hadoop.apache.org Subject: Re: Questions about the MapReduce libraries and job schedulers inside JobTracker and JobClient running on Hadoop The class is defined to be accessed in package level so not displayed in javadoc. Source code comes with hadoop installation under ${HADOOP_INSTALLATION_DIR}/src/java/org/apache/hadoop/mapred. Eric Andy Li wrote: > Thanks for both inputs. My question actually focus more on what Vivek > has mentioned. > > I would like to work on the JobClient to see how it submits jobs to > different file system and slaves in the same Hadoop cluster. > > Not sure if there is a complete document to explain the scheduler > underneath Hadoop, if not, I'll wrap up what I know and study from the > source code and submit it to the community once it is done. Review > and comments are welcome. > > For the code, I couldn't find JobInProgress from the API index. Could > anyone provide me a pointer to this? Thanks. > > On Fri, Feb 15, 2008 at 3:01 PM, Vivek Ratan <[EMAIL PROTECTED]> wrote: > > >> I read Andy's question a little differently. For a given job, the >> JobTracker decides which tasks go to which TaskTracker (the TTs ask >> for a task to run and the JT decides which task is the most >> appropriate). Currently, the JT favors a task whose input data is on >> the same host as the TT (if there are more than one such tasks, it >> picks the one with the largest input size). >> It >> also looks at failed tasks and certain other criteria. This is very >> basic scheduling and there is a lot of scope for improvement. There >> currently is a proposal to support rack awareness, so that if the JT >> can't find a task whose input data is on the same host as the TT, it >> looks for a task whose data is on the same rack. >> >> You can clearly get more ambitious with your scheduling algorithm. As >> you mention, you could use other criteria for scheduling a task: >> available CPU or memory, for example. You could assign tasks to hosts >> that are the most 'free', or aim to distribute tasks across racks, or >> try some other load balancing techniques. I believe there are a few >> discussions on these methods on Jira, but I don't think there's >> anything concrete yet. >> >> BTW, the code that decides what task to run is primarily in >> JobInProgress::findNewTask(). >> >> >> -Original Message- >> From: Ted Dunning [mailto:[EMAIL PROTECTED] >> Sent: Friday, February 15, 2008 1:54 PM >> To: core-user@hadoop.apache.org >> Subject: Re: Questions about the MapReduce libraries and job >> schedulers inside JobTracker and JobClient running on Hadoop >> >> >> Core-user is the right place for this question. >> >> Your description is mostly correct. Jobs don't necessarily go to all >> of your boxes in the cluster, but they may. >> >> Non-uniform machine specs are a bit of a problem that is being (has >> been?) addressed by allowing each machine to have a slightly >> different hadoop-site.xml file. That would allow different settings >> for storage configuration and number of processes to run. >> >> Even without that, you can level the load a bit by simply running >> more jobs on the weak machines than you would otherwise prefer. Most >> map reduce programs are pretty light on memory usage so all that >> happens is that you get less throughput on the weak machines. Since >> there are normally more map tasks than cores, this is no big deal; >> slow machines get fewer tasks and toward the end of the job, their >> tasks are even replicated on other machines in case they can be done >> more quickly. >> >> >> On 2/15/08 1:25 PM, "[EMAIL PROTECTED]" >> <[EMAIL PROTECTED] >> >> wrote: >> >> >>> Hello, >>> >>> My first time posting this in the news group.My question sounds more >>> >> like >> >>> a MapReduce question >>> instead of Hadoop HDFS itself. >>> >>> To my understanding, the JobClient will submit all Mapper an
Re: FileOutputFormat which does not write key value?
Hmmm... May be I should rather go to bet (it is just midnight in my part of the world...) but I think I did what you are saying: Configuration: conf.setOutputKeyClass(NullWritable.class); conf.setOutputValueClass(Text.class); And the reducer: public class PermutationReduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { while (values.hasNext()) { output.collect(NullWritable.get(), values.next()); } } } Regards, Lukas On 2/19/08, Owen O'Malley <[EMAIL PROTECTED]> wrote: > > > On Feb 19, 2008, at 1:52 PM, Lukas Vlcek wrote: > > > Hi, > > > > I don't care about key value in the output file. Is there any way > > how I can > > suppress key in the output? > > Is there a way how to tell (Text)OutputFormat not to write key but > > value > > only? Or can I pass my own implementation of RecordWriter into > > FileOutputFormat? > > The easiest way is to put either null or a NullWritable in for the > key coming out of the reduce. The TextOutputFormat will drop the tab > character. You can also define your own OutputFormat and encode them > as you wish. > > -- Owen > -- http://blog.lukas-vlcek.com/
Re: FileOutputFormat which does not write key value?
On Feb 19, 2008, at 1:52 PM, Lukas Vlcek wrote: Hi, I don't care about key value in the output file. Is there any way how I can suppress key in the output? Is there a way how to tell (Text)OutputFormat not to write key but value only? Or can I pass my own implementation of RecordWriter into FileOutputFormat? The easiest way is to put either null or a NullWritable in for the key coming out of the reduce. The TextOutputFormat will drop the tab character. You can also define your own OutputFormat and encode them as you wish. -- Owen
Re: FileOutputFormat which does not write key value?
Actually, I DID mean for you to pass a null. And you have provided me a warning about what might break in 16.* when I get there. On 2/19/08 2:52 PM, "Lukas Vlcek" <[EMAIL PROTECTED]> wrote: > I think you didn't mean that I should directly pass a null into a key (this > is what I did in my example code). I have just found that there is > NullWritable class in hadoop.io package but still I can not make it work > correctly.
Re: Yahoo's production webmap is now on Hadoop
>> In English, a trillion usually means 10^12, not 10^10. Hmmm, the Empire Strikes Back ? ;-) - Original Message From: Doug Cutting <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Tuesday, February 19, 2008 2:39:33 PM Subject: Re: Yahoo's production webmap is now on Hadoop Peter W. wrote: > one trillion links=(10k million links/10 links per page)=1000 million > pages=one billion. In English, a trillion usually means 10^12, not 10^10. http://en.wikipedia.org/wiki/Trillion Doug Looking for last minute shopping deals? Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping
Re: FileOutputFormat which does not write key value?
Ted, I think you didn't mean that I should directly pass a null into a key (this is what I did in my example code). I have just found that there is NullWritable class in hadoop.io package but still I can not make it work correctly. I am getting the following exception: java.lang.RuntimeException: java.lang.IllegalAccessException: Class org.apache.hadoop.io.WritableComparator can not access a member of class org.apache.hadoop.io.NullWritable with modifiers "private" at org.apache.hadoop.io.WritableComparator.newKey( WritableComparator.java:77) at org.apache.hadoop.io.WritableComparator.( WritableComparator.java:63) at org.apache.hadoop.io.WritableComparator.get(WritableComparator.java :42) at org.apache.hadoop.mapred.JobConf.getOutputKeyComparator(JobConf.java :642) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java :313) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:174) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java :132) Caused by: java.lang.IllegalAccessException: Class org.apache.hadoop.io.WritableComparator can not access a member of class org.apache.hadoop.io.NullWritable with modifiers "private" at sun.reflect.Reflection.ensureMemberAccess(Reflection.java:65) at java.lang.Class.newInstance0(Class.java:349) at java.lang.Class.newInstance(Class.java:308) at org.apache.hadoop.io.WritableComparator.newKey( WritableComparator.java:73) ... 6 more Is there any test of NullWritable in Hadoop unit test suite? Lukas On Feb 19, 2008 11:35 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > I use 15.1 and it does work there. Pity if we lost that capability. > Having > to take a structure apart and put together a new one just to move one > field > out is a real pain and significantly increases garbage allocations. > > > On 2/19/08 2:08 PM, "Lukas Vlcek" <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > Either I am doing something wrong or this does not work (I am using > 0.16.0): > > > > My class: > > > > public class PermutationReduce extends MapReduceBase implements > > Reducer { > > > > public void reduce(Text key, Iterator values, > > OutputCollector output, Reporter reporter) throws > IOException { > > while (values.hasNext()) { > > output.collect(null, values.next()); > > } > > } > > } > > > > the Exception: > > > > java.lang.NullPointerException > > at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java > > :948) > > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$2.collect( > > MapTask.java:489) > > at org.permutation.PermutationReduce.reduce(PermutationReduce.java > :16) > > at org.permutation.PermutationReduce.reduce(PermutationReduce.java > :1) > > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill( > > MapTask.java:522) > > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk( > > MapTask.java:493) > > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush( > MapTask.java > > :713) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:209) > > at org.apache.hadoop.mapred.LocalJobRunner$Job.run( > LocalJobRunner.java > > :132) > > Exception in thread "main" java.io.IOException: Job failed! > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:894) > > at org.permutation.Starter.main(Starter.java:37) > > > > Since all I need is just to output all mapper emits (every value which > > enters output collector in Mapper) I thought I could use > IdentityReducer. > > But it seems that this will not give me any option to suppress key in > > output. > > > > Regards, > > Lukas > > > > On Feb 19, 2008 11:00 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > >> > >> Give a key of null to the reducer's output collector. > >> > >> > >> On 2/19/08 1:52 PM, "Lukas Vlcek" <[EMAIL PROTECTED]> wrote: > >> > >>> Hi, > >>> > >>> I don't care about key value in the output file. Is there any way how > I > >> can > >>> suppress key in the output? > >>> Is there a way how to tell (Text)OutputFormat not to write key but > value > >>> only? Or can I pass my own implementation of RecordWriter into > >>> FileOutputFormat? > >>> > >>> Regards, > >>> Lukas > >> > >> > > > > -- http://blog.lukas-vlcek.com/
Re: Yahoo's production webmap is now on Hadoop
Peter W. wrote: one trillion links=(10k million links/10 links per page)=1000 million pages=one billion. In English, a trillion usually means 10^12, not 10^10. http://en.wikipedia.org/wiki/Trillion Doug
Re: Yahoo's production webmap is now on Hadoop
Sorry to be picky about the math, but 1 Trillion = 10^12 = million million. At 10 links per page, this gives 100 x 10^9 pages, not 1 x 10^9. At 100 links per page, this gives 10B pages. On 2/19/08 2:25 PM, "Peter W." <[EMAIL PROTECTED]> wrote: > Amazing milestone, > > Looks like Y! had approximately 1B documents in the WebMap: > > one trillion links=(10k million links/10 links per page)=1000 million > pages=one billion. > > If Google has 10B docs (indexed w/25 MR jobs) then Hadoop has > acheived one-tenth of its scale? > > Good stuff, > > Peter W. > > > > > On Feb 19, 2008, at 9:58 AM, Owen O'Malley wrote: > >> The link inversion and ranking algorithms for Yahoo Search are now >> being generated on Hadoop: >> >> http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds- >> largest-production-hadoop.html >> >> Some Webmap size data: >> >> * Number of links between pages in the index: roughly 1 >> trillion links >> * Size of output: over 300 TB, compressed! >> * Number of cores used to run a single Map-Reduce job: over 10,000 >> * Raw disk used in the production cluster: over 5 Petabytes >> >
Re: FileOutputFormat which does not write key value?
I use 15.1 and it does work there. Pity if we lost that capability. Having to take a structure apart and put together a new one just to move one field out is a real pain and significantly increases garbage allocations. On 2/19/08 2:08 PM, "Lukas Vlcek" <[EMAIL PROTECTED]> wrote: > Hi, > > Either I am doing something wrong or this does not work (I am using 0.16.0): > > My class: > > public class PermutationReduce extends MapReduceBase implements > Reducer { > > public void reduce(Text key, Iterator values, > OutputCollector output, Reporter reporter) throws IOException { > while (values.hasNext()) { > output.collect(null, values.next()); > } > } > } > > the Exception: > > java.lang.NullPointerException > at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java > :948) > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$2.collect( > MapTask.java:489) > at org.permutation.PermutationReduce.reduce(PermutationReduce.java:16) > at org.permutation.PermutationReduce.reduce(PermutationReduce.java:1) > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill( > MapTask.java:522) > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk( > MapTask.java:493) > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java > :713) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:209) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java > :132) > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:894) > at org.permutation.Starter.main(Starter.java:37) > > Since all I need is just to output all mapper emits (every value which > enters output collector in Mapper) I thought I could use IdentityReducer. > But it seems that this will not give me any option to suppress key in > output. > > Regards, > Lukas > > On Feb 19, 2008 11:00 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > >> >> Give a key of null to the reducer's output collector. >> >> >> On 2/19/08 1:52 PM, "Lukas Vlcek" <[EMAIL PROTECTED]> wrote: >> >>> Hi, >>> >>> I don't care about key value in the output file. Is there any way how I >> can >>> suppress key in the output? >>> Is there a way how to tell (Text)OutputFormat not to write key but value >>> only? Or can I pass my own implementation of RecordWriter into >>> FileOutputFormat? >>> >>> Regards, >>> Lukas >> >> >
Re: Yahoo's production webmap is now on Hadoop
Amazing milestone, Looks like Y! had approximately 1B documents in the WebMap: one trillion links=(10k million links/10 links per page)=1000 million pages=one billion. If Google has 10B docs (indexed w/25 MR jobs) then Hadoop has acheived one-tenth of its scale? Good stuff, Peter W. On Feb 19, 2008, at 9:58 AM, Owen O'Malley wrote: The link inversion and ranking algorithms for Yahoo Search are now being generated on Hadoop: http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds- largest-production-hadoop.html Some Webmap size data: * Number of links between pages in the index: roughly 1 trillion links * Size of output: over 300 TB, compressed! * Number of cores used to run a single Map-Reduce job: over 10,000 * Raw disk used in the production cluster: over 5 Petabytes
Re: FileOutputFormat which does not write key value?
Hi, Either I am doing something wrong or this does not work (I am using 0.16.0): My class: public class PermutationReduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { while (values.hasNext()) { output.collect(null, values.next()); } } } the Exception: java.lang.NullPointerException at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java :948) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$2.collect( MapTask.java:489) at org.permutation.PermutationReduce.reduce(PermutationReduce.java:16) at org.permutation.PermutationReduce.reduce(PermutationReduce.java:1) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill( MapTask.java:522) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk( MapTask.java:493) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java :713) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:209) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java :132) Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:894) at org.permutation.Starter.main(Starter.java:37) Since all I need is just to output all mapper emits (every value which enters output collector in Mapper) I thought I could use IdentityReducer. But it seems that this will not give me any option to suppress key in output. Regards, Lukas On Feb 19, 2008 11:00 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > Give a key of null to the reducer's output collector. > > > On 2/19/08 1:52 PM, "Lukas Vlcek" <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > I don't care about key value in the output file. Is there any way how I > can > > suppress key in the output? > > Is there a way how to tell (Text)OutputFormat not to write key but value > > only? Or can I pass my own implementation of RecordWriter into > > FileOutputFormat? > > > > Regards, > > Lukas > > -- http://blog.lukas-vlcek.com/
Re: FileOutputFormat which does not write key value?
Give a key of null to the reducer's output collector. On 2/19/08 1:52 PM, "Lukas Vlcek" <[EMAIL PROTECTED]> wrote: > Hi, > > I don't care about key value in the output file. Is there any way how I can > suppress key in the output? > Is there a way how to tell (Text)OutputFormat not to write key but value > only? Or can I pass my own implementation of RecordWriter into > FileOutputFormat? > > Regards, > Lukas
FileOutputFormat which does not write key value?
Hi, I don't care about key value in the output file. Is there any way how I can suppress key in the output? Is there a way how to tell (Text)OutputFormat not to write key but value only? Or can I pass my own implementation of RecordWriter into FileOutputFormat? Regards, Lukas -- http://blog.lukas-vlcek.com/
Re: Yahoo's production webmap is now on Hadoop
Hi Owen, A very impressive feat. Definitely the shining star of Hadoop's scalability. I'd be interested to know what other problems Yahoo! has solved in the process of scaling these jobs up to 10k cores, that are not represented by parts of Hadoop and other tools included in the distribution. I wonder if there are other cluster provisioning, management and monitoring tools that Yahoo! uses, that have contributed to, and made possible this great success. Thank you, Garth On Feb 19, 2008 1:30 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote: > > On Feb 19, 2008, at 11:55 AM, Eric Zhang wrote: > > > This is very impressive. Congrats!. > > Which version of Hadoop is this running on and what's the input > > data size? > > They are running Hadoop-0.16.0... > > -- Owen >
Re: Yahoo's production webmap is now on Hadoop
On Feb 19, 2008, at 11:55 AM, Eric Zhang wrote: This is very impressive. Congrats!. Which version of Hadoop is this running on and what's the input data size? They are running Hadoop-0.16.0... -- Owen
Re: Yahoo's production webmap is now on Hadoop
This is awesome, Owen. Congratulations to the whole team! On Feb 19, 2008 1:21 PM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Owen O'Malley wrote: > > The link inversion and ranking algorithms for Yahoo Search are now being > > generated on Hadoop: > > > > > http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html > > > > > > Some Webmap size data: > > > > * Number of links between pages in the index: roughly 1 trillion > links > > * Size of output: over 300 TB, compressed! > > * Number of cores used to run a single Map-Reduce job: over 10,000 > > * Raw disk used in the production cluster: over 5 Petabytes > > > > > > Truly impressive. IMHO this is something the project should boast about, > i.e. include this data point in the scalability / performance section. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >
Re: Yahoo's production webmap is now on Hadoop
Owen O'Malley wrote: The link inversion and ranking algorithms for Yahoo Search are now being generated on Hadoop: http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html Some Webmap size data: * Number of links between pages in the index: roughly 1 trillion links * Size of output: over 300 TB, compressed! * Number of cores used to run a single Map-Reduce job: over 10,000 * Raw disk used in the production cluster: over 5 Petabytes Truly impressive. IMHO this is something the project should boast about, i.e. include this data point in the scalability / performance section. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Yahoo's production webmap is now on Hadoop
that 10k number is probably a large under-estimate; perhaps add a an extra zero to get something closer. still, impressive stuff. Miles On 19/02/2008, Toby DiPasquale <[EMAIL PROTECTED]> wrote: > > On Feb 19, 2008 12:58 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote: > > The link inversion and ranking algorithms for Yahoo Search are now > > being generated on Hadoop: > > > > http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest- > > production-hadoop.html > > > > Some Webmap size data: > > > > * Number of links between pages in the index: roughly 1 trillion > > links > > * Size of output: over 300 TB, compressed! > > * Number of cores used to run a single Map-Reduce job: over 10,000 > > > I thought I had read on this list before that Yahoo! was using > quad-core machines for their Hadoop clusters. Does this mean there are > ~2,500 machines in the cluster referred to above? > > -- > > Toby DiPasquale > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Re: Yahoo's production webmap is now on Hadoop
On Feb 19, 2008 12:58 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote: > The link inversion and ranking algorithms for Yahoo Search are now > being generated on Hadoop: > > http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest- > production-hadoop.html > > Some Webmap size data: > > * Number of links between pages in the index: roughly 1 trillion > links > * Size of output: over 300 TB, compressed! > * Number of cores used to run a single Map-Reduce job: over 10,000 I thought I had read on this list before that Yahoo! was using quad-core machines for their Hadoop clusters. Does this mean there are ~2,500 machines in the cluster referred to above? -- Toby DiPasquale
Re: Yahoo's production webmap is now on Hadoop
Impressive! Considering that Hadoop is open source software in early stage of development written in Java could this be the *REAL* reason why Microsoft want to buy Yahoo!? :-) Lukas On Feb 19, 2008 8:55 PM, Eric Zhang <[EMAIL PROTECTED]> wrote: > This is very impressive. Congrats!. > > Which version of Hadoop is this running on and what's the input data size? > > Eric > > Owen O'Malley wrote: > > The link inversion and ranking algorithms for Yahoo Search are now > > being generated on Hadoop: > > > > > http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html > > > > > > Some Webmap size data: > > > > * Number of links between pages in the index: roughly 1 trillion > > links > > * Size of output: over 300 TB, compressed! > > * Number of cores used to run a single Map-Reduce job: over 10,000 > > * Raw disk used in the production cluster: over 5 Petabytes > > > > > >
Re: Yahoo's production webmap is now on Hadoop
This is very impressive. Congrats!. Which version of Hadoop is this running on and what's the input data size? Eric Owen O'Malley wrote: The link inversion and ranking algorithms for Yahoo Search are now being generated on Hadoop: http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html Some Webmap size data: * Number of links between pages in the index: roughly 1 trillion links * Size of output: over 300 TB, compressed! * Number of cores used to run a single Map-Reduce job: over 10,000 * Raw disk used in the production cluster: over 5 Petabytes
Re: Questions about the MapReduce libraries and job schedulers inside JobTracker and JobClient running on Hadoop
The class is defined to be accessed in package level so not displayed in javadoc. Source code comes with hadoop installation under ${HADOOP_INSTALLATION_DIR}/src/java/org/apache/hadoop/mapred. Eric Andy Li wrote: Thanks for both inputs. My question actually focus more on what Vivek has mentioned. I would like to work on the JobClient to see how it submits jobs to different file system and slaves in the same Hadoop cluster. Not sure if there is a complete document to explain the scheduler underneath Hadoop, if not, I'll wrap up what I know and study from the source code and submit it to the community once it is done. Review and comments are welcome. For the code, I couldn't find JobInProgress from the API index. Could anyone provide me a pointer to this? Thanks. On Fri, Feb 15, 2008 at 3:01 PM, Vivek Ratan <[EMAIL PROTECTED]> wrote: I read Andy's question a little differently. For a given job, the JobTracker decides which tasks go to which TaskTracker (the TTs ask for a task to run and the JT decides which task is the most appropriate). Currently, the JT favors a task whose input data is on the same host as the TT (if there are more than one such tasks, it picks the one with the largest input size). It also looks at failed tasks and certain other criteria. This is very basic scheduling and there is a lot of scope for improvement. There currently is a proposal to support rack awareness, so that if the JT can't find a task whose input data is on the same host as the TT, it looks for a task whose data is on the same rack. You can clearly get more ambitious with your scheduling algorithm. As you mention, you could use other criteria for scheduling a task: available CPU or memory, for example. You could assign tasks to hosts that are the most 'free', or aim to distribute tasks across racks, or try some other load balancing techniques. I believe there are a few discussions on these methods on Jira, but I don't think there's anything concrete yet. BTW, the code that decides what task to run is primarily in JobInProgress::findNewTask(). -Original Message- From: Ted Dunning [mailto:[EMAIL PROTECTED] Sent: Friday, February 15, 2008 1:54 PM To: core-user@hadoop.apache.org Subject: Re: Questions about the MapReduce libraries and job schedulers inside JobTracker and JobClient running on Hadoop Core-user is the right place for this question. Your description is mostly correct. Jobs don't necessarily go to all of your boxes in the cluster, but they may. Non-uniform machine specs are a bit of a problem that is being (has been?) addressed by allowing each machine to have a slightly different hadoop-site.xml file. That would allow different settings for storage configuration and number of processes to run. Even without that, you can level the load a bit by simply running more jobs on the weak machines than you would otherwise prefer. Most map reduce programs are pretty light on memory usage so all that happens is that you get less throughput on the weak machines. Since there are normally more map tasks than cores, this is no big deal; slow machines get fewer tasks and toward the end of the job, their tasks are even replicated on other machines in case they can be done more quickly. On 2/15/08 1:25 PM, "[EMAIL PROTECTED]" <[EMAIL PROTECTED] wrote: Hello, My first time posting this in the news group.My question sounds more like a MapReduce question instead of Hadoop HDFS itself. To my understanding, the JobClient will submit all Mapper and Reduce class in a uniform way to the cluster? Can I assume this is more like a uniform scheduler for all the task? For example, if I have a 100 node cluster, 1 master (namenode), 99 slaves (datanodes). When I do "JobClient.runJob(jconf)" the JobClient will uniformly distributes all Mapper and Reduce class to all 99 nodes. In the slaves, they will all have the same hadoop-site.xml and hadoop-default.xml. Here comes the main concern, what if some of the nodes don't have the same hardware spec such as memory or CPU speed? E.g. different batch purchase and repairment overtime that causes this. Is there any way that the JobClient can be aware of this and submit different number of tasks to different slaves during start-up? For example, for some slaves, it has 16 cores CPU instead of 8 cores. The problem I see here is that for the 16 cores, only 8 cores are used. P.S. I'm looking into the JobClient source code and JobProfile/JobTracker to see if this can be done. But not sure if I am on the right track. If this topic is more likely to be in the [EMAIL PROTECTED], please let me know. I'll send another one to that news group. Regards, -Andy TREND MICRO EMAIL NOTICE The information contained in this email and any attachments is confidential and may be subject to copyright or other intellectual property protection. If you are not the intended recipient, you are not authorized to use or disclose this information, and
Re: Yahoo's production webmap is now on Hadoop
Wow! Congrats! On 19.02.2008, at 18:58, Owen O'Malley wrote: The link inversion and ranking algorithms for Yahoo Search are now being generated on Hadoop: http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds- largest-production-hadoop.html Some Webmap size data: * Number of links between pages in the index: roughly 1 trillion links * Size of output: over 300 TB, compressed! * Number of cores used to run a single Map-Reduce job: over 10,000 * Raw disk used in the production cluster: over 5 Petabytes
Yahoo's production webmap is now on Hadoop
The link inversion and ranking algorithms for Yahoo Search are now being generated on Hadoop: http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest- production-hadoop.html Some Webmap size data: * Number of links between pages in the index: roughly 1 trillion links * Size of output: over 300 TB, compressed! * Number of cores used to run a single Map-Reduce job: over 10,000 * Raw disk used in the production cluster: over 5 Petabytes
Re: external jar using eclipse-plugin?
It is "hadoop-0.15.3-eclipse-plugin". Tamer On 2/19/08, Christophe Taton <[EMAIL PROTECTED]> wrote: > > Hi Tamer, > > Can you tell which version of the plug-in do you use? > Unfortunately, I did not try this kind of configuration yet, but I'll work > having it work... > > Thanks, > Christophe > > On Feb 18, 2008 10:39 PM, Tamer Elsayed <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > This is a question about using external jars when running a Hadoop > > mapreduce > > job using eclipse-plugin. In my situation I want to use Lucene jar file. > > The > > code compiles fine on my machine since the jar file is added to the > > project > > external jars, but when I run it on Hadoop cluster, it gives me the > > following error: > > "Exception in thread "main" java.lang.NoClassDefFoundError: > > org.apache.lucene.search.IndexSearcher" > > which means that the jar file is not seen. I have tried to load it to > HDFS > > and use DistributedCache.addArchiveToClassPath but got the same error. > The > > code that needs Lucene is in both the controller and the mapper classes. > > > > Any clue to how to resolve this? > > > > Thanks in advance, > > Tamer > > > -- Proud to be a follower of the "Best of Mankind" "وَاذْكُرْ رَبَّكَ إِذَا نَسِيتَ وَقُلْ عَسَى أَنْ يَهْدِيَنِي رَبِّي لأقْرَبَ مِنْ هَذَا رَشَدًا"