Re: Yahoo's production webmap is now on Hadoop

2008-02-19 Thread Peter W.

Guys,

Thanks for the clarification and math explanations.

Such a number would then likely be 100x my original
estimate given that the web may have doubled for each
year since that blog post and is growing exponentially.

Index size was only a byproduct of trying to discern the
significance of 1 trillion links in an inverted web graph.

Hadoop has certainly arrived and become a valuable software
asset likely to power next-generation Internet computing.

Thanks again,

Peter W.


On Feb 19, 2008, at 5:33 PM, Eric Baldeschwieler wrote:

Search engine Index size comparison is actually a very inexact  
science.  Various 3rd parities comparing the major search engines  
do not come the the same conclusions.  But ours is certainly world  
class and well over the discussed sizes.


Here is an interesting bit of web history...  A blog from AUGUST  
08, 2005 discussing our index of over 19.2 billion web documents.   
It has only grown since then.


http://www.ysearchblog.com/archives/000172.html


On Feb 19, 2008, at 2:38 PM, Ted Dunning wrote:




Sorry to be picky about the math, but 1 Trillion = 10^12 = million  
million.
At 10 links per page, this gives 100 x 10^9 pages, not 1 x 10^9.   
At 100

links per page, this gives 10B pages.


On 2/19/08 2:25 PM, "Peter W." <[EMAIL PROTECTED]> wrote:


Amazing milestone,

Looks like Y! had approximately 1B documents in the WebMap:

one trillion links=(10k million links/10 links per page)=1000  
million

pages=one billion.

If Google has 10B docs (indexed w/25 MR jobs) then Hadoop has
acheived one-tenth of its scale?

Good stuff,

Peter W.




On Feb 19, 2008, at 9:58 AM, Owen O'Malley wrote:


The link inversion and ranking algorithms for Yahoo Search are now
being generated on Hadoop:

http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-
largest-production-hadoop.html

Some Webmap size data:

* Number of links between pages in the index: roughly 1
trillion links
* Size of output: over 300 TB, compressed!
* Number of cores used to run a single Map-Reduce job: over  
10,000

* Raw disk used in the production cluster: over 5 Petabytes











Re: FileOutputFormat which does not write key value?

2008-02-19 Thread Ted Dunning

Re-reading the thread convinces me that this is a difference between
TextOutputFormat and other output formats.


On 2/19/08 6:01 PM, "Andy Li" <[EMAIL PROTECTED]> wrote:

> Shouldn't the official way to do this is to implement your own RecordWriter
> and implement the
> OutputFormatClass.
> 
> conf.setOutputFormat(yourClass);
> 
> Inside the yourClass, you can return your own RecordWriter class in the
> getRecordWriter method.
> 
> I did it on the FileInputFormat with my own RecordReader and it worked for
> me
> to take KEY and null VALUE into the Mapper.  I believe it is the same thing
> vice versa.
> 
> But there should be a formal way instead of try-and-error to see what the
> system default
> is.  I guess the system does not have a standard spec to define what is the
> default values?
> Maybe this is why Ted has such concern of incompatible in 0.16.*?
> 
> -Andy
> 
> On Feb 19, 2008 3:02 PM, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
> 
>> Hmmm...
>> 
>> May be I should rather go to bet (it is just midnight in my part of the
>> world...) but I think I did what you are saying:
>> 
>> Configuration:
>> conf.setOutputKeyClass(NullWritable.class);
>> conf.setOutputValueClass(Text.class);
>> 
>> And the reducer:
>> public class PermutationReduce extends MapReduceBase implements
>> Reducer {
>> 
>>public void reduce(Text key, Iterator values,
>> OutputCollector output, Reporter reporter) throws
>> IOException {
>>while (values.hasNext()) {
>>output.collect(NullWritable.get(), values.next());
>>}
>> 
>>}
>> }
>> 
>> Regards,
>> Lukas
>> 
>> On 2/19/08, Owen O'Malley <[EMAIL PROTECTED]> wrote:
>>> 
>>> 
>>> On Feb 19, 2008, at 1:52 PM, Lukas Vlcek wrote:
>>> 
 Hi,
 
 I don't care about key value in the output file. Is there any way
 how I can
 suppress key in the output?
 Is there a way how to tell (Text)OutputFormat not to write key but
 value
 only? Or can I pass my own implementation of RecordWriter into
 FileOutputFormat?
>>> 
>>> The easiest way is to put either null or a NullWritable in for the
>>> key coming out of the reduce. The TextOutputFormat will drop the tab
>>> character. You can also define your own OutputFormat and encode them
>>> as you wish.
>>> 
>>> -- Owen
>>> 
>> 
>> 
>> 
>> --
>> http://blog.lukas-vlcek.com/
>> 



Re: FileOutputFormat which does not write key value?

2008-02-19 Thread Andy Li
Shouldn't the official way to do this is to implement your own RecordWriter
and implement the
OutputFormatClass.

conf.setOutputFormat(yourClass);

Inside the yourClass, you can return your own RecordWriter class in the
getRecordWriter method.

I did it on the FileInputFormat with my own RecordReader and it worked for
me
to take KEY and null VALUE into the Mapper.  I believe it is the same thing
vice versa.

But there should be a formal way instead of try-and-error to see what the
system default
is.  I guess the system does not have a standard spec to define what is the
default values?
Maybe this is why Ted has such concern of incompatible in 0.16.*?

-Andy

On Feb 19, 2008 3:02 PM, Lukas Vlcek <[EMAIL PROTECTED]> wrote:

> Hmmm...
>
> May be I should rather go to bet (it is just midnight in my part of the
> world...) but I think I did what you are saying:
>
> Configuration:
> conf.setOutputKeyClass(NullWritable.class);
> conf.setOutputValueClass(Text.class);
>
> And the reducer:
> public class PermutationReduce extends MapReduceBase implements
> Reducer {
>
>public void reduce(Text key, Iterator values,
> OutputCollector output, Reporter reporter) throws
> IOException {
>while (values.hasNext()) {
>output.collect(NullWritable.get(), values.next());
>}
>
>}
> }
>
> Regards,
> Lukas
>
> On 2/19/08, Owen O'Malley <[EMAIL PROTECTED]> wrote:
> >
> >
> > On Feb 19, 2008, at 1:52 PM, Lukas Vlcek wrote:
> >
> > > Hi,
> > >
> > > I don't care about key value in the output file. Is there any way
> > > how I can
> > > suppress key in the output?
> > > Is there a way how to tell (Text)OutputFormat not to write key but
> > > value
> > > only? Or can I pass my own implementation of RecordWriter into
> > > FileOutputFormat?
> >
> > The easiest way is to put either null or a NullWritable in for the
> > key coming out of the reduce. The TextOutputFormat will drop the tab
> > character. You can also define your own OutputFormat and encode them
> > as you wish.
> >
> > -- Owen
> >
>
>
>
> --
> http://blog.lukas-vlcek.com/
>


Re: Yahoo's production webmap is now on Hadoop

2008-02-19 Thread Eric Baldeschwieler
Search engine Index size comparison is actually a very inexact  
science.  Various 3rd parities comparing the major search engines do  
not come the the same conclusions.  But ours is certainly world class  
and well over the discussed sizes.


Here is an interesting bit of web history...  A blog from AUGUST 08,  
2005 discussing our index of over 19.2 billion web documents.  It has  
only grown since then.


http://www.ysearchblog.com/archives/000172.html


On Feb 19, 2008, at 2:38 PM, Ted Dunning wrote:




Sorry to be picky about the math, but 1 Trillion = 10^12 = million  
million.
At 10 links per page, this gives 100 x 10^9 pages, not 1 x 10^9.   
At 100

links per page, this gives 10B pages.


On 2/19/08 2:25 PM, "Peter W." <[EMAIL PROTECTED]> wrote:


Amazing milestone,

Looks like Y! had approximately 1B documents in the WebMap:

one trillion links=(10k million links/10 links per page)=1000 million
pages=one billion.

If Google has 10B docs (indexed w/25 MR jobs) then Hadoop has
acheived one-tenth of its scale?

Good stuff,

Peter W.




On Feb 19, 2008, at 9:58 AM, Owen O'Malley wrote:


The link inversion and ranking algorithms for Yahoo Search are now
being generated on Hadoop:

http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-
largest-production-hadoop.html

Some Webmap size data:

* Number of links between pages in the index: roughly 1
trillion links
* Size of output: over 300 TB, compressed!
* Number of cores used to run a single Map-Reduce job: over  
10,000

* Raw disk used in the production cluster: over 5 Petabytes









RE: Questions about the MapReduce libraries and job schedulers inside JobTracker and JobClient running on Hadoop

2008-02-19 Thread Vivek Ratan
Andy, it's great that you're taking a deeper look at the scheduling code. I
don't think there is a complete document that describes what it does (the
code is the documentation, for good or for bad). But there has been some
concerted effort to improve the scheduler's performance and to make it take
other things into consideration (rack awareness, for example). Start with
http://issues.apache.org/jira/browse/HADOOP-2119, and also look at some of
the Jiras it references. This should give you an idea of what kinds of
changes people are looking at. The Jiras, especially 2119, should also have
enough discussions on how the scheduling currently works. 

I would also recommend that you look at
http://issues.apache.org/jira/browse/HADOOP-2491. This Jira is meant to
capture a more generic discussion on how to do better scheduling within the
MR framework. You could probably add some of your suggestions to it. 
 

-Original Message-
From: Eric Zhang [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 19, 2008 11:50 AM
To: core-user@hadoop.apache.org
Subject: Re: Questions about the MapReduce libraries and job schedulers
inside JobTracker and JobClient running on Hadoop

The class is defined to be accessed in package level so not displayed in 
javadoc.   Source code comes with hadoop installation under 
${HADOOP_INSTALLATION_DIR}/src/java/org/apache/hadoop/mapred.

Eric
Andy Li wrote:
> Thanks for both inputs.  My question actually focus more on what Vivek 
> has mentioned.
>
> I would like to work on the JobClient to see how it submits jobs to 
> different file system and slaves in the same Hadoop cluster.
>
> Not sure if there is a complete document to explain the scheduler 
> underneath Hadoop, if not, I'll wrap up what I know and study from the 
> source code and submit it to the community once it is done.  Review 
> and comments are welcome.
>
> For the code, I couldn't find JobInProgress from the API index.  Could 
> anyone provide me a pointer to this?  Thanks.
>
> On Fri, Feb 15, 2008 at 3:01 PM, Vivek Ratan <[EMAIL PROTECTED]> wrote:
>
>   
>> I read Andy's question a little differently. For a given job, the 
>> JobTracker decides which tasks go to which TaskTracker (the TTs ask 
>> for a task to run and the JT decides which task is the most 
>> appropriate). Currently, the JT favors a task whose input data is on 
>> the same host as the TT (if there are more than one such tasks, it 
>> picks the one with the largest input size).
>> It
>> also looks at failed tasks and certain other criteria. This is very 
>> basic scheduling and there is a lot of scope for improvement. There 
>> currently is a proposal to support rack awareness, so that if the JT 
>> can't find a task whose input data is on the same host as the TT, it 
>> looks for a task whose data is on the same rack.
>>
>> You can clearly get more ambitious with your scheduling algorithm. As 
>> you mention, you could use other criteria for scheduling a task: 
>> available CPU or memory, for example. You could assign tasks to hosts 
>> that are the most 'free', or aim to distribute tasks across racks, or 
>> try some other load balancing techniques. I believe there are a few 
>> discussions on these methods on Jira, but I don't think there's 
>> anything concrete yet.
>>
>> BTW, the code that decides what task to run is primarily in 
>> JobInProgress::findNewTask().
>>
>>
>> -Original Message-
>> From: Ted Dunning [mailto:[EMAIL PROTECTED]
>> Sent: Friday, February 15, 2008 1:54 PM
>> To: core-user@hadoop.apache.org
>> Subject: Re: Questions about the MapReduce libraries and job 
>> schedulers inside JobTracker and JobClient running on Hadoop
>>
>>
>> Core-user is the right place for this question.
>>
>> Your description is mostly correct.  Jobs don't necessarily go to all 
>> of your boxes in the cluster, but they may.
>>
>> Non-uniform machine specs are a bit of a problem that is being (has 
>> been?) addressed by allowing each machine to have a slightly 
>> different hadoop-site.xml file.  That would allow different settings 
>> for storage configuration and number of processes to run.
>>
>> Even without that, you can level the load a bit by simply running 
>> more jobs on the weak machines than you would otherwise prefer.  Most 
>> map reduce programs are pretty light on memory usage so all that 
>> happens is that you get less throughput on the weak machines.  Since 
>> there are normally more map tasks than cores, this is no big deal; 
>> slow machines get fewer tasks and toward the end of the job, their 
>> tasks are even replicated on other machines in case they can be done 
>> more quickly.
>>
>>
>> On 2/15/08 1:25 PM, "[EMAIL PROTECTED]" 
>> <[EMAIL PROTECTED]
>> 
>> wrote:
>>
>> 
>>> Hello,
>>>
>>> My first time posting this in the news group.My question sounds more
>>>   
>> like
>> 
>>> a MapReduce question
>>> instead of Hadoop HDFS itself.
>>>
>>> To my understanding, the JobClient will submit all Mapper an

Re: FileOutputFormat which does not write key value?

2008-02-19 Thread Lukas Vlcek
Hmmm...

May be I should rather go to bet (it is just midnight in my part of the
world...) but I think I did what you are saying:

Configuration:
 conf.setOutputKeyClass(NullWritable.class);
 conf.setOutputValueClass(Text.class);

And the reducer:
public class PermutationReduce extends MapReduceBase implements
Reducer {

public void reduce(Text key, Iterator values,
OutputCollector output, Reporter reporter) throws
IOException {
while (values.hasNext()) {
output.collect(NullWritable.get(), values.next());
}

}
}

Regards,
Lukas

On 2/19/08, Owen O'Malley <[EMAIL PROTECTED]> wrote:
>
>
> On Feb 19, 2008, at 1:52 PM, Lukas Vlcek wrote:
>
> > Hi,
> >
> > I don't care about key value in the output file. Is there any way
> > how I can
> > suppress key in the output?
> > Is there a way how to tell (Text)OutputFormat not to write key but
> > value
> > only? Or can I pass my own implementation of RecordWriter into
> > FileOutputFormat?
>
> The easiest way is to put either null or a NullWritable in for the
> key coming out of the reduce. The TextOutputFormat will drop the tab
> character. You can also define your own OutputFormat and encode them
> as you wish.
>
> -- Owen
>



-- 
http://blog.lukas-vlcek.com/


Re: FileOutputFormat which does not write key value?

2008-02-19 Thread Owen O'Malley


On Feb 19, 2008, at 1:52 PM, Lukas Vlcek wrote:


Hi,

I don't care about key value in the output file. Is there any way  
how I can

suppress key in the output?
Is there a way how to tell (Text)OutputFormat not to write key but  
value

only? Or can I pass my own implementation of RecordWriter into
FileOutputFormat?


The easiest way is to put either null or a NullWritable in for the  
key coming out of the reduce. The TextOutputFormat will drop the tab  
character. You can also define your own OutputFormat and encode them  
as you wish.


-- Owen


Re: FileOutputFormat which does not write key value?

2008-02-19 Thread Ted Dunning

Actually, I DID mean for you to pass a null.

And you have provided me a warning about what might break in 16.* when I get
there.


On 2/19/08 2:52 PM, "Lukas Vlcek" <[EMAIL PROTECTED]> wrote:

> I think you didn't mean that I should directly pass a null into a key (this
> is what I did in my example code). I have just found that there is
> NullWritable class in hadoop.io package but still I can not make it work
> correctly.



Re: Yahoo's production webmap is now on Hadoop

2008-02-19 Thread Patrick McCormack

>> In 
English, 
a 
trillion 
usually 
means 
10^12, 
not 
10^10.

Hmmm, the Empire Strikes Back ?  ;-) 

- Original Message 
From: Doug Cutting <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Tuesday, February 19, 2008 2:39:33 PM
Subject: Re: Yahoo's production webmap is now on Hadoop


Peter 
W. 
wrote:
> 
one 
trillion 
links=(10k 
million 
links/10 
links 
per 
page)=1000 
million 
> 
pages=one 
billion.

In 
English, 
a 
trillion 
usually 
means 
10^12, 
not 
10^10.

http://en.wikipedia.org/wiki/Trillion

Doug






  

Looking for last minute shopping deals?  
Find them fast with Yahoo! Search.  
http://tools.search.yahoo.com/newsearch/category.php?category=shopping

Re: FileOutputFormat which does not write key value?

2008-02-19 Thread Lukas Vlcek
Ted,

I think you didn't mean that I should directly pass a null into a key (this
is what I did in my example code). I have just found that there is
NullWritable class in hadoop.io package but still I can not make it work
correctly. I am getting the following exception:

java.lang.RuntimeException: java.lang.IllegalAccessException: Class
org.apache.hadoop.io.WritableComparator can not access a member of class
org.apache.hadoop.io.NullWritable with modifiers "private"
at org.apache.hadoop.io.WritableComparator.newKey(
WritableComparator.java:77)
at org.apache.hadoop.io.WritableComparator.(
WritableComparator.java:63)
at org.apache.hadoop.io.WritableComparator.get(WritableComparator.java
:42)
at org.apache.hadoop.mapred.JobConf.getOutputKeyComparator(JobConf.java
:642)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java
:313)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:174)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java
:132)
Caused by: java.lang.IllegalAccessException: Class
org.apache.hadoop.io.WritableComparator can not access a member of class
org.apache.hadoop.io.NullWritable with modifiers "private"
at sun.reflect.Reflection.ensureMemberAccess(Reflection.java:65)
at java.lang.Class.newInstance0(Class.java:349)
at java.lang.Class.newInstance(Class.java:308)
at org.apache.hadoop.io.WritableComparator.newKey(
WritableComparator.java:73)
... 6 more

Is there any test of NullWritable in Hadoop unit test suite?

Lukas

On Feb 19, 2008 11:35 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

>
>
> I use 15.1 and it does work there.  Pity if we lost that capability.
>  Having
> to take a structure apart and put together a new one just to move one
> field
> out is a real pain and significantly increases garbage allocations.
>
>
> On 2/19/08 2:08 PM, "Lukas Vlcek" <[EMAIL PROTECTED]> wrote:
>
> > Hi,
> >
> > Either I am doing something wrong or this does not work (I am using
> 0.16.0):
> >
> > My class:
> >
> > public class PermutationReduce extends MapReduceBase implements
> > Reducer {
> >
> > public void reduce(Text key, Iterator values,
> > OutputCollector output, Reporter reporter) throws
> IOException {
> > while (values.hasNext()) {
> > output.collect(null, values.next());
> > }
> > }
> > }
> >
> > the Exception:
> >
> > java.lang.NullPointerException
> > at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java
> > :948)
> > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$2.collect(
> > MapTask.java:489)
> > at org.permutation.PermutationReduce.reduce(PermutationReduce.java
> :16)
> > at org.permutation.PermutationReduce.reduce(PermutationReduce.java
> :1)
> > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(
> > MapTask.java:522)
> > at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(
> > MapTask.java:493)
> > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(
> MapTask.java
> > :713)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:209)
> > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
> LocalJobRunner.java
> > :132)
> > Exception in thread "main" java.io.IOException: Job failed!
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:894)
> > at org.permutation.Starter.main(Starter.java:37)
> >
> > Since all I need is just to output all mapper emits (every value which
> > enters output collector in Mapper) I thought I could use
> IdentityReducer.
> > But it seems that this will not give me any option to suppress key in
> > output.
> >
> > Regards,
> > Lukas
> >
> > On Feb 19, 2008 11:00 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> >
> >>
> >> Give a key of null to the reducer's output collector.
> >>
> >>
> >> On 2/19/08 1:52 PM, "Lukas Vlcek" <[EMAIL PROTECTED]> wrote:
> >>
> >>> Hi,
> >>>
> >>> I don't care about key value in the output file. Is there any way how
> I
> >> can
> >>> suppress key in the output?
> >>> Is there a way how to tell (Text)OutputFormat not to write key but
> value
> >>> only? Or can I pass my own implementation of RecordWriter into
> >>> FileOutputFormat?
> >>>
> >>> Regards,
> >>> Lukas
> >>
> >>
> >
>
>


-- 
http://blog.lukas-vlcek.com/


Re: Yahoo's production webmap is now on Hadoop

2008-02-19 Thread Doug Cutting

Peter W. wrote:
one trillion links=(10k million links/10 links per page)=1000 million 
pages=one billion.


In English, a trillion usually means 10^12, not 10^10.

http://en.wikipedia.org/wiki/Trillion

Doug


Re: Yahoo's production webmap is now on Hadoop

2008-02-19 Thread Ted Dunning


Sorry to be picky about the math, but 1 Trillion = 10^12 = million million.
At 10 links per page, this gives 100 x 10^9 pages, not 1 x 10^9.  At 100
links per page, this gives 10B pages.


On 2/19/08 2:25 PM, "Peter W." <[EMAIL PROTECTED]> wrote:

> Amazing milestone,
> 
> Looks like Y! had approximately 1B documents in the WebMap:
> 
> one trillion links=(10k million links/10 links per page)=1000 million
> pages=one billion.
> 
> If Google has 10B docs (indexed w/25 MR jobs) then Hadoop has
> acheived one-tenth of its scale?
> 
> Good stuff,
> 
> Peter W.
> 
> 
> 
> 
> On Feb 19, 2008, at 9:58 AM, Owen O'Malley wrote:
> 
>> The link inversion and ranking algorithms for Yahoo Search are now
>> being generated on Hadoop:
>> 
>> http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-
>> largest-production-hadoop.html
>> 
>> Some Webmap size data:
>> 
>> * Number of links between pages in the index: roughly 1
>> trillion links
>> * Size of output: over 300 TB, compressed!
>> * Number of cores used to run a single Map-Reduce job: over 10,000
>> * Raw disk used in the production cluster: over 5 Petabytes
>> 
> 



Re: FileOutputFormat which does not write key value?

2008-02-19 Thread Ted Dunning


I use 15.1 and it does work there.  Pity if we lost that capability.  Having
to take a structure apart and put together a new one just to move one field
out is a real pain and significantly increases garbage allocations.


On 2/19/08 2:08 PM, "Lukas Vlcek" <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> Either I am doing something wrong or this does not work (I am using 0.16.0):
> 
> My class:
> 
> public class PermutationReduce extends MapReduceBase implements
> Reducer {
> 
> public void reduce(Text key, Iterator values,
> OutputCollector output, Reporter reporter) throws IOException {
> while (values.hasNext()) {
> output.collect(null, values.next());
> }
> }
> }
> 
> the Exception:
> 
> java.lang.NullPointerException
> at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java
> :948)
> at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$2.collect(
> MapTask.java:489)
> at org.permutation.PermutationReduce.reduce(PermutationReduce.java:16)
> at org.permutation.PermutationReduce.reduce(PermutationReduce.java:1)
> at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(
> MapTask.java:522)
> at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(
> MapTask.java:493)
> at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java
> :713)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:209)
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java
> :132)
> Exception in thread "main" java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:894)
> at org.permutation.Starter.main(Starter.java:37)
> 
> Since all I need is just to output all mapper emits (every value which
> enters output collector in Mapper) I thought I could use IdentityReducer.
> But it seems that this will not give me any option to suppress key in
> output.
> 
> Regards,
> Lukas
> 
> On Feb 19, 2008 11:00 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> 
>> 
>> Give a key of null to the reducer's output collector.
>> 
>> 
>> On 2/19/08 1:52 PM, "Lukas Vlcek" <[EMAIL PROTECTED]> wrote:
>> 
>>> Hi,
>>> 
>>> I don't care about key value in the output file. Is there any way how I
>> can
>>> suppress key in the output?
>>> Is there a way how to tell (Text)OutputFormat not to write key but value
>>> only? Or can I pass my own implementation of RecordWriter into
>>> FileOutputFormat?
>>> 
>>> Regards,
>>> Lukas
>> 
>> 
> 



Re: Yahoo's production webmap is now on Hadoop

2008-02-19 Thread Peter W.

Amazing milestone,

Looks like Y! had approximately 1B documents in the WebMap:

one trillion links=(10k million links/10 links per page)=1000 million  
pages=one billion.


If Google has 10B docs (indexed w/25 MR jobs) then Hadoop has  
acheived one-tenth of its scale?


Good stuff,

Peter W.




On Feb 19, 2008, at 9:58 AM, Owen O'Malley wrote:

The link inversion and ranking algorithms for Yahoo Search are now  
being generated on Hadoop:


http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds- 
largest-production-hadoop.html


Some Webmap size data:

* Number of links between pages in the index: roughly 1  
trillion links

* Size of output: over 300 TB, compressed!
* Number of cores used to run a single Map-Reduce job: over 10,000
* Raw disk used in the production cluster: over 5 Petabytes





Re: FileOutputFormat which does not write key value?

2008-02-19 Thread Lukas Vlcek
Hi,

Either I am doing something wrong or this does not work (I am using 0.16.0):

My class:

public class PermutationReduce extends MapReduceBase implements
Reducer {

public void reduce(Text key, Iterator values,
OutputCollector output, Reporter reporter) throws IOException {
while (values.hasNext()) {
output.collect(null, values.next());
}
}
}

the Exception:

java.lang.NullPointerException
at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java
:948)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$2.collect(
MapTask.java:489)
at org.permutation.PermutationReduce.reduce(PermutationReduce.java:16)
at org.permutation.PermutationReduce.reduce(PermutationReduce.java:1)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(
MapTask.java:522)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(
MapTask.java:493)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java
:713)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:209)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java
:132)
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:894)
at org.permutation.Starter.main(Starter.java:37)

Since all I need is just to output all mapper emits (every value which
enters output collector in Mapper) I thought I could use IdentityReducer.
But it seems that this will not give me any option to suppress key in
output.

Regards,
Lukas

On Feb 19, 2008 11:00 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

>
> Give a key of null to the reducer's output collector.
>
>
> On 2/19/08 1:52 PM, "Lukas Vlcek" <[EMAIL PROTECTED]> wrote:
>
> > Hi,
> >
> > I don't care about key value in the output file. Is there any way how I
> can
> > suppress key in the output?
> > Is there a way how to tell (Text)OutputFormat not to write key but value
> > only? Or can I pass my own implementation of RecordWriter into
> > FileOutputFormat?
> >
> > Regards,
> > Lukas
>
>


-- 
http://blog.lukas-vlcek.com/


Re: FileOutputFormat which does not write key value?

2008-02-19 Thread Ted Dunning

Give a key of null to the reducer's output collector.


On 2/19/08 1:52 PM, "Lukas Vlcek" <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> I don't care about key value in the output file. Is there any way how I can
> suppress key in the output?
> Is there a way how to tell (Text)OutputFormat not to write key but value
> only? Or can I pass my own implementation of RecordWriter into
> FileOutputFormat?
> 
> Regards,
> Lukas



FileOutputFormat which does not write key value?

2008-02-19 Thread Lukas Vlcek
Hi,

I don't care about key value in the output file. Is there any way how I can
suppress key in the output?
Is there a way how to tell (Text)OutputFormat not to write key but value
only? Or can I pass my own implementation of RecordWriter into
FileOutputFormat?

Regards,
Lukas

-- 
http://blog.lukas-vlcek.com/


Re: Yahoo's production webmap is now on Hadoop

2008-02-19 Thread Garth Patil
Hi Owen,
A very impressive feat. Definitely the shining star of Hadoop's scalability.
I'd be interested to know what other problems Yahoo! has solved in the
process of scaling these jobs up to 10k cores, that are not
represented by parts of Hadoop and other tools included in the
distribution. I wonder if there are other cluster provisioning,
management and monitoring tools that Yahoo! uses, that have
contributed to, and made possible this great success.
Thank you,
Garth

On Feb 19, 2008 1:30 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote:
>
> On Feb 19, 2008, at 11:55 AM, Eric Zhang wrote:
>
> > This is very impressive.  Congrats!.
> > Which version of Hadoop is this running on and what's the input
> > data size?
>
> They are running Hadoop-0.16.0...
>
> -- Owen
>


Re: Yahoo's production webmap is now on Hadoop

2008-02-19 Thread Owen O'Malley


On Feb 19, 2008, at 11:55 AM, Eric Zhang wrote:


This is very impressive.  Congrats!.
Which version of Hadoop is this running on and what's the input  
data size?


They are running Hadoop-0.16.0...

-- Owen


Re: Yahoo's production webmap is now on Hadoop

2008-02-19 Thread Jeff Hammerbacher
This is awesome, Owen.  Congratulations to the whole team!

On Feb 19, 2008 1:21 PM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

> Owen O'Malley wrote:
> > The link inversion and ranking algorithms for Yahoo Search are now being
> > generated on Hadoop:
> >
> >
> http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html
> >
> >
> > Some Webmap size data:
> >
> > * Number of links between pages in the index: roughly 1 trillion
> links
> > * Size of output: over 300 TB, compressed!
> > * Number of cores used to run a single Map-Reduce job: over 10,000
> > * Raw disk used in the production cluster: over 5 Petabytes
> >
> >
>
> Truly impressive. IMHO this is something the project should boast about,
> i.e. include this data point in the scalability / performance section.
>
>
> --
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


Re: Yahoo's production webmap is now on Hadoop

2008-02-19 Thread Andrzej Bialecki

Owen O'Malley wrote:
The link inversion and ranking algorithms for Yahoo Search are now being 
generated on Hadoop:


http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html 



Some Webmap size data:

* Number of links between pages in the index: roughly 1 trillion links
* Size of output: over 300 TB, compressed!
* Number of cores used to run a single Map-Reduce job: over 10,000
* Raw disk used in the production cluster: over 5 Petabytes




Truly impressive. IMHO this is something the project should boast about, 
i.e. include this data point in the scalability / performance section.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Yahoo's production webmap is now on Hadoop

2008-02-19 Thread Miles Osborne
that 10k number is probably a large under-estimate; perhaps add a an extra
zero to get something closer.

still, impressive stuff.

Miles

On 19/02/2008, Toby DiPasquale <[EMAIL PROTECTED]> wrote:
>
> On Feb 19, 2008 12:58 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote:
> > The link inversion and ranking algorithms for Yahoo Search are now
> > being generated on Hadoop:
> >
> > http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-
> > production-hadoop.html
> >
> > Some Webmap size data:
> >
> >  * Number of links between pages in the index: roughly 1 trillion
> > links
> >  * Size of output: over 300 TB, compressed!
> >  * Number of cores used to run a single Map-Reduce job: over 10,000
>
>
> I thought I had read on this list before that Yahoo! was using
> quad-core machines for their Hadoop clusters. Does this mean there are
> ~2,500 machines in the cluster referred to above?
>
> --
>
> Toby DiPasquale
>



-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.


Re: Yahoo's production webmap is now on Hadoop

2008-02-19 Thread Toby DiPasquale
On Feb 19, 2008 12:58 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote:
> The link inversion and ranking algorithms for Yahoo Search are now
> being generated on Hadoop:
>
> http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-
> production-hadoop.html
>
> Some Webmap size data:
>
>  * Number of links between pages in the index: roughly 1 trillion
> links
>  * Size of output: over 300 TB, compressed!
>  * Number of cores used to run a single Map-Reduce job: over 10,000

I thought I had read on this list before that Yahoo! was using
quad-core machines for their Hadoop clusters. Does this mean there are
~2,500 machines in the cluster referred to above?

-- 
Toby DiPasquale


Re: Yahoo's production webmap is now on Hadoop

2008-02-19 Thread Lukas Vlcek
Impressive! Considering that Hadoop is open source software in early stage
of development written in Java could this be the *REAL* reason why Microsoft
want to buy Yahoo!? :-)

Lukas

On Feb 19, 2008 8:55 PM, Eric Zhang <[EMAIL PROTECTED]> wrote:

> This is very impressive.  Congrats!.
>
> Which version of Hadoop is this running on and what's the input data size?
>
> Eric
>
> Owen O'Malley wrote:
> > The link inversion and ranking algorithms for Yahoo Search are now
> > being generated on Hadoop:
> >
> >
> http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html
> >
> >
> > Some Webmap size data:
> >
> > * Number of links between pages in the index: roughly 1 trillion
> > links
> > * Size of output: over 300 TB, compressed!
> > * Number of cores used to run a single Map-Reduce job: over 10,000
> > * Raw disk used in the production cluster: over 5 Petabytes
> >
> >
>
>


Re: Yahoo's production webmap is now on Hadoop

2008-02-19 Thread Eric Zhang
This is very impressive.  Congrats!. 


Which version of Hadoop is this running on and what's the input data size?

Eric

Owen O'Malley wrote:
The link inversion and ranking algorithms for Yahoo Search are now 
being generated on Hadoop:


http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html 



Some Webmap size data:

* Number of links between pages in the index: roughly 1 trillion 
links

* Size of output: over 300 TB, compressed!
* Number of cores used to run a single Map-Reduce job: over 10,000
* Raw disk used in the production cluster: over 5 Petabytes






Re: Questions about the MapReduce libraries and job schedulers inside JobTracker and JobClient running on Hadoop

2008-02-19 Thread Eric Zhang
The class is defined to be accessed in package level so not displayed in 
javadoc.   Source code comes with hadoop installation under 
${HADOOP_INSTALLATION_DIR}/src/java/org/apache/hadoop/mapred.


Eric
Andy Li wrote:

Thanks for both inputs.  My question actually focus more on what Vivek has
mentioned.

I would like to work on the JobClient to see how it submits jobs to
different file system and
slaves in the same Hadoop cluster.

Not sure if there is a complete document to explain the scheduler underneath
Hadoop,
if not, I'll wrap up what I know and study from the source code and submit
it to the community
once it is done.  Review and comments are welcome.

For the code, I couldn't find JobInProgress from the API index.  Could
anyone provide me
a pointer to this?  Thanks.

On Fri, Feb 15, 2008 at 3:01 PM, Vivek Ratan <[EMAIL PROTECTED]> wrote:

  

I read Andy's question a little differently. For a given job, the
JobTracker
decides which tasks go to which TaskTracker (the TTs ask for a task to run
and the JT decides which task is the most appropriate). Currently, the JT
favors a task whose input data is on the same host as the TT (if there are
more than one such tasks, it picks the one with the largest input size).
It
also looks at failed tasks and certain other criteria. This is very basic
scheduling and there is a lot of scope for improvement. There currently is
a
proposal to support rack awareness, so that if the JT can't find a task
whose input data is on the same host as the TT, it looks for a task whose
data is on the same rack.

You can clearly get more ambitious with your scheduling algorithm. As you
mention, you could use other criteria for scheduling a task: available CPU
or memory, for example. You could assign tasks to hosts that are the most
'free', or aim to distribute tasks across racks, or try some other load
balancing techniques. I believe there are a few discussions on these
methods
on Jira, but I don't think there's anything concrete yet.

BTW, the code that decides what task to run is primarily in
JobInProgress::findNewTask().


-Original Message-
From: Ted Dunning [mailto:[EMAIL PROTECTED]
Sent: Friday, February 15, 2008 1:54 PM
To: core-user@hadoop.apache.org
Subject: Re: Questions about the MapReduce libraries and job schedulers
inside JobTracker and JobClient running on Hadoop


Core-user is the right place for this question.

Your description is mostly correct.  Jobs don't necessarily go to all of
your boxes in the cluster, but they may.

Non-uniform machine specs are a bit of a problem that is being (has been?)
addressed by allowing each machine to have a slightly different
hadoop-site.xml file.  That would allow different settings for storage
configuration and number of processes to run.

Even without that, you can level the load a bit by simply running more
jobs
on the weak machines than you would otherwise prefer.  Most map reduce
programs are pretty light on memory usage so all that happens is that you
get less throughput on the weak machines.  Since there are normally more
map
tasks than cores, this is no big deal; slow machines get fewer tasks and
toward the end of the job, their tasks are even replicated on other
machines
in case they can be done more quickly.


On 2/15/08 1:25 PM, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]

wrote:




Hello,

My first time posting this in the news group.My question sounds more
  

like


a MapReduce question
instead of Hadoop HDFS itself.

To my understanding, the JobClient will submit all Mapper and Reduce
class in a uniform way to the cluster?  Can I assume this is more like
a uniform scheduler for all the task?

For example, if I have a 100 node cluster, 1 master (namenode), 99
slaves (datanodes).
When I do
"JobClient.runJob(jconf)"
the JobClient will uniformly distributes all Mapper and Reduce class
to all 99 nodes.

In the slaves, they will all have the same hadoop-site.xml and
hadoop-default.xml.
Here comes the main concern, what if some of the nodes don't have the
same hardware spec such as memory or CPU speed?  E.g. different batch
purchase and repairment overtime that causes this.

Is there any way that the JobClient can be aware of this and submit
different number of tasks to different slaves during start-up?
For example, for some slaves, it has 16 cores CPU instead of 8 cores.
The problem I see here is that for the 16 cores, only 8 cores are
used.

P.S. I'm looking into the JobClient source code and
JobProfile/JobTracker to see if this can be done.
But not sure if I am on the right track.

If this topic is more likely to be in the [EMAIL PROTECTED],
please let me know.  I'll send another one to that news group.

Regards,
-Andy

TREND MICRO EMAIL NOTICE
The information contained in this email and any attachments is
confidential and may be subject to copyright or other intellectual
property protection. If you are not the intended recipient, you are
not authorized to use or disclose this information, and 

Re: Yahoo's production webmap is now on Hadoop

2008-02-19 Thread Torsten Curdt

Wow! Congrats!

On 19.02.2008, at 18:58, Owen O'Malley wrote:

The link inversion and ranking algorithms for Yahoo Search are now  
being generated on Hadoop:


http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds- 
largest-production-hadoop.html


Some Webmap size data:

* Number of links between pages in the index: roughly 1  
trillion links

* Size of output: over 300 TB, compressed!
* Number of cores used to run a single Map-Reduce job: over 10,000
* Raw disk used in the production cluster: over 5 Petabytes





Yahoo's production webmap is now on Hadoop

2008-02-19 Thread Owen O'Malley
The link inversion and ranking algorithms for Yahoo Search are now  
being generated on Hadoop:


http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest- 
production-hadoop.html


Some Webmap size data:

* Number of links between pages in the index: roughly 1 trillion  
links

* Size of output: over 300 TB, compressed!
* Number of cores used to run a single Map-Reduce job: over 10,000
* Raw disk used in the production cluster: over 5 Petabytes



Re: external jar using eclipse-plugin?

2008-02-19 Thread Tamer Elsayed
It is "hadoop-0.15.3-eclipse-plugin".

Tamer

On 2/19/08, Christophe Taton <[EMAIL PROTECTED]> wrote:
>
> Hi Tamer,
>
> Can you tell which version of the plug-in do you use?
> Unfortunately, I did not try this kind of configuration yet, but I'll work
> having it work...
>
> Thanks,
> Christophe
>
> On Feb 18, 2008 10:39 PM, Tamer Elsayed <[EMAIL PROTECTED]> wrote:
>
> > Hi,
> >
> > This is a question about using external jars when running a Hadoop
> > mapreduce
> > job using eclipse-plugin. In my situation I want to use Lucene jar file.
> > The
> > code compiles fine on my machine since the jar file is added to the
> > project
> > external jars, but when I run it on Hadoop cluster, it gives me the
> > following error:
> > "Exception in thread "main" java.lang.NoClassDefFoundError:
> > org.apache.lucene.search.IndexSearcher"
> > which means that the jar file is not seen. I have tried to load it to
> HDFS
> > and use DistributedCache.addArchiveToClassPath but got the same error.
> The
> > code that needs Lucene is in both the controller and the mapper classes.
> >
> > Any clue to how to resolve this?
> >
> > Thanks in advance,
> > Tamer
> >
>



-- 
Proud to be a follower of the "Best of Mankind"
"وَاذْكُرْ رَبَّكَ إِذَا نَسِيتَ وَقُلْ عَسَى أَنْ يَهْدِيَنِي رَبِّي
لأقْرَبَ مِنْ هَذَا رَشَدًا"