issue map/reduce job to linux hadoop cluster from MS Windows, Eclipse

2008-12-13 Thread Songting Chen
Is it possible to do that?

I can access files at HDFS by specifying the URI below.
FileSystem fileSys = FileSystem.get(new URI("hdfs://server:9000"), conf); 

But I don't know how to do that for JobConf.

Thanks,
-Songting


Re: Simple data transformations in Hadoop?

2008-12-13 Thread Delip Rao
On Sat, Dec 13, 2008 at 9:32 PM, Stuart White  wrote:
> (I'm quite new to hadoop and map/reduce, so some of these questions
> might not make complete sense.)
>
> I want to perform simple data transforms on large datasets, and it
> seems Hadoop is an appropriate tool.  As a simple example, let's say I
> want to read every line of a text file, uppercase it, and write it
> out.
>
> First question: would Hadoop be an appropriate tool for something like this?

Yes. Very appropriate.

>
> What is the best way to model this type of work in Hadoop?

Start with Hadoop's WordCount example in the tutorial and modify it to
your requirement.

>
> I'm thinking my mappers will accept a Long key that represents the
> byte offset into the input file, and a Text value that represents the
> line in the file.
>
> I *could* simply uppercase the text lines and write them to an output
> file directly in the mapper (and not use any reducers).  So, there's a
> question: is it considered bad practice to write output files directly
> from mappers?

Technically, you could do this by opening the file writer in
configure(), do the writes in map() and close the writer in close().
But to me this appears contorted when the Hadoop framework has
something straight-forward.

>
> Assuming it's advisable in this example to write a file directly in
> the mapper - how should the mapper create a unique output partition
> file name?
> Is there a way for a mapper to know its index in the total
> # of mappers?

use mapred.task.id to create unique name per mapper.

>
> Assuming it's inadvisable to write a file directly in the mapper - I
> can output the records to the reducers using the same key and using
> the uppercased data as the value.  Then, in my reducer, should I write
> a file?  Or should I collect() the records in the reducers and let
> hadoop write the output?
>
> If I let hadoop write the output, is there a way to prevent hadoop
> from writing the key to the output file?  I may want to perform
> several transformations, one-after-another, on a set of data, and I
> don't want to place a superfluous key at the front of every record for
> each pass of the data.
>

Just use collect() and TextOutputFormat. The key, as you correctly
noted, is the offset in the file but when you do collect(key, value)
the 'value' will be written at the appropriate offset given by the
'key'. As long as you are using TextOutputFormat there won't be any
superfluous key prefixed to your records. Another way to think of this
is when you use TextOutputFormat, the 'value's from collect() are
appended to the reduce output.

> I appreciate any feedback anyone has to offer.
>


Simple data transformations in Hadoop?

2008-12-13 Thread Stuart White
(I'm quite new to hadoop and map/reduce, so some of these questions
might not make complete sense.)

I want to perform simple data transforms on large datasets, and it
seems Hadoop is an appropriate tool.  As a simple example, let's say I
want to read every line of a text file, uppercase it, and write it
out.

First question: would Hadoop be an appropriate tool for something like this?

What is the best way to model this type of work in Hadoop?

I'm thinking my mappers will accept a Long key that represents the
byte offset into the input file, and a Text value that represents the
line in the file.

I *could* simply uppercase the text lines and write them to an output
file directly in the mapper (and not use any reducers).  So, there's a
question: is it considered bad practice to write output files directly
from mappers?

Assuming it's advisable in this example to write a file directly in
the mapper - how should the mapper create a unique output partition
file name?  Is there a way for a mapper to know its index in the total
# of mappers?

Assuming it's inadvisable to write a file directly in the mapper - I
can output the records to the reducers using the same key and using
the uppercased data as the value.  Then, in my reducer, should I write
a file?  Or should I collect() the records in the reducers and let
hadoop write the output?

If I let hadoop write the output, is there a way to prevent hadoop
from writing the key to the output file?  I may want to perform
several transformations, one-after-another, on a set of data, and I
don't want to place a superfluous key at the front of every record for
each pass of the data.

I appreciate any feedback anyone has to offer.


NumMapTasks and NumReduceTasks with MapRunnable

2008-12-13 Thread Marcus Herou
Hi.

We are finally in the beta stage with our crawler and have tested it with a
few hundred thousand urls. However it performs worse than if we would run it
on a local machine without connecting to a hadoop JobTracker.
Each crawl Job is fairly alike a Nutch Fetcher job which spawns X threads
which all read the same RecordReader and starts to fetch the current url
assigned.
However I am not able to utilize all our 9 machines at the same time which
is really preferable since this is an external IO bound job (remote
servers).

How can I with a crawl list of just 9 urls (stupidly small I know) make sure
that all machines is used at least once ?
With a crawl list of 900 how can i make sure at least 100 are crawled at the
same time on all machines ?
And so on with much bigger crawl lists (which is why need hadoop anyway).

Just as I write this I launched a job where i manually set the numMapTasks
to 9 and it seems to be fruitful, quite fast crawl actually :) however I
wonder if this is how I should think with all MapRunnables ?
Next Job we call is PersistOutLinks and yep it goes through a massive list
of source->target links and saves them in a DB.

This list is of a magnitude of at least a 100 times larger than the Fetcher
list. Is it still smart to hardcode a value 9 to numMapTasks for this
MapRunnable job ? Or should I create some form of InputFormat.getInputSplits
based on the crawl/outlink sizes ? Of course the numMapTasks are not
hardcoded but they are injected into the Configuration based on a properties
file.

Kindly

//Marcus





-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/