issue map/reduce job to linux hadoop cluster from MS Windows, Eclipse
Is it possible to do that? I can access files at HDFS by specifying the URI below. FileSystem fileSys = FileSystem.get(new URI("hdfs://server:9000"), conf); But I don't know how to do that for JobConf. Thanks, -Songting
Re: Simple data transformations in Hadoop?
On Sat, Dec 13, 2008 at 9:32 PM, Stuart White wrote: > (I'm quite new to hadoop and map/reduce, so some of these questions > might not make complete sense.) > > I want to perform simple data transforms on large datasets, and it > seems Hadoop is an appropriate tool. As a simple example, let's say I > want to read every line of a text file, uppercase it, and write it > out. > > First question: would Hadoop be an appropriate tool for something like this? Yes. Very appropriate. > > What is the best way to model this type of work in Hadoop? Start with Hadoop's WordCount example in the tutorial and modify it to your requirement. > > I'm thinking my mappers will accept a Long key that represents the > byte offset into the input file, and a Text value that represents the > line in the file. > > I *could* simply uppercase the text lines and write them to an output > file directly in the mapper (and not use any reducers). So, there's a > question: is it considered bad practice to write output files directly > from mappers? Technically, you could do this by opening the file writer in configure(), do the writes in map() and close the writer in close(). But to me this appears contorted when the Hadoop framework has something straight-forward. > > Assuming it's advisable in this example to write a file directly in > the mapper - how should the mapper create a unique output partition > file name? > Is there a way for a mapper to know its index in the total > # of mappers? use mapred.task.id to create unique name per mapper. > > Assuming it's inadvisable to write a file directly in the mapper - I > can output the records to the reducers using the same key and using > the uppercased data as the value. Then, in my reducer, should I write > a file? Or should I collect() the records in the reducers and let > hadoop write the output? > > If I let hadoop write the output, is there a way to prevent hadoop > from writing the key to the output file? I may want to perform > several transformations, one-after-another, on a set of data, and I > don't want to place a superfluous key at the front of every record for > each pass of the data. > Just use collect() and TextOutputFormat. The key, as you correctly noted, is the offset in the file but when you do collect(key, value) the 'value' will be written at the appropriate offset given by the 'key'. As long as you are using TextOutputFormat there won't be any superfluous key prefixed to your records. Another way to think of this is when you use TextOutputFormat, the 'value's from collect() are appended to the reduce output. > I appreciate any feedback anyone has to offer. >
Simple data transformations in Hadoop?
(I'm quite new to hadoop and map/reduce, so some of these questions might not make complete sense.) I want to perform simple data transforms on large datasets, and it seems Hadoop is an appropriate tool. As a simple example, let's say I want to read every line of a text file, uppercase it, and write it out. First question: would Hadoop be an appropriate tool for something like this? What is the best way to model this type of work in Hadoop? I'm thinking my mappers will accept a Long key that represents the byte offset into the input file, and a Text value that represents the line in the file. I *could* simply uppercase the text lines and write them to an output file directly in the mapper (and not use any reducers). So, there's a question: is it considered bad practice to write output files directly from mappers? Assuming it's advisable in this example to write a file directly in the mapper - how should the mapper create a unique output partition file name? Is there a way for a mapper to know its index in the total # of mappers? Assuming it's inadvisable to write a file directly in the mapper - I can output the records to the reducers using the same key and using the uppercased data as the value. Then, in my reducer, should I write a file? Or should I collect() the records in the reducers and let hadoop write the output? If I let hadoop write the output, is there a way to prevent hadoop from writing the key to the output file? I may want to perform several transformations, one-after-another, on a set of data, and I don't want to place a superfluous key at the front of every record for each pass of the data. I appreciate any feedback anyone has to offer.
NumMapTasks and NumReduceTasks with MapRunnable
Hi. We are finally in the beta stage with our crawler and have tested it with a few hundred thousand urls. However it performs worse than if we would run it on a local machine without connecting to a hadoop JobTracker. Each crawl Job is fairly alike a Nutch Fetcher job which spawns X threads which all read the same RecordReader and starts to fetch the current url assigned. However I am not able to utilize all our 9 machines at the same time which is really preferable since this is an external IO bound job (remote servers). How can I with a crawl list of just 9 urls (stupidly small I know) make sure that all machines is used at least once ? With a crawl list of 900 how can i make sure at least 100 are crawled at the same time on all machines ? And so on with much bigger crawl lists (which is why need hadoop anyway). Just as I write this I launched a job where i manually set the numMapTasks to 9 and it seems to be fruitful, quite fast crawl actually :) however I wonder if this is how I should think with all MapRunnables ? Next Job we call is PersistOutLinks and yep it goes through a massive list of source->target links and saves them in a DB. This list is of a magnitude of at least a 100 times larger than the Fetcher list. Is it still smart to hardcode a value 9 to numMapTasks for this MapRunnable job ? Or should I create some form of InputFormat.getInputSplits based on the crawl/outlink sizes ? Of course the numMapTasks are not hardcoded but they are injected into the Configuration based on a properties file. Kindly //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/