Re: Help me with architecture of a somewhat non-trivial mapreduce implementation

Sky Fri, 20 Apr 2012 07:49:51 -0700

Thanks! That helped!

-----Original Message-----From: Michael Segel

Sent: Thursday, April 19, 2012 9:38 PM
To: common-user@hadoop.apache.org

Subject: Re: Help me with architecture of a somewhat non-trivial mapreduceimplementation

If the file is small enough you could read it in to a java object like alist and write your own input format that takes a list object as its inputand then lets you specify the number of mappers.


On Apr 19, 2012, at 11:34 PM, Sky wrote:

My file for the input to mapper is very small - as all it has is urls tolist of manifests. The task for mappers is to fetch each manifest, andthen fetch files using urls from the manifests and then process them.Besides passing around lists of files, I am not really accessing the disk.It should be RAM, network, and CPU (unzip, parsexml,extract attributes).
So is my only choice to break the input file and submit multiple files (ifI have 15 cores, I should split the file with urls to 15 files? also howdoes it look in code?)? The two drawbacks are - some cores might finishearly and stay idle, and I don’t know how to deal with dynamicallyincreasing/decreasing cores.
Thx
- Sky

-----Original Message----- From: Michael Segel
Sent: Thursday, April 19, 2012 8:49 PM
To: common-user@hadoop.apache.org
Subject: Re: Help me with architecture of a somewhat non-trivial mapreduceimplementation
How 'large' or rather in this case small is your file?
If you're on a default system, the block sizes are 64MB. So if your file~<= 64MB, you end up with 1 block, and you will only have 1 mapper.
On Apr 19, 2012, at 10:10 PM, Sky wrote:
Thanks for your reply. After I sent my email, I found a fundamentaldefect - in my understanding of how MR is distributed. I discovered thateven though I was firing off 15 COREs, the map job - which is the mostexpensive part of my processing was run only on 1 core.
To start my map job, I was creating a single file with following data:
1 storage:/root/1.manif.txt
2 storage:/root/2.manif.txt
3 storage:/root/3.manif.txt
...
4000 storage:/root/4000.manif.txt
I thought that each of the available COREs will be assigned a map jobfrom top down from the same file one at a time, and as soon as one COREis done, it would get the next map job. However, it looks like I need tosplit the file into the number of times. Now while that’s clearly trivialto do, I am not sure how I can detect at runtime how many splits I needto do, and also to deal with adding new CORES at runtime. Anysuggestions? (it doesn't have to be a file, it can be a list, etc).
This all would be much easier to debug, if somehow I could get my log4jlogs for my mappers and reducers. I can see log4j for my main launcher,but not sure how to enable it for mappers and reducers.
Thx
- Akash


-----Original Message----- From: Robert Evans
Sent: Thursday, April 19, 2012 2:08 PM
To: common-user@hadoop.apache.org
Subject: Re: Help me with architecture of a somewhat non-trivialmapreduce implementation
From what I can see your implementation seems OK, especially from aperformance perspective. Depending on what storage: is it is likely to beyour bottlekneck, not the hadoop computations.
Because you are writing files directly instead of relying on Hadoop to doit for you, you may need to deal with error cases that Hadoop willnormally hide from you, and you will not be able to turn on speculativeexecution. Just be aware that a map or reduce task may have problems inthe middle, and be relaunched. So when you are writing out your updatedmanifest be careful to not replace the old one until the new one iscompletely ready and will not fail, or you may lose data. You may alsoneed to be careful in your reduce if you are writing directly to the filethere too, but because it is not a read modify write, but just a write itis not as critical.
--Bobby Evans

On 4/18/12 4:56 PM, "Sky USC" <sky...@hotmail.com> wrote:
Please help me architect the design of my first significant MR taskbeyond "word count". My program works well. but I am trying to optimizeperformance to maximize use of available computing resources. I have 3questions at the bottom.
Project description in an abstract sense (written in java):
* I have MM number of MANIFEST files available onstorage:/root/1.manif.txt to 4000.manif.txt* Each MANIFEST in turn contains varilable number "EE" of URLs toEBOOKS (range could be 10000 - 50,000 EBOOKS urls per MANIFEST) -- storedon storage:/root/1.manif/1223.folder/5443.Ebook.ebk
So we are talking about millions of ebooks

My task is to:
1. Fetch each ebook, and obtain a set of 3 attributes per ebook (example:publisher, year, ebook-version).2. Update each of the EBOOK entry record in the manifest - with the 3attributes (eg: ebook 1334 -> publisher=aaa year=bbb, ebook-version=2.01)3. Create a output file such that the named"<publisher>_<year>_<ebook-version>" contains a list of all "ebook urls"that met that criteria.
example:
File "storage:/root/summary/RANDOMHOUSE_1999_2.01.txt" contains:
storage:/root/1.manif/1223.folder/2143.Ebook.ebk
storage:/root/2.manif/2133.folder/5449.Ebook.ebk
storage:/root/2.manif/2133.folder/5450.Ebook.ebk
etc..

and File "storage:/root/summary/PENGUIN_2001_3.12.txt" contains:
storage:/root/19.manif/2223.folder/4343.Ebook.ebk
storage:/root/13.manif/9733.folder/2149.Ebook.ebk
storage:/root/21.manif/3233.folder/1110.Ebook.ebk

etc

4. finally, I also want to output statistics such that:
<publisher>_<year>_<ebook-version>  <COUNT_OF_URLs>
PENGUIN_2001_3.12     250,111
RANDOMHOUSE_1999_2.01  11,322
etc

Here is how I implemented:
* My launcher gets list of MM manifests
* My Mapper gets one manifest.
--- It reads the manifest, within a WHILE loop,
 --- fetches each EBOOK,  and obtain attributes from each ebook,
 --- updates the manifest for that ebook
--- context.write(new Text("RANDOMHOUSE_1999_2.01"), newText("storage:/root/1.manif/1223.folder/2143.Ebook.ebk"))--- Once all ebooks in the manifest are read, it saves the updatedManifest, and exits
* My Reducer gets the "RANDOMHOUSE_1999_2.01" and a list of ebooks urls.
--- It writes a new file"storage:/root/summary/RANDOMHOUSE_1999_2.01.txt" with all the storageurls for the ebooks--- It also does a context.write(new Text("RANDOMHOUSE_1999_2.01"), newIntWritable(SUM_OF_ALL_EBOOK_URLS_FROM_THE_LIST))
As I mentioned, its working. I launch it on 15 elastic instances. I havethree questions:
1. Is this the best way to implement the MR logic?
2. I dont know if each of the instances is getting one task or multipletasks simultaneously for the MAP portion. If it is not getting multipleMAP tasks, should I go with the route of "multithreaded" reading ofebooks from each manifest? Its not efficient to read just one ebook at atime per machine. Is "Context.write()" threadsafe?3. I can see log4j logs for main program, but no visibility into logs forMapper or Reducer. Any idea?

Re: Help me with architecture of a somewhat non-trivial mapreduce implementation

Reply via email to