RE: Grouping Values for Reducer Input
I'm not familiar with setOutputValueGroupingComparator what about adding the doc# in the key and have your own hashing/Partitioner? so doing something like cat_doc5-> 1 cat_doc1-> 1 cat_doc5-> 3 the hashing method would take everything before "_" as the hash. the shuffling would still put the catxxx keys together using your hashing but sort them like you need. cat_doc5->1 cat_doc5->3 cat_doc1->1 then the reduce task can count for each doc# in a "cat" From: Streckfus, William [USA] [mailto:streckfus_will...@bah.com] Sent: Monday, April 13, 2009 2:53 PM To: core-user@hadoop.apache.org Subject: Grouping Values for Reducer Input Hi Everyone, I'm working on a relatively simple MapReduce job with a slight complication with regards to the ordering of my key/values heading into the reducer. The output from the mapper might be something like cat -> doc5, 1 cat -> doc1, 1 cat -> doc5, 3 ... Here, 'cat' is my key and the value is the document ID and the count (my own WritableComparable.) Originally I was going to create a HashMap in the reduce method and add an entry for each document ID and sum the counts for each. I realized the method would be better if the values were in order like so: cat -> doc1, 1 cat -> doc5, 1 cat -> doc5, 3 ... Using this style I can continue summing until I reach a new document ID and just collect the output at this point thus avoiding data structures and object creation costs. I tried setting JobConf.setOutputValueGroupingComparator() but this didn't seem to do anything. In fact, I threw an exception from the Comparator I supplied but this never showed up when running the job. My map output value consists of a UTF and a Long so perhaps the Comparator I'm using (identical to Text.Comparator) is incorrect: public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { int n1 = WritableUtils.decodeVIntSize(b1[s1]); int n2 = WritableUtils.decodeVIntSize(b2[s2]); return compareBytes(b1, s1 + n1, l1 - n1, b2, s2 + n2, l2 - n2); } In my final output I'm basically running into the same word -> documentID being output multiple times. So for the above example I have multiple lines with cat -> doc5, X. Reducer method just in case: public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { long sum = 0; String lastDocID = null; // Iterate through all values while(values.hasNext()) { TermFrequencyWritable value = values.next(); // Encountered new document ID = record and reset if(!value.getDocumentID().equals(lastDocID)) { // Ignore first go through if(sum != 0) { termFrequency.setDocumentID(lastDocID); termFrequency.setFrequency(sum); output.collect(key, termFrequency); } sum = 0; lastDocID = value.getDocumentID(); } sum += value.getFrequency(); } // Record last one termFrequency.setDocumentID(lastDocID); termFrequency.setFrequency(sum); output.collect(key, termFrequency); } Any ideas (Using Hadoop .19.1)? Thanks, - Bill
Orange Labs is hosting an event about recommendation engines - March 3rd
Hadoop fellows, Orange Labs is hosting a forum about Recommendation Engines (not limited to video) in our South San Francisco lab. We are now looking for more people interested in bringing their own experience and perspective to the discussion. I am sure there are interesting things to learn from the Hadoop crowd processing data for recommendation purpose, and show what a large scale middleware can bring to the picture. It's on March 3rd at 4pm. It's free. We expect between 40-50+ people hopefully. The format is a few short presentations of key speakers and mostly open discussions. If interested, drop me an email with your company name and what's your interest on the topic. I copy below the event description. Sorry for the spam (I'll spam also the mahout list btw...) Jeremy Huylebroeck Orange Labs, France Telecom group. Come and exchange with Netflix, Clerkdogs, Baynote, Modista and more... Orange Labs Presents: Recommendation Services Spotlight. Tuesday, March 03, 2009 from 4:00 PM - 6:00 PM (PT) South San Francisco, CA Recommendations are driving everything from what you watch on TV to what shoes you're wearing. Recommendation technologies are an area of intense interest for us here at Orange Labs San Francisco, and we want to hear what you are thinking about this topic. We are convening a forum for exploring the latest trends and developments in the recommendations space, with leading thinkers such as Andreas Weigend (ex Amazon Chief Scientist), and companies such as Netflix, Baynote, Modista and Clerkdogs in what promises to be a fascinating domain dive. If you are a practitioner, or have an active interest in recommendations solutions, please join us for an interactive discussion on this exciting topic. Space is limited and priority will be given to active participants and researchers pursuing projects in this domain. As an integrated operator with over 100 million customers for mobile, Internet, and TV services, Orange has an active development program in recommendations technologies, and participants in this Spotlight can expect to interact with Orange investigators currently working on video recommendations and related problems. Join the discussion! Doors open at 4:00pm Discussions start at 4:30pm
Videos and slides of the HUG meetings?
Apparently Yahoo has been taking video/audio of all the presentations in the past HUG meetings. Are they available somewhere?
Anybody used AppNexus for hosting Hadoop app?
I discovered AppNexus yesterday. They offer hosting similar to Amazon EC2, with apparently more dedicated hardware and a better notion of where things are in the datacenter. Their web site says they are optimized for Hadoop applications. Anybody tried and could give some feedback? J.
RE: Hadoop Distributed Virtualisation
I see the VM approach great for isolation, customized hadoop or tools required by the jobs and ease of IT management. Performance hit on CPU and IO is there but I never looked at the numbers. Anybody did? Basically for now, on EC2 for instance, if you need to go faster, just buy 50 more machines for a couple of hours... ;)