RE: Grouping Values for Reducer Input

2009-04-13 Thread jeremy.huylebroeck
I'm not familiar with setOutputValueGroupingComparator
 
what about adding the doc# in the key and have your own
hashing/Partitioner?
so doing something like
cat_doc5-> 1
cat_doc1-> 1
cat_doc5-> 3
 
the hashing method would take everything before "_" as the hash.
 
 
the shuffling would still put the catxxx keys together using your
hashing but sort them like you need.
cat_doc5->1
cat_doc5->3
cat_doc1->1
 
then the reduce task can count for each doc# in a "cat"

 


From: Streckfus, William [USA] [mailto:streckfus_will...@bah.com] 
Sent: Monday, April 13, 2009 2:53 PM
To: core-user@hadoop.apache.org
Subject: Grouping Values for Reducer Input


Hi Everyone,
 
I'm working on a relatively simple MapReduce job with a slight
complication with regards to the ordering of my key/values heading into
the reducer. The output from the mapper might be something like
 
cat -> doc5, 1
cat -> doc1, 1
cat -> doc5, 3
...
 
Here, 'cat' is my key and the value is the document ID and the count (my
own WritableComparable.) Originally I was going to create a HashMap in
the reduce method and add an entry for each document ID and sum the
counts for each. I realized the method would be better if the values
were in order like so:
 
cat -> doc1, 1
cat -> doc5, 1
cat -> doc5, 3
...
 
Using this style I can continue summing until I reach a new document ID
and just collect the output at this point thus avoiding data structures
and object creation costs. I tried setting
JobConf.setOutputValueGroupingComparator() but this didn't seem to do
anything. In fact, I threw an exception from the Comparator I supplied
but this never showed up when running the job. My map output value
consists of a UTF and a Long so perhaps the Comparator I'm using
(identical to Text.Comparator) is incorrect:
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2)
{
int n1 = WritableUtils.decodeVIntSize(b1[s1]);
int n2 = WritableUtils.decodeVIntSize(b2[s2]);

return compareBytes(b1, s1 + n1, l1 - n1, b2, s2 + n2, l2 - n2);
}

In my final output I'm basically running into the same word ->
documentID being output multiple times. So for the above example I have
multiple lines with cat -> doc5, X.
 
Reducer method just in case:
public void reduce(Text key, Iterator values,
OutputCollector output, Reporter reporter)
throws IOException {
long sum = 0;
String lastDocID = null;

// Iterate through all values
while(values.hasNext()) {
TermFrequencyWritable value = values.next();

// Encountered new document ID = record and reset
if(!value.getDocumentID().equals(lastDocID)) {
// Ignore first go through
if(sum != 0) {
termFrequency.setDocumentID(lastDocID);
termFrequency.setFrequency(sum);
output.collect(key, termFrequency);
}

sum = 0;
lastDocID = value.getDocumentID();
}

sum += value.getFrequency();
}

// Record last one
termFrequency.setDocumentID(lastDocID);
termFrequency.setFrequency(sum);
output.collect(key, termFrequency);
}

 
Any ideas (Using Hadoop .19.1)?
 
Thanks,
- Bill


Orange Labs is hosting an event about recommendation engines - March 3rd

2009-02-25 Thread jeremy.huylebroeck

Hadoop fellows,

Orange Labs is hosting a forum about Recommendation Engines (not limited to 
video) in our South San Francisco lab.
We are now looking for more people interested in bringing their own experience 
and perspective to the discussion.

I am sure there are interesting things to learn from the Hadoop crowd 
processing data for recommendation purpose, and show what a large scale 
middleware can bring to the picture.

It's on March 3rd at 4pm. It's free. We expect between 40-50+ people hopefully.
The format is a few short presentations of key speakers and mostly open 
discussions. 

If interested, drop me an email with your company name and what's your interest 
on the topic.
I copy below the event description.

Sorry for the spam
(I'll spam also the mahout list btw...)

Jeremy Huylebroeck
Orange Labs, France Telecom group.


Come and exchange with Netflix, Clerkdogs, Baynote, Modista and more... Orange 
Labs Presents: Recommendation Services Spotlight.
Tuesday, March 03, 2009 from 4:00 PM - 6:00 PM (PT)
South San Francisco, CA

Recommendations are driving everything from what you watch on TV to what shoes 
you're wearing. Recommendation technologies are an area of intense interest for 
us here at Orange Labs San Francisco, and we want to hear what you are thinking 
about this topic. 

We are convening a forum for exploring the latest trends and developments in 
the recommendations space, with leading thinkers such as Andreas Weigend (ex 
Amazon Chief Scientist), and companies such as Netflix, Baynote, Modista and 
Clerkdogs in what promises to be a fascinating domain dive.

If you are a practitioner, or have an active interest in recommendations 
solutions, please join us for an interactive discussion on this exciting topic. 
Space is limited and priority will be given to active participants and 
researchers pursuing projects in this domain. 
 
As an integrated operator with over 100 million customers for mobile, Internet, 
and TV services, Orange has an active development program in recommendations 
technologies, and participants in this Spotlight can expect to interact with 
Orange investigators currently working on video recommendations and related 
problems. Join the discussion!

Doors open at 4:00pm
Discussions start at 4:30pm



Videos and slides of the HUG meetings?

2008-10-16 Thread jeremy.huylebroeck

Apparently Yahoo has been taking video/audio of all the presentations in
the past HUG meetings.
Are they available somewhere?





Anybody used AppNexus for hosting Hadoop app?

2008-07-24 Thread jeremy.huylebroeck

I discovered AppNexus yesterday.
They offer hosting similar to Amazon EC2, with apparently more dedicated
hardware and a better notion of where things are in the datacenter.

Their web site says they are optimized for Hadoop applications.

Anybody tried and could give some feedback?
 
J.


RE: Hadoop Distributed Virtualisation

2008-06-06 Thread jeremy.huylebroeck
I see the VM approach great for isolation, customized hadoop or tools
required by the jobs and ease of IT management.

Performance hit on CPU and IO is there but I never looked at the
numbers.
Anybody did?

Basically for now, on EC2 for instance, if you need to go faster, just
buy 50 more machines for a couple of hours... ;)