Re: hadoop: how to find top N frequently occurring words

Ted Dunning Mon, 04 Feb 2008 16:10:16 -0800


I have created a Jira for the Groovy integration.


https://issues.apache.org/jira/browse/HADOOP-2781

As soon as I can clear the licenses, I will post the code.


On 2/4/08 4:06 PM, "Colin Evans" <[EMAIL PROTECTED]> wrote:

> Hi Ted,
> I've been building out a similar framework in JavaScript (Rhino) for
> work that I've been doing at MetaWeb, and we've been thinking about open
> sourcing it too.  It's pretty clear that there are major benefits to
> using a dynamic scripting language with Hadoop.
> 
> I'd love too see how you're tackled this problem and would be interested
> in contributing work to this too.
> 
> -Colin
> 
> 
> 
> Ted Dunning wrote:
>> This is a great opportunity for me to talk about the Groovy support that I
>> have just gotten running.  I am looking for friendly testers as this code is
>> definitely not ready for full release.
>> 
>> The program you need in groovy is this:
>> 
>> // define the map-reduce function by specifying map and reduce functions
>> logCount = Hadoop.mr(
>>    {key, value, out, report -> out.collect(value.split[0], 1)},
>>    {keyword, counts, out, report ->
>>       sum = 0; 
>>       counts.each { sum += it}
>>       out.collect(keyword, sum)
>>    })
>> 
>> // apply the function to an input file and collect the results in a map
>> results = [:]
>> LogCount(inputFileEitherLocallyOnHDFS).eachLine {
>>     line ->
>>       parts = line.split(\t)
>>       results[parts[0]] = parts[1]
>> }
>> 
>> // sort the entries in the map by descending count and print the results
>> for (x in results.entrySet().sort( {-it.value} )) {
>>    println x
>> }
>> 
>> // delete the temporary results
>> Hadoop.cleanup(results)
>> 
>> The important points here are:
>> 
>> 1) the groovy binding lets you express the map-reduce part of your program
>> simply.
>> 
>> 2) collecting the results is trivial ... You don't have to worry about where
>> or how the results are kept.  You would use the same code to read a local
>> file as to read the results of the map-reduce computation
>> 
>> 3) because of (2), you can do some computation locally (the sort) and some
>> in parallel (the counting).  You could easily translate the sort to a hadoop
>> call as well.
>> 
>> I know that this doesn't quite answer the question because my groovy-hadoop
>> bridge isn't available yet, but it hopefully will spark some interest.
>> 
>> The question I would like to pose to the community is this:
>> 
>>   What is the best way to proceed with code like this that is not ready for
>> prime time, but is ready for others to contribute and possibly also use?
>> Should I follow the Jaql and Cascading course and build a separate
>> repository and web site or should I try to add this as a contrib package
>> like streaming?  Or should I just hand out source by hand for a little while
>> to get feedback?
>> 
>> 
>> On 2/4/08 2:04 PM, "Tarandeep Singh" <[EMAIL PROTECTED]> wrote:
>> 
>>   
>>> Hi,
>>> 
>>> Can someone guide me on how to write program using hadoop framework
>>> that analyze the log files and find out the top most frequently
>>> occurring keywords. The log file has the format -
>>> 
>>> keyword source dateId
>>> 
>>> Thanks,
>>> Tarandeep
>>>     
>> 
>>   
>

Re: hadoop: how to find top N frequently occurring words

Reply via email to