Re: hadoop: how to find top N frequently occurring words

Miles Osborne Mon, 04 Feb 2008 14:41:27 -0800

How stable is the code?  I could quite easily set some undergraduate project
to do something with it, for example process query logs


Miles

On 04/02/2008, Ted Dunning <[EMAIL PROTECTED]> wrote:
>
>
> This is a great opportunity for me to talk about the Groovy support that I
> have just gotten running.  I am looking for friendly testers as this code
> is
> definitely not ready for full release.
>
> The program you need in groovy is this:
>
> // define the map-reduce function by specifying map and reduce functions
> logCount = Hadoop.mr(
>    {key, value, out, report -> out.collect(value.split[0], 1)},
>    {keyword, counts, out, report ->
>       sum = 0;
>       counts.each { sum += it}
>       out.collect(keyword, sum)
>    })
>
> // apply the function to an input file and collect the results in a map
> results = [:]
> LogCount(inputFileEitherLocallyOnHDFS).eachLine {
>     line ->
>       parts = line.split(\t)
>       results[parts[0]] = parts[1]
> }
>
> // sort the entries in the map by descending count and print the results
> for (x in results.entrySet().sort( {-it.value} )) {
>    println x
> }
>
> // delete the temporary results
> Hadoop.cleanup(results)
>
> The important points here are:
>
> 1) the groovy binding lets you express the map-reduce part of your program
> simply.
>
> 2) collecting the results is trivial ... You don't have to worry about
> where
> or how the results are kept.  You would use the same code to read a local
> file as to read the results of the map-reduce computation
>
> 3) because of (2), you can do some computation locally (the sort) and some
> in parallel (the counting).  You could easily translate the sort to a
> hadoop
> call as well.
>
> I know that this doesn't quite answer the question because my
> groovy-hadoop
> bridge isn't available yet, but it hopefully will spark some interest.
>
> The question I would like to pose to the community is this:
>
>   What is the best way to proceed with code like this that is not ready
> for
> prime time, but is ready for others to contribute and possibly also use?
> Should I follow the Jaql and Cascading course and build a separate
> repository and web site or should I try to add this as a contrib package
> like streaming?  Or should I just hand out source by hand for a little
> while
> to get feedback?
>
>
> On 2/4/08 2:04 PM, "Tarandeep Singh" <[EMAIL PROTECTED]> wrote:
>
> > Hi,
> >
> > Can someone guide me on how to write program using hadoop framework
> > that analyze the log files and find out the top most frequently
> > occurring keywords. The log file has the format -
> >
> > keyword source dateId
> >
> > Thanks,
> > Tarandeep
>
>


-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

Re: hadoop: how to find top N frequently occurring words

Reply via email to