How stable is the code?  I could quite easily set some undergraduate project
to do something with it, for example process query logs


On 04/02/2008, Ted Dunning <[EMAIL PROTECTED]> wrote:
> This is a great opportunity for me to talk about the Groovy support that I
> have just gotten running.  I am looking for friendly testers as this code
> is
> definitely not ready for full release.
> The program you need in groovy is this:
> // define the map-reduce function by specifying map and reduce functions
> logCount =
>    {key, value, out, report -> out.collect(value.split[0], 1)},
>    {keyword, counts, out, report ->
>       sum = 0;
>       counts.each { sum += it}
>       out.collect(keyword, sum)
>    })
> // apply the function to an input file and collect the results in a map
> results = [:]
> LogCount(inputFileEitherLocallyOnHDFS).eachLine {
>     line ->
>       parts = line.split(\t)
>       results[parts[0]] = parts[1]
> }
> // sort the entries in the map by descending count and print the results
> for (x in results.entrySet().sort( {-it.value} )) {
>    println x
> }
> // delete the temporary results
> Hadoop.cleanup(results)
> The important points here are:
> 1) the groovy binding lets you express the map-reduce part of your program
> simply.
> 2) collecting the results is trivial ... You don't have to worry about
> where
> or how the results are kept.  You would use the same code to read a local
> file as to read the results of the map-reduce computation
> 3) because of (2), you can do some computation locally (the sort) and some
> in parallel (the counting).  You could easily translate the sort to a
> hadoop
> call as well.
> I know that this doesn't quite answer the question because my
> groovy-hadoop
> bridge isn't available yet, but it hopefully will spark some interest.
> The question I would like to pose to the community is this:
>   What is the best way to proceed with code like this that is not ready
> for
> prime time, but is ready for others to contribute and possibly also use?
> Should I follow the Jaql and Cascading course and build a separate
> repository and web site or should I try to add this as a contrib package
> like streaming?  Or should I just hand out source by hand for a little
> while
> to get feedback?
> On 2/4/08 2:04 PM, "Tarandeep Singh" <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > Can someone guide me on how to write program using hadoop framework
> > that analyze the log files and find out the top most frequently
> > occurring keywords. The log file has the format -
> >
> > keyword source dateId
> >
> > Thanks,
> > Tarandeep

The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

Reply via email to