How stable is the code? I could quite easily set some undergraduate project to do something with it, for example process query logs
Miles On 04/02/2008, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > This is a great opportunity for me to talk about the Groovy support that I > have just gotten running. I am looking for friendly testers as this code > is > definitely not ready for full release. > > The program you need in groovy is this: > > // define the map-reduce function by specifying map and reduce functions > logCount = Hadoop.mr( > {key, value, out, report -> out.collect(value.split[0], 1)}, > {keyword, counts, out, report -> > sum = 0; > counts.each { sum += it} > out.collect(keyword, sum) > }) > > // apply the function to an input file and collect the results in a map > results = [:] > LogCount(inputFileEitherLocallyOnHDFS).eachLine { > line -> > parts = line.split(\t) > results[parts[0]] = parts[1] > } > > // sort the entries in the map by descending count and print the results > for (x in results.entrySet().sort( {-it.value} )) { > println x > } > > // delete the temporary results > Hadoop.cleanup(results) > > The important points here are: > > 1) the groovy binding lets you express the map-reduce part of your program > simply. > > 2) collecting the results is trivial ... You don't have to worry about > where > or how the results are kept. You would use the same code to read a local > file as to read the results of the map-reduce computation > > 3) because of (2), you can do some computation locally (the sort) and some > in parallel (the counting). You could easily translate the sort to a > hadoop > call as well. > > I know that this doesn't quite answer the question because my > groovy-hadoop > bridge isn't available yet, but it hopefully will spark some interest. > > The question I would like to pose to the community is this: > > What is the best way to proceed with code like this that is not ready > for > prime time, but is ready for others to contribute and possibly also use? > Should I follow the Jaql and Cascading course and build a separate > repository and web site or should I try to add this as a contrib package > like streaming? Or should I just hand out source by hand for a little > while > to get feedback? > > > On 2/4/08 2:04 PM, "Tarandeep Singh" <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > Can someone guide me on how to write program using hadoop framework > > that analyze the log files and find out the top most frequently > > occurring keywords. The log file has the format - > > > > keyword source dateId > > > > Thanks, > > Tarandeep > > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.