On Feb 4, 2008 2:40 PM, Miles Osborne <[EMAIL PROTECTED]> wrote: > How stable is the code? I could quite easily set some undergraduate project > to do something with it, for example process query logs >
I started learning and using hadoop few days back. The program that I have is similar to word count except that it processes a querylog in special format. I have another program that reads the output of this program and computes the top N keywords. Want to make it a one program (single map reduce) -Taran > Miles > > > On 04/02/2008, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > > > > This is a great opportunity for me to talk about the Groovy support that I > > have just gotten running. I am looking for friendly testers as this code > > is > > definitely not ready for full release. > > > > The program you need in groovy is this: > > > > // define the map-reduce function by specifying map and reduce functions > > logCount = Hadoop.mr( > > {key, value, out, report -> out.collect(value.split[0], 1)}, > > {keyword, counts, out, report -> > > sum = 0; > > counts.each { sum += it} > > out.collect(keyword, sum) > > }) > > > > // apply the function to an input file and collect the results in a map > > results = [:] > > LogCount(inputFileEitherLocallyOnHDFS).eachLine { > > line -> > > parts = line.split(\t) > > results[parts[0]] = parts[1] > > } > > > > // sort the entries in the map by descending count and print the results > > for (x in results.entrySet().sort( {-it.value} )) { > > println x > > } > > > > // delete the temporary results > > Hadoop.cleanup(results) > > > > The important points here are: > > > > 1) the groovy binding lets you express the map-reduce part of your program > > simply. > > > > 2) collecting the results is trivial ... You don't have to worry about > > where > > or how the results are kept. You would use the same code to read a local > > file as to read the results of the map-reduce computation > > > > 3) because of (2), you can do some computation locally (the sort) and some > > in parallel (the counting). You could easily translate the sort to a > > hadoop > > call as well. > > > > I know that this doesn't quite answer the question because my > > groovy-hadoop > > bridge isn't available yet, but it hopefully will spark some interest. > > > > The question I would like to pose to the community is this: > > > > What is the best way to proceed with code like this that is not ready > > for > > prime time, but is ready for others to contribute and possibly also use? > > Should I follow the Jaql and Cascading course and build a separate > > repository and web site or should I try to add this as a contrib package > > like streaming? Or should I just hand out source by hand for a little > > while > > to get feedback? > > > > > > On 2/4/08 2:04 PM, "Tarandeep Singh" <[EMAIL PROTECTED]> wrote: > > > > > Hi, > > > > > > Can someone guide me on how to write program using hadoop framework > > > that analyze the log files and find out the top most frequently > > > occurring keywords. The log file has the format - > > > > > > keyword source dateId > > > > > > Thanks, > > > Tarandeep > > > > > > > -- > > The University of Edinburgh is a charitable body, registered in Scotland, > with registration number SC005336. >