I have created a Jira for the Groovy integration.
https://issues.apache.org/jira/browse/HADOOP-2781 As soon as I can clear the licenses, I will post the code. On 2/4/08 4:06 PM, "Colin Evans" <[EMAIL PROTECTED]> wrote: > Hi Ted, > I've been building out a similar framework in JavaScript (Rhino) for > work that I've been doing at MetaWeb, and we've been thinking about open > sourcing it too. It's pretty clear that there are major benefits to > using a dynamic scripting language with Hadoop. > > I'd love too see how you're tackled this problem and would be interested > in contributing work to this too. > > -Colin > > > > Ted Dunning wrote: >> This is a great opportunity for me to talk about the Groovy support that I >> have just gotten running. I am looking for friendly testers as this code is >> definitely not ready for full release. >> >> The program you need in groovy is this: >> >> // define the map-reduce function by specifying map and reduce functions >> logCount = Hadoop.mr( >> {key, value, out, report -> out.collect(value.split[0], 1)}, >> {keyword, counts, out, report -> >> sum = 0; >> counts.each { sum += it} >> out.collect(keyword, sum) >> }) >> >> // apply the function to an input file and collect the results in a map >> results = [:] >> LogCount(inputFileEitherLocallyOnHDFS).eachLine { >> line -> >> parts = line.split(\t) >> results[parts[0]] = parts[1] >> } >> >> // sort the entries in the map by descending count and print the results >> for (x in results.entrySet().sort( {-it.value} )) { >> println x >> } >> >> // delete the temporary results >> Hadoop.cleanup(results) >> >> The important points here are: >> >> 1) the groovy binding lets you express the map-reduce part of your program >> simply. >> >> 2) collecting the results is trivial ... You don't have to worry about where >> or how the results are kept. You would use the same code to read a local >> file as to read the results of the map-reduce computation >> >> 3) because of (2), you can do some computation locally (the sort) and some >> in parallel (the counting). You could easily translate the sort to a hadoop >> call as well. >> >> I know that this doesn't quite answer the question because my groovy-hadoop >> bridge isn't available yet, but it hopefully will spark some interest. >> >> The question I would like to pose to the community is this: >> >> What is the best way to proceed with code like this that is not ready for >> prime time, but is ready for others to contribute and possibly also use? >> Should I follow the Jaql and Cascading course and build a separate >> repository and web site or should I try to add this as a contrib package >> like streaming? Or should I just hand out source by hand for a little while >> to get feedback? >> >> >> On 2/4/08 2:04 PM, "Tarandeep Singh" <[EMAIL PROTECTED]> wrote: >> >> >>> Hi, >>> >>> Can someone guide me on how to write program using hadoop framework >>> that analyze the log files and find out the top most frequently >>> occurring keywords. The log file has the format - >>> >>> keyword source dateId >>> >>> Thanks, >>> Tarandeep >>> >> >> >