The system as it stands supports the following major features - map-reduce programs can be constructed for interactive, local use or hadoop based execution
- map-reduce programs are functions that can be nested and composed - inputs to map-reduce programs can be strings, lists of strings, local files or HDFS files - outputs are stored in HDFS - outputs can be consumed by multiple other functions The current minor(ish) limitations include - combiners, partition functions and sorting aren't supported yet - you can't pass conventional java Mappers or Reducers to the framework - only one input file can be given - the system doesn't clean up afterwards These are all easily addressed and should be fixed over the next week or two. The major limitations include: - only one script can be specified - additional jars cannot be submitted - no explicit group/co-group syntactic sugar is provided These will take a bit longer to resolve. I hope to incorporate jar building code similar to that used by the streaming system to address most of this. The group/co-group stuff is just the matter of a bit of work. Pig is very different from this Groovy integration. They are trying to build a new relational algebra language. I am just trying to write map-reduce programs. They explicitly do not want to support general coding of functions except in a very limited way or via integration of Java code while that is my primary goal. The other big difference is that my system is simple enough that I was able to implement it with a week of coding (after a few weeks of noodling about how to make it possible at all). On 2/4/08 3:28 PM, "Khalil Honsali" <[EMAIL PROTECTED]> wrote: > sorry for the unclarity, > > - I think I understand that Groovy is already usable and stable, but > requires some testing ? what others things required? > - what is the next step, i.e., roadmap if any, what evolution / growth > direction? > - I haven't tried Pig but it also seems to support submitting a function to > be transformed to map/reduce, though pig is higher level? > > PS: > - maybe Groovy requires another mailinglist thread ... > > K. Honsali > > > > On 05/02/2008, Ted Dunning <[EMAIL PROTECTED]> wrote: >> >> >> Did you mean who, what, when, where and how? >> >> Who is me. I am the only author so far. >> >> What is a groovy/java program that supports running groovy/hadoop scripts >> >> When is nearly now. >> >> Where is everywhere (this is the internet) >> >> How is an open question. I think that Doug's suggested evolution of Jira >> with patches -> contrib -> sub-project is appropriate. >> >> >> On 2/4/08 2:59 PM, "Khalil Honsali" <[EMAIL PROTECTED]> wrote: >> >>> Hi all, Mr. Dunning; >>> >>> I am interested in the Groovy idea, especially for processing text, I >> think >>> it can be a good opensource alternative to Google's Sawzall. >>> >>> Please let me know the 5-Ws of the matter if possible. >>> >>> K. Honsali >>> >>> On 05/02/2008, Miles Osborne <[EMAIL PROTECTED]> wrote: >>>> >>>> sorry, I meant Groovy >>>> >>>> Miles >>>> >>>> On 04/02/2008, Tarandeep Singh <[EMAIL PROTECTED]> wrote: >>>>> >>>>> On Feb 4, 2008 2:40 PM, Miles Osborne <[EMAIL PROTECTED]> wrote: >>>>>> How stable is the code? I could quite easily set some undergraduate >>>>> project >>>>>> to do something with it, for example process query logs >>>>>> >>>>> >>>>> I started learning and using hadoop few days back. The program that I >>>>> have is similar to word count except that it processes a querylog in >>>>> special format. I have another program that reads the output of this >>>>> program and computes the top N keywords. Want to make it a one program >>>>> (single map reduce) >>>>> >>>>> -Taran >>>>> >>>>>> Miles >>>>>> >>>>>> >>>>>> On 04/02/2008, Ted Dunning <[EMAIL PROTECTED]> wrote: >>>>>>> >>>>>>> >>>>>>> This is a great opportunity for me to talk about the Groovy support >>>>> that I >>>>>>> have just gotten running. I am looking for friendly testers as this >>>>> code >>>>>>> is >>>>>>> definitely not ready for full release. >>>>>>> >>>>>>> The program you need in groovy is this: >>>>>>> >>>>>>> // define the map-reduce function by specifying map and reduce >>>>> functions >>>>>>> logCount = Hadoop.mr( >>>>>>> {key, value, out, report -> out.collect(value.split[0], 1)}, >>>>>>> {keyword, counts, out, report -> >>>>>>> sum = 0; >>>>>>> counts.each { sum += it} >>>>>>> out.collect(keyword, sum) >>>>>>> }) >>>>>>> >>>>>>> // apply the function to an input file and collect the results in a >>>>> map >>>>>>> results = [:] >>>>>>> LogCount(inputFileEitherLocallyOnHDFS).eachLine { >>>>>>> line -> >>>>>>> parts = line.split(\t) >>>>>>> results[parts[0]] = parts[1] >>>>>>> } >>>>>>> >>>>>>> // sort the entries in the map by descending count and print the >>>>> results >>>>>>> for (x in results.entrySet().sort( {-it.value} )) { >>>>>>> println x >>>>>>> } >>>>>>> >>>>>>> // delete the temporary results >>>>>>> Hadoop.cleanup(results) >>>>>>> >>>>>>> The important points here are: >>>>>>> >>>>>>> 1) the groovy binding lets you express the map-reduce part of your >>>>> program >>>>>>> simply. >>>>>>> >>>>>>> 2) collecting the results is trivial ... You don't have to worry >>>> about >>>>>>> where >>>>>>> or how the results are kept. You would use the same code to read a >>>>> local >>>>>>> file as to read the results of the map-reduce computation >>>>>>> >>>>>>> 3) because of (2), you can do some computation locally (the sort) >>>> and >>>>> some >>>>>>> in parallel (the counting). You could easily translate the sort to >>>> a >>>>>>> hadoop >>>>>>> call as well. >>>>>>> >>>>>>> I know that this doesn't quite answer the question because my >>>>>>> groovy-hadoop >>>>>>> bridge isn't available yet, but it hopefully will spark some >>>> interest. >>>>>>> >>>>>>> The question I would like to pose to the community is this: >>>>>>> >>>>>>> What is the best way to proceed with code like this that is not >>>>> ready >>>>>>> for >>>>>>> prime time, but is ready for others to contribute and possibly also >>>>> use? >>>>>>> Should I follow the Jaql and Cascading course and build a separate >>>>>>> repository and web site or should I try to add this as a contrib >>>>> package >>>>>>> like streaming? Or should I just hand out source by hand for a >>>> little >>>>>>> while >>>>>>> to get feedback? >>>>>>> >>>>>>> >>>>>>> On 2/4/08 2:04 PM, "Tarandeep Singh" <[EMAIL PROTECTED]> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Can someone guide me on how to write program using hadoop >>>> framework >>>>>>>> that analyze the log files and find out the top most frequently >>>>>>>> occurring keywords. The log file has the format - >>>>>>>> >>>>>>>> keyword source dateId >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Tarandeep >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> The University of Edinburgh is a charitable body, registered in >>>>> Scotland, >>>>>> with registration number SC005336. >>>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> The University of Edinburgh is a charitable body, registered in >> Scotland, >>>> with registration number SC005336. >>>> >> >>