[ https://issues.apache.org/jira/browse/PIG-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Coveney updated PIG-2317: ---------------------------------- Attachment: jruby_scripting_6.patch pigudf.rb pigjruby.rb So! Made some new changes. There is now an accumulator interface. {code} class SUM2 < AccumulatorPigUDF output_schema "val:long" def exec items @sum||=0 @sum+=items.flatten.inject(:+) end def get @sum end end {code} One interesting thing about the accumulator interface is that all of the state is handled inside of the ruby class...so if you want intermediate objects, it's all there. The cleanup step is just throwing away the class, and then it will be reinstantiated if the interface is invoked again. Algebraic UDFs are easier than ever. {code} class SUM < AlgebraicPigUDF output_schema "val:long" def initial item item end def intermed items items.flatten.inject(:+) end def final items intermed items end end class WORDCOUNT < AlgebraicPigUDF output_schema "val:long" def initial item item ? item.split.length : 0 end def intermed items items.flatten.inject(:+) end def final items intermed items end end {code} One of the more exciting changes (to me...) is that I have added DataBags as a native ruby object, so it's super easy to use them. If you do include the pigudf package, you can do "DataBag.new." Examples of how to use it follow: {code} jruby -J-Xmx1024m -S irb {code} this ensures that you have enough heap space {code} require 'pigudf' db=DataBag.new {code} a is now a databag! to test that it spills properly, we do... {code} (0..10000000).each {|x| db.add(x)} {code} On my computer, with the heap size we specified, it spilled once. But it spills! Also, a note: arrays still convert to tuples, and a bag can either accept ONE argument, or an array of arguments. The one argument thing is a convenience function. I will probably make it a varargs for conciseness. But that means you can do {code} db.add(1) {code} or {code} db.add([1]) {code} After running the each above, you get: {code} ree-1.8.7-2010.02 :009 > db.size() => 10000001 {code} Nice! I need to look into how to get JRuby to generate better docs, but if you look at RubyDataBag.java in the patch you can see the api (anything marked with @JRubyMethod). I'll summarize here. {code} DataBag.new, DataBag.new db {code} DataBag has two initializers: the default initializer just creates an empty databag, and the second takes a databag and copies it over. There is also {code} db.add_all db2, db.copy db2 {code} which pulls all of the data out of the given DataBag or RubyDataBag. {code} db.to_s,db.to_string,db.inspect {code} return a string view. if you do db.to_s(true), you'll also see the contents (useful for debugging) {code} db.size,db.length {code} number of elements in the bag {code} db.add(elem) or db.add([e1,e2,e3]) {code} Add the elements to the bag {code} db.distinct?, db.is_distinct? {code} returns if the bag is distinct {code} db.sorted?, db.is_sorted? {code} returns if the bag is sorted {code} db.clear {code} clears the databag {code} db.empty? {code} returns if the bag is empty {code} db.each {code} One thing that I did with the DataBag implementation is that I had it include Enumerable, and implement each. This means that all of the fun commands you like to use in ruby like map and so on should work... also, for convenient, I implement a flatten command {code} db.flatten or db.flat_each => #<Enumerable::Enumerator:0x8939ec3 @__args__=[], @__object__=[DataBag: size: 10000001], @__method__=:flat_each> {code} what this does is create an object that accepts .each {block}, but will flatten the value out of the Tuple before passing it to the block. This allows you to efficiently do things like db.flatten.inject(:+), because it is pulling the element out of the tuple on each block invocation instead of doing the naive thing which would be to create an array of the output. One thing to keep in mind though is that this only pulls out the first argument. I guess I could change that. Am undecided. And lastly, there is... {code} db.iterator {code} returns a BagIterator. This is basically a simplifed access point that is very similar to bag, except with less power. {code} db.get, db.getNext, db.get_next {code} {code} db.has_next?, db.hasNext, db.has_next, db.next? {code} and it supports the exact same map semantics as bag does. Phew! Ok. Definitely would love feedback. I'm going to work on making UDFs in-line, and need to write tests.... > Ruby/Jruby UDFs > --------------- > > Key: PIG-2317 > URL: https://issues.apache.org/jira/browse/PIG-2317 > Project: Pig > Issue Type: New Feature > Reporter: Jacob Perkins > Assignee: Jacob Perkins > Priority: Minor > Fix For: 0.9.2 > > Attachments: PigUdf.rb, PigUdf.rb, jruby_scripting.patch, > jruby_scripting_2_real.patch, jruby_scripting_3.patch, > jruby_scripting_4.patch, jruby_scripting_5.patch, jruby_scripting_6.patch, > pigjruby.rb, pigjruby.rb, pigjruby.rb, pigudf.rb > > > It should be possible to write UDFs in Ruby. These UDFs will be registered in > the same way as python and javascript UDFs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira