[jira] [Updated] (PIG-2317) Ruby/Jruby UDFs

Jonathan Coveney (Updated) (JIRA) Thu, 20 Oct 2011 14:07:36 -0700

     [ 
https://issues.apache.org/jira/browse/PIG-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jonathan Coveney updated PIG-2317:
----------------------------------

    Attachment: jruby_scripting_6.patch
                pigudf.rb
                pigjruby.rb

So! Made some new changes. There is now an accumulator interface.

{code}
class SUM2 < AccumulatorPigUDF
  output_schema "val:long"

  def exec items
    @sum||=0
    @sum+=items.flatten.inject(:+)
  end

  def get
    @sum
  end
end
{code}

One interesting thing about the accumulator interface is that all of the state 
is handled inside of the ruby class...so if you want intermediate objects, it's 
all there. The cleanup step is just throwing away the class, and then it will 
be reinstantiated if the interface is invoked again.

Algebraic UDFs are easier than ever.

{code}
class SUM < AlgebraicPigUDF
  output_schema "val:long"
  
  def initial item
    item
  end
  
  def intermed items
    items.flatten.inject(:+)
  end
  
  def final items
    intermed items
  end
end

class WORDCOUNT < AlgebraicPigUDF
  output_schema "val:long"

  def initial item
    item ? item.split.length : 0
  end

  def intermed items
    items.flatten.inject(:+)
  end
  
  def final items
    intermed items
  end
end
{code}

One of the more exciting changes (to me...) is that I have added DataBags as a 
native ruby object, so it's super easy to use them. If you do include the 
pigudf package, you can do "DataBag.new." Examples of how to use it follow:
{code}
jruby -J-Xmx1024m -S irb
{code}
this ensures that you have enough heap space

{code}
require 'pigudf'
db=DataBag.new
{code}
a is now a databag! to test that it spills properly, we do...
{code}
(0..10000000).each {|x| db.add(x)}
{code}

On my computer, with the heap size we specified, it spilled once. But it 
spills! Also, a note: arrays still convert to tuples, and a bag can either 
accept ONE argument, or an array of arguments. The one argument thing is a 
convenience function. I will probably make it a varargs for conciseness. But 
that means you can do

{code}
db.add(1)
{code}

or

{code}
db.add([1])
{code}

After running the each above, you get:

{code}
ree-1.8.7-2010.02 :009 > db.size()
 => 10000001
{code}

Nice! I need to look into how to get JRuby to generate better docs, but if you 
look at RubyDataBag.java in the patch you can see the api (anything marked with 
@JRubyMethod). I'll summarize here.

{code}
DataBag.new, DataBag.new db
{code}
DataBag has two initializers: the default initializer just creates an empty 
databag, and the second takes a databag and copies it over. There is also

{code}
db.add_all db2, db.copy db2
{code}
which pulls all of the data out of the given DataBag or RubyDataBag.

{code}
db.to_s,db.to_string,db.inspect
{code}
return a string view. if you do db.to_s(true), you'll also see the contents 
(useful for debugging)

{code}
db.size,db.length
{code}
number of elements in the bag

{code}
db.add(elem) or db.add([e1,e2,e3])
{code}
Add the elements to the bag

{code}
db.distinct?, db.is_distinct?
{code}
returns if the bag is distinct

{code}
db.sorted?, db.is_sorted?
{code}
returns if the bag is sorted

{code}
db.clear
{code}
clears the databag

{code}
db.empty?
{code}
returns if the bag is empty

{code}
db.each
{code}
One thing that I did with the DataBag implementation is that I had it include 
Enumerable, and implement each. This means that all of the fun commands you 
like to use in ruby like map and so on should work... also, for convenient, I 
implement a flatten command

{code}
db.flatten or db.flat_each
 => #<Enumerable::Enumerator:0x8939ec3 @__args__=[], @__object__=[DataBag: 
size: 10000001], @__method__=:flat_each> 
{code}
what this does is create an object that accepts .each {block}, but will flatten 
the value out of the Tuple before passing it to the block. This allows you to 
efficiently do things like db.flatten.inject(:+), because it is pulling the 
element out of the tuple on each block invocation instead of doing the naive 
thing which would be to create an array of the output. One thing to keep in 
mind though is that this only pulls out the first argument. I guess I could 
change that. Am undecided.

And lastly, there is...

{code}
db.iterator
{code}
returns a BagIterator. This is basically a simplifed access point that is very 
similar to bag, except with less power.

{code}
db.get, db.getNext, db.get_next
{code}

{code}
db.has_next?, db.hasNext, db.has_next, db.next?
{code}

and it supports the exact same map semantics as bag does.

Phew! Ok. Definitely would love feedback. I'm going to work on making UDFs 
in-line, and need to write tests....
                
> Ruby/Jruby UDFs
> ---------------
>
>                 Key: PIG-2317
>                 URL: https://issues.apache.org/jira/browse/PIG-2317
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jacob Perkins
>            Assignee: Jacob Perkins
>            Priority: Minor
>             Fix For: 0.9.2
>
>         Attachments: PigUdf.rb, PigUdf.rb, jruby_scripting.patch, 
> jruby_scripting_2_real.patch, jruby_scripting_3.patch, 
> jruby_scripting_4.patch, jruby_scripting_5.patch, jruby_scripting_6.patch, 
> pigjruby.rb, pigjruby.rb, pigjruby.rb, pigudf.rb
>
>
> It should be possible to write UDFs in Ruby. These UDFs will be registered in 
> the same way as python and javascript UDFs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2317) Ruby/Jruby UDFs

Reply via email to