[
https://issues.apache.org/jira/browse/PIG-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jonathan Coveney updated PIG-2317:
----------------------------------
Attachment: jruby_scripting_6.patch
pigudf.rb
pigjruby.rb
So! Made some new changes. There is now an accumulator interface.
{code}
class SUM2 < AccumulatorPigUDF
output_schema "val:long"
def exec items
@sum||=0
@sum+=items.flatten.inject(:+)
end
def get
@sum
end
end
{code}
One interesting thing about the accumulator interface is that all of the state
is handled inside of the ruby class...so if you want intermediate objects, it's
all there. The cleanup step is just throwing away the class, and then it will
be reinstantiated if the interface is invoked again.
Algebraic UDFs are easier than ever.
{code}
class SUM < AlgebraicPigUDF
output_schema "val:long"
def initial item
item
end
def intermed items
items.flatten.inject(:+)
end
def final items
intermed items
end
end
class WORDCOUNT < AlgebraicPigUDF
output_schema "val:long"
def initial item
item ? item.split.length : 0
end
def intermed items
items.flatten.inject(:+)
end
def final items
intermed items
end
end
{code}
One of the more exciting changes (to me...) is that I have added DataBags as a
native ruby object, so it's super easy to use them. If you do include the
pigudf package, you can do "DataBag.new." Examples of how to use it follow:
{code}
jruby -J-Xmx1024m -S irb
{code}
this ensures that you have enough heap space
{code}
require 'pigudf'
db=DataBag.new
{code}
a is now a databag! to test that it spills properly, we do...
{code}
(0..10000000).each {|x| db.add(x)}
{code}
On my computer, with the heap size we specified, it spilled once. But it
spills! Also, a note: arrays still convert to tuples, and a bag can either
accept ONE argument, or an array of arguments. The one argument thing is a
convenience function. I will probably make it a varargs for conciseness. But
that means you can do
{code}
db.add(1)
{code}
or
{code}
db.add([1])
{code}
After running the each above, you get:
{code}
ree-1.8.7-2010.02 :009 > db.size()
=> 10000001
{code}
Nice! I need to look into how to get JRuby to generate better docs, but if you
look at RubyDataBag.java in the patch you can see the api (anything marked with
@JRubyMethod). I'll summarize here.
{code}
DataBag.new, DataBag.new db
{code}
DataBag has two initializers: the default initializer just creates an empty
databag, and the second takes a databag and copies it over. There is also
{code}
db.add_all db2, db.copy db2
{code}
which pulls all of the data out of the given DataBag or RubyDataBag.
{code}
db.to_s,db.to_string,db.inspect
{code}
return a string view. if you do db.to_s(true), you'll also see the contents
(useful for debugging)
{code}
db.size,db.length
{code}
number of elements in the bag
{code}
db.add(elem) or db.add([e1,e2,e3])
{code}
Add the elements to the bag
{code}
db.distinct?, db.is_distinct?
{code}
returns if the bag is distinct
{code}
db.sorted?, db.is_sorted?
{code}
returns if the bag is sorted
{code}
db.clear
{code}
clears the databag
{code}
db.empty?
{code}
returns if the bag is empty
{code}
db.each
{code}
One thing that I did with the DataBag implementation is that I had it include
Enumerable, and implement each. This means that all of the fun commands you
like to use in ruby like map and so on should work... also, for convenient, I
implement a flatten command
{code}
db.flatten or db.flat_each
=> #<Enumerable::Enumerator:0x8939ec3 @__args__=[], @__object__=[DataBag:
size: 10000001], @__method__=:flat_each>
{code}
what this does is create an object that accepts .each {block}, but will flatten
the value out of the Tuple before passing it to the block. This allows you to
efficiently do things like db.flatten.inject(:+), because it is pulling the
element out of the tuple on each block invocation instead of doing the naive
thing which would be to create an array of the output. One thing to keep in
mind though is that this only pulls out the first argument. I guess I could
change that. Am undecided.
And lastly, there is...
{code}
db.iterator
{code}
returns a BagIterator. This is basically a simplifed access point that is very
similar to bag, except with less power.
{code}
db.get, db.getNext, db.get_next
{code}
{code}
db.has_next?, db.hasNext, db.has_next, db.next?
{code}
and it supports the exact same map semantics as bag does.
Phew! Ok. Definitely would love feedback. I'm going to work on making UDFs
in-line, and need to write tests....
> Ruby/Jruby UDFs
> ---------------
>
> Key: PIG-2317
> URL: https://issues.apache.org/jira/browse/PIG-2317
> Project: Pig
> Issue Type: New Feature
> Reporter: Jacob Perkins
> Assignee: Jacob Perkins
> Priority: Minor
> Fix For: 0.9.2
>
> Attachments: PigUdf.rb, PigUdf.rb, jruby_scripting.patch,
> jruby_scripting_2_real.patch, jruby_scripting_3.patch,
> jruby_scripting_4.patch, jruby_scripting_5.patch, jruby_scripting_6.patch,
> pigjruby.rb, pigjruby.rb, pigjruby.rb, pigudf.rb
>
>
> It should be possible to write UDFs in Ruby. These UDFs will be registered in
> the same way as python and javascript UDFs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira