On Jul 6, 2011, at 12:47 PM, Dmitriy Ryaboy wrote:

> I think this is the same problem we were having earlier:
> http://hadoop.markmail.org/thread/kgxhdgw6zdmadch4
> 
> One workaround is to use defines to explicitly create different
> instances of your UDF, and use them separately.. it's ugly but it
> works.

Thanks Dmitriy.

I tried doing something like:
define ToCassandraBag1 org.pygmalion.udf.ToCassandraBag();
define ToCassandraBag2 org.pygmalion.udf.ToCassandraBag();

at the top and then using each one only once.  That still produces the same 
error.  I guess in this case we'll just have to require the field names be 
entered into the UDF and it won't introspect them.  Ah well.  Would be nice to 
be able to use it but I don't really see another way around this bug with the 
shared UDF context.

> 
> D
> 
> On Wed, Jul 6, 2011 at 9:42 AM, Jeremy Hanna <jeremy.hanna1...@gmail.com> 
> wrote:
>> We have a UDF that introspects the output schema and gets the field names 
>> there and use that in the exec method.
>> 
>> The UDF is found here: 
>> https://github.com/jeromatron/pygmalion/blob/master/udf/src/main/java/org/pygmalion/udf/ToCassandraBag.java
>> 
>> A simple example is found here: 
>> https://github.com/jeromatron/pygmalion/blob/master/scripts/from_to_cassandra_bag_example.pig
>> 
>> It takes the relation's aliases and uses them in the output so that the user 
>> doesn't have to specify them.  However we've been noticing that if you have 
>> more than one ToCassandraBag call in a pig script, sometimes they are run at 
>> the same time and the key is the same in the UDF context: 
>> cassandra.input_field_schema.  So we think there is an issue there (array 
>> out of bounds exceptions when running the script, but when running in grunt 
>> one at a time, it doesn't do that).
>> 
>> Is there a right way to do this parameter passing so that we don't get these 
>> errors when multiple calls exist?
>> 
>> We thought of using the schema hash code as a suffix (e.g. 
>> cassandra.input_field_schema.12344321) but we don't have access to the 
>> schema in the exec method.
>> 
>> We thought of having the first parameter of the input tuple be a unique name 
>> that the script specifies, like 'filename.relationalias' as a convention to 
>> make them unique to the file.  However in the outputSchema, we don't have 
>> access to the input tuple, just the schema itself, so it couldn't get that 
>> value in there.
>> 
>> Any ideas on how to make this so it doesn't stomp on each other within the 
>> pig script?  Is there a best way to do that?
>> 
>> Thanks!
>> 
>> Jeremy

Reply via email to