On Jul 6, 2011, at 11:10 PM, Raghu Angadi wrote:

> On Wed, Jul 6, 2011 at 7:20 PM, Jeremy Hanna 
> <jeremy.hanna1...@gmail.com>wrote:
> 
>> 
>> On Jul 6, 2011, at 12:47 PM, Dmitriy Ryaboy wrote:
>> 
>>> I think this is the same problem we were having earlier:
>>> http://hadoop.markmail.org/thread/kgxhdgw6zdmadch4
>>> 
>>> One workaround is to use defines to explicitly create different
>>> instances of your UDF, and use them separately.. it's ugly but it
>>> works.
>> 
>> Thanks Dmitriy.
>> 
>> I tried doing something like:
>> define ToCassandraBag1 org.pygmalion.udf.ToCassandraBag();
>> define ToCassandraBag2 org.pygmalion.udf.ToCassandraBag();
>> 
> 
> This still does not work since you can't distinguish the two. The way I was
> thinking of doing this is to let user pass in some unique sting as a
> substitute for context:
> 
> define ToCassandraBag1 ToCassandraBag('1');
> define ToCassandraBag2 ToCassandraBag('2');

Ah yes.  I had misunderstood.  Thanks for the clarification.  Now the pig docs 
also make more sense in the Passing Configurations to UDFs section:
http://pig.apache.org/docs/r0.8.1/udf.html#Passing+Configurations+to+UDFs
It says:
"The UDF can pass its constructor arguments, or some other identifying strings. 
This allows each instantiation of the UDF to have a different properties object 
thus avoiding name space collisions between instantiations of the UDF."
and the HBaseStorage example was helpful to see that in action.

Thanks both to Raghu and Dmitriy.

> 
> inside the UDF, you would use this arg to make a 'contextString' (see
> HBaseStorage.java for example use) to store any state.
> 
> ideally UDFs shouldn't have to do this.. They should have the same context
> info that is available for loaders and storage.
> 
> Raghu.
> 
> 
>> 
>> at the top and then using each one only once.  That still produces the same
>> error.  I guess in this case we'll just have to require the field names be
>> entered into the UDF and it won't introspect them.  Ah well.  Would be nice
>> to be able to use it but I don't really see another way around this bug with
>> the shared UDF context.
>> 
>>> 
>>> D
>>> 
>>> On Wed, Jul 6, 2011 at 9:42 AM, Jeremy Hanna <jeremy.hanna1...@gmail.com>
>> wrote:
>>>> We have a UDF that introspects the output schema and gets the field
>> names there and use that in the exec method.
>>>> 
>>>> The UDF is found here:
>> https://github.com/jeromatron/pygmalion/blob/master/udf/src/main/java/org/pygmalion/udf/ToCassandraBag.java
>>>> 
>>>> A simple example is found here:
>> https://github.com/jeromatron/pygmalion/blob/master/scripts/from_to_cassandra_bag_example.pig
>>>> 
>>>> It takes the relation's aliases and uses them in the output so that the
>> user doesn't have to specify them.  However we've been noticing that if you
>> have more than one ToCassandraBag call in a pig script, sometimes they are
>> run at the same time and the key is the same in the UDF context:
>> cassandra.input_field_schema.  So we think there is an issue there (array
>> out of bounds exceptions when running the script, but when running in grunt
>> one at a time, it doesn't do that).
>>>> 
>>>> Is there a right way to do this parameter passing so that we don't get
>> these errors when multiple calls exist?
>>>> 
>>>> We thought of using the schema hash code as a suffix (e.g.
>> cassandra.input_field_schema.12344321) but we don't have access to the
>> schema in the exec method.
>>>> 
>>>> We thought of having the first parameter of the input tuple be a unique
>> name that the script specifies, like 'filename.relationalias' as a
>> convention to make them unique to the file.  However in the outputSchema, we
>> don't have access to the input tuple, just the schema itself, so it couldn't
>> get that value in there.
>>>> 
>>>> Any ideas on how to make this so it doesn't stomp on each other within
>> the pig script?  Is there a best way to do that?
>>>> 
>>>> Thanks!
>>>> 
>>>> Jeremy
>> 
>> 

Reply via email to