On Jul 6, 2011, at 11:10 PM, Raghu Angadi wrote: > On Wed, Jul 6, 2011 at 7:20 PM, Jeremy Hanna > <jeremy.hanna1...@gmail.com>wrote: > >> >> On Jul 6, 2011, at 12:47 PM, Dmitriy Ryaboy wrote: >> >>> I think this is the same problem we were having earlier: >>> http://hadoop.markmail.org/thread/kgxhdgw6zdmadch4 >>> >>> One workaround is to use defines to explicitly create different >>> instances of your UDF, and use them separately.. it's ugly but it >>> works. >> >> Thanks Dmitriy. >> >> I tried doing something like: >> define ToCassandraBag1 org.pygmalion.udf.ToCassandraBag(); >> define ToCassandraBag2 org.pygmalion.udf.ToCassandraBag(); >> > > This still does not work since you can't distinguish the two. The way I was > thinking of doing this is to let user pass in some unique sting as a > substitute for context: > > define ToCassandraBag1 ToCassandraBag('1'); > define ToCassandraBag2 ToCassandraBag('2');
Ah yes. I had misunderstood. Thanks for the clarification. Now the pig docs also make more sense in the Passing Configurations to UDFs section: http://pig.apache.org/docs/r0.8.1/udf.html#Passing+Configurations+to+UDFs It says: "The UDF can pass its constructor arguments, or some other identifying strings. This allows each instantiation of the UDF to have a different properties object thus avoiding name space collisions between instantiations of the UDF." and the HBaseStorage example was helpful to see that in action. Thanks both to Raghu and Dmitriy. > > inside the UDF, you would use this arg to make a 'contextString' (see > HBaseStorage.java for example use) to store any state. > > ideally UDFs shouldn't have to do this.. They should have the same context > info that is available for loaders and storage. > > Raghu. > > >> >> at the top and then using each one only once. That still produces the same >> error. I guess in this case we'll just have to require the field names be >> entered into the UDF and it won't introspect them. Ah well. Would be nice >> to be able to use it but I don't really see another way around this bug with >> the shared UDF context. >> >>> >>> D >>> >>> On Wed, Jul 6, 2011 at 9:42 AM, Jeremy Hanna <jeremy.hanna1...@gmail.com> >> wrote: >>>> We have a UDF that introspects the output schema and gets the field >> names there and use that in the exec method. >>>> >>>> The UDF is found here: >> https://github.com/jeromatron/pygmalion/blob/master/udf/src/main/java/org/pygmalion/udf/ToCassandraBag.java >>>> >>>> A simple example is found here: >> https://github.com/jeromatron/pygmalion/blob/master/scripts/from_to_cassandra_bag_example.pig >>>> >>>> It takes the relation's aliases and uses them in the output so that the >> user doesn't have to specify them. However we've been noticing that if you >> have more than one ToCassandraBag call in a pig script, sometimes they are >> run at the same time and the key is the same in the UDF context: >> cassandra.input_field_schema. So we think there is an issue there (array >> out of bounds exceptions when running the script, but when running in grunt >> one at a time, it doesn't do that). >>>> >>>> Is there a right way to do this parameter passing so that we don't get >> these errors when multiple calls exist? >>>> >>>> We thought of using the schema hash code as a suffix (e.g. >> cassandra.input_field_schema.12344321) but we don't have access to the >> schema in the exec method. >>>> >>>> We thought of having the first parameter of the input tuple be a unique >> name that the script specifies, like 'filename.relationalias' as a >> convention to make them unique to the file. However in the outputSchema, we >> don't have access to the input tuple, just the schema itself, so it couldn't >> get that value in there. >>>> >>>> Any ideas on how to make this so it doesn't stomp on each other within >> the pig script? Is there a best way to do that? >>>> >>>> Thanks! >>>> >>>> Jeremy >> >>