Follow-up question, how do you add it to the cache in a pig script, and once
it's in there can you access it from the UDF using regular Java file I/O?
 That is, it is as simple as saying:

copyFromLocal $localFilePath udfFile.txt
DEFINE someudf org.someudf CACHE('udfFile.txt#udfFile.txt');

And then the UDF can read it using regular Java file streams/etc?

Thanks for your help so far - the mailing list has been fairly kind to me in
this regard, especially considering my lack of Pig experience.

-Mark

On Fri, Apr 22, 2011 at 7:40 AM, Mark Laczin <[email protected]> wrote:

> I think I may have to go with your second option - but thanks for the info,
> I'll keep an eye on 0.9.0.
>
>
> On Thu, Apr 21, 2011 at 4:16 PM, Alan Gates <[email protected]> wrote:
>
>> Starting with Pig 0.9 (not yet released but you can build it off the
>> branch) a UDF can specify a file to put in the distributed cache.  You could
>> thus have your UDF pick up the file locally on your box and put it in the
>> distributed cache, and then read it from the distributed cache on the back
>> end.  If running with an un-released version isn't an option for you, you
>> could manually load the file into the distributed cache and then read it
>> from your UDF.
>>
>> Alan.
>>
>>
>> On Apr 21, 2011, at 8:18 AM, Mark Laczin wrote:
>>
>>  Does anyone know how to ship the config file in this situation?
>>> I'm encountering problems with file not found exceptions when trying to
>>> run
>>> this over a cluster.
>>>
>>> On Wed, Apr 20, 2011 at 1:03 PM, Mark Laczin <[email protected]>
>>> wrote:
>>>
>>>  I kind of solved it by reading in the data from my UDF constructor (it's
>>>> just a file with a list of like 10 regular expressions, so I did manual
>>>> file
>>>> I/O), by passing the path (provided as a parameter), and then just
>>>> storing
>>>> it (and then, looping over it and testing a, b by hand).  It's not the
>>>> MapReduce way, but it will work for this application, considering the
>>>> small
>>>> size of the file.
>>>>
>>>> If anyone knows how my "patch" might fail, or if there is a better way -
>>>> feel free to speak up.
>>>>
>>>> -Mark
>>>>
>>>>
>>>> On Wed, Apr 20, 2011 at 12:51 PM, Bill Graham <[email protected]
>>>> >wrote:
>>>>
>>>>  You could try doing GROUP ALL on the contents of M, which would
>>>>> produce a since bag containing each record and then joining M with
>>>>> data using a surrogate constant key. Or CROSS would also work instead
>>>>> of the join I suspect. Then you'd have a tuple like this to work with:
>>>>>
>>>>> (a, b, M:bag)
>>>>>
>>>>> I'm not sure if things would blow up if M is too large to fit into
>>>>> memory in your UDF though.
>>>>>
>>>>>
>>>>> On Wed, Apr 20, 2011 at 6:27 AM, Mark Laczin <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I'm trying to do something like this:
>>>>>> (if 'data' is a set of tuples loaded from a file containing fields a,
>>>>>> b
>>>>>>
>>>>> and
>>>>>
>>>>>> c)
>>>>>> (if 'M' is another set of tuples loaded from a file)
>>>>>>
>>>>>> data = FOREACH data GENERATE *, someUDF(a, b, M);
>>>>>>
>>>>>> What I'm looking for is to generate (in this case, a string) based on
>>>>>> a
>>>>>>
>>>>> and
>>>>>
>>>>>> b, using the contents of M inside the UDF.
>>>>>>
>>>>>> The UDF looks like this, in pseudocode:
>>>>>>
>>>>>> foreach element x in M {
>>>>>> if a matches x or b matches x {
>>>>>>  return "something"
>>>>>> }
>>>>>> }
>>>>>> return "something else"
>>>>>>
>>>>>> Is this possible?  I keep getting errors related to "Scalars can only
>>>>>> be
>>>>>> used with projections" and the like.
>>>>>> The thing holding me back from using filters is that I won't know
>>>>>> what's
>>>>>>
>>>>> in
>>>>>
>>>>>> M until it's read, and since (in this case) they'll be regular
>>>>>>
>>>>> expressions,
>>>>>
>>>>>> I'd need to be able to join/group with regex matching which I don't
>>>>>>
>>>>> think
>>>>>
>>>>>> Pig can do.
>>>>>>
>>>>>> -Mark
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>
>

Reply via email to