Re: Question about bags and UDFs

Mark Laczin Fri, 22 Apr 2011 04:40:37 -0700

I think I may have to go with your second option - but thanks for the info,
I'll keep an eye on 0.9.0.


On Thu, Apr 21, 2011 at 4:16 PM, Alan Gates <[email protected]> wrote:

> Starting with Pig 0.9 (not yet released but you can build it off the
> branch) a UDF can specify a file to put in the distributed cache.  You could
> thus have your UDF pick up the file locally on your box and put it in the
> distributed cache, and then read it from the distributed cache on the back
> end.  If running with an un-released version isn't an option for you, you
> could manually load the file into the distributed cache and then read it
> from your UDF.
>
> Alan.
>
>
> On Apr 21, 2011, at 8:18 AM, Mark Laczin wrote:
>
>  Does anyone know how to ship the config file in this situation?
>> I'm encountering problems with file not found exceptions when trying to
>> run
>> this over a cluster.
>>
>> On Wed, Apr 20, 2011 at 1:03 PM, Mark Laczin <[email protected]>
>> wrote:
>>
>>  I kind of solved it by reading in the data from my UDF constructor (it's
>>> just a file with a list of like 10 regular expressions, so I did manual
>>> file
>>> I/O), by passing the path (provided as a parameter), and then just
>>> storing
>>> it (and then, looping over it and testing a, b by hand).  It's not the
>>> MapReduce way, but it will work for this application, considering the
>>> small
>>> size of the file.
>>>
>>> If anyone knows how my "patch" might fail, or if there is a better way -
>>> feel free to speak up.
>>>
>>> -Mark
>>>
>>>
>>> On Wed, Apr 20, 2011 at 12:51 PM, Bill Graham <[email protected]
>>> >wrote:
>>>
>>>  You could try doing GROUP ALL on the contents of M, which would
>>>> produce a since bag containing each record and then joining M with
>>>> data using a surrogate constant key. Or CROSS would also work instead
>>>> of the join I suspect. Then you'd have a tuple like this to work with:
>>>>
>>>> (a, b, M:bag)
>>>>
>>>> I'm not sure if things would blow up if M is too large to fit into
>>>> memory in your UDF though.
>>>>
>>>>
>>>> On Wed, Apr 20, 2011 at 6:27 AM, Mark Laczin <[email protected]>
>>>> wrote:
>>>>
>>>>> I'm trying to do something like this:
>>>>> (if 'data' is a set of tuples loaded from a file containing fields a, b
>>>>>
>>>> and
>>>>
>>>>> c)
>>>>> (if 'M' is another set of tuples loaded from a file)
>>>>>
>>>>> data = FOREACH data GENERATE *, someUDF(a, b, M);
>>>>>
>>>>> What I'm looking for is to generate (in this case, a string) based on a
>>>>>
>>>> and
>>>>
>>>>> b, using the contents of M inside the UDF.
>>>>>
>>>>> The UDF looks like this, in pseudocode:
>>>>>
>>>>> foreach element x in M {
>>>>> if a matches x or b matches x {
>>>>>  return "something"
>>>>> }
>>>>> }
>>>>> return "something else"
>>>>>
>>>>> Is this possible?  I keep getting errors related to "Scalars can only
>>>>> be
>>>>> used with projections" and the like.
>>>>> The thing holding me back from using filters is that I won't know
>>>>> what's
>>>>>
>>>> in
>>>>
>>>>> M until it's read, and since (in this case) they'll be regular
>>>>>
>>>> expressions,
>>>>
>>>>> I'd need to be able to join/group with regex matching which I don't
>>>>>
>>>> think
>>>>
>>>>> Pig can do.
>>>>>
>>>>> -Mark
>>>>>
>>>>>
>>>>
>>>
>>>
>

Re: Question about bags and UDFs

Reply via email to