Re: Question about bags and UDFs

Mark Laczin Thu, 21 Apr 2011 08:18:52 -0700

Does anyone know how to ship the config file in this situation?
I'm encountering problems with file not found exceptions when trying to run
this over a cluster.


On Wed, Apr 20, 2011 at 1:03 PM, Mark Laczin <[email protected]> wrote:

> I kind of solved it by reading in the data from my UDF constructor (it's
> just a file with a list of like 10 regular expressions, so I did manual file
> I/O), by passing the path (provided as a parameter), and then just storing
> it (and then, looping over it and testing a, b by hand).  It's not the
> MapReduce way, but it will work for this application, considering the small
> size of the file.
>
> If anyone knows how my "patch" might fail, or if there is a better way -
> feel free to speak up.
>
> -Mark
>
>
> On Wed, Apr 20, 2011 at 12:51 PM, Bill Graham <[email protected]>wrote:
>
>> You could try doing GROUP ALL on the contents of M, which would
>> produce a since bag containing each record and then joining M with
>> data using a surrogate constant key. Or CROSS would also work instead
>> of the join I suspect. Then you'd have a tuple like this to work with:
>>
>> (a, b, M:bag)
>>
>> I'm not sure if things would blow up if M is too large to fit into
>> memory in your UDF though.
>>
>>
>> On Wed, Apr 20, 2011 at 6:27 AM, Mark Laczin <[email protected]>
>> wrote:
>> > I'm trying to do something like this:
>> > (if 'data' is a set of tuples loaded from a file containing fields a, b
>> and
>> > c)
>> > (if 'M' is another set of tuples loaded from a file)
>> >
>> > data = FOREACH data GENERATE *, someUDF(a, b, M);
>> >
>> > What I'm looking for is to generate (in this case, a string) based on a
>> and
>> > b, using the contents of M inside the UDF.
>> >
>> > The UDF looks like this, in pseudocode:
>> >
>> > foreach element x in M {
>> >  if a matches x or b matches x {
>> >    return "something"
>> >  }
>> > }
>> > return "something else"
>> >
>> > Is this possible?  I keep getting errors related to "Scalars can only be
>> > used with projections" and the like.
>> > The thing holding me back from using filters is that I won't know what's
>> in
>> > M until it's read, and since (in this case) they'll be regular
>> expressions,
>> > I'd need to be able to join/group with regex matching which I don't
>> think
>> > Pig can do.
>> >
>> > -Mark
>> >
>>
>
>

Re: Question about bags and UDFs

Reply via email to