I kind of solved it by reading in the data from my UDF constructor (it's
just a file with a list of like 10 regular expressions, so I did manual file
I/O), by passing the path (provided as a parameter), and then just storing
it (and then, looping over it and testing a, b by hand).  It's not the
MapReduce way, but it will work for this application, considering the small
size of the file.

If anyone knows how my "patch" might fail, or if there is a better way -
feel free to speak up.

-Mark

On Wed, Apr 20, 2011 at 12:51 PM, Bill Graham <[email protected]> wrote:

> You could try doing GROUP ALL on the contents of M, which would
> produce a since bag containing each record and then joining M with
> data using a surrogate constant key. Or CROSS would also work instead
> of the join I suspect. Then you'd have a tuple like this to work with:
>
> (a, b, M:bag)
>
> I'm not sure if things would blow up if M is too large to fit into
> memory in your UDF though.
>
>
> On Wed, Apr 20, 2011 at 6:27 AM, Mark Laczin <[email protected]>
> wrote:
> > I'm trying to do something like this:
> > (if 'data' is a set of tuples loaded from a file containing fields a, b
> and
> > c)
> > (if 'M' is another set of tuples loaded from a file)
> >
> > data = FOREACH data GENERATE *, someUDF(a, b, M);
> >
> > What I'm looking for is to generate (in this case, a string) based on a
> and
> > b, using the contents of M inside the UDF.
> >
> > The UDF looks like this, in pseudocode:
> >
> > foreach element x in M {
> >  if a matches x or b matches x {
> >    return "something"
> >  }
> > }
> > return "something else"
> >
> > Is this possible?  I keep getting errors related to "Scalars can only be
> > used with projections" and the like.
> > The thing holding me back from using filters is that I won't know what's
> in
> > M until it's read, and since (in this case) they'll be regular
> expressions,
> > I'd need to be able to join/group with regex matching which I don't think
> > Pig can do.
> >
> > -Mark
> >
>

Reply via email to