I kind of solved it by reading in the data from my UDF constructor (it's just a file with a list of like 10 regular expressions, so I did manual file I/O), by passing the path (provided as a parameter), and then just storing it (and then, looping over it and testing a, b by hand). It's not the MapReduce way, but it will work for this application, considering the small size of the file.
If anyone knows how my "patch" might fail, or if there is a better way - feel free to speak up. -Mark On Wed, Apr 20, 2011 at 12:51 PM, Bill Graham <[email protected]> wrote: > You could try doing GROUP ALL on the contents of M, which would > produce a since bag containing each record and then joining M with > data using a surrogate constant key. Or CROSS would also work instead > of the join I suspect. Then you'd have a tuple like this to work with: > > (a, b, M:bag) > > I'm not sure if things would blow up if M is too large to fit into > memory in your UDF though. > > > On Wed, Apr 20, 2011 at 6:27 AM, Mark Laczin <[email protected]> > wrote: > > I'm trying to do something like this: > > (if 'data' is a set of tuples loaded from a file containing fields a, b > and > > c) > > (if 'M' is another set of tuples loaded from a file) > > > > data = FOREACH data GENERATE *, someUDF(a, b, M); > > > > What I'm looking for is to generate (in this case, a string) based on a > and > > b, using the contents of M inside the UDF. > > > > The UDF looks like this, in pseudocode: > > > > foreach element x in M { > > if a matches x or b matches x { > > return "something" > > } > > } > > return "something else" > > > > Is this possible? I keep getting errors related to "Scalars can only be > > used with projections" and the like. > > The thing holding me back from using filters is that I won't know what's > in > > M until it's read, and since (in this case) they'll be regular > expressions, > > I'd need to be able to join/group with regex matching which I don't think > > Pig can do. > > > > -Mark > > >
