Does anyone know how to ship the config file in this situation? I'm encountering problems with file not found exceptions when trying to run this over a cluster.
On Wed, Apr 20, 2011 at 1:03 PM, Mark Laczin <[email protected]> wrote: > I kind of solved it by reading in the data from my UDF constructor (it's > just a file with a list of like 10 regular expressions, so I did manual file > I/O), by passing the path (provided as a parameter), and then just storing > it (and then, looping over it and testing a, b by hand). It's not the > MapReduce way, but it will work for this application, considering the small > size of the file. > > If anyone knows how my "patch" might fail, or if there is a better way - > feel free to speak up. > > -Mark > > > On Wed, Apr 20, 2011 at 12:51 PM, Bill Graham <[email protected]>wrote: > >> You could try doing GROUP ALL on the contents of M, which would >> produce a since bag containing each record and then joining M with >> data using a surrogate constant key. Or CROSS would also work instead >> of the join I suspect. Then you'd have a tuple like this to work with: >> >> (a, b, M:bag) >> >> I'm not sure if things would blow up if M is too large to fit into >> memory in your UDF though. >> >> >> On Wed, Apr 20, 2011 at 6:27 AM, Mark Laczin <[email protected]> >> wrote: >> > I'm trying to do something like this: >> > (if 'data' is a set of tuples loaded from a file containing fields a, b >> and >> > c) >> > (if 'M' is another set of tuples loaded from a file) >> > >> > data = FOREACH data GENERATE *, someUDF(a, b, M); >> > >> > What I'm looking for is to generate (in this case, a string) based on a >> and >> > b, using the contents of M inside the UDF. >> > >> > The UDF looks like this, in pseudocode: >> > >> > foreach element x in M { >> > if a matches x or b matches x { >> > return "something" >> > } >> > } >> > return "something else" >> > >> > Is this possible? I keep getting errors related to "Scalars can only be >> > used with projections" and the like. >> > The thing holding me back from using filters is that I won't know what's >> in >> > M until it's read, and since (in this case) they'll be regular >> expressions, >> > I'd need to be able to join/group with regex matching which I don't >> think >> > Pig can do. >> > >> > -Mark >> > >> > >
