I think I may have to go with your second option - but thanks for the info, I'll keep an eye on 0.9.0.
On Thu, Apr 21, 2011 at 4:16 PM, Alan Gates <[email protected]> wrote: > Starting with Pig 0.9 (not yet released but you can build it off the > branch) a UDF can specify a file to put in the distributed cache. You could > thus have your UDF pick up the file locally on your box and put it in the > distributed cache, and then read it from the distributed cache on the back > end. If running with an un-released version isn't an option for you, you > could manually load the file into the distributed cache and then read it > from your UDF. > > Alan. > > > On Apr 21, 2011, at 8:18 AM, Mark Laczin wrote: > > Does anyone know how to ship the config file in this situation? >> I'm encountering problems with file not found exceptions when trying to >> run >> this over a cluster. >> >> On Wed, Apr 20, 2011 at 1:03 PM, Mark Laczin <[email protected]> >> wrote: >> >> I kind of solved it by reading in the data from my UDF constructor (it's >>> just a file with a list of like 10 regular expressions, so I did manual >>> file >>> I/O), by passing the path (provided as a parameter), and then just >>> storing >>> it (and then, looping over it and testing a, b by hand). It's not the >>> MapReduce way, but it will work for this application, considering the >>> small >>> size of the file. >>> >>> If anyone knows how my "patch" might fail, or if there is a better way - >>> feel free to speak up. >>> >>> -Mark >>> >>> >>> On Wed, Apr 20, 2011 at 12:51 PM, Bill Graham <[email protected] >>> >wrote: >>> >>> You could try doing GROUP ALL on the contents of M, which would >>>> produce a since bag containing each record and then joining M with >>>> data using a surrogate constant key. Or CROSS would also work instead >>>> of the join I suspect. Then you'd have a tuple like this to work with: >>>> >>>> (a, b, M:bag) >>>> >>>> I'm not sure if things would blow up if M is too large to fit into >>>> memory in your UDF though. >>>> >>>> >>>> On Wed, Apr 20, 2011 at 6:27 AM, Mark Laczin <[email protected]> >>>> wrote: >>>> >>>>> I'm trying to do something like this: >>>>> (if 'data' is a set of tuples loaded from a file containing fields a, b >>>>> >>>> and >>>> >>>>> c) >>>>> (if 'M' is another set of tuples loaded from a file) >>>>> >>>>> data = FOREACH data GENERATE *, someUDF(a, b, M); >>>>> >>>>> What I'm looking for is to generate (in this case, a string) based on a >>>>> >>>> and >>>> >>>>> b, using the contents of M inside the UDF. >>>>> >>>>> The UDF looks like this, in pseudocode: >>>>> >>>>> foreach element x in M { >>>>> if a matches x or b matches x { >>>>> return "something" >>>>> } >>>>> } >>>>> return "something else" >>>>> >>>>> Is this possible? I keep getting errors related to "Scalars can only >>>>> be >>>>> used with projections" and the like. >>>>> The thing holding me back from using filters is that I won't know >>>>> what's >>>>> >>>> in >>>> >>>>> M until it's read, and since (in this case) they'll be regular >>>>> >>>> expressions, >>>> >>>>> I'd need to be able to join/group with regex matching which I don't >>>>> >>>> think >>>> >>>>> Pig can do. >>>>> >>>>> -Mark >>>>> >>>>> >>>> >>> >>> >
