Starting with Pig 0.9 (not yet released but you can build it off the branch) a UDF can specify a file to put in the distributed cache. You could thus have your UDF pick up the file locally on your box and put it in the distributed cache, and then read it from the distributed cache on the back end. If running with an un-released version isn't an option for you, you could manually load the file into the distributed cache and then read it from your UDF.

Alan.

On Apr 21, 2011, at 8:18 AM, Mark Laczin wrote:

Does anyone know how to ship the config file in this situation?
I'm encountering problems with file not found exceptions when trying to run
this over a cluster.

On Wed, Apr 20, 2011 at 1:03 PM, Mark Laczin <[email protected]> wrote:

I kind of solved it by reading in the data from my UDF constructor (it's just a file with a list of like 10 regular expressions, so I did manual file I/O), by passing the path (provided as a parameter), and then just storing it (and then, looping over it and testing a, b by hand). It's not the MapReduce way, but it will work for this application, considering the small
size of the file.

If anyone knows how my "patch" might fail, or if there is a better way -
feel free to speak up.

-Mark


On Wed, Apr 20, 2011 at 12:51 PM, Bill Graham <[email protected]>wrote:

You could try doing GROUP ALL on the contents of M, which would
produce a since bag containing each record and then joining M with
data using a surrogate constant key. Or CROSS would also work instead of the join I suspect. Then you'd have a tuple like this to work with:

(a, b, M:bag)

I'm not sure if things would blow up if M is too large to fit into
memory in your UDF though.


On Wed, Apr 20, 2011 at 6:27 AM, Mark Laczin <[email protected]>
wrote:
I'm trying to do something like this:
(if 'data' is a set of tuples loaded from a file containing fields a, b
and
c)
(if 'M' is another set of tuples loaded from a file)

data = FOREACH data GENERATE *, someUDF(a, b, M);

What I'm looking for is to generate (in this case, a string) based on a
and
b, using the contents of M inside the UDF.

The UDF looks like this, in pseudocode:

foreach element x in M {
if a matches x or b matches x {
  return "something"
}
}
return "something else"

Is this possible? I keep getting errors related to "Scalars can only be
used with projections" and the like.
The thing holding me back from using filters is that I won't know what's
in
M until it's read, and since (in this case) they'll be regular
expressions,
I'd need to be able to join/group with regex matching which I don't
think
Pig can do.

-Mark





Reply via email to