Hi,

Lets say I have a large data set, A, that is like:

user, verb, action, location

Example:
joe, said, I had a nice day, Tokyo
jane, paid, two dollars for a nice cup of coffee, Melbourne
jack, watched, an interesting movie, New York
jamie, said, I am interested in hiking, Austin

Another smaller data set, B, has a list of regex to match the "action"
and each regex has some other attribute associated with it, say,
category of action.
Example:
.*interest.*, explore
.*bank.*, account
.*tax.*, account
.*play.*, sports

What I want is that if "action" matches "regex" then join join sets A
and B such that I end up with tuple (user, verb, category of action,
location).

Right now, I have done this using a Java UDF where each A::action gets
evaluated against each B::regex for a match. If yes, returns the
desired tuple.

However, performance is slow. I am wondering if there is a better
strategy to do what I think is essentially a lookup table. I have seen
threads where replicated join has been recommended but obviously a
simple "join" isn't going to work for regex matching.

Any recommendations?

Thanks,

Xuri

Reply via email to