Hi, Lets say I have a large data set, A, that is like:
user, verb, action, location Example: joe, said, I had a nice day, Tokyo jane, paid, two dollars for a nice cup of coffee, Melbourne jack, watched, an interesting movie, New York jamie, said, I am interested in hiking, Austin Another smaller data set, B, has a list of regex to match the "action" and each regex has some other attribute associated with it, say, category of action. Example: .*interest.*, explore .*bank.*, account .*tax.*, account .*play.*, sports What I want is that if "action" matches "regex" then join join sets A and B such that I end up with tuple (user, verb, category of action, location). Right now, I have done this using a Java UDF where each A::action gets evaluated against each B::regex for a match. If yes, returns the desired tuple. However, performance is slow. I am wondering if there is a better strategy to do what I think is essentially a lookup table. I have seen threads where replicated join has been recommended but obviously a simple "join" isn't going to work for regex matching. Any recommendations? Thanks, Xuri
