Bill, that doesn't work if he's trying to do a join to a table of blacklisted patterns.
Jan, because of the fundamental way Map-Reduce works, Joins work on equality operators. If your blacklist is not huge (just a few megs perhaps?) you can just put the file containing your blacklist in HDFS, use the cache directive to make sure your worker nodes are prepped to use it efficiently, and then write a UDF that will take one of your strings and run it through the blacklist to check if any entries match. You could then filter by this UDF. This could be done reasonably efficiently. Check out LookupInFiles (in the piggybank) for something similar. -D On Thu, Feb 25, 2010 at 11:03 AM, Bill Graham <[email protected]> wrote: > You could specify a condition using the the RegexMatch or RegexExtract UDF > in piggybank: > > > http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/RegexMatch.java > > > http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/RegexExtract.java > > On Thu, Feb 25, 2010 at 10:17 AM, Jan Zimmek <[email protected]> > wrote: > > > hi, > > > > i recently found pig, really like it and want to use it for one of our > > actual projects. > > > > getting the basics running was easy, but now i am struggling one a > problem. > > > > i am trying to get customers whose email is not blacklisted. > > > > blacklist entires can be specified as: > > > > [email protected] > > > > or wildcarded > > > > @domain.de > > > > in sql i would solve this by: > > > > ---- > > > > select > > * > > from > > customer c > > left join blacklist b > > on > > c.email like concat("%",b.email) > > where > > b.email is null > > > > ---- > > > > this is the structure of my input files: > > > > raw_customer = LOAD 'customer.csv' USING PigStorage('\t') AS (id: long, > > email: chararray); > > raw_blacklist = LOAD 'blacklist.csv' USING PigStorage('\t') AS (email: > > chararray); > > > > > > how would i solve this using pig ? - especially handling the "like %" > > condition. > > > > i already looked into udf, but need some advice how to implement this. > > > > > > any help would be really appreciated. > > > > regards, > > jan > > > > >
