Re: filter/join by sql like "%pattern" condition

Dmitriy Ryaboy Thu, 25 Feb 2010 11:15:17 -0800

Bill, that doesn't work if he's trying to do a join to a table of
blacklisted patterns.


Jan, because of the fundamental way Map-Reduce works, Joins work on equality
operators. If your blacklist is not huge (just a few megs perhaps?) you can
just put the file containing your blacklist in HDFS, use the cache directive
to make sure your worker nodes are prepped to use it efficiently, and then
write a UDF that will take one of your strings and run it through the
blacklist to check if any entries match. You could then filter by this UDF.
This could be done reasonably efficiently. Check out LookupInFiles (in the
piggybank) for something similar.

-D



On Thu, Feb 25, 2010 at 11:03 AM, Bill Graham <[email protected]> wrote:

> You could specify a condition using the the RegexMatch or RegexExtract UDF
> in piggybank:
>
>
> http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/RegexMatch.java
>
>
> http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/RegexExtract.java
>
> On Thu, Feb 25, 2010 at 10:17 AM, Jan Zimmek <[email protected]>
> wrote:
>
> > hi,
> >
> > i recently found pig, really like it and want to use it for one of our
> > actual projects.
> >
> > getting the basics running was easy, but now i am struggling one a
> problem.
> >
> > i am trying to get customers whose email is not blacklisted.
> >
> > blacklist entires can be specified as:
> >
> > [email protected]
> >
> > or wildcarded
> >
> > @domain.de
> >
> > in sql i would solve this by:
> >
> > ----
> >
> > select
> >  *
> > from
> >  customer c
> > left join blacklist b
> > on
> >  c.email like concat("%",b.email)
> > where
> >  b.email is null
> >
> > ----
> >
> > this is the structure of my input files:
> >
> > raw_customer = LOAD 'customer.csv' USING PigStorage('\t') AS (id: long,
> > email: chararray);
> > raw_blacklist = LOAD 'blacklist.csv' USING PigStorage('\t') AS (email:
> > chararray);
> >
> >
> > how would i solve this using pig ? - especially handling the "like %"
> > condition.
> >
> > i already looked into udf, but need some advice how to implement this.
> >
> >
> > any help would be really appreciated.
> >
> > regards,
> > jan
> >
> >
>

Re: filter/join by sql like "%pattern" condition

Reply via email to