On Thu, Dec 03, 2009 at 03:12:05PM -0800, Julien wrote:
> Well, then I'd need some help with that...
> 
> Again, it's easy with single search keywords, but I haven't found a
> solution for combined searches like twitter+stream or photo+Paris...
> because I would have to compare each combination of tokens in the
> tweet...
> 
> Can someone give more details.

I don't mean to be flogging my site today, but take a look at
http://fishtwits.com for the results I'm producing (just click the logo
at the top of the page to view the full site without logging in):  Any
tweets from users followed by FishTwits are scanned for fishing-related
terms and all such terms found in the tweet are displayed below it.  At
this moment, for instance, the first displayed tweet shows matches for
both "Fly Fishing" and "Sole".

This is accomplished with the following Perl code (edited to remove
parts which aren't directly relevant):

sub load_from_text {
  my ($class, $text) = @_;

  unless($topic_regex) {
    require Regexp::Assemble;
    my $ra = Regexp::Assemble->new(
               chomp => 0,
               anchor_word_begin => 1,
               anchor_word_end => 1,
             );
    for my $topic (@topic_list) {
      $ra->add(lc $topic);
    }
    $topic_regex = $ra->re;
  }

  $text = lc $text;
  my @topics = $text =~ /$topic_regex/g;

  return sort @topics;
}

It first uses Regexp::Assemble to build a $topic_regex[1] which will
match any of the words/phrases found in the topic table, then does a
global match of $text (the body of the tweet being examined) against
$topic_regex, capturing all matches into the array @topics, which is
then sorted and returned to the caller.

After the match is performed, @topics contains every search term which
is matched, no matter how many there may be, which should fill your
requirement for "combined searches", unless I'm misunderstanding it.

If you mean you would want that "Fly Fishing", "Sole" tweet to return
three hits rather than two ("Fly Fishing", "Sole", "Fly Fishing+Sole"),
that's easy enough to create from @topics, just generate every
permutation of the terms which the individual tweet matched.


[1]  If you're only dealing with 10 or so keywords, you'd probably be
just as well off building the regex by hand.  The main reason I'm using
Regexp::Assemble to do it on the fly is because manually creating and
then maintaining a regex that will efficiently match any of 1300 terms
would be a nightmare.

-- 
Dave Sherohman

Reply via email to