[twitter-dev] Re: Track streaming : how to match tweets?

Julien Mon, 07 Dec 2009 15:45:27 -0800

Hum... ok... sad, but I have an idea. Please tell me if this is
stupid.

So, for each tweet I receive, I know what searches it _may_ match.
Right?
So, with all these "candidates" query, what I can do is perform them
against the regular search API (as long as they're complex). If the
result from the polling includes them, then, I know that the searches
matches and I don't have to build anything on top of what you built.


Let's take an example :
-  If I have a search for "starbuck AND free near:94123"
- I track "starbuck" with the streaming API
- Whenever you guys send me a tweet for this track
-  I check internally all the queries that may match Starbucks
- I perform them on your API
- if the tweet you sent me is in the results, then I know this tweet
is valid,
- if not, I discard it.

My only concern here is the 20k/hour limit. I think this is still
doable, because
1) we will only make queries to the search API when we receive
notifications
2) we will only make queries to the search API for complex queries
(IE : AND, +, "" or near:

The pros :
- whener you guys change/add stuff to your search DSL, I don't have to
change anything on my side.

How does that sound?

Thanks John anyway for your great help!

Julien


On Dec 5, 3:32 pm, John Kalucki <j...@twitter.com> wrote:
> This could only make sense if the Streaming API supported "search engine
> logic". Currently Streaming only supports keyword matching -- you have to
> post-process to add additional predicate operators beyond OR. You can
> reproduce the keyword match in a few lines of code, and the rest is
> (currently) all up to you anyway. Just remember that a given tweet could
> have triggered multiple predicates.
>
> Beyond being a low priority feature, rendering and delivering custom
> responses per user would be a performance risk. We currently can support a
> very large number of filter clients per server, and we want to preserve this
> performance.
>
> -John Kaluckihttp://twitter.com/jkalucki
> Services, Twitter Inc.
>
>
>
> On Sat, Dec 5, 2009 at 3:18 AM, Julien <julien.genest...@gmail.com> wrote:
> > Thanks Dave,
>
> > I think I get it from your example... yet, in our case, we have
> > several thousands of keywords, and many many complex searches (with
> > filter:, "and", "or", :near ... an so on).
>
> > I keep thinking that instead of re-implementing on my side the search
> > engine logic that Twitter has, it would be simpler for them to also
> > send the macthing keywords. And even more elegant solution (yet
> > slightly more complex) would be to be able to parse parameters along
> > with the search I give, such as a unique search_id (that I can store
> > on my side) and then, instead of giving me the matched keywords/search
> > terms, they could just give me back that search_id. That would be
> > something like this :
>
> > Right now it is :
> > POST  http://stream.twitter.com/1/statuses/filter.json
> > track=paris,twitter+superfeedr,<http://stream.twitter.com/1/statuses/filter.json%0Atrack=paris,twitte...,>"julien
> > near:france"
>
> > It would be awesome if I could do :
> > POST  http://stream.twitter.com/1/statuses/filter.json
> > track={"paris":"my_search_1","twitter
> > +superfeedr":"my_search_2","julien near:france":"my_search_3"}
>
> > And then, upon notifications, they would just pass me this search key
> > my_search_xx
>
> > I know and understand and implies a little bit of work for Twitter,
> > but it also removes the pain from each susbcriber to this streaming
> > API who has to re-implement again and again the "search engine" from
> > Twitter.
>
> > On Dec 4, 11:33 am, Dave Sherohman <d...@fishtwits.com> wrote:
> > > On Thu, Dec 03, 2009 at 03:12:05PM -0800, Julien wrote:
> > > > Well, then I'd need some help with that...
>
> > > > Again, it's easy with single search keywords, but I haven't found a
> > > > solution for combined searches like twitter+stream or photo+Paris...
> > > > because I would have to compare each combination of tokens in the
> > > > tweet...
>
> > > > Can someone give more details.
>
> > > I don't mean to be flogging my site today, but take a look
> > athttp://fishtwits.comforthe results I'm producing (just click the logo
> > > at the top of the page to view the full site without logging in):  Any
> > > tweets from users followed by FishTwits are scanned for fishing-related
> > > terms and all such terms found in the tweet are displayed below it.  At
> > > this moment, for instance, the first displayed tweet shows matches for
> > > both "Fly Fishing" and "Sole".
>
> > > This is accomplished with the following Perl code (edited to remove
> > > parts which aren't directly relevant):
>
> > > sub load_from_text {
> > >   my ($class, $text) = @_;
>
> > >   unless($topic_regex) {
> > >     require Regexp::Assemble;
> > >     my $ra = Regexp::Assemble->new(
> > >                chomp => 0,
> > >                anchor_word_begin => 1,
> > >                anchor_word_end => 1,
> > >              );
> > >     for my $topic (@topic_list) {
> > >       $ra->add(lc $topic);
> > >     }
> > >     $topic_regex = $ra->re;
> > >   }
>
> > >   $text = lc $text;
> > >   my @topics = $text =~ /$topic_regex/g;
>
> > >   return sort @topics;
>
> > > }
>
> > > It first uses Regexp::Assemble to build a $topic_regex[1] which will
> > > match any of the words/phrases found in the topic table, then does a
> > > global match of $text (the body of the tweet being examined) against
> > > $topic_regex, capturing all matches into the array @topics, which is
> > > then sorted and returned to the caller.
>
> > > After the match is performed, @topics contains every search term which
> > > is matched, no matter how many there may be, which should fill your
> > > requirement for "combined searches", unless I'm misunderstanding it.
>
> > > If you mean you would want that "Fly Fishing", "Sole" tweet to return
> > > three hits rather than two ("Fly Fishing", "Sole", "Fly Fishing+Sole"),
> > > that's easy enough to create from @topics, just generate every
> > > permutation of the terms which the individual tweet matched.
>
> > > [1]  If you're only dealing with 10 or so keywords, you'd probably be
> > > just as well off building the regex by hand.  The main reason I'm using
> > > Regexp::Assemble to do it on the fly is because manually creating and
> > > then maintaining a regex that will efficiently match any of 1300 terms
> > > would be a nightmare.
>
> > > --
> > > Dave Sherohman

[twitter-dev] Re: Track streaming : how to match tweets?

Reply via email to