Hi Matt,
I have tried to use language parameter of twitter search and find the
result is very unreliable. For example:
http://search.twitter.com/search?lang=all&q=tweetjobsearch returns 10
results (all in english), but
http://search.twitter.com/search?lang=en&q=tweetjobsearch only returns
3.
I googled this list and it seems you are using n-gram based algorithm
(http://groups.google.com/group/twitter-development-talk/msg/
565313d7b36e8d65). I have found n-gram algorithm works very well for
language detection, but the quality of training data may make a big
difference.
Recently I have developed a language detector (in ruby) myself:
http://github.com/feedbackmine/language_detector/tree/master
It uses wikipedia's data for training, and based on my limited
experience it works well. Actually using wikipedia's data is not my
idea, all credits should go to Kevin Burton (http://feedblog.org/
2005/08/19/ngram-language-categorization-source/ ).
Just thought you may be interested.
@feedbackmine
http://twitter.com/feedbackmine
On Mar 31, 11:22 am, Matt Sanford wrote:
> Hi there,
>
> Can you provide an example URL where since_id isn't working so I
> can try and reproduce the issue? As forlanguage, thelanguage
> identifier is not a 100% and sometimes makes mistakes. Hopefully not
> too many mistakes but it definitely does.
>
> Thanks;
> — Matt Sanford / @mzsanford
>
> On Mar 31, 2009, at 08:14 AM, codepuke wrote:
>
>
>
>
>
> > Hi all;
>
> > I see a few people complaining about the since_id not working. I too
> > have the same issue - I am currently storing the last executed id and
> > having to check new tweets to make sure their id is greater than my
> > last processed id as a temporary workaround.
>
> > I have also noticed that the filter bylanguageparam also doesn't
> > seem to be working 100% - I notice a few chinese tweets, as well as
> > tweets having a null value forlanguage...