Re: [twitter-dev] Farsi Twitter App

2010-07-06 Thread Lucas Vickers
Thank you everyone.

You've given me quite a few good options to look into.

Lucas

On Mon, Jul 5, 2010 at 5:57 AM, Jean-Charles Campagne a...@semiocast.com 
wrote:
 Hello Lucas,

 We do not provide, yet, exactly what you are looking for, but for now
 we might help you on the language filtering part.
 We provide an API for language and location filtering for
 micro-messages (Tweets and Facebook messages, etc.).

 You'll find more info on the API website: http://developer.semiocast.com

 Regarding the feature you are looking for, we made a request to
 Twitter to be able to redistribute a filtered API, so we will be
 able to provide something closer to what you are looking for. You can,
 more or less, achieve the same today with our current state of the API
 but it'll be more plumbing on your side.


 Best regards,
 Jean-Charles Campagne
 Semiocast

 On Sat, Jul 3, 2010 at 12:36 AM, Lucas Vickers lucasvick...@gmail.com wrote:
 Hello,

 I am trying to create an app that will show tweets and trends in
 Farsi, for native speakers.  I would like to somehow get a sample
 'garden hose' of Farsi based tweets, but I am unable to come up with
 an elegant solution.




Re: [twitter-dev] Farsi Twitter App

2010-07-05 Thread Jean-Charles Campagne
Hello Lucas,

We do not provide, yet, exactly what you are looking for, but for now
we might help you on the language filtering part.
We provide an API for language and location filtering for
micro-messages (Tweets and Facebook messages, etc.).

You'll find more info on the API website: http://developer.semiocast.com

Regarding the feature you are looking for, we made a request to
Twitter to be able to redistribute a filtered API, so we will be
able to provide something closer to what you are looking for. You can,
more or less, achieve the same today with our current state of the API
but it'll be more plumbing on your side.


Best regards,
Jean-Charles Campagne
Semiocast

On Sat, Jul 3, 2010 at 12:36 AM, Lucas Vickers lucasvick...@gmail.com wrote:
 Hello,

 I am trying to create an app that will show tweets and trends in
 Farsi, for native speakers.  I would like to somehow get a sample
 'garden hose' of Farsi based tweets, but I am unable to come up with
 an elegant solution.



Re: [twitter-dev] Farsi Twitter App

2010-07-04 Thread Pascal Jürgens
Interesting. Your method is similar to the breadth-first crawl that many people 
do (for example, see the academic paper by Kwak et al. 2010).

You have to keep in mind, however, that you are only crawling the giant 
component of the network, the connected part. If there are any turkish users 
who have their *separate* subpopulation, which is not connected to the rest, 
you won't find those.

You could easily find those with a sample stream. Although I have to admit that 
the number of non-connected users is not so big, no one has really tested that 
so far.

Pascal

On Jul 3, 2010, at 20:00 , Furkan Kuru wrote:

 We have implemented the Turkish version: 
 Twitturkhttp://twitturk.com/home/lang/en
 
 We skipped the first three steps but started with a few Turkish users and 
 crawled all the network and for each new user we tested if the description or 
 latest tweets are in Turkish language.
 
 We have almost 100.000 Turkish users identified so far.
 
 Using stream api we collect their tweets and we find out the popular people 
 and key-words, top tweets (most retweeted ones) among Turkish people.



Re: [twitter-dev] Farsi Twitter App

2010-07-04 Thread Furkan Kuru
You are right. Separate subpopulation s are out of our reach.

Apart from following/friendship connection we look at mentions and follow
them as well.
If a new comer or a man from other population mentions one of the people in
our network, his tweet will reach us and we can test him and add as well.

Thank you, I will look at the paper.


2010/7/4 Pascal Jürgens lists.pascal.juerg...@googlemail.com

 Interesting. Your method is similar to the breadth-first crawl that many
 people do (for example, see the academic paper by Kwak et al. 2010).

 You have to keep in mind, however, that you are only crawling the giant
 component of the network, the connected part. If there are any turkish users
 who have their *separate* subpopulation, which is not connected to the rest,
 you won't find those.

 You could easily find those with a sample stream. Although I have to admit
 that the number of non-connected users is not so big, no one has really
 tested that so far.

 Pascal

 On Jul 3, 2010, at 20:00 , Furkan Kuru wrote:

 We have implemented the Turkish version: Twitturk
 http://twitturk.com/home/lang/en


 We skipped the first three steps but started with a few Turkish users and
 crawled all the network and for each new user we tested if the description
 or latest tweets are in Turkish language.

 We have almost 100.000 Turkish users identified so far.

 Using stream api we collect their tweets and we find out the popular people
 and key-words, top tweets (most retweeted ones) among Turkish people.





-- 
Furkan Kuru


Re: [twitter-dev] Farsi Twitter App

2010-07-03 Thread Pascal Jürgens
Hi Lucas,

as someone who approached a similar problem, my recommendation would be to 
track users.  In order to get results quickly (rather than every few hours via 
user timeline calls), you need streaming access, which is a bit more 
complicated. I implemented such a system in order to track the german-speaking 
population of twitter users, and it works extremely well.

1) get access to the sample stream (5% or 15% type) (warning: the 15% stream is 
~10GB+ a day)

2) construct an efficient cascading language filter, ie:
- first test the computationally cheap AND precise attributes, such as a list 
of known farsi-only keywords or the location box
- if those attribute tests are negative, perform more computationally expensive 
tests
- if in doubt, count it as non-farsi! False positives will kill you if you 
sample a very small population!

3) With said filter, identify the accounts using farsi

4) Perform a first-degree network sweep and scan all their friends+followers, 
since those have a higher likelihood to speak farsi as well

5) compile a list of those known users

6) track those users with the shadow role stream (80.000 users) or higher.

If your language detection code is not efficient enough, you might want to 
include a cheap, fast and precise negative filter of known non-farsi 
attributes. Test that one before all the others and you should be able to 
filter out a large part of the volume.


Don't hesitate to ask for any clarification!

Pascal Juergens
Graduate Student / Mass Communication
U of Mainz, Germany

On Jul 3, 2010, at 0:36 , Lucas Vickers wrote:

 Hello,
 
 I am trying to create an app that will show tweets and trends in
 Farsi, for native speakers.  I would like to somehow get a sample
 'garden hose' of Farsi based tweets, but I am unable to come up with
 an elegant solution.
 
 I see the following options:
 
 - Sample all tweets, and run a language detection algorithm on the
 tweet to determine which are/could be Farsi.
  * Problem: only a very very small % of the tweets will be in Farsi
 
 - Use the location filter to try and sample tweets from countries that
 are known to speak Farsi, and then run a language detection algorithm
 on the tweets.
  * Problem: I seem to be limited on the size of the coordinate box I
 can provide.  I can not even cover all of Iran for example.
 
 - Filter a standard farsi term.
  * Problem: will limit my results to only tweets with this term
 
 - Search for laguage = farsi
   * Problem: Not a stream, I will need to keep searching.
 
 I think of the given options I mentioned what makes the most sense is
 to search for tweets where language=farsi, and use the since_id to
 keep my results new.  Given this method, I have three questions
 1 - since_id I imagine is the highest tweet_id from the previous
 result set?
 2 - How often can I search (given API limits of course) in order to
 ensure I get new data?
 3 - Will the language filter provide me with users who's default
 language is farsi, or will it actually find tweets in farsi?
 
 I am aware that the user can select their native language in the user
 profile, but I also know this is not 100% reliable.
 
 Can anyone think of a more elegant solution?
 Are there any hidden/experimental language type filters available to
 us?
 
 Thanks!
 Lucas



Re: [twitter-dev] Farsi Twitter App

2010-07-03 Thread John Kalucki
It's great to hear that someone implemented all this. There's a similar
technique documented here:
http://dev.twitter.com/pages/streaming_api_concepts, under By Language and
Country. My suggestion was to start with a list of stop words to build your
user corpus -- but I don't know how well Farsi works with track, so random
sample method might indeed be better.

-John Kalucki
http://twitter.com/jkalucki
Infrastructure, Twitter Inc.




2010/7/3 Pascal Jürgens lists.pascal.juerg...@googlemail.com

 Hi Lucas,

 as someone who approached a similar problem, my recommendation would be to
 track users.  In order to get results quickly (rather than every few hours
 via user timeline calls), you need streaming access, which is a bit more
 complicated. I implemented such a system in order to track the
 german-speaking population of twitter users, and it works extremely well.

 1) get access to the sample stream (5% or 15% type) (warning: the 15%
 stream is ~10GB+ a day)

 2) construct an efficient cascading language filter, ie:
 - first test the computationally cheap AND precise attributes, such as a
 list of known farsi-only keywords or the location box
 - if those attribute tests are negative, perform more computationally
 expensive tests
 - if in doubt, count it as non-farsi! False positives will kill you if you
 sample a very small population!

 3) With said filter, identify the accounts using farsi

 4) Perform a first-degree network sweep and scan all their
 friends+followers, since those have a higher likelihood to speak farsi as
 well

 5) compile a list of those known users

 6) track those users with the shadow role stream (80.000 users) or higher.

 If your language detection code is not efficient enough, you might want to
 include a cheap, fast and precise negative filter of known non-farsi
 attributes. Test that one before all the others and you should be able to
 filter out a large part of the volume.


 Don't hesitate to ask for any clarification!

 Pascal Juergens
 Graduate Student / Mass Communication
 U of Mainz, Germany

 On Jul 3, 2010, at 0:36 , Lucas Vickers wrote:

  Hello,
 
  I am trying to create an app that will show tweets and trends in
  Farsi, for native speakers.  I would like to somehow get a sample
  'garden hose' of Farsi based tweets, but I am unable to come up with
  an elegant solution.
 
  I see the following options:
 
  - Sample all tweets, and run a language detection algorithm on the
  tweet to determine which are/could be Farsi.
   * Problem: only a very very small % of the tweets will be in Farsi
 
  - Use the location filter to try and sample tweets from countries that
  are known to speak Farsi, and then run a language detection algorithm
  on the tweets.
   * Problem: I seem to be limited on the size of the coordinate box I
  can provide.  I can not even cover all of Iran for example.
 
  - Filter a standard farsi term.
   * Problem: will limit my results to only tweets with this term
 
  - Search for laguage = farsi
* Problem: Not a stream, I will need to keep searching.
 
  I think of the given options I mentioned what makes the most sense is
  to search for tweets where language=farsi, and use the since_id to
  keep my results new.  Given this method, I have three questions
  1 - since_id I imagine is the highest tweet_id from the previous
  result set?
  2 - How often can I search (given API limits of course) in order to
  ensure I get new data?
  3 - Will the language filter provide me with users who's default
  language is farsi, or will it actually find tweets in farsi?
 
  I am aware that the user can select their native language in the user
  profile, but I also know this is not 100% reliable.
 
  Can anyone think of a more elegant solution?
  Are there any hidden/experimental language type filters available to
  us?
 
  Thanks!
  Lucas




Re: [twitter-dev] Farsi Twitter App

2010-07-03 Thread Pascal Jürgens
John,

yes, thanks a lot for the design proposal - that is what inspired my own 
system. I am not primarily filtering by language, however, but by country, so 
I'm using time zone and location data together with a list of cities from 
http://www.geonames.org/

The manual cross-check in my thesis shows that this gets you close to 1 in 
specificity and above .7 in sensitivity.

From my experience, the key is to develop efficient language-specific tests 
with as low an error rate as possible (this, sadly, largely excludes 
conventional SVM, HMM models etc, because tweets are so short and full of weird 
punctuation).

Pascal

On Jul 3, 2010, at 15:26 , John Kalucki wrote:

 It's great to hear that someone implemented all this. There's a similar 
 technique documented here: 
 http://dev.twitter.com/pages/streaming_api_concepts, under By Language and 
 Country. My suggestion was to start with a list of stop words to build your 
 user corpus -- but I don't know how well Farsi works with track, so random 
 sample method might indeed be better.
 
 -John Kalucki
 http://twitter.com/jkalucki
 Infrastructure, Twitter Inc.



Re: [twitter-dev] Farsi Twitter App

2010-07-03 Thread Furkan Kuru
We have implemented the Turkish version: Twitturk
http://twitturk.com/home/lang/en

We skipped the first three steps but started with a few Turkish users and
crawled all the network and for each new user we tested if the description
or latest tweets are in Turkish language.

We have almost 100.000 Turkish users identified so far.

Using stream api we collect their tweets and we find out the popular people
and key-words, top tweets (most retweeted ones) among Turkish people.


2010/7/3 Pascal Jürgens lists.pascal.juerg...@googlemail.com

 Hi Lucas,

 as someone who approached a similar problem, my recommendation would be to
 track users.  In order to get results quickly (rather than every few hours
 via user timeline calls), you need streaming access, which is a bit more
 complicated. I implemented such a system in order to track the
 german-speaking population of twitter users, and it works extremely well.

 1) get access to the sample stream (5% or 15% type) (warning: the 15%
 stream is ~10GB+ a day)

 2) construct an efficient cascading language filter, ie:
 - first test the computationally cheap AND precise attributes, such as a
 list of known farsi-only keywords or the location box
 - if those attribute tests are negative, perform more computationally
 expensive tests
 - if in doubt, count it as non-farsi! False positives will kill you if you
 sample a very small population!

 3) With said filter, identify the accounts using farsi

 4) Perform a first-degree network sweep and scan all their
 friends+followers, since those have a higher likelihood to speak farsi as
 well

 5) compile a list of those known users

 6) track those users with the shadow role stream (80.000 users) or higher.

 If your language detection code is not efficient enough, you might want to
 include a cheap, fast and precise negative filter of known non-farsi
 attributes. Test that one before all the others and you should be able to
 filter out a large part of the volume.


 Don't hesitate to ask for any clarification!

 Pascal Juergens
 Graduate Student / Mass Communication
 U of Mainz, Germany

 On Jul 3, 2010, at 0:36 , Lucas Vickers wrote:

  Hello,
 
  I am trying to create an app that will show tweets and trends in
  Farsi, for native speakers.  I would like to somehow get a sample
  'garden hose' of Farsi based tweets, but I am unable to come up with
  an elegant solution.
 
  I see the following options:
 
  - Sample all tweets, and run a language detection algorithm on the
  tweet to determine which are/could be Farsi.
   * Problem: only a very very small % of the tweets will be in Farsi
 
  - Use the location filter to try and sample tweets from countries that
  are known to speak Farsi, and then run a language detection algorithm
  on the tweets.
   * Problem: I seem to be limited on the size of the coordinate box I
  can provide.  I can not even cover all of Iran for example.
 
  - Filter a standard farsi term.
   * Problem: will limit my results to only tweets with this term
 
  - Search for laguage = farsi
* Problem: Not a stream, I will need to keep searching.
 
  I think of the given options I mentioned what makes the most sense is
  to search for tweets where language=farsi, and use the since_id to
  keep my results new.  Given this method, I have three questions
  1 - since_id I imagine is the highest tweet_id from the previous
  result set?
  2 - How often can I search (given API limits of course) in order to
  ensure I get new data?
  3 - Will the language filter provide me with users who's default
  language is farsi, or will it actually find tweets in farsi?
 
  I am aware that the user can select their native language in the user
  profile, but I also know this is not 100% reliable.
 
  Can anyone think of a more elegant solution?
  Are there any hidden/experimental language type filters available to
  us?
 
  Thanks!
  Lucas




-- 
Furkan Kuru


[twitter-dev] Farsi Twitter App

2010-07-02 Thread Lucas Vickers
Hello,

I am trying to create an app that will show tweets and trends in
Farsi, for native speakers.  I would like to somehow get a sample
'garden hose' of Farsi based tweets, but I am unable to come up with
an elegant solution.

I see the following options:

- Sample all tweets, and run a language detection algorithm on the
tweet to determine which are/could be Farsi.
  * Problem: only a very very small % of the tweets will be in Farsi

- Use the location filter to try and sample tweets from countries that
are known to speak Farsi, and then run a language detection algorithm
on the tweets.
  * Problem: I seem to be limited on the size of the coordinate box I
can provide.  I can not even cover all of Iran for example.

- Filter a standard farsi term.
  * Problem: will limit my results to only tweets with this term

- Search for laguage = farsi
   * Problem: Not a stream, I will need to keep searching.

I think of the given options I mentioned what makes the most sense is
to search for tweets where language=farsi, and use the since_id to
keep my results new.  Given this method, I have three questions
1 - since_id I imagine is the highest tweet_id from the previous
result set?
2 - How often can I search (given API limits of course) in order to
ensure I get new data?
3 - Will the language filter provide me with users who's default
language is farsi, or will it actually find tweets in farsi?

I am aware that the user can select their native language in the user
profile, but I also know this is not 100% reliable.

Can anyone think of a more elegant solution?
Are there any hidden/experimental language type filters available to
us?

Thanks!
Lucas