[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)
Raffi, A bit advanced request. Would it be possible to attach list of significant words and phrases present in the tweet. We could then use this info to categorize tweets and even build a trends list on the tweets aggregated by our apps. In one of our apps, we use Yahoo Terms Extraction service to extract phrases from tweets. Thanks, Karthik
Re: [twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)
Disambiguating short URLs and delivering the true URL and title would be a real plus, not just for developers, but for the target of a URL. While it does add a load to twitter's servers, it will save many, many useless hits to the target. Imagine 100,000 Twitter apps resolving each short URL found in a tweet. All of them doing it within seconds of the tweet arriving via the streaming API. It would be an automatic DOS against every site mentioned in a tweet. If this sounds hyperbolic, read the APIwiki docs that say 2,000 followers is an expected max. Ha! On Fri, May 14, 2010 at 9:15 AM, Zhami wrote: > +1 for it being optional as well -- keep the bandwidth to a minimum > for scenarios where it's not needed. > > +1 for having short URLs' original (long) URL provided (perhaps also > an option?) >
[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)
+1 for it being optional as well -- keep the bandwidth to a minimum for scenarios where it's not needed. +1 for having short URLs' original (long) URL provided (perhaps also an option?)
[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)
I understand. And I don't have anything against it (even if it will be default), as long as it will be optional. And we're all appreciating the library (and its Java implementation: http://github.com/mzsanford/twitter-text-java). On May 14, 3:47 pm, Raffi Krikorian wrote: > all we're trying to do is help people standardize on how they parse stuff. > making sure you can represent what is a hash tag, a url, a username, etc., > in the same way that twitter.com does it, can be difficult. > > -- > Raffi Krikorian > Twitter Platform Teamhttp://twitter.com/raffi
Re: [twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)
> > Besides, if this is the library used for web, you're not doing it > right. :) > For example, to mention URL parsing only, you don't check for valid > domain names (e.g. www.test.failure is matched as URL), > some characters are not recognized as part of a link (e.g. "|" in > "http://translate.google.com/?hl=en#auto|en|bonjour")... > all we're trying to do is help people standardize on how they parse stuff. making sure you can represent what is a hash tag, a url, a username, etc., in the same way that twitter.com does it, can be difficult. -- Raffi Krikorian Twitter Platform Team http://twitter.com/raffi
[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)
+1 for making this optional. It's faster for mobile apps to do this themselves than download it. Besides, if this is the library used for web, you're not doing it right. :) For example, to mention URL parsing only, you don't check for valid domain names (e.g. www.test.failure is matched as URL), some characters are not recognized as part of a link (e.g. "|" in "http://translate.google.com/?hl=en#auto|en|bonjour")...
[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)
Yes, this would be very cool. Any ideas on when this would be rolled out? 1) It would be nice to have the profile_image_url in it as well. I can imagine a lot of nice visual enhancements with that. 2) +1 for making it optional. A lot of people are suggesting additional stuff, so maybe it would even be nicer to not just have a include/don't include param, but to be able to specify which data you would like to have included... jarón On May 14, 6:29 am, Rich wrote: > +1 for it being optional as well. Whilst I will probably use it, it's > nice to be able to keep the bandwidth download to a minimum for > scenarios where it's not needed > > On May 14, 1:52 am, Naveen Ayyagari wrote: > > > +1 on the additional parameter to optionally request the data. Every > > byte counts for mobile device battery life and download time. > > > --Naveen Ayyagari > > @knight9 > > > On May 13, 8:13 pm, Dewald Pretorius wrote: > > > > Raffi, > > > > This is all good, but can you please make the inclusion in the tweet > > > payload optional? Meaning, only include it if it is requested by an > > > additional parameter? > > > > I, and I'm sure a lot of others, are already parsing the tweet text. > > > This is just going to consume additional bandwidth and not add any > > > value for us. It will add value for folks who are not already doing > > > the parsing or don't know how. So, they can just request this > > > additional payload.
[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)
+1 for it being optional as well. Whilst I will probably use it, it's nice to be able to keep the bandwidth download to a minimum for scenarios where it's not needed On May 14, 1:52 am, Naveen Ayyagari wrote: > +1 on the additional parameter to optionally request the data. Every > byte counts for mobile device battery life and download time. > > --Naveen Ayyagari > @knight9 > > On May 13, 8:13 pm, Dewald Pretorius wrote: > > > > > Raffi, > > > This is all good, but can you please make the inclusion in the tweet > > payload optional? Meaning, only include it if it is requested by an > > additional parameter? > > > I, and I'm sure a lot of others, are already parsing the tweet text. > > This is just going to consume additional bandwidth and not add any > > value for us. It will add value for folks who are not already doing > > the parsing or don't know how. So, they can just request this > > additional payload.
[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)
Indeed, it would be great to see this is the preview of UserStreams :)
[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)
+1 on the additional parameter to optionally request the data. Every byte counts for mobile device battery life and download time. --Naveen Ayyagari @knight9 On May 13, 8:13 pm, Dewald Pretorius wrote: > Raffi, > > This is all good, but can you please make the inclusion in the tweet > payload optional? Meaning, only include it if it is requested by an > additional parameter? > > I, and I'm sure a lot of others, are already parsing the tweet text. > This is just going to consume additional bandwidth and not add any > value for us. It will add value for folks who are not already doing > the parsing or don't know how. So, they can just request this > additional payload.
[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)
Raffi, This is all good, but can you please make the inclusion in the tweet payload optional? Meaning, only include it if it is requested by an additional parameter? I, and I'm sure a lot of others, are already parsing the tweet text. This is just going to consume additional bandwidth and not add any value for us. It will add value for folks who are not already doing the parsing or don't know how. So, they can just request this additional payload.
[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)
On May 13, 11:11 pm, Raffi Krikorian wrote: > hey glenn. > > i think something went wrong in the copy and paste -- there should have been > a space between the URL and the hashtag. My bad. Back in my box then. Cheers, -- Glenn Gillen http://glenngillen.com/
[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)
Raffi: On May 13, 2:25 pm, Raffi Krikorian wrote: > as shown above, we'll be parsing out all mentioned users, all lists, all > included URLs, and all hashtags This is an interesting step forward. The internationalisation considerations can be sticky, though. I did some entity-parsing from tweets as part of my "Twanguages" project (a language census of Twitter). One discover was that people are in fact using hashtags with non-latin scripts. Another is that some people are using the '#' character without intending to create a hashtage (e.g. "we are #2 in line"). How will your entity parsing handle non-latin hashtags, latin- character hashtags with accented characters, and strings starting with '#' not intended as hashtags? Also note that URLs can now have non-Latin top-level domain names as well as second-level domain names and other path parts. For instance, http://وزارة-الأتصالات.مصر is a valid URL in the .مصر top-level domain. Will your entity parsing code handle such URLs? In any case, it would be very helpful if the platform team would document exactly what regular expressions govern the entities you recognise. I might not agree with your definition of hashtag syntax, but at least I want to know what it is. See for example the running questions on how to measure the length of a status message. <> > matt sanford > (@mzsanford) on our internationalization team released the twitter-text > library (http://github.com/mzsanford/twitter-text-rb) to help making parsing > easier and standardized (in fact, we use this library ourselves), but we on > the Platform team wondered if we could make this even easier for our > developers. ... I wasn't aware of this, and I'll take a look. Thank you for the tip! — Jim DeLaHunt, Vancouver, Canada
Re: [twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)
yeah - i'm extremely sensitive to that not happening again. i'll keep that in mind. i expect there may be another draft floated around before we start to roll this out. On Thu, May 13, 2010 at 11:14 PM, Rich wrote: > I can see the inside some of the entities tag causing some > developers some problems as it's the same tag name as the status. Of > course all of us should be able to handle it, but just look what > happened with the extra user id tag inside a status > > On May 13, 11:11 pm, Raffi Krikorian wrote: > > hey glenn. > > > > i think something went wrong in the copy and paste -- there should have > been > > a space between the URL and the hashtag. > > > > > > > > > > > > On Thu, May 13, 2010 at 11:02 PM, glenn gillen > wrote: > > > Raffi, > > > > > This follows on nicely from the presentation at Warblecamp last week > > > discussing how difficult it is to do this right, and I think a > > > consistent approach across all clients (including twitter.com, > > > mobile.twitter, and 3rd party apps) should be priority number 1. > > > However looking at your example: > > > > > On May 13, 10:25 pm, Raffi Krikorian wrote: > > > > { > > > > "text" : "hey @raffi tell @noradio to check out > > >http://dev.twitter.com#hot";, > > > > > > > > { > > > > "url" : "http://dev.twitter.com";, > > > > "indices" : [38, 64] > > > > }, > > > > ], > > > > "hashtags" : [ > > > > { > > > > "text" : "#hot", > > > > "indices" : [66, 69] > > > > "url" : "http://search.twitter.com/search?q=%23hot"; > > > > } > > > > ] > > > > } > > > > > Without looking at how twitter.com would currently handle that > > > example, I would have expected the url to be "http://dev.twitter.com/ > > > #hot" and for the tweet to contain no hashtag. If the hashtag always > > > takes precedence I'd have no way to link to the following without > > > using a URL shortener:http://oauth.net/core/1.0a/#anchor41 > > > -- > > > Glenn Gillen > > >http://glenngillen.com/ > > > > -- > > Raffi Krikorian > > Twitter Platform Teamhttp://twitter.com/raffi > -- Raffi Krikorian Twitter Platform Team http://twitter.com/raffi
[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)
I can see the inside some of the entities tag causing some developers some problems as it's the same tag name as the status. Of course all of us should be able to handle it, but just look what happened with the extra user id tag inside a status On May 13, 11:11 pm, Raffi Krikorian wrote: > hey glenn. > > i think something went wrong in the copy and paste -- there should have been > a space between the URL and the hashtag. > > > > > > On Thu, May 13, 2010 at 11:02 PM, glenn gillen wrote: > > Raffi, > > > This follows on nicely from the presentation at Warblecamp last week > > discussing how difficult it is to do this right, and I think a > > consistent approach across all clients (including twitter.com, > > mobile.twitter, and 3rd party apps) should be priority number 1. > > However looking at your example: > > > On May 13, 10:25 pm, Raffi Krikorian wrote: > > > { > > > "text" : "hey @raffi tell @noradio to check out > >http://dev.twitter.com#hot";, > > > > > > { > > > "url" : "http://dev.twitter.com";, > > > "indices" : [38, 64] > > > }, > > > ], > > > "hashtags" : [ > > > { > > > "text" : "#hot", > > > "indices" : [66, 69] > > > "url" : "http://search.twitter.com/search?q=%23hot"; > > > } > > > ] > > > } > > > Without looking at how twitter.com would currently handle that > > example, I would have expected the url to be "http://dev.twitter.com/ > > #hot" and for the tweet to contain no hashtag. If the hashtag always > > takes precedence I'd have no way to link to the following without > > using a URL shortener:http://oauth.net/core/1.0a/#anchor41 > > -- > > Glenn Gillen > >http://glenngillen.com/ > > -- > Raffi Krikorian > Twitter Platform Teamhttp://twitter.com/raffi
Re: [twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)
hey glenn. i think something went wrong in the copy and paste -- there should have been a space between the URL and the hashtag. On Thu, May 13, 2010 at 11:02 PM, glenn gillen wrote: > Raffi, > > This follows on nicely from the presentation at Warblecamp last week > discussing how difficult it is to do this right, and I think a > consistent approach across all clients (including twitter.com, > mobile.twitter, and 3rd party apps) should be priority number 1. > However looking at your example: > > On May 13, 10:25 pm, Raffi Krikorian wrote: > > { > > "text" : "hey @raffi tell @noradio to check out > http://dev.twitter.com#hot";, > > > > { > > "url" : "http://dev.twitter.com";, > > "indices" : [38, 64] > > }, > > ], > > "hashtags" : [ > > { > > "text" : "#hot", > > "indices" : [66, 69] > > "url" : "http://search.twitter.com/search?q=%23hot"; > > } > > ] > > } > > Without looking at how twitter.com would currently handle that > example, I would have expected the url to be "http://dev.twitter.com/ > #hot" and for the tweet to contain no hashtag. If the hashtag always > takes precedence I'd have no way to link to the following without > using a URL shortener: http://oauth.net/core/1.0a/#anchor41 > -- > Glenn Gillen > http://glenngillen.com/ > -- Raffi Krikorian Twitter Platform Team http://twitter.com/raffi
RE: [twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)
Glenn Gillen wrote: > Without looking at how twitter.com would currently handle that example, I > would have expected the url to be "http://dev.twitter.com/ #hot" and for the > tweet to contain no hashtag. If the hashtag always takes precedence I'd have no > way to link to the following without using a URL shortener: > http://oauth.net/core/1.0a/#anchor41 I think you are overlooking the space between the last slash and "#hot". URLs cannot contain (un-encoded) spaces. Regards, Brian
[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)
Raffi, This follows on nicely from the presentation at Warblecamp last week discussing how difficult it is to do this right, and I think a consistent approach across all clients (including twitter.com, mobile.twitter, and 3rd party apps) should be priority number 1. However looking at your example: On May 13, 10:25 pm, Raffi Krikorian wrote: > { > "text" : "hey @raffi tell @noradio to check out http://dev.twitter.com#hot";, > > { > "url" : "http://dev.twitter.com";, > "indices" : [38, 64] > }, > ], > "hashtags" : [ > { > "text" : "#hot", > "indices" : [66, 69] > "url" : "http://search.twitter.com/search?q=%23hot"; > } > ] > } Without looking at how twitter.com would currently handle that example, I would have expected the url to be "http://dev.twitter.com/ #hot" and for the tweet to contain no hashtag. If the hashtag always takes precedence I'd have no way to link to the following without using a URL shortener: http://oauth.net/core/1.0a/#anchor41 -- Glenn Gillen http://glenngillen.com/