[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-14 Thread Karthik
Raffi,

A bit advanced request. Would it be possible to attach list of
significant words and phrases present in the tweet. We could then use
this info to categorize tweets and even build a trends list on the
tweets aggregated by our apps.

In one of our apps, we use Yahoo Terms Extraction service to extract
phrases from tweets.

Thanks,
Karthik



Re: [twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-14 Thread Adam Green
Disambiguating short URLs and delivering the true URL and title would
be a real plus, not just for developers, but for the target of a URL.
While it does add a load to twitter's servers, it will save many, many
useless hits to the target.

Imagine 100,000 Twitter apps resolving each short URL found in a
tweet. All of them doing it within seconds of the tweet arriving via
the streaming API. It would be an automatic DOS against every site
mentioned in a tweet.

If this sounds hyperbolic, read the APIwiki docs that say 2,000
followers is an expected max. Ha!


On Fri, May 14, 2010 at 9:15 AM, Zhami  wrote:
> +1 for it being optional as well -- keep the bandwidth to a minimum
> for scenarios where it's not needed.
>
> +1 for having short URLs' original (long) URL provided (perhaps also
> an option?)
>


[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-14 Thread Zhami
+1 for it being optional as well -- keep the bandwidth to a minimum
for scenarios where it's not needed.

+1 for having short URLs' original (long) URL provided (perhaps also
an option?)


[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-14 Thread Edi
I understand. And I don't have anything against it (even if it will be
default), as long as it will be optional.
And we're all appreciating the library (and its Java implementation:
http://github.com/mzsanford/twitter-text-java).


On May 14, 3:47 pm, Raffi Krikorian  wrote:
> all we're trying to do is help people standardize on how they parse stuff.
>  making sure you can represent what is a hash tag, a url, a username, etc.,
> in the same way that twitter.com does it, can be difficult.
>
> --
> Raffi Krikorian
> Twitter Platform Teamhttp://twitter.com/raffi


Re: [twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-14 Thread Raffi Krikorian
>
> Besides, if this is the library used for web, you're not doing it
> right. :)
> For example, to mention URL parsing only, you don't check for valid
> domain names (e.g. www.test.failure is matched as URL),
> some characters are not recognized as part of a link (e.g. "|" in
> "http://translate.google.com/?hl=en#auto|en|bonjour")...
>

all we're trying to do is help people standardize on how they parse stuff.
 making sure you can represent what is a hash tag, a url, a username, etc.,
in the same way that twitter.com does it, can be difficult.

-- 
Raffi Krikorian
Twitter Platform Team
http://twitter.com/raffi


[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-14 Thread Edi
+1 for making this optional.
It's faster for mobile apps to do this themselves than download it.

Besides, if this is the library used for web, you're not doing it
right. :)
For example, to mention URL parsing only, you don't check for valid
domain names (e.g. www.test.failure is matched as URL),
some characters are not recognized as part of a link (e.g. "|" in
"http://translate.google.com/?hl=en#auto|en|bonjour")...


[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-14 Thread jaronbarends
Yes, this would be very cool. Any ideas on when this would be rolled
out?

1) It would be nice to have the profile_image_url in it as well. I can
imagine a lot of nice visual enhancements with that.

2) +1 for making it optional. A lot of people are suggesting
additional stuff, so maybe it would even be nicer to not just have a
include/don't include param, but to be able to specify which data you
would like to have included...

jarón

On May 14, 6:29 am, Rich  wrote:
> +1 for it being optional as well.  Whilst I will probably use it, it's
> nice to be able to keep the bandwidth download to a minimum for
> scenarios where it's not needed
>
> On May 14, 1:52 am, Naveen Ayyagari  wrote:
>
> > +1 on the additional parameter to optionally request the data. Every
> > byte counts for mobile device battery life and download time.
>
> > --Naveen Ayyagari
> > @knight9
>
> > On May 13, 8:13 pm, Dewald Pretorius  wrote:
>
> > > Raffi,
>
> > > This is all good, but can you please make the inclusion in the tweet
> > > payload optional? Meaning, only include it if it is requested by an
> > > additional parameter?
>
> > > I, and I'm sure a lot of others, are already parsing the tweet text.
> > > This is just going to consume additional bandwidth and not add any
> > > value for us. It will add value for folks who are not already doing
> > > the parsing or don't know how. So, they can just request this
> > > additional payload.


[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-13 Thread Rich
+1 for it being optional as well.  Whilst I will probably use it, it's
nice to be able to keep the bandwidth download to a minimum for
scenarios where it's not needed

On May 14, 1:52 am, Naveen Ayyagari  wrote:
> +1 on the additional parameter to optionally request the data. Every
> byte counts for mobile device battery life and download time.
>
> --Naveen Ayyagari
> @knight9
>
> On May 13, 8:13 pm, Dewald Pretorius  wrote:
>
>
>
> > Raffi,
>
> > This is all good, but can you please make the inclusion in the tweet
> > payload optional? Meaning, only include it if it is requested by an
> > additional parameter?
>
> > I, and I'm sure a lot of others, are already parsing the tweet text.
> > This is just going to consume additional bandwidth and not add any
> > value for us. It will add value for folks who are not already doing
> > the parsing or don't know how. So, they can just request this
> > additional payload.


[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-13 Thread Adam
Indeed, it would be great to see this is the preview of UserStreams :)


[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-13 Thread Naveen Ayyagari
+1 on the additional parameter to optionally request the data. Every
byte counts for mobile device battery life and download time.

--Naveen Ayyagari
@knight9


On May 13, 8:13 pm, Dewald Pretorius  wrote:
> Raffi,
>
> This is all good, but can you please make the inclusion in the tweet
> payload optional? Meaning, only include it if it is requested by an
> additional parameter?
>
> I, and I'm sure a lot of others, are already parsing the tweet text.
> This is just going to consume additional bandwidth and not add any
> value for us. It will add value for folks who are not already doing
> the parsing or don't know how. So, they can just request this
> additional payload.


[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-13 Thread Dewald Pretorius
Raffi,

This is all good, but can you please make the inclusion in the tweet
payload optional? Meaning, only include it if it is requested by an
additional parameter?

I, and I'm sure a lot of others, are already parsing the tweet text.
This is just going to consume additional bandwidth and not add any
value for us. It will add value for folks who are not already doing
the parsing or don't know how. So, they can just request this
additional payload.


[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-13 Thread glenn gillen
On May 13, 11:11 pm, Raffi Krikorian  wrote:
> hey glenn.
>
> i think something went wrong in the copy and paste -- there should have been
> a space between the URL and the hashtag.

My bad. Back in my box then.

Cheers,
--
Glenn Gillen
http://glenngillen.com/


[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-13 Thread Jim DeLaHunt
Raffi:



On May 13, 2:25 pm, Raffi Krikorian  wrote:
> as shown above, we'll be parsing out all mentioned users, all lists, all
> included URLs, and all hashtags

This is an interesting step forward.  The internationalisation
considerations can be sticky, though.  I did some entity-parsing from
tweets as part of my "Twanguages" project (a language census of
Twitter). One discover was that people are in fact using hashtags with
non-latin scripts. Another is that some people are using the '#'
character without intending to create a hashtage (e.g. "we are #2 in
line"). How will your entity parsing handle non-latin hashtags, latin-
character hashtags with accented characters, and strings starting with
'#' not intended as hashtags?

Also note that URLs can now have non-Latin top-level domain names as
well as second-level domain names and other path parts. For instance,
http://وزارة-الأتصالات.مصر is a valid URL in the .مصر top-level
domain. Will your entity parsing code handle such URLs?

In any case, it would be very helpful if the platform team would
document exactly what regular expressions govern the entities you
recognise. I might not agree with your definition of hashtag syntax,
but at least I want to know what it is.  See for example the running
questions on how to measure the length of a status message. <>

> matt sanford
> (@mzsanford) on our internationalization team released the twitter-text
> library (http://github.com/mzsanford/twitter-text-rb) to help making parsing
> easier and standardized (in fact, we use this library ourselves), but we on
> the Platform team wondered if we could make this even easier for our
> developers. ...

I wasn't aware of this, and I'll take a look.  Thank you for the tip!
— Jim DeLaHunt, Vancouver, Canada


Re: [twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-13 Thread Raffi Krikorian
yeah - i'm extremely sensitive to that not happening again.  i'll keep that
in mind.  i expect there may be another draft floated around before we start
to roll this out.

On Thu, May 13, 2010 at 11:14 PM, Rich  wrote:

> I can see the  inside some of the entities tag causing some
> developers some problems as it's the same tag name as the status.  Of
> course all of us should be able to handle it, but just look what
> happened with the extra user id tag inside a status
>
> On May 13, 11:11 pm, Raffi Krikorian  wrote:
> > hey glenn.
> >
> > i think something went wrong in the copy and paste -- there should have
> been
> > a space between the URL and the hashtag.
> >
> >
> >
> >
> >
> > On Thu, May 13, 2010 at 11:02 PM, glenn gillen 
> wrote:
> > > Raffi,
> >
> > > This follows on nicely from the presentation at Warblecamp last week
> > > discussing how difficult it is to do this right, and I think a
> > > consistent approach across all clients (including twitter.com,
> > > mobile.twitter, and 3rd party apps) should be priority number 1.
> > > However looking at your example:
> >
> > > On May 13, 10:25 pm, Raffi Krikorian  wrote:
> > > > {
> > > >  "text" : "hey @raffi tell @noradio to check out
> > >http://dev.twitter.com#hot";,
> > > > 
> > > > {
> > > >   "url" : "http://dev.twitter.com";,
> > > >   "indices" : [38, 64]
> > > > },
> > > >   ],
> > > >   "hashtags" : [
> > > > {
> > > >   "text" : "#hot",
> > > >   "indices" : [66, 69]
> > > >   "url" : "http://search.twitter.com/search?q=%23hot";
> > > > }
> > > >   ]
> > > >  }
> >
> > > Without looking at how twitter.com would currently handle that
> > > example, I would have expected the url to be "http://dev.twitter.com/
> > > #hot" and for the tweet to contain no hashtag. If the hashtag always
> > > takes precedence I'd have no way to link to the following without
> > > using a URL shortener:http://oauth.net/core/1.0a/#anchor41
> > > --
> > > Glenn Gillen
> > >http://glenngillen.com/
> >
> > --
> > Raffi Krikorian
> > Twitter Platform Teamhttp://twitter.com/raffi
>



-- 
Raffi Krikorian
Twitter Platform Team
http://twitter.com/raffi


[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-13 Thread Rich
I can see the  inside some of the entities tag causing some
developers some problems as it's the same tag name as the status.  Of
course all of us should be able to handle it, but just look what
happened with the extra user id tag inside a status

On May 13, 11:11 pm, Raffi Krikorian  wrote:
> hey glenn.
>
> i think something went wrong in the copy and paste -- there should have been
> a space between the URL and the hashtag.
>
>
>
>
>
> On Thu, May 13, 2010 at 11:02 PM, glenn gillen  wrote:
> > Raffi,
>
> > This follows on nicely from the presentation at Warblecamp last week
> > discussing how difficult it is to do this right, and I think a
> > consistent approach across all clients (including twitter.com,
> > mobile.twitter, and 3rd party apps) should be priority number 1.
> > However looking at your example:
>
> > On May 13, 10:25 pm, Raffi Krikorian  wrote:
> > > {
> > >  "text" : "hey @raffi tell @noradio to check out
> >http://dev.twitter.com#hot";,
> > > 
> > >     {
> > >       "url" : "http://dev.twitter.com";,
> > >       "indices" : [38, 64]
> > >     },
> > >   ],
> > >   "hashtags" : [
> > >     {
> > >       "text" : "#hot",
> > >       "indices" : [66, 69]
> > >       "url" : "http://search.twitter.com/search?q=%23hot";
> > >     }
> > >   ]
> > >  }
>
> > Without looking at how twitter.com would currently handle that
> > example, I would have expected the url to be "http://dev.twitter.com/
> > #hot" and for the tweet to contain no hashtag. If the hashtag always
> > takes precedence I'd have no way to link to the following without
> > using a URL shortener:http://oauth.net/core/1.0a/#anchor41
> > --
> > Glenn Gillen
> >http://glenngillen.com/
>
> --
> Raffi Krikorian
> Twitter Platform Teamhttp://twitter.com/raffi


Re: [twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-13 Thread Raffi Krikorian
hey glenn.

i think something went wrong in the copy and paste -- there should have been
a space between the URL and the hashtag.

On Thu, May 13, 2010 at 11:02 PM, glenn gillen  wrote:

> Raffi,
>
> This follows on nicely from the presentation at Warblecamp last week
> discussing how difficult it is to do this right, and I think a
> consistent approach across all clients (including twitter.com,
> mobile.twitter, and 3rd party apps) should be priority number 1.
> However looking at your example:
>
> On May 13, 10:25 pm, Raffi Krikorian  wrote:
> > {
> >  "text" : "hey @raffi tell @noradio to check out
> http://dev.twitter.com#hot";,
> > 
> > {
> >   "url" : "http://dev.twitter.com";,
> >   "indices" : [38, 64]
> > },
> >   ],
> >   "hashtags" : [
> > {
> >   "text" : "#hot",
> >   "indices" : [66, 69]
> >   "url" : "http://search.twitter.com/search?q=%23hot";
> > }
> >   ]
> >  }
>
> Without looking at how twitter.com would currently handle that
> example, I would have expected the url to be "http://dev.twitter.com/
> #hot" and for the tweet to contain no hashtag. If the hashtag always
> takes precedence I'd have no way to link to the following without
> using a URL shortener: http://oauth.net/core/1.0a/#anchor41
> --
> Glenn Gillen
> http://glenngillen.com/
>



-- 
Raffi Krikorian
Twitter Platform Team
http://twitter.com/raffi


RE: [twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-13 Thread Brian Smith
Glenn Gillen wrote:
> Without looking at how twitter.com would currently handle that example, I
> would have expected the url to be "http://dev.twitter.com/ #hot" and for
the
> tweet to contain no hashtag. If the hashtag always takes precedence I'd
have no
> way to link to the following without using a URL shortener:
> http://oauth.net/core/1.0a/#anchor41

I think you are overlooking the space between the last slash and "#hot".
URLs cannot contain (un-encoded) spaces.

Regards,
Brian



[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-13 Thread glenn gillen
Raffi,

This follows on nicely from the presentation at Warblecamp last week
discussing how difficult it is to do this right, and I think a
consistent approach across all clients (including twitter.com,
mobile.twitter, and 3rd party apps) should be priority number 1.
However looking at your example:

On May 13, 10:25 pm, Raffi Krikorian  wrote:
> {
>  "text" : "hey @raffi tell @noradio to check out http://dev.twitter.com#hot";,
> 
>     {
>       "url" : "http://dev.twitter.com";,
>       "indices" : [38, 64]
>     },
>   ],
>   "hashtags" : [
>     {
>       "text" : "#hot",
>       "indices" : [66, 69]
>       "url" : "http://search.twitter.com/search?q=%23hot";
>     }
>   ]
>  }

Without looking at how twitter.com would currently handle that
example, I would have expected the url to be "http://dev.twitter.com/
#hot" and for the tweet to contain no hashtag. If the hashtag always
takes precedence I'd have no way to link to the following without
using a URL shortener: http://oauth.net/core/1.0a/#anchor41
--
Glenn Gillen
http://glenngillen.com/