[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-14 Thread jaronbarends
Yes, this would be very cool. Any ideas on when this would be rolled
out?

1) It would be nice to have the profile_image_url in it as well. I can
imagine a lot of nice visual enhancements with that.

2) +1 for making it optional. A lot of people are suggesting
additional stuff, so maybe it would even be nicer to not just have a
include/don't include param, but to be able to specify which data you
would like to have included...

jarón

On May 14, 6:29 am, Rich rhyl...@gmail.com wrote:
 +1 for it being optional as well.  Whilst I will probably use it, it's
 nice to be able to keep the bandwidth download to a minimum for
 scenarios where it's not needed

 On May 14, 1:52 am, Naveen Ayyagari nav...@getsocialscope.com wrote:

  +1 on the additional parameter to optionally request the data. Every
  byte counts for mobile device battery life and download time.

  --Naveen Ayyagari
  @knight9

  On May 13, 8:13 pm, Dewald Pretorius dpr...@gmail.com wrote:

   Raffi,

   This is all good, but can you please make the inclusion in the tweet
   payload optional? Meaning, only include it if it is requested by an
   additional parameter?

   I, and I'm sure a lot of others, are already parsing the tweet text.
   This is just going to consume additional bandwidth and not add any
   value for us. It will add value for folks who are not already doing
   the parsing or don't know how. So, they can just request this
   additional payload.


[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-14 Thread Edi
+1 for making this optional.
It's faster for mobile apps to do this themselves than download it.

Besides, if this is the library used for web, you're not doing it
right. :)
For example, to mention URL parsing only, you don't check for valid
domain names (e.g. www.test.failure is matched as URL),
some characters are not recognized as part of a link (e.g. | in
http://translate.google.com/?hl=en#auto|en|bonjour)...


Re: [twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-14 Thread Raffi Krikorian

 Besides, if this is the library used for web, you're not doing it
 right. :)
 For example, to mention URL parsing only, you don't check for valid
 domain names (e.g. www.test.failure is matched as URL),
 some characters are not recognized as part of a link (e.g. | in
 http://translate.google.com/?hl=en#auto|en|bonjour)...


all we're trying to do is help people standardize on how they parse stuff.
 making sure you can represent what is a hash tag, a url, a username, etc.,
in the same way that twitter.com does it, can be difficult.

-- 
Raffi Krikorian
Twitter Platform Team
http://twitter.com/raffi


[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-14 Thread Edi
I understand. And I don't have anything against it (even if it will be
default), as long as it will be optional.
And we're all appreciating the library (and its Java implementation:
http://github.com/mzsanford/twitter-text-java).


On May 14, 3:47 pm, Raffi Krikorian ra...@twitter.com wrote:
 all we're trying to do is help people standardize on how they parse stuff.
  making sure you can represent what is a hash tag, a url, a username, etc.,
 in the same way that twitter.com does it, can be difficult.

 --
 Raffi Krikorian
 Twitter Platform Teamhttp://twitter.com/raffi


[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-14 Thread Zhami
+1 for it being optional as well -- keep the bandwidth to a minimum
for scenarios where it's not needed.

+1 for having short URLs' original (long) URL provided (perhaps also
an option?)


Re: [twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-14 Thread Adam Green
Disambiguating short URLs and delivering the true URL and title would
be a real plus, not just for developers, but for the target of a URL.
While it does add a load to twitter's servers, it will save many, many
useless hits to the target.

Imagine 100,000 Twitter apps resolving each short URL found in a
tweet. All of them doing it within seconds of the tweet arriving via
the streaming API. It would be an automatic DOS against every site
mentioned in a tweet.

If this sounds hyperbolic, read the APIwiki docs that say 2,000
followers is an expected max. Ha!


On Fri, May 14, 2010 at 9:15 AM, Zhami stu...@yellowhelium.com wrote:
 +1 for it being optional as well -- keep the bandwidth to a minimum
 for scenarios where it's not needed.

 +1 for having short URLs' original (long) URL provided (perhaps also
 an option?)



[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-14 Thread Karthik
Raffi,

A bit advanced request. Would it be possible to attach list of
significant words and phrases present in the tweet. We could then use
this info to categorize tweets and even build a trends list on the
tweets aggregated by our apps.

In one of our apps, we use Yahoo Terms Extraction service to extract
phrases from tweets.

Thanks,
Karthik



[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-13 Thread glenn gillen
Raffi,

This follows on nicely from the presentation at Warblecamp last week
discussing how difficult it is to do this right, and I think a
consistent approach across all clients (including twitter.com,
mobile.twitter, and 3rd party apps) should be priority number 1.
However looking at your example:

On May 13, 10:25 pm, Raffi Krikorian ra...@twitter.com wrote:
 {
  text : hey @raffi tell @noradio to check out http://dev.twitter.com#hot;,
 snip
     {
       url : http://dev.twitter.com;,
       indices : [38, 64]
     },
   ],
   hashtags : [
     {
       text : #hot,
       indices : [66, 69]
       url : http://search.twitter.com/search?q=%23hot;
     }
   ]
  }

Without looking at how twitter.com would currently handle that
example, I would have expected the url to be http://dev.twitter.com/
#hot and for the tweet to contain no hashtag. If the hashtag always
takes precedence I'd have no way to link to the following without
using a URL shortener: http://oauth.net/core/1.0a/#anchor41
--
Glenn Gillen
http://glenngillen.com/


RE: [twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-13 Thread Brian Smith
Glenn Gillen wrote:
 Without looking at how twitter.com would currently handle that example, I
 would have expected the url to be http://dev.twitter.com/ #hot and for
the
 tweet to contain no hashtag. If the hashtag always takes precedence I'd
have no
 way to link to the following without using a URL shortener:
 http://oauth.net/core/1.0a/#anchor41

I think you are overlooking the space between the last slash and #hot.
URLs cannot contain (un-encoded) spaces.

Regards,
Brian



Re: [twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-13 Thread Raffi Krikorian
hey glenn.

i think something went wrong in the copy and paste -- there should have been
a space between the URL and the hashtag.

On Thu, May 13, 2010 at 11:02 PM, glenn gillen gl...@rubypond.com wrote:

 Raffi,

 This follows on nicely from the presentation at Warblecamp last week
 discussing how difficult it is to do this right, and I think a
 consistent approach across all clients (including twitter.com,
 mobile.twitter, and 3rd party apps) should be priority number 1.
 However looking at your example:

 On May 13, 10:25 pm, Raffi Krikorian ra...@twitter.com wrote:
  {
   text : hey @raffi tell @noradio to check out
 http://dev.twitter.com#hot;,
  snip
  {
url : http://dev.twitter.com;,
indices : [38, 64]
  },
],
hashtags : [
  {
text : #hot,
indices : [66, 69]
url : http://search.twitter.com/search?q=%23hot;
  }
]
   }

 Without looking at how twitter.com would currently handle that
 example, I would have expected the url to be http://dev.twitter.com/
 #hot and for the tweet to contain no hashtag. If the hashtag always
 takes precedence I'd have no way to link to the following without
 using a URL shortener: http://oauth.net/core/1.0a/#anchor41
 --
 Glenn Gillen
 http://glenngillen.com/




-- 
Raffi Krikorian
Twitter Platform Team
http://twitter.com/raffi


[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-13 Thread Rich
I can see the text inside some of the entities tag causing some
developers some problems as it's the same tag name as the status.  Of
course all of us should be able to handle it, but just look what
happened with the extra user id tag inside a status

On May 13, 11:11 pm, Raffi Krikorian ra...@twitter.com wrote:
 hey glenn.

 i think something went wrong in the copy and paste -- there should have been
 a space between the URL and the hashtag.





 On Thu, May 13, 2010 at 11:02 PM, glenn gillen gl...@rubypond.com wrote:
  Raffi,

  This follows on nicely from the presentation at Warblecamp last week
  discussing how difficult it is to do this right, and I think a
  consistent approach across all clients (including twitter.com,
  mobile.twitter, and 3rd party apps) should be priority number 1.
  However looking at your example:

  On May 13, 10:25 pm, Raffi Krikorian ra...@twitter.com wrote:
   {
    text : hey @raffi tell @noradio to check out
 http://dev.twitter.com#hot;,
   snip
       {
         url : http://dev.twitter.com;,
         indices : [38, 64]
       },
     ],
     hashtags : [
       {
         text : #hot,
         indices : [66, 69]
         url : http://search.twitter.com/search?q=%23hot;
       }
     ]
    }

  Without looking at how twitter.com would currently handle that
  example, I would have expected the url to be http://dev.twitter.com/
  #hot and for the tweet to contain no hashtag. If the hashtag always
  takes precedence I'd have no way to link to the following without
  using a URL shortener:http://oauth.net/core/1.0a/#anchor41
  --
  Glenn Gillen
 http://glenngillen.com/

 --
 Raffi Krikorian
 Twitter Platform Teamhttp://twitter.com/raffi


[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-13 Thread Jim DeLaHunt
Raffi:



On May 13, 2:25 pm, Raffi Krikorian ra...@twitter.com wrote:
 as shown above, we'll be parsing out all mentioned users, all lists, all
 included URLs, and all hashtags

This is an interesting step forward.  The internationalisation
considerations can be sticky, though.  I did some entity-parsing from
tweets as part of my Twanguages project (a language census of
Twitter). One discover was that people are in fact using hashtags with
non-latin scripts. Another is that some people are using the '#'
character without intending to create a hashtage (e.g. we are #2 in
line). How will your entity parsing handle non-latin hashtags, latin-
character hashtags with accented characters, and strings starting with
'#' not intended as hashtags?

Also note that URLs can now have non-Latin top-level domain names as
well as second-level domain names and other path parts. For instance,
http://وزارة-الأتصالات.مصر is a valid URL in the .مصر top-level
domain. Will your entity parsing code handle such URLs?

In any case, it would be very helpful if the platform team would
document exactly what regular expressions govern the entities you
recognise. I might not agree with your definition of hashtag syntax,
but at least I want to know what it is.  See for example the running
questions on how to measure the length of a status message. 

 matt sanford
 (@mzsanford) on our internationalization team released the twitter-text
 library (http://github.com/mzsanford/twitter-text-rb) to help making parsing
 easier and standardized (in fact, we use this library ourselves), but we on
 the Platform team wondered if we could make this even easier for our
 developers. ...

I wasn't aware of this, and I'll take a look.  Thank you for the tip!
— Jim DeLaHunt, Vancouver, Canada


[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-13 Thread glenn gillen
On May 13, 11:11 pm, Raffi Krikorian ra...@twitter.com wrote:
 hey glenn.

 i think something went wrong in the copy and paste -- there should have been
 a space between the URL and the hashtag.

My bad. Back in my box then.

Cheers,
--
Glenn Gillen
http://glenngillen.com/


[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-13 Thread Dewald Pretorius
Raffi,

This is all good, but can you please make the inclusion in the tweet
payload optional? Meaning, only include it if it is requested by an
additional parameter?

I, and I'm sure a lot of others, are already parsing the tweet text.
This is just going to consume additional bandwidth and not add any
value for us. It will add value for folks who are not already doing
the parsing or don't know how. So, they can just request this
additional payload.


[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-13 Thread Naveen Ayyagari
+1 on the additional parameter to optionally request the data. Every
byte counts for mobile device battery life and download time.

--Naveen Ayyagari
@knight9


On May 13, 8:13 pm, Dewald Pretorius dpr...@gmail.com wrote:
 Raffi,

 This is all good, but can you please make the inclusion in the tweet
 payload optional? Meaning, only include it if it is requested by an
 additional parameter?

 I, and I'm sure a lot of others, are already parsing the tweet text.
 This is just going to consume additional bandwidth and not add any
 value for us. It will add value for folks who are not already doing
 the parsing or don't know how. So, they can just request this
 additional payload.


[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-13 Thread Adam
Indeed, it would be great to see this is the preview of UserStreams :)


[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

2010-05-13 Thread Rich
+1 for it being optional as well.  Whilst I will probably use it, it's
nice to be able to keep the bandwidth download to a minimum for
scenarios where it's not needed

On May 14, 1:52 am, Naveen Ayyagari nav...@getsocialscope.com wrote:
 +1 on the additional parameter to optionally request the data. Every
 byte counts for mobile device battery life and download time.

 --Naveen Ayyagari
 @knight9

 On May 13, 8:13 pm, Dewald Pretorius dpr...@gmail.com wrote:



  Raffi,

  This is all good, but can you please make the inclusion in the tweet
  payload optional? Meaning, only include it if it is requested by an
  additional parameter?

  I, and I'm sure a lot of others, are already parsing the tweet text.
  This is just going to consume additional bandwidth and not add any
  value for us. It will add value for folks who are not already doing
  the parsing or don't know how. So, they can just request this
  additional payload.