[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

Jim DeLaHunt Thu, 13 May 2010 15:17:00 -0700

Raffi:

On May 13, 2:25 pm, Raffi Krikorian <ra...@twitter.com> wrote:
> as shown above, we'll be parsing out all mentioned users, all lists, all
> included URLs, and all hashtags....

This is an interesting step forward.  The internationalisation
considerations can be sticky, though.  I did some entity-parsing from
tweets as part of my "Twanguages" project (a language census of
Twitter). One discover was that people are in fact using hashtags with
non-latin scripts. Another is that some people are using the '#'
character without intending to create a hashtage (e.g. "we are #2 in
line"). How will your entity parsing handle non-latin hashtags, latin-
character hashtags with accented characters, and strings starting with
'#' not intended as hashtags?

Also note that URLs can now have non-Latin top-level domain names as
well as second-level domain names and other path parts. For instance,
http://وزارة-الأتصالات.مصر is a valid URL in the .مصر top-level
domain. Will your entity parsing code handle such URLs?

In any case, it would be very helpful if the platform team would
document exactly what regular expressions govern the entities you
recognise. I might not agree with your definition of hashtag syntax,
but at least I want to know what it is.  See for example the running
questions on how to measure the length of a status message. <>

>.... matt sanford
> (@mzsanford) on our internationalization team released the twitter-text
> library (http://github.com/mzsanford/twitter-text-rb) to help making parsing
> easier and standardized (in fact, we use this library ourselves), but we on
> the Platform team wondered if we could make this even easier for our
> developers. ...

I wasn't aware of this, and I'll take a look.  Thank you for the tip!
— Jim DeLaHunt, Vancouver, Canada

[twitter-dev] Re: parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

Reply via email to