[twitter-dev] parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

Raffi Krikorian Thu, 13 May 2010 14:26:06 -0700

tweet text can potentially mention other users, lists, contain URLs, and
contain hashtags -- in fact, something like 50% of tweets contain at least
one of those.  developers who want to understand the tweet text have to
parse the text to try to extract those entities (which can get really hard
and difficult when dealing with unicode characters) and then have to
potentially make another REST call to resolve that data.  matt sanford
(@mzsanford) on our internationalization team released the twitter-text
library (http://github.com/mzsanford/twitter-text-rb) to help making parsing
easier and standardized (in fact, we use this library ourselves), but we on
the Platform team wondered if we could make this even easier for our
developers.


as part of our JSON and XML payloads, we are going to start supporting an
entities attribute that will contain this parsed and structured data.
 you'll see it like so:

{
 "text" : "hey @raffi tell @noradio to check out http://dev.twitter.com#hot";,
 ...
 "entities" : {
  "user_mentions" : [
    {
      "id" : 8285392,
      "screen_name" : "raffi",
      "indices" : [4, 9]
    },
    {
      "id" : 3191321,
      "screen_name" : "noradio",
      "indices" : [16, 23]
    }
  ],
  "urls" : [
    {
      "url" : "http://dev.twitter.com";,
      "indices" : [38, 64]
    },
  ],
  "hashtags" : [
    {
      "text" : "#hot",
      "indices" : [66, 69]
      "url" : "http://search.twitter.com/search?q=%23hot";
    }
  ]
 }
 ...
}

or like so

<status>
  <text>hey @raffi tell @noradio to check out http://dev.twitter.com#hot</text>
  ...
  <entities>
    <user_mentions>
      <user_mention start="4" end="9">
        <id>8285392</id>
        <screen_name>raffi</screen_name>
      </user_mention>
      <user_mention start="16" end="23">
        <id>3191321</id>
        <screen_name>noradio</screen_name>
      </user_mention>
    </user_mentions>
    <urls>
      <url start="38" end="64">
        <url>http://dev.twitter.com</url>
      </url>
    </urls>
    <hashtags>
      <hashtag start="66" end="69">
        <text>#hot</text>
        <url>http://search.twitter.com/search?q=%23hot</url>
      </hashtag>
    </hashtags>
  </entities>
  ...
</status>

as shown above, we'll be parsing out all mentioned users, all lists, all
included URLs, and all hashtags.  in the case of users, we'll provide you
their user ID, and for hashtags we'll provide you the query you can run
against the search API.  and, for all of them, we'll also tell you at what
character count the entity starts and stops -- that should really take the
burden off you guys to parse the text properly.

this entities block will probably be extended later, and these entities are
just the start.  have we missed anything?  is there anything else you would
like to see?  as always - just drop us a note, and look for these entities
to start slowly rolling out.

-- 
Raffi Krikorian
Twitter Platform Team
http://twitter.com/raffi

[twitter-dev] parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

Reply via email to