Revision: 17948
http://sourceforge.net/p/gate/code/17948
Author: ian_roberts
Date: 2014-05-11 00:08:34 +0000 (Sun, 11 May 2014)
Log Message:
-----------
Fleshing out the documentation for TwitIE.
Modified Paths:
--------------
userguide/branches/release-8.0/misc-creole.tex
userguide/branches/release-8.0/sections.map
userguide/branches/release-8.0/tao_main.tex
Added Paths:
-----------
userguide/branches/release-8.0/social-media.tex
Modified: userguide/branches/release-8.0/misc-creole.tex
===================================================================
--- userguide/branches/release-8.0/misc-creole.tex 2014-05-10 22:35:37 UTC
(rev 17947)
+++ userguide/branches/release-8.0/misc-creole.tex 2014-05-11 00:08:34 UTC
(rev 17948)
@@ -3225,74 +3225,6 @@
and if one GATE document should be created per CSV file or per row within a
file.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\sect[sec:creole:tweet]{Twitter processing}
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-The \verb!Twitter! plugin contains several tools useful for processing tweets.
-This plugin depends on the \verb!Tagger_Stanford! plugin, which must be loaded
-first.
-%
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\subsect[sec:creole:tweetformat]{Twitter JSON format}
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%
-This plugin contains a format analyser for JSON files from the Twitter
-API\footnote{\url{https://dev.twitter.com/docs/platform-objects/tweets}}.
-Loading the plugin registers the document format with GATE, so that it will be
-automatically associated with files whose names end in ``\verb!.json!'';
-otherwise you need to specify \verb!text/x-json-twitter! for the document
-mimeType parameter. This will work both when directly creating a single new
-GATE document and when populating a corpus.
-
-Each tweet object's \verb!text! value is converted into the document content,
-which is covered with a \emph{Tweet} annotations whose features represent
-(recursively when appropriate, using \emph{Map} and \emph{List}) all the other
-key-value pairs in the tweet object. \textbf{Note:} these recursive values are
-difficult to work with in JAPE; the special corpus population tool described
-next allows important key-sequences to be ``brought up'' to the document
content
-and the top level of the annotation features.
-
-Multiple tweet objects in the same JSON file are separated by blank lines
(which
-are not covered by \emph{Tweet} annotations).
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\subsect[sec:creole:population]{Corpus population from JSON files}
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-Loading this plugin adds a ``Populate from Twitter JSON files'' option to the
-GATE Corpus right-click menu. Selecting this option bringgs up a dialog that
-allows you to select one or more files of tweets in the Twitter API's JSON
-format and set the following options to populate the corpus.
-%%
-\begin{description}
-\item[Encoding] The default here is UTF-8 (regardless of your Java default) to
- conform to Twitter JSON.
-\item[One document per tweet] If this box is ticked, each tweet will produce a
- separate document. If not (the default), each input file will produce one
- GATE document.
-\item[Content keys] The values of these JSON keys are converted into strings
and
- concatenated into each tweet's document content. Colon-delimited strings
- specify nested keys, e.g., ``\texttt{user:name}'' will yield the value of the
- \texttt{name} key in the map that is the value of the \texttt{user} key.
- Missing key sequences are ignored. Each span of text will be covered by an
- annotation whose type is the key sequence.
-\item[Feature keys] The key sequences and values of these JSON keys (where
- present) are turned into feature names and values on the tweet's main
- \texttt{Tweet} annotation.
-\item[Save configuration] This button saves the current options in an XML file
- for re-use later.
-\item[Load configuration] This button sets the options according to a saved XML
- configuration.
-\end{description}
-%%
-Every tweet is covered by a \texttt{Tweet} annotation with features specified
by
-the ``feature keys'' option. Multiple tweets in the same GATE document are
-separated by a blank line (two newlines).
-
-Corpus population from Twitter JSON files is also accessible programmatically
-when this plugin is loaded, using the public static void method
-\texttt{gate.corpora.twitter.Population.populateCorpus(final Corpus corpus, URL
- inputUrl, String encoding, List<String> contentKeys, List<String>
featureKeys,
- int tweetsPerDoc)}.
-%%
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\sect[sec:creole:termraider]{TermRaider term extraction tools}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TermRaider is a set of term extraction and scoring tools developed in the NeOn
Modified: userguide/branches/release-8.0/sections.map
===================================================================
--- userguide/branches/release-8.0/sections.map 2014-05-10 22:35:37 UTC (rev
17947)
+++ userguide/branches/release-8.0/sections.map 2014-05-11 00:08:34 UTC (rev
17948)
@@ -183,3 +183,6 @@
sec:mvc sec:design:mvc
sec:interfaces sec:design:interfaces
sec:exceptions sec:design:exceptions
+## Twitter
+sec:creole:tweet sec:social:twitter
+sec:creole:tweetformat sec:social:twitter:format
Added: userguide/branches/release-8.0/social-media.tex
===================================================================
--- userguide/branches/release-8.0/social-media.tex
(rev 0)
+++ userguide/branches/release-8.0/social-media.tex 2014-05-11 00:08:34 UTC
(rev 17948)
@@ -0,0 +1,188 @@
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %
+% social-media.tex
+%
+% Ian Roberts, May 2014
+%
+% $Id: uima.tex,v 1.3 2006/10/21 11:44:47 ian Exp $
+%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\chapt[chap:social]{Tools for Social Media Data}
+\markboth{Tools for Social Media Data}{Tools for Social Media Data}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\nnormalsize
+
+Social media provides data that is highly valuable to many organizations, for
+example as a way to track public opinion about a company's products or to
+discover attitudes towards ``hot topics'' and breaking news stories. However,
+processing social media text presents a set of unique challenges, and text
+processing tools designed to work on longer and more well-formed texts such as
+news articles tend to perform badly on social media. To obtain reasonable
+results on short, inconsistent and ungrammatical texts such as these requires
+tools that are specifically tuned to deal with them.
+
+This chapter discusses the tools provided by GATE for use with social media
+data.
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\sect[sec:social:twitter]{Tools for Twitter}
+
+The \verb!Twitter! plugin contains several tools useful for processing tweets.
+This plugin depends on the \verb!Tagger_Stanford! plugin, which must be loaded
+first. This includes tools to load documents into GATE from the JSON format
+provided by the Twitter APIs, a tokeniser and POS tagger tuned specifically for
+Tweets, a tool to split up multi-word hashtags, and an example named entity
+recognition application called {\em TwitIE} which demonstrates all these
+components working together.
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsect[sec:social:twitter:format]{Twitter JSON format}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%
+Twitter provides APIs to search for Tweets according to various criteria, and
+to collect streams of Tweets in real-time. These APIs return the Tweets in a
+structured JSON format%
+\footnote{\url{https://dev.twitter.com/docs/platform-objects/tweets}} which
+includes the text of the Tweet plus a large amount of supporting metadata.
+The GATE \verb!Twitter! plugin contains a format analyser for this JSON format
+which allows you to load a file of one or more JSON Tweets into a GATE
+document. Loading the plugin registers the document format with GATE, so that
+it will be automatically associated with files whose names end in
+``\verb!.json!''; otherwise you need to specify \verb!text/x-json-twitter! for
+the document mimeType parameter. This will work both when directly creating a
+single new GATE document and when populating a corpus.
+
+Each tweet object's \verb!text! value is converted into the document content,
+which is covered with a \emph{Tweet} annotations whose features represent
+(recursively when appropriate, using \emph{Map} and \emph{List}) all the other
+key-value pairs in the tweet object. \textbf{Note:} these recursive values are
+difficult to work with in JAPE; the special corpus population tool described
+next allows important key-sequences to be ``brought up'' to the document
content
+and the top level of the annotation features.
+
+Multiple tweet objects in the same JSON file are separated by blank lines
(which
+are not covered by \emph{Tweet} annotations).
+
+As well as the document format parser to load Tweets into a single GATE
+document, the plugin provides a ``Populate from Twitter JSON files'' option on
+the GATE Corpus right-click menu. Selecting this option bringgs up a dialog
+that allows you to select one or more files of tweets in the Twitter API's JSON
+format and set the following options to populate the corpus.
+%%
+\begin{description}
+\item[Encoding] The default here is UTF-8 (regardless of your Java default) to
+ conform to Twitter JSON.
+\item[One document per tweet] If this box is ticked (the default), each tweet
+ will produce a separate document. If not, each {\em input file} will produce
+ one GATE document.
+\item[Content keys] The values of these JSON keys are converted into strings
and
+ concatenated into each tweet's document content. Colon-delimited strings
+ specify nested keys, e.g., ``\texttt{user:name}'' will yield the value of the
+ \texttt{name} key in the map that is the value of the \texttt{user} key.
+ Missing key sequences are ignored. Each span of text will be covered by an
+ annotation whose type is the key sequence.
+\item[Feature keys] The key sequences and values of these JSON keys (where
+ present) are turned into feature names and values on the tweet's main
+ \texttt{Tweet} annotation.
+\item[Save configuration] This button saves the current options in an XML file
+ for re-use later.
+\item[Load configuration] This button sets the options according to a saved XML
+ configuration.
+\end{description}
+%%
+Every tweet is covered by a \texttt{Tweet} annotation with features specified
by
+the ``feature keys'' option. Multiple tweets in the same GATE document are
+separated by a blank line (two newlines).
+
+Corpus population from Twitter JSON files is also accessible programmatically
+when this plugin is loaded, using the public static void method
+\texttt{gate.corpora.twitter.Population.populateCorpus(final Corpus corpus, URL
+ inputUrl, String encoding, List<String> contentKeys, List<String>
featureKeys,
+ int tweetsPerDoc)}.
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\sect[sec:social:twitter:prs]{Low-level PRs for Tweets}
+
+The \verb!Twitter! plugin provides a number of low-level language processing
components that are specifically tuned to Twitter data.
+
+The ``Twitter Tokenizer'' PR is a specialization of the ANNIE English Tokeniser
+for use with Tweets. There are a number of differences in the way this
+tokeniser divides up the text compared to the default ANNIE PR:
+%
+\begin{itemize}
+\item URLs and abbreviations (such as ``gr8'' or ``2day'') are treated as a
+ single token.
+\item User mentions (\verb!@username!) are two tokens, one for the \verb!@! and
+ one for the username.
+\item Hashtags are likewise two tokens (the hash and the tag), but see below
+ for another component that can split up multi-word hashtags.
+\item ``Emoticons'' such as \verb!:-D! can be treated as a single token. This
+ requires a gazetteer of emoticons to be run before the tokeniser, an example
+ gazetteer is provided in the Twitter plugin.
+\end{itemize}
+
+The ``Tweet Normaliser'' PR uses a spelling correction dictionary to correct
+mis-spellings and a Twitter-specific dictionary to expand common abbreviations
+and substitutions. It replaces the \verb!string! feature on matching tokens
+with the normalised form, preserving the original string value in the
+\verb!origString! feature.
+
+The ``Twitter POS Tagger'' PR uses the Stanford Tagger
+(section~\ref{sec:misc:creole:stanford}) with a model trained on Tweets. The
+POS tagger can take advantage of expanded strings produced by the normaliser
+PR.
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\sect[sec:social:twitter:hashtag]{Handling multi-word hashtags}
+
+When rendering a Tweet on the web, Twitter automatically converts contiguous
+sequences of alpha-numeric characters following a hash (\verb!#!) into links
+to search for other Tweets that include the same string. Thus ``hashtags''
+have rapidly become the de-facto standard way to mark a Tweet as relating to a
+particular theme, event, brand name, etc. Since hashtags cannot contain white
+space, it is common for users to form hashtags by running together a number of
+separate words, sometimes in ``camel case'' form but sometimes simply all in
+lower (or upper) case, for example ``\#worldgonemad'' (as search queries on
+Twitter are not case-sensitive).
+
+The ``Hashtag Tokenizer'' PR attempts to recover the original discrete words
+from such multi-word hashtags. It uses a large gazetteer of common English
+words, organization names, locations, etc. as well as slang words and
+contractions without the use of apostrophes (since hashtags are alphanumeric,
+words like ``wouldn't'' tend to be expressed as ``wouldnt'' without the
+apostrophe). Camel-cased hashtags (\verb!#CamelCasedHashtag!) are split at
+case changes.
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\sect[sec:social:twitie]{The TwitIE Pipeline}
+
+The Twitter plugin includes a sample ready-made application called TwitIE,
+which combines the PRs described above with additional resources borrowed from
+ANNIE and the TextCat language identification PR to produce a general-purpose
+named entity recognition pipeline for use with Tweets. TwitIE includes the
+following components:
+
+\begin{itemize}
+\item Annotation Set Transfer to transfer Tweet annotations from the Original
+ markups annotation set. For documents loaded using the JSON document format
+ or corpus population logic, this means that each Tweet will be covered by a
+ separate Tweet annotation in the final output of TwitIE.
+\item \emph{Language identification} PR (see
+ section~\ref{sec:misc-creole:language-identification}) using language models
+ trained on English, French, German, Dutch and Spanish Tweets. This creates a
+ feature \verb!lang! on each Tweet annotation giving the detected language.
+\item \emph{Twitter tokenizer} described above, including a gazetteer of
+ emoticons.
+\item \emph{Hashtag tokenizer} to split up hashtags consisting of multiple
+ words.
+\item The standard ANNIE \emph{gazetteer} and \emph{sentence splitter}.
+\item \emph{Normaliser} and \emph{POS tagger} described above.
+\item Named entity JAPE grammars, based largely on the ANNIE defaults but with
+ some customizations.
+\end{itemize}
+
+Full details of the TwitIE pipeline can be found in \cite{bontcheva2013twitie}.
+
+% vim:ft=tex
Modified: userguide/branches/release-8.0/tao_main.tex
===================================================================
--- userguide/branches/release-8.0/tao_main.tex 2014-05-10 22:35:37 UTC (rev
17947)
+++ userguide/branches/release-8.0/tao_main.tex 2014-05-11 00:08:34 UTC (rev
17948)
@@ -689,6 +689,7 @@
\input{ontologies} %final for book (a couple of small overfulls in listings,
but not bad)
\input{language-creole} %final for book
\input{domain-creole}
+\input{social-media}
\input{parsers} %final for book
\input{machine-learning} %final for book
\input{alignment} %final for book
This was sent by the SourceForge.net collaborative development platform, the
world's largest Open Source development site.
------------------------------------------------------------------------------
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
• 3 signs your SCM is hindering your productivity
• Requirements for releasing software faster
• Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
_______________________________________________
GATE-cvs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/gate-cvs