release-8.0

ian_roberts Sat, 10 May 2014 17:09:06 -0700

Revision: 17948
          http://sourceforge.net/p/gate/code/17948
Author:   ian_roberts
Date:     2014-05-11 00:08:34 +0000 (Sun, 11 May 2014)
Log Message:
-----------
Fleshing out the documentation for TwitIE.


Modified Paths:
--------------
    userguide/branches/release-8.0/misc-creole.tex
    userguide/branches/release-8.0/sections.map
    userguide/branches/release-8.0/tao_main.tex

Added Paths:
-----------
    userguide/branches/release-8.0/social-media.tex

Modified: userguide/branches/release-8.0/misc-creole.tex
===================================================================
--- userguide/branches/release-8.0/misc-creole.tex      2014-05-10 22:35:37 UTC 
(rev 17947)
+++ userguide/branches/release-8.0/misc-creole.tex      2014-05-11 00:08:34 UTC 
(rev 17948)
@@ -3225,74 +3225,6 @@
 and if one GATE document should be created per CSV file or per row within a 
file.
 %
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\sect[sec:creole:tweet]{Twitter processing}
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-The \verb!Twitter! plugin contains several tools useful for processing tweets.
-This plugin depends on the \verb!Tagger_Stanford! plugin, which must be loaded
-first.
-%
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\subsect[sec:creole:tweetformat]{Twitter JSON format}
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%
-This plugin contains a format analyser for JSON files from the Twitter
-API\footnote{\url{https://dev.twitter.com/docs/platform-objects/tweets}}.
-Loading the plugin registers the document format with GATE, so that it will be
-automatically associated with files whose names end in ``\verb!.json!'';
-otherwise you need to specify \verb!text/x-json-twitter! for the document
-mimeType parameter.  This will work both when directly creating a single new
-GATE document and when populating a corpus.
-
-Each tweet object's \verb!text! value is converted into the document content,
-which is covered with a \emph{Tweet} annotations whose features represent
-(recursively when appropriate, using \emph{Map} and \emph{List}) all the other
-key-value pairs in the tweet object.  \textbf{Note:} these recursive values are
-difficult to work with in JAPE; the special corpus population tool described
-next allows important key-sequences to be ``brought up'' to the document 
content
-and the top level of the annotation features.
-
-Multiple tweet objects in the same JSON file are separated by blank lines 
(which
-are not covered by \emph{Tweet} annotations).
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\subsect[sec:creole:population]{Corpus population from JSON files}
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-Loading this plugin adds a ``Populate from Twitter JSON files'' option to the
-GATE Corpus right-click menu.  Selecting this option bringgs up a dialog that
-allows you to select one or more files of tweets in the Twitter API's JSON
-format and set the following options to populate the corpus.
-%%
-\begin{description}
-\item[Encoding]  The default here is UTF-8 (regardless of your Java default) to
-  conform to Twitter JSON.
-\item[One document per tweet] If this box is ticked, each tweet will produce a
-  separate document.  If not (the default), each input file will produce one
-  GATE document.
-\item[Content keys] The values of these JSON keys are converted into strings 
and
-  concatenated into each tweet's document content.  Colon-delimited strings
-  specify nested keys, e.g., ``\texttt{user:name}'' will yield the value of the
-  \texttt{name} key in the map that is the value of the \texttt{user} key.
-  Missing key sequences are ignored.  Each span of text will be covered by an
-  annotation whose type is the key sequence.
-\item[Feature keys] The key sequences and values of these JSON keys (where
-  present) are turned into feature names and values on the tweet's main
-  \texttt{Tweet} annotation.
-\item[Save configuration] This button saves the current options in an XML file
-  for re-use later.
-\item[Load configuration] This button sets the options according to a saved XML
-  configuration.
-\end{description}
-%%
-Every tweet is covered by a \texttt{Tweet} annotation with features specified 
by
-the ``feature keys'' option.  Multiple tweets in the same GATE document are
-separated by a blank line (two newlines).
-
-Corpus population from Twitter JSON files is also accessible programmatically
-when this plugin is loaded, using the public static void method
-\texttt{gate.corpora.twitter.Population.populateCorpus(final Corpus corpus, URL
-  inputUrl, String encoding, List<String> contentKeys, List<String> 
featureKeys,
-  int tweetsPerDoc)}.
-%%
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \sect[sec:creole:termraider]{TermRaider term extraction tools}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 TermRaider is a set of term extraction and scoring tools developed in the NeOn

Modified: userguide/branches/release-8.0/sections.map
===================================================================
--- userguide/branches/release-8.0/sections.map 2014-05-10 22:35:37 UTC (rev 
17947)
+++ userguide/branches/release-8.0/sections.map 2014-05-11 00:08:34 UTC (rev 
17948)
@@ -183,3 +183,6 @@
 sec:mvc        sec:design:mvc
 sec:interfaces sec:design:interfaces
 sec:exceptions sec:design:exceptions
+## Twitter
+sec:creole:tweet       sec:social:twitter
+sec:creole:tweetformat sec:social:twitter:format

Added: userguide/branches/release-8.0/social-media.tex
===================================================================
--- userguide/branches/release-8.0/social-media.tex                             
(rev 0)
+++ userguide/branches/release-8.0/social-media.tex     2014-05-11 00:08:34 UTC 
(rev 17948)
@@ -0,0 +1,188 @@
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %
+% social-media.tex
+%
+% Ian Roberts, May 2014
+%
+% $Id: uima.tex,v 1.3 2006/10/21 11:44:47 ian Exp $
+%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\chapt[chap:social]{Tools for Social Media Data}
+\markboth{Tools for Social Media Data}{Tools for Social Media Data}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\nnormalsize
+
+Social media provides data that is highly valuable to many organizations, for
+example as a way to track public opinion about a company's products or to
+discover attitudes towards ``hot topics'' and breaking news stories.  However,
+processing social media text presents a set of unique challenges, and text
+processing tools designed to work on longer and more well-formed texts such as
+news articles tend to perform badly on social media.  To obtain reasonable
+results on short, inconsistent and ungrammatical texts such as these requires
+tools that are specifically tuned to deal with them.
+
+This chapter discusses the tools provided by GATE for use with social media
+data.
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\sect[sec:social:twitter]{Tools for Twitter}
+
+The \verb!Twitter! plugin contains several tools useful for processing tweets.
+This plugin depends on the \verb!Tagger_Stanford! plugin, which must be loaded
+first.  This includes tools to load documents into GATE from the JSON format
+provided by the Twitter APIs, a tokeniser and POS tagger tuned specifically for
+Tweets, a tool to split up multi-word hashtags, and an example named entity
+recognition application called {\em TwitIE} which demonstrates all these
+components working together.
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsect[sec:social:twitter:format]{Twitter JSON format}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%
+Twitter provides APIs to search for Tweets according to various criteria, and
+to collect streams of Tweets in real-time.  These APIs return the Tweets in a
+structured JSON format%
+\footnote{\url{https://dev.twitter.com/docs/platform-objects/tweets}} which
+includes the text of the Tweet plus a large amount of supporting metadata.
+The GATE \verb!Twitter! plugin contains a format analyser for this JSON format
+which allows you to load a file of one or more JSON Tweets into a GATE
+document.  Loading the plugin registers the document format with GATE, so that
+it will be automatically associated with files whose names end in
+``\verb!.json!''; otherwise you need to specify \verb!text/x-json-twitter! for
+the document mimeType parameter.  This will work both when directly creating a
+single new GATE document and when populating a corpus.
+
+Each tweet object's \verb!text! value is converted into the document content,
+which is covered with a \emph{Tweet} annotations whose features represent
+(recursively when appropriate, using \emph{Map} and \emph{List}) all the other
+key-value pairs in the tweet object.  \textbf{Note:} these recursive values are
+difficult to work with in JAPE; the special corpus population tool described
+next allows important key-sequences to be ``brought up'' to the document 
content
+and the top level of the annotation features.
+
+Multiple tweet objects in the same JSON file are separated by blank lines 
(which
+are not covered by \emph{Tweet} annotations).
+
+As well as the document format parser to load Tweets into a single GATE
+document, the plugin provides a ``Populate from Twitter JSON files'' option on
+the GATE Corpus right-click menu.  Selecting this option bringgs up a dialog
+that allows you to select one or more files of tweets in the Twitter API's JSON
+format and set the following options to populate the corpus.
+%%
+\begin{description}
+\item[Encoding]  The default here is UTF-8 (regardless of your Java default) to
+  conform to Twitter JSON.
+\item[One document per tweet] If this box is ticked (the default), each tweet
+  will produce a separate document.  If not, each {\em input file} will produce
+  one GATE document.
+\item[Content keys] The values of these JSON keys are converted into strings 
and
+  concatenated into each tweet's document content.  Colon-delimited strings
+  specify nested keys, e.g., ``\texttt{user:name}'' will yield the value of the
+  \texttt{name} key in the map that is the value of the \texttt{user} key.
+  Missing key sequences are ignored.  Each span of text will be covered by an
+  annotation whose type is the key sequence.
+\item[Feature keys] The key sequences and values of these JSON keys (where
+  present) are turned into feature names and values on the tweet's main
+  \texttt{Tweet} annotation.
+\item[Save configuration] This button saves the current options in an XML file
+  for re-use later.
+\item[Load configuration] This button sets the options according to a saved XML
+  configuration.
+\end{description}
+%%
+Every tweet is covered by a \texttt{Tweet} annotation with features specified 
by
+the ``feature keys'' option.  Multiple tweets in the same GATE document are
+separated by a blank line (two newlines).
+
+Corpus population from Twitter JSON files is also accessible programmatically
+when this plugin is loaded, using the public static void method
+\texttt{gate.corpora.twitter.Population.populateCorpus(final Corpus corpus, URL
+  inputUrl, String encoding, List<String> contentKeys, List<String> 
featureKeys,
+  int tweetsPerDoc)}.
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\sect[sec:social:twitter:prs]{Low-level PRs for Tweets}
+
+The \verb!Twitter! plugin provides a number of low-level language processing 
components that are specifically tuned to Twitter data.
+
+The ``Twitter Tokenizer'' PR is a specialization of the ANNIE English Tokeniser
+for use with Tweets.  There are a number of differences in the way this
+tokeniser divides up the text compared to the default ANNIE PR:
+%
+\begin{itemize}
+\item URLs and abbreviations (such as ``gr8'' or ``2day'') are treated as a
+  single token.
+\item User mentions (\verb!@username!) are two tokens, one for the \verb!@! and
+  one for the username.
+\item Hashtags are likewise two tokens (the hash and the tag), but see below
+  for another component that can split up multi-word hashtags.
+\item ``Emoticons'' such as \verb!:-D! can be treated as a single token.  This
+  requires a gazetteer of emoticons to be run before the tokeniser, an example
+  gazetteer is provided in the Twitter plugin.
+\end{itemize}
+
+The ``Tweet Normaliser'' PR uses a spelling correction dictionary to correct
+mis-spellings and a Twitter-specific dictionary to expand common abbreviations
+and substitutions.  It replaces the \verb!string! feature on matching tokens
+with the normalised form, preserving the original string value in the
+\verb!origString! feature.
+
+The ``Twitter POS Tagger'' PR uses the Stanford Tagger
+(section~\ref{sec:misc:creole:stanford}) with a model trained on Tweets.  The
+POS tagger can take advantage of expanded strings produced by the normaliser
+PR.
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\sect[sec:social:twitter:hashtag]{Handling multi-word hashtags}
+
+When rendering a Tweet on the web, Twitter automatically converts contiguous
+sequences of alpha-numeric characters following a hash (\verb!#!) into links
+to search for other Tweets that include the same string.  Thus ``hashtags''
+have rapidly become the de-facto standard way to mark a Tweet as relating to a
+particular theme, event, brand name, etc.  Since hashtags cannot contain white
+space, it is common for users to form hashtags by running together a number of
+separate words, sometimes in ``camel case'' form but sometimes simply all in
+lower (or upper) case, for example ``\#worldgonemad'' (as search queries on
+Twitter are not case-sensitive).
+
+The ``Hashtag Tokenizer'' PR attempts to recover the original discrete words
+from such multi-word hashtags.  It uses a large gazetteer of common English
+words, organization names, locations, etc. as well as slang words and
+contractions without the use of apostrophes (since hashtags are alphanumeric,
+words like ``wouldn't'' tend to be expressed as ``wouldnt'' without the
+apostrophe).  Camel-cased hashtags (\verb!#CamelCasedHashtag!) are split at
+case changes.
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\sect[sec:social:twitie]{The TwitIE Pipeline}
+
+The Twitter plugin includes a sample ready-made application called TwitIE,
+which combines the PRs described above with additional resources borrowed from
+ANNIE and the TextCat language identification PR to produce a general-purpose
+named entity recognition pipeline for use with Tweets.  TwitIE includes the
+following components:
+
+\begin{itemize}
+\item Annotation Set Transfer to transfer Tweet annotations from the Original
+  markups annotation set.  For documents loaded using the JSON document format
+  or corpus population logic, this means that each Tweet will be covered by a
+  separate Tweet annotation in the final output of TwitIE.
+\item \emph{Language identification} PR (see
+  section~\ref{sec:misc-creole:language-identification}) using language models
+  trained on English, French, German, Dutch and Spanish Tweets.  This creates a
+  feature \verb!lang! on each Tweet annotation giving the detected language.
+\item \emph{Twitter tokenizer} described above, including a gazetteer of
+  emoticons.
+\item \emph{Hashtag tokenizer} to split up hashtags consisting of multiple
+  words.
+\item The standard ANNIE \emph{gazetteer} and \emph{sentence splitter}.
+\item \emph{Normaliser} and \emph{POS tagger} described above.
+\item Named entity JAPE grammars, based largely on the ANNIE defaults but with
+  some customizations.
+\end{itemize}
+
+Full details of the TwitIE pipeline can be found in \cite{bontcheva2013twitie}.
+
+% vim:ft=tex

Modified: userguide/branches/release-8.0/tao_main.tex
===================================================================
--- userguide/branches/release-8.0/tao_main.tex 2014-05-10 22:35:37 UTC (rev 
17947)
+++ userguide/branches/release-8.0/tao_main.tex 2014-05-11 00:08:34 UTC (rev 
17948)
@@ -689,6 +689,7 @@
 \input{ontologies} %final for book (a couple of small overfulls in listings, 
but not bad)
 \input{language-creole} %final for book
 \input{domain-creole}
+\input{social-media}
 \input{parsers} %final for book
 \input{machine-learning} %final for book
 \input{alignment} %final for book

This was sent by the SourceForge.net collaborative development platform, the 
world's largest Open Source development site.


------------------------------------------------------------------------------
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
&#149; 3 signs your SCM is hindering your productivity
&#149; Requirements for releasing software faster
&#149; Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
_______________________________________________
GATE-cvs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/gate-cvs

[gate-cvs] SF.net SVN: gate:[17948] userguide/branches/release-8.0

Reply via email to