Revision: 18448
http://sourceforge.net/p/gate/code/18448
Author: ian_roberts
Date: 2014-11-08 17:26:05 +0000 (Sat, 08 Nov 2014)
Log Message:
-----------
Documentation for the latest set of JSON input/output changes.
Modified Paths:
--------------
userguide/trunk/social-media.tex
Added Paths:
-----------
userguide/trunk/save-as-json.png
Added: userguide/trunk/save-as-json.png
===================================================================
(Binary files differ)
Index: userguide/trunk/save-as-json.png
===================================================================
--- userguide/trunk/save-as-json.png 2014-11-08 15:19:27 UTC (rev 18447)
+++ userguide/trunk/save-as-json.png 2014-11-08 17:26:05 UTC (rev 18448)
Property changes on: userguide/trunk/save-as-json.png
___________________________________________________________________
Added: svn:mime-type
## -0,0 +1 ##
+image/png
\ No newline at end of property
Modified: userguide/trunk/social-media.tex
===================================================================
--- userguide/trunk/social-media.tex 2014-11-08 15:19:27 UTC (rev 18447)
+++ userguide/trunk/social-media.tex 2014-11-08 17:26:05 UTC (rev 18448)
@@ -31,14 +31,14 @@
The \verb!Twitter! plugin contains several tools useful for processing tweets.
This plugin depends on the \verb!Stanford_CoreNLP! plugin, which must be loaded
-first. This includes tools to load documents into GATE from the JSON format
-provided by the Twitter APIs, a tokeniser and POS tagger tuned specifically for
-Tweets, a tool to split up multi-word hashtags, and an example named entity
-recognition application called {\em TwitIE} which demonstrates all these
-components working together.
+first. This includes tools to load and save documents in GATE using the JSON
+format provided by the Twitter APIs, a tokeniser and POS tagger tuned
+specifically for Tweets, a tool to split up multi-word hashtags, and an example
+named entity recognition application called {\em TwitIE} which demonstrates all
+these components working together.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\subsect[sec:social:twitter:format]{Twitter JSON format}
+\sect[sec:social:twitter:format]{Twitter JSON format}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
Twitter provides APIs to search for Tweets according to various criteria, and
@@ -48,19 +48,35 @@
includes the text of the Tweet plus a large amount of supporting metadata.
The GATE \verb!Twitter! plugin contains a format analyser for this JSON format
which allows you to load a file of one or more JSON Tweets into a GATE
-document. Loading the plugin registers the document format with GATE, so that
-it will be automatically associated with files whose names end in
-``\verb!.json!''; otherwise you need to specify \verb!text/x-json-twitter! for
-the document mimeType parameter. This will work both when directly creating a
-single new GATE document and when populating a corpus.
+document. The format analyser can handle multiple Tweets in one file,
+represented as any of:
+\begin{itemize}
+\item a top-level JSON array \verb![{...},{...}]!
+\item a top-level JSON object containing properties ``search\_metadata'' and
+ ``statuses'', where the ``statuses'' property is an array of Tweets (this is
+ the format returned by a call to Twitter's ``search'' API)
+\item or simply concatenated together, optionally with white space or newline
+ characters between adjacent objects (this is the format returned by Twitter's
+ streaming APIs).
+\end{itemize}
+Loading the plugin registers the
+document format with GATE, so that it will be automatically associated with
+files whose names end in ``\verb!.json!''; otherwise you need to specify
+\verb!text/x-json-twitter! for the document mimeType parameter. This will work
+both when directly creating a single new GATE document and when populating a
+corpus.
-Each tweet object's \verb!text! value is converted into the document content,
+Each tweet object's \verb!text! value is converted into the document
+content\footnote{HTML entity references \texttt{\&}, \texttt{\<} and
+\texttt{\>} are decoded into the corresponding characters},
which is covered with a \emph{Tweet} annotations whose features represent
(recursively when appropriate, using \emph{Map} and \emph{List}) all the other
key-value pairs in the tweet object. \textbf{Note:} these recursive values are
difficult to work with in JAPE; the special corpus population tool described
next allows important key-sequences to be ``brought up'' to the document
content
-and the top level of the annotation features.
+and the top level of the annotation features. Any entities described by the
+standoff markup ``entities'' JSON property will be converted into their
+corresponding GATE annotations (see below for details).
Multiple tweet objects in the same JSON file are separated by blank lines
(which
are not covered by \emph{Tweet} annotations).
@@ -77,6 +93,9 @@
\item[One document per tweet] If this box is ticked (the default), each tweet
will produce a separate document. If not, each {\em input file} will produce
one GATE document.
+\item[Annotations for ``entities''] If this box is ticked (the default), any
+ entities described by the standoff markup ``entities'' JSON property will be
+ converted into their corresponding GATE annotations (see below).
\item[Content keys] The values of these JSON keys are converted into strings
and
concatenated into each tweet's document content. Colon-delimited strings
specify nested keys, e.g., ``\texttt{user:name}'' will yield the value of the
@@ -92,17 +111,126 @@
configuration.
\end{description}
%%
-Every tweet is covered by a \texttt{Tweet} annotation with features specified
by
-the ``feature keys'' option. Multiple tweets in the same GATE document are
-separated by a blank line (two newlines).
+Again, the input can be in any of the three formats discussed above (an array
+of Tweets, a search result, or a stream of concatenated objects).
+Every tweet in the resulting GATE documents is covered by a \texttt{Tweet}
+annotation with features specified by the ``feature keys'' option. Multiple
+tweets in the same GATE document are separated by a blank line (two newlines).
Corpus population from Twitter JSON files is also accessible programmatically
when this plugin is loaded, using the public static void method
\texttt{gate.corpora.twitter.Population.populateCorpus(final Corpus corpus, URL
inputUrl, String encoding, List<String> contentKeys, List<String>
featureKeys,
- int tweetsPerDoc)}.
+ int tweetsPerDoc, boolean processEntities)}.
+\subsect[sec:social:twitter:entities]{Entity annotations in JSON}
+
+Twitter's JSON format provides a mechanism to represent annotations over the
+Tweet text as standoff markup, via a JSON property named ``entities''. The
+value of this property is an object with one property for each entity
+\emph{type}, whose value is a list of objects representing the individual
+annotations. Within each individual entity object, the ``indices'' property
+gives start and end character offsets of the annotation within the Tweet text.
+
+\begin{verbatim}
+{
+ "text":"@some_user this is a nice #example",
+ "entities":{
+ "user_mentions":[
+ {
+ "indices":[0,10],
+ "screen_name":"some_user",
+ ...
+ }
+ ],
+ "hashtags":[
+ {
+ "indices":[26,34],
+ "text":"example"
+ }
+ ]
+ }
+}
+\end{verbatim}
+
+Both the single document format parser and the corpus population tool are able
+to convert this structure into GATE annotations. The entity type (e.g.
+\verb!user_mentions!) becomes the annotation type, the \verb!indices! property
+provides the offsets, and the other properties become features of the generated
+annotation.
+
+By default, the entity annotations are created in the ``Original markups''
+annotation set, as is the usual convention for annotations generated by a
+document format. However, if the entity type contains a colon character (e.g.
+\verb!"Key:Person":[...]!) then the portion before the colon is taken to be an
+annotation set name and the portion after the colon is the annotation type (in
+this example, a ``Person'' annotation in the ``Key'' annotation set). An
+empty annotation set name (i.e. \verb!":Person"!) creates the corresponding
+annotations in the default annotation set. This scheme is designed to be
+compatible with the GATE JSON export mechanism described in the next section.
+
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\sect[sec:social:twitter:export]{Exporting GATE documents as JSON}
+
+Loading the \verb!Twitter! plugin also adds a ``GATE JSON'' option to the
+``Save as\ldots'' right-click menu on documents and corpora, to export GATE
+documents in the Twitter-style JSON format. This tool can save a document or
+corpus of documents as a single file where each Tweet in the document or corpus
+is represented as a JSON object, and the set of objects are represented either
+as a single top-level JSON array (\verb![{...},{...}]!) or simply as one object
+per line (as per Twitter's streaming APIs). This exporter can be used for any
+GATE document, not just for documents that were initially loaded from Twitter
+JSON format, and can be used as a much more compact alternative to GATE XML, or
+as an easy-to-parse interchange format to pass GATE-annotated documents to
+non-GATE tools.
+
+The format is the same as Twitter's -- the text becomes a property ``text'' in
+the JSON, and annotations are represented as standoff markup in the
+``entities'' property, which is an object whose keys are annotation types and
+whose corresponding values are arrays of objects representing the annotations.
+
+\begin{figure}[htb]
+ \centering
+ \includegraphics[width=0.8\textwidth]{save-as-json.png}
+ \caption{Options dialog for saving a document or corpus as JSON}
+ \label{fig:social:save-as-json}
+\end{figure}
+
+The available options for the JSON exporter are shown in
+figure~\ref{fig:social:save-as-json}. In detail, they are:
+\begin{description}
+\item[documentAnnotationASName/documentAnnotationType] the annotation set and
+ type that should be treated as covering each span of text that should be
output
+ as a separate JSON object. By default this is annotations of type ``Tweet''
in
+ the ``Original markups'' set (i.e. the annotations covering individual Tweets
+ parsed by the JSON document format parser or corpus population tool). If a
+ document contains any annotations of the specified type then one JSON object
+ will be output for each such annotation $X$, with the text and entity
+ annotations constrained to the span of $X$. In addition, features of $X$
+ will become top-level properties of the resulting JSON object. Text that is
+ not covered by any such annotation will not be saved. If there are no
+ document annotations found in a particular document (or if the
+ documentAnnotationType parameter is unset) then the whole of the document
+ text will be output as a single JSON object.
+\item[entitiesAnnotationSetName] the primary annotation set that should be
+ scanned for entity annotations.
+\item[annotationTypes] the entity annotation types to output.
+\item[exportAsArray] if true, output the objects as a top-level JSON array. If
+ false (the default), output the JSON objects directly at the top level,
+ separated by newlines.
+\end{description}
+
+Annotation types to be saved can be specified in two ways. Plain annotation
+type names such as ``Person'' will be taken from the specified
+\emph{entitiesAnnotationSetName}, but if a type name contains a colon character
+(e.g. ``Key:Person'') then the portion before the colon is treated as the
+annotation set name and the portion after the colon as the annotation type.
+The full name including the colon will be used as the type label in the
+``entities'' object, so if the resulting JSON were re-loaded into GATE the
+annotations would be re-created in the same annotation sets they originally
+came from.
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\sect[sec:social:twitter:prs]{Low-level PRs for Tweets}
The \verb!Twitter! plugin provides a number of low-level language processing
components that are specifically tuned to Twitter data.
This was sent by the SourceForge.net collaborative development platform, the
world's largest Open Source development site.
------------------------------------------------------------------------------
_______________________________________________
GATE-cvs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/gate-cvs