[gate-cvs] SF.net SVN: gate:[18448] userguide/trunk

ian_roberts Sat, 08 Nov 2014 09:26:45 -0800

Revision: 18448
          http://sourceforge.net/p/gate/code/18448
Author:   ian_roberts
Date:     2014-11-08 17:26:05 +0000 (Sat, 08 Nov 2014)
Log Message:
-----------
Documentation for the latest set of JSON input/output changes.


Modified Paths:
--------------
    userguide/trunk/social-media.tex

Added Paths:
-----------
    userguide/trunk/save-as-json.png

Added: userguide/trunk/save-as-json.png
===================================================================
(Binary files differ)

Index: userguide/trunk/save-as-json.png
===================================================================
--- userguide/trunk/save-as-json.png    2014-11-08 15:19:27 UTC (rev 18447)
+++ userguide/trunk/save-as-json.png    2014-11-08 17:26:05 UTC (rev 18448)

Property changes on: userguide/trunk/save-as-json.png
___________________________________________________________________
Added: svn:mime-type
## -0,0 +1 ##
+image/png
\ No newline at end of property
Modified: userguide/trunk/social-media.tex
===================================================================
--- userguide/trunk/social-media.tex    2014-11-08 15:19:27 UTC (rev 18447)
+++ userguide/trunk/social-media.tex    2014-11-08 17:26:05 UTC (rev 18448)
@@ -31,14 +31,14 @@
 
 The \verb!Twitter! plugin contains several tools useful for processing tweets.
 This plugin depends on the \verb!Stanford_CoreNLP! plugin, which must be loaded
-first.  This includes tools to load documents into GATE from the JSON format
-provided by the Twitter APIs, a tokeniser and POS tagger tuned specifically for
-Tweets, a tool to split up multi-word hashtags, and an example named entity
-recognition application called {\em TwitIE} which demonstrates all these
-components working together.
+first.  This includes tools to load and save documents in GATE using the JSON
+format provided by the Twitter APIs, a tokeniser and POS tagger tuned
+specifically for Tweets, a tool to split up multi-word hashtags, and an example
+named entity recognition application called {\em TwitIE} which demonstrates all
+these components working together.
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\subsect[sec:social:twitter:format]{Twitter JSON format}
+\sect[sec:social:twitter:format]{Twitter JSON format}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %
 Twitter provides APIs to search for Tweets according to various criteria, and
@@ -48,19 +48,35 @@
 includes the text of the Tweet plus a large amount of supporting metadata.
 The GATE \verb!Twitter! plugin contains a format analyser for this JSON format
 which allows you to load a file of one or more JSON Tweets into a GATE
-document.  Loading the plugin registers the document format with GATE, so that
-it will be automatically associated with files whose names end in
-``\verb!.json!''; otherwise you need to specify \verb!text/x-json-twitter! for
-the document mimeType parameter.  This will work both when directly creating a
-single new GATE document and when populating a corpus.
+document.  The format analyser can handle multiple Tweets in one file,
+represented as any of:
+\begin{itemize}
+\item a top-level JSON array \verb![{...},{...}]!
+\item a top-level JSON object containing properties ``search\_metadata'' and
+  ``statuses'', where the ``statuses'' property is an array of Tweets (this is
+  the format returned by a call to Twitter's ``search'' API)
+\item or simply concatenated together, optionally with white space or newline
+  characters between adjacent objects (this is the format returned by Twitter's
+  streaming APIs).
+\end{itemize}
+Loading the plugin registers the
+document format with GATE, so that it will be automatically associated with
+files whose names end in ``\verb!.json!''; otherwise you need to specify
+\verb!text/x-json-twitter! for the document mimeType parameter.  This will work
+both when directly creating a single new GATE document and when populating a
+corpus.
 
-Each tweet object's \verb!text! value is converted into the document content,
+Each tweet object's \verb!text! value is converted into the document
+content\footnote{HTML entity references \texttt{\&amp;}, \texttt{\&lt;} and
+\texttt{\&gt;} are decoded into the corresponding characters},
 which is covered with a \emph{Tweet} annotations whose features represent
 (recursively when appropriate, using \emph{Map} and \emph{List}) all the other
 key-value pairs in the tweet object.  \textbf{Note:} these recursive values are
 difficult to work with in JAPE; the special corpus population tool described
 next allows important key-sequences to be ``brought up'' to the document 
content
-and the top level of the annotation features.
+and the top level of the annotation features.  Any entities described by the
+standoff markup ``entities'' JSON property will be converted into their
+corresponding GATE annotations (see below for details).
 
 Multiple tweet objects in the same JSON file are separated by blank lines 
(which
 are not covered by \emph{Tweet} annotations).
@@ -77,6 +93,9 @@
 \item[One document per tweet] If this box is ticked (the default), each tweet
   will produce a separate document.  If not, each {\em input file} will produce
   one GATE document.
+\item[Annotations for ``entities''] If this box is ticked (the default), any
+  entities described by the standoff markup ``entities'' JSON property will be
+  converted into their corresponding GATE annotations (see below).
 \item[Content keys] The values of these JSON keys are converted into strings 
and
   concatenated into each tweet's document content.  Colon-delimited strings
   specify nested keys, e.g., ``\texttt{user:name}'' will yield the value of the
@@ -92,17 +111,126 @@
   configuration.
 \end{description}
 %%
-Every tweet is covered by a \texttt{Tweet} annotation with features specified 
by
-the ``feature keys'' option.  Multiple tweets in the same GATE document are
-separated by a blank line (two newlines).
+Again, the input can be in any of the three formats discussed above (an array
+of Tweets, a search result, or a stream of concatenated objects).
+Every tweet in the resulting GATE documents is covered by a \texttt{Tweet}
+annotation with features specified by the ``feature keys'' option.  Multiple
+tweets in the same GATE document are separated by a blank line (two newlines).
 
 Corpus population from Twitter JSON files is also accessible programmatically
 when this plugin is loaded, using the public static void method
 \texttt{gate.corpora.twitter.Population.populateCorpus(final Corpus corpus, URL
   inputUrl, String encoding, List<String> contentKeys, List<String> 
featureKeys,
-  int tweetsPerDoc)}.
+  int tweetsPerDoc, boolean processEntities)}.
 
+\subsect[sec:social:twitter:entities]{Entity annotations in JSON}
+
+Twitter's JSON format provides a mechanism to represent annotations over the
+Tweet text as standoff markup, via a JSON property named ``entities''.  The
+value of this property is an object with one property for each entity
+\emph{type}, whose value is a list of objects representing the individual
+annotations.  Within each individual entity object, the ``indices'' property
+gives start and end character offsets of the annotation within the Tweet text.
+
+\begin{verbatim}
+{
+  "text":"@some_user this is a nice #example",
+  "entities":{
+    "user_mentions":[
+      {
+        "indices":[0,10],
+        "screen_name":"some_user",
+        ...
+      }
+    ],
+    "hashtags":[
+      {
+        "indices":[26,34],
+        "text":"example"
+      }
+    ]
+  }
+}
+\end{verbatim}
+
+Both the single document format parser and the corpus population tool are able
+to convert this structure into GATE annotations.  The entity type (e.g.
+\verb!user_mentions!) becomes the annotation type, the \verb!indices!  property
+provides the offsets, and the other properties become features of the generated
+annotation.
+
+By default, the entity annotations are created in the ``Original markups''
+annotation set, as is the usual convention for annotations generated by a
+document format.  However, if the entity type contains a colon character (e.g.
+\verb!"Key:Person":[...]!) then the portion before the colon is taken to be an
+annotation set name and the portion after the colon is the annotation type (in
+this example, a ``Person'' annotation in the ``Key'' annotation set).  An
+empty annotation set name (i.e. \verb!":Person"!) creates the corresponding
+annotations in the default annotation set.  This scheme is designed to be
+compatible with the GATE JSON export mechanism described in the next section.
+
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\sect[sec:social:twitter:export]{Exporting GATE documents as JSON}
+
+Loading the \verb!Twitter! plugin also adds a ``GATE JSON'' option to the
+``Save as\ldots'' right-click menu on documents and corpora, to export GATE
+documents in the Twitter-style JSON format.  This tool can save a document or
+corpus of documents as a single file where each Tweet in the document or corpus
+is represented as a JSON object, and the set of objects are represented either
+as a single top-level JSON array (\verb![{...},{...}]!) or simply as one object
+per line (as per Twitter's streaming APIs).  This exporter can be used for any
+GATE document, not just for documents that were initially loaded from Twitter
+JSON format, and can be used as a much more compact alternative to GATE XML, or
+as an easy-to-parse interchange format to pass GATE-annotated documents to
+non-GATE tools.
+
+The format is the same as Twitter's -- the text becomes a property ``text'' in
+the JSON, and annotations are represented as standoff markup in the
+``entities'' property, which is an object whose keys are annotation types and
+whose corresponding values are arrays of objects representing the annotations.
+
+\begin{figure}[htb]
+  \centering
+  \includegraphics[width=0.8\textwidth]{save-as-json.png}
+  \caption{Options dialog for saving a document or corpus as JSON}
+  \label{fig:social:save-as-json}
+\end{figure}
+
+The available options for the JSON exporter are shown in
+figure~\ref{fig:social:save-as-json}.  In detail, they are:
+\begin{description}
+\item[documentAnnotationASName/documentAnnotationType] the annotation set and
+  type that should be treated as covering each span of text that should be 
output
+  as a separate JSON object.  By default this is annotations of type ``Tweet'' 
in
+  the ``Original markups'' set (i.e. the annotations covering individual Tweets
+  parsed by the JSON document format parser or corpus population tool).  If a
+  document contains any annotations of the specified type then one JSON object
+  will be output for each such annotation $X$, with the text and entity
+  annotations constrained to the span of $X$.  In addition, features of $X$
+  will become top-level properties of the resulting JSON object.  Text that is
+  not covered by any such annotation will not be saved.  If there are no
+  document annotations found in a particular document (or if the
+  documentAnnotationType parameter is unset) then the whole of the document
+  text will be output as a single JSON object.
+\item[entitiesAnnotationSetName] the primary annotation set that should be
+  scanned for entity annotations.
+\item[annotationTypes] the entity annotation types to output.
+\item[exportAsArray] if true, output the objects as a top-level JSON array.  If
+  false (the default), output the JSON objects directly at the top level,
+  separated by newlines.
+\end{description}
+
+Annotation types to be saved can be specified in two ways.  Plain annotation
+type names such as ``Person'' will be taken from the specified
+\emph{entitiesAnnotationSetName}, but if a type name contains a colon character
+(e.g. ``Key:Person'') then the portion before the colon is treated as the
+annotation set name and the portion after the colon as the annotation type.
+The full name including the colon will be used as the type label in the
+``entities'' object, so if the resulting JSON were re-loaded into GATE the
+annotations would be re-created in the same annotation sets they originally
+came from.
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \sect[sec:social:twitter:prs]{Low-level PRs for Tweets}
 
 The \verb!Twitter! plugin provides a number of low-level language processing 
components that are specifically tuned to Twitter data.

This was sent by the SourceForge.net collaborative development platform, the 
world's largest Open Source development site.


------------------------------------------------------------------------------
_______________________________________________
GATE-cvs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/gate-cvs

[gate-cvs] SF.net SVN: gate:[18448] userguide/trunk

Reply via email to