Revision: 18530
          http://sourceforge.net/p/gate/code/18530
Author:   ian_roberts
Date:     2015-01-12 12:33:05 +0000 (Mon, 12 Jan 2015)
Log Message:
-----------
Documentation for the JSON streaming input and output handlers

Modified Paths:
--------------
    gcp/trunk/doc/batch-def.tex
    gcp/trunk/doc/gcp-guide.pdf

Modified: gcp/trunk/doc/batch-def.tex
===================================================================
--- gcp/trunk/doc/batch-def.tex 2015-01-12 02:19:51 UTC (rev 18529)
+++ gcp/trunk/doc/batch-def.tex 2015-01-12 12:33:05 UTC (rev 18530)
@@ -17,13 +17,16 @@
   \verb!<report file="../report.xml" />!
 
 \item[input] (required) specifies the input handler which will be the source of
-  documents to process.
+  documents to process.  Most handlers load documents one by one based on their
+  IDs, but certain handlers operate in a \emph{streaming} mode, processing a
+  block of documents in one pass.
 
 \item[output] (zero or more) specified what to do with the documents once they
   have been processed.
 
-\item[documents] (required) specifies the document IDs to be processed, as any
-  combination of the child elements:
+\item[documents] (required, except when using a streaming input handler)
+  specifies the document IDs to be processed, as any combination of the child
+  elements:
   \bde
   \item[id] a single document ID. \verb!<id>bbc/article001.html</id>!
   \item[documentEnumerator] an enumerator that generates a list of IDs.  The
@@ -99,6 +102,14 @@
   (\url{http://crawler.archive.org}).
 \eit
 
+and one \emph{streaming} handler:
+
+\bit
+\item \verb!gate.cloud.io.json.JSONStreamingInputHandler! to read a stream of
+  documents from a single large JSON file (for example a collection of Tweets
+  from Twitter's streaming API).
+\eit
+
 \subsection{The {\tt FileInputHandler}}
 
 \verb!FileInputHandler! reads documents from individual files on the
@@ -231,6 +242,53 @@
 names are prefixed with ``http\_header\_'' and ARC/WARC record headers with
 ``arc\_header\_''.
 
+\subsection{The streaming JSON input handler}
+\label{sec:batch-def:json-input}
+
+An increasing number of services, most notably Twitter and social media
+aggregators such as DataSift, provide their data in JSON format.  Twitter
+offers streaming APIs that deliver Tweets as a continuous stream of JSON
+objects concatenated together, DataSift typically delivers a large JSON array
+of documents.  The streaming JSON input handler can process either format,
+treating each JSON object in the ``stream'' as a separate GATE document.
+
+The \verb!gate.cloud.io.json.JSONStreamingInputHandler! accepts the following
+attributes:
+
+\bde
+\item[srcFile] the file containing the JSON objects (either as a top-level
+  array or simply concatenated together, optionally separated by whitespace).
+\item[idPointer] the ``path'' within each JSON object of the property that
+  represents the document identifier.  This is an expression in the \emph{JSON
+  
Pointer}\footnote{\url{http://tools.ietf.org/html/draft-ietf-appsawg-json-pointer-03}}
+  language.  It must start with a forward slash and then a sequence of property
+  names separated by further slashes.  A suitable value for the Twitter JSON
+  format would be \verb!/id_str! (the property named ``\verb!id_str!'' of the
+  object), and for DataSift \verb!/interaction/id! (the top-level object has an
+  ``interaction'' property whose value is an object, we want the ``id''
+  property of \emph{that} object).  Any object that does not have a property at
+  the specified path will be ignored.
+\item[compression] (optional) the compression format used by the
+  \verb!srcFile!, if any.  If the value is ``none'' (the default) then the file
+  is assumed not to be compressed, if the value is ``gzip'' then Java's native
+  GZIP decompression utilities will be used, otherwise the value is taken to be
+  the command line for a native decompression program that expects compressed
+  data on stdin and will produce decompressed data on stdout, for example
+  \verb!"lzop -dc"! or \verb!"bunzip2"!.
+\item[mimeType] (optional but highly recommended) the value to pass as the
+  ``mimeType'' parameter when creating a GATE Document from the JSON string.
+  This will be used by GATE to select an appropriate document format parser, so
+  for Twitter JSON you should use \verb!"text/x-json-twitter"! and for DataSift
+  \verb!"text/x-json-datasift"!.  Note that the GATE plugin defining the
+  relevant format parser \emph{must} be loaded as part of your GATE
+  application.
+\ede
+
+This is a streaming handler -- it will process all documents in the JSON bundle
+and does \emph{not} require a \verb!documents! section in the batch
+specification.  As with other input handlers, when restarting a failed batch
+documents that were successfully processed in the previous run will be skipped.
+
 \section{Specifying the Output Handlers}
 
 Output handlers are responsible for taking the GATE Documents that have been
@@ -292,8 +350,12 @@
   XCES standoff format.  Annotation offsets in XCES refer to the plain text as
   saved by a \verb!PlainTextOutputHandler!.
 \item \verb!gate.cloud.io.file.JSONOutputHandler! to save documents in a JSON
-  format modelled on that used by Twitter to represent "entities" in Tweets.
-\item \verb!gate.cloid.io.file.SerializedObjectOutputHandler! to save documents
+  format modelled on that used by Twitter to represent ``entities'' in Tweets.
+\item \verb!gate.cloud.io.json.JSONStreamingOutputHandler! saves documents in
+  the same JSON format as the previous handler, but concatenated together in
+  one or more output batches rather than saving each document in its own
+  individual output file.
+\item \verb!gate.cloud.io.file.SerializedObjectOutputHandler! to save documents
   using Java's built in \emph{object serialization} protocol (with optional
   compression).  This handler ignores annotation filters, and always writes
   the complete document.  This is the same mechanism used by GATE's
@@ -393,7 +455,7 @@
 This handler supports a number of additional \verb!<output>! attributes to
 control the format.
 
-\begin{description}
+\bde
 \item[groupEntitiesBy] controls how the annotations are grouped under the
   ``entities'' object.  Permitted values are ``type'' (the default) or ``set''.
   Grouping by ``type'' produces output like the example above, with one entry
@@ -424,8 +486,41 @@
   top-level properties (alongside ``text'' and ``entities'') of the generated
   JSON object.  This option is intended to support round-trip processing of
   documents that were originally loaded from JSON by GATE's Twitter support.
-\end{description}
+\ede
 
+The \verb!JSONStreamingOutputHandler! writes the same JSON format, but instead
+of storing each GATE document in its own individual file on disk, this handler
+creates one large file (or several ``chunks'') and writes documents to this
+file in one stream, separated by newlines.  In addition to the parameters
+described above this handler adds two further parameters:
+
+\bde
+\item[pattern] (optional, default \verb!part-%03d!) the pattern on which chunk
+  file names should be created.  This is a standard Java \verb!String.format!
+  pattern string which will be instantiated with a single integer parameter, so
+  should include a single \verb!%d!-based placeholder.  Output file names are
+  generated by instantiating the pattern with successive numbers starting from
+  0 and passing the result to the configured naming strategy until a file name
+  is found that does not already exist.  With the default naming strategy this
+  effectively means \verb!{dir}/{pattern}{fileExtension}!, e.g.
+  \verb!output/part-003.json.gz!
+\item[chunkSize] (optional, default \verb!99000000!) approximate maximum size
+  in bytes of a single output file, after which the handler will close the
+  current file and start the next chunk.  The file size is checked after every 
MB
+  of uncompressed data, so each chunk should be no more than 1MB larger than 
the
+  configured chunk size.  The default chunkSize is 99 million bytes, which
+  should produce chunks of no more than 100MB.
+\ede
+
+This handler, like the \verb!JSONStreamingInputHandler! can cope with a wider
+variety of compression formats than the standard one-file-per-document output
+handlers.  A value other than ``none'' or ``gzip'' for the ``compression''
+parameter will be taken as the command line for a native compression program
+that expects raw data on its stdin and produces compressed data on stdout, for
+example \verb!"bzip2"! or \verb!"lzop"! (with the default naming strategy, the
+configured fileExtension should take the compression format into account, e.g.
+\verb!".json.lzo"!).
+
 \subsection{The M\'{i}mir Output Handler}
 
 GCP also provides \verb!gate.cloud.io.mimir.MimirOutputHandler! to send 
annotated documents to a M\'{i}mir server for indexing.  This handler supports 
the following \verb!<output>! attributes:
@@ -486,8 +581,9 @@
 
 \section{Specifying the Documents to Process}
 
-The final section of the batch definition specifies which document IDs GCP
-should process.  The IDs can be specified in two ways:
+If you are not using a streaming input handler then the final section of the
+batch definition specifies which document IDs GCP should process.  The IDs can
+be specified in two ways:
 
 \bit
 \item Directly in the XML as \verb!<id>doc/id/here</id>! elements.

Modified: gcp/trunk/doc/gcp-guide.pdf
===================================================================
(Binary files differ)

This was sent by the SourceForge.net collaborative development platform, the 
world's largest Open Source development site.


------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
vanity: www.gigenet.com
_______________________________________________
GATE-cvs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/gate-cvs

Reply via email to