Revision: 18530
http://sourceforge.net/p/gate/code/18530
Author: ian_roberts
Date: 2015-01-12 12:33:05 +0000 (Mon, 12 Jan 2015)
Log Message:
-----------
Documentation for the JSON streaming input and output handlers
Modified Paths:
--------------
gcp/trunk/doc/batch-def.tex
gcp/trunk/doc/gcp-guide.pdf
Modified: gcp/trunk/doc/batch-def.tex
===================================================================
--- gcp/trunk/doc/batch-def.tex 2015-01-12 02:19:51 UTC (rev 18529)
+++ gcp/trunk/doc/batch-def.tex 2015-01-12 12:33:05 UTC (rev 18530)
@@ -17,13 +17,16 @@
\verb!<report file="../report.xml" />!
\item[input] (required) specifies the input handler which will be the source of
- documents to process.
+ documents to process. Most handlers load documents one by one based on their
+ IDs, but certain handlers operate in a \emph{streaming} mode, processing a
+ block of documents in one pass.
\item[output] (zero or more) specified what to do with the documents once they
have been processed.
-\item[documents] (required) specifies the document IDs to be processed, as any
- combination of the child elements:
+\item[documents] (required, except when using a streaming input handler)
+ specifies the document IDs to be processed, as any combination of the child
+ elements:
\bde
\item[id] a single document ID. \verb!<id>bbc/article001.html</id>!
\item[documentEnumerator] an enumerator that generates a list of IDs. The
@@ -99,6 +102,14 @@
(\url{http://crawler.archive.org}).
\eit
+and one \emph{streaming} handler:
+
+\bit
+\item \verb!gate.cloud.io.json.JSONStreamingInputHandler! to read a stream of
+ documents from a single large JSON file (for example a collection of Tweets
+ from Twitter's streaming API).
+\eit
+
\subsection{The {\tt FileInputHandler}}
\verb!FileInputHandler! reads documents from individual files on the
@@ -231,6 +242,53 @@
names are prefixed with ``http\_header\_'' and ARC/WARC record headers with
``arc\_header\_''.
+\subsection{The streaming JSON input handler}
+\label{sec:batch-def:json-input}
+
+An increasing number of services, most notably Twitter and social media
+aggregators such as DataSift, provide their data in JSON format. Twitter
+offers streaming APIs that deliver Tweets as a continuous stream of JSON
+objects concatenated together, DataSift typically delivers a large JSON array
+of documents. The streaming JSON input handler can process either format,
+treating each JSON object in the ``stream'' as a separate GATE document.
+
+The \verb!gate.cloud.io.json.JSONStreamingInputHandler! accepts the following
+attributes:
+
+\bde
+\item[srcFile] the file containing the JSON objects (either as a top-level
+ array or simply concatenated together, optionally separated by whitespace).
+\item[idPointer] the ``path'' within each JSON object of the property that
+ represents the document identifier. This is an expression in the \emph{JSON
+
Pointer}\footnote{\url{http://tools.ietf.org/html/draft-ietf-appsawg-json-pointer-03}}
+ language. It must start with a forward slash and then a sequence of property
+ names separated by further slashes. A suitable value for the Twitter JSON
+ format would be \verb!/id_str! (the property named ``\verb!id_str!'' of the
+ object), and for DataSift \verb!/interaction/id! (the top-level object has an
+ ``interaction'' property whose value is an object, we want the ``id''
+ property of \emph{that} object). Any object that does not have a property at
+ the specified path will be ignored.
+\item[compression] (optional) the compression format used by the
+ \verb!srcFile!, if any. If the value is ``none'' (the default) then the file
+ is assumed not to be compressed, if the value is ``gzip'' then Java's native
+ GZIP decompression utilities will be used, otherwise the value is taken to be
+ the command line for a native decompression program that expects compressed
+ data on stdin and will produce decompressed data on stdout, for example
+ \verb!"lzop -dc"! or \verb!"bunzip2"!.
+\item[mimeType] (optional but highly recommended) the value to pass as the
+ ``mimeType'' parameter when creating a GATE Document from the JSON string.
+ This will be used by GATE to select an appropriate document format parser, so
+ for Twitter JSON you should use \verb!"text/x-json-twitter"! and for DataSift
+ \verb!"text/x-json-datasift"!. Note that the GATE plugin defining the
+ relevant format parser \emph{must} be loaded as part of your GATE
+ application.
+\ede
+
+This is a streaming handler -- it will process all documents in the JSON bundle
+and does \emph{not} require a \verb!documents! section in the batch
+specification. As with other input handlers, when restarting a failed batch
+documents that were successfully processed in the previous run will be skipped.
+
\section{Specifying the Output Handlers}
Output handlers are responsible for taking the GATE Documents that have been
@@ -292,8 +350,12 @@
XCES standoff format. Annotation offsets in XCES refer to the plain text as
saved by a \verb!PlainTextOutputHandler!.
\item \verb!gate.cloud.io.file.JSONOutputHandler! to save documents in a JSON
- format modelled on that used by Twitter to represent "entities" in Tweets.
-\item \verb!gate.cloid.io.file.SerializedObjectOutputHandler! to save documents
+ format modelled on that used by Twitter to represent ``entities'' in Tweets.
+\item \verb!gate.cloud.io.json.JSONStreamingOutputHandler! saves documents in
+ the same JSON format as the previous handler, but concatenated together in
+ one or more output batches rather than saving each document in its own
+ individual output file.
+\item \verb!gate.cloud.io.file.SerializedObjectOutputHandler! to save documents
using Java's built in \emph{object serialization} protocol (with optional
compression). This handler ignores annotation filters, and always writes
the complete document. This is the same mechanism used by GATE's
@@ -393,7 +455,7 @@
This handler supports a number of additional \verb!<output>! attributes to
control the format.
-\begin{description}
+\bde
\item[groupEntitiesBy] controls how the annotations are grouped under the
``entities'' object. Permitted values are ``type'' (the default) or ``set''.
Grouping by ``type'' produces output like the example above, with one entry
@@ -424,8 +486,41 @@
top-level properties (alongside ``text'' and ``entities'') of the generated
JSON object. This option is intended to support round-trip processing of
documents that were originally loaded from JSON by GATE's Twitter support.
-\end{description}
+\ede
+The \verb!JSONStreamingOutputHandler! writes the same JSON format, but instead
+of storing each GATE document in its own individual file on disk, this handler
+creates one large file (or several ``chunks'') and writes documents to this
+file in one stream, separated by newlines. In addition to the parameters
+described above this handler adds two further parameters:
+
+\bde
+\item[pattern] (optional, default \verb!part-%03d!) the pattern on which chunk
+ file names should be created. This is a standard Java \verb!String.format!
+ pattern string which will be instantiated with a single integer parameter, so
+ should include a single \verb!%d!-based placeholder. Output file names are
+ generated by instantiating the pattern with successive numbers starting from
+ 0 and passing the result to the configured naming strategy until a file name
+ is found that does not already exist. With the default naming strategy this
+ effectively means \verb!{dir}/{pattern}{fileExtension}!, e.g.
+ \verb!output/part-003.json.gz!
+\item[chunkSize] (optional, default \verb!99000000!) approximate maximum size
+ in bytes of a single output file, after which the handler will close the
+ current file and start the next chunk. The file size is checked after every
MB
+ of uncompressed data, so each chunk should be no more than 1MB larger than
the
+ configured chunk size. The default chunkSize is 99 million bytes, which
+ should produce chunks of no more than 100MB.
+\ede
+
+This handler, like the \verb!JSONStreamingInputHandler! can cope with a wider
+variety of compression formats than the standard one-file-per-document output
+handlers. A value other than ``none'' or ``gzip'' for the ``compression''
+parameter will be taken as the command line for a native compression program
+that expects raw data on its stdin and produces compressed data on stdout, for
+example \verb!"bzip2"! or \verb!"lzop"! (with the default naming strategy, the
+configured fileExtension should take the compression format into account, e.g.
+\verb!".json.lzo"!).
+
\subsection{The M\'{i}mir Output Handler}
GCP also provides \verb!gate.cloud.io.mimir.MimirOutputHandler! to send
annotated documents to a M\'{i}mir server for indexing. This handler supports
the following \verb!<output>! attributes:
@@ -486,8 +581,9 @@
\section{Specifying the Documents to Process}
-The final section of the batch definition specifies which document IDs GCP
-should process. The IDs can be specified in two ways:
+If you are not using a streaming input handler then the final section of the
+batch definition specifies which document IDs GCP should process. The IDs can
+be specified in two ways:
\bit
\item Directly in the XML as \verb!<id>doc/id/here</id>! elements.
Modified: gcp/trunk/doc/gcp-guide.pdf
===================================================================
(Binary files differ)
This was sent by the SourceForge.net collaborative development platform, the
world's largest Open Source development site.
------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
vanity: www.gigenet.com
_______________________________________________
GATE-cvs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/gate-cvs