Revision: 18757
http://sourceforge.net/p/gate/code/18757
Author: ian_roberts
Date: 2015-06-05 15:56:46 +0000 (Fri, 05 Jun 2015)
Log Message:
-----------
Updated GCP documentation to include gcp-direct script and a changelog for 2.5
Modified Paths:
--------------
gcp/trunk/doc/batch-def.tex
gcp/trunk/doc/gcp-guide.pdf
gcp/trunk/doc/install-and-run.tex
gcp/trunk/doc/introduction.tex
Modified: gcp/trunk/doc/batch-def.tex
===================================================================
--- gcp/trunk/doc/batch-def.tex 2015-06-05 01:20:22 UTC (rev 18756)
+++ gcp/trunk/doc/batch-def.tex 2015-06-05 15:56:46 UTC (rev 18757)
@@ -270,11 +270,16 @@
the specified path will be ignored.
\item[compression] (optional) the compression format used by the
\verb!srcFile!, if any. If the value is ``none'' (the default) then the file
- is assumed not to be compressed, if the value is ``gzip'' then Java's native
- GZIP decompression utilities will be used, otherwise the value is taken to be
+ is assumed not to be compressed, if the value is one of the compression
formats
+ supported by Apache Commons Compress (``gz''\footnote{For backwards
+ compatibility, ``gzip'' is treated as an alias for ``gz''}, ``bzip2'',
+ ``xz'', ``lzma'', ``snappy-raw'', ``snappy-framed'', ``pack200'', ``z'')
then
+ it will be unpacked using that library. If the value is ``any'' then the
+ handler uses the auto-detection capabilities of Commons Compress to attempt
+ to detect the appropriate compression format. Any other value is taken to be
the command line for a native decompression program that expects compressed
data on stdin and will produce decompressed data on stdout, for example
- \verb!"lzop -dc"! or \verb!"bunzip2"!.
+ \verb!"lzop -dc"!.
\item[mimeType] (optional but highly recommended) the value to pass as the
``mimeType'' parameter when creating a GATE Document from the JSON string.
This will be used by GATE to select an appropriate document format parser, so
Modified: gcp/trunk/doc/gcp-guide.pdf
===================================================================
(Binary files differ)
Modified: gcp/trunk/doc/install-and-run.tex
===================================================================
--- gcp/trunk/doc/install-and-run.tex 2015-06-05 01:20:22 UTC (rev 18756)
+++ gcp/trunk/doc/install-and-run.tex 2015-06-05 15:56:46 UTC (rev 18757)
@@ -20,11 +20,22 @@
\section{Running GCP}
-Once GCP is installed you can run it using the \verb!gcp-cli.jar! executable
+Once GCP is installed you can run it in one of two ways:
+\bit
+\item using the \verb!gcp-cli.jar! executable
JAR file in the installation directory (or the \verb!gcp.sh! bash script, which
-simply calls \verb!java -jar gcp-cli.jar!). This tool takes a number of
-optional arguments:
+simply calls \verb!java -jar gcp-cli.jar!)
+\item using the \verb!gcp-direct.sh! bash script.
+\eit
+\subsection{Using {\tt gcp-cli.jar}}
+
+The usual way to run GCP is to write one or more {\em batch definition} XML
+files (see chapter~\ref{chap:batch-def} for details) defining the application
+you want to run, the documents to process, and the output formats to produce.
+You then pass these batch definitions to \verb!gcp-cli.jar! for processing.
+The CLI tool takes a number of optional arguments:
+
\bde
\item[-m] Specifies the maximum Java heap size, in the format expected by the
usual \verb!-Xmx! Java option, e.g. \verb!-m 10G! for a 10GB heap limit. The
@@ -41,6 +52,12 @@
\verb!-Djava.io.tmpdir=/home/bigtmp!. \verb!-D! options specified before the
\verb!-jar! apply to the virtual machine running the CLI, those specified
after \verb!-jar gcp-cli.jar! will be passed to the batch runner processes.
+ If you have an installed copy of GATE Developer you may wish to set
+ \verb!-Dgate.home=...! to point to your installation. This is required if
+ your saved GATE application refers to standard GATE plugins (using
+ \verb!$gatehome$! paths in the xgapp), but is optional if the application is
+ self-contained -- GCP includes its own copy of GATE Embedded and does not
+ require a separate installed copy of the core libraries.
\ede
The tool will determine the location of where GCP is installed in the
@@ -99,4 +116,45 @@
the script to exit at the end of the batch it is currently processing (or
immediately if it is currently idle).
+
+\subsection{Using {\tt gcp-direct.sh}}
+\label{sec:running:gcp-direct}
+
+The \verb!gcp-direct.sh! script can be used for simple cases where you want to
+process all the files under one particular directory and output the resulting
+annotations in GATE XML or FastInfoset format. For this specific case it is
+not necessary to write an XML batch descriptor, you can specify the required
+parameters using command line options to \verb!gcp-direct.sh!:
+
+\bde
+\item[-t] the number of parallel threads to use.
+\item[-x] the path to the saved GATE application that you want to run.
+\item[-f] the output format to use for saving results, must be either ``xml''
+ (GATE XML format) or ``finf'' (FastInfoset format). To use FastInfoset the
+ GATE \verb!Format_FastInfoset! plugin must be loaded by the saved
+ application.
+\item[-i] the directory in which to look for the input files. All files in
+ this directory and any subdirectories will be processed (except for standard
+ backup and temporary file name patterns and source control metadata -- see
+ \url{http://ant.apache.org/manual/dirtasks.html#defaultexcludes} for
+ details).
+\item[-o] the directory in which to place the output files. Each input file
+ will generate an output file with the same name in the output directory.
+\ede
+
+Additionally, you can specify \verb!-D! and \verb!-X! options which will be
+passed through to the Java VM, for example you can set the maximum amount of
+heap memory that the JVM can use with an option like \verb!-Xmx2G!
+
+The \verb!gcp-direct.sh! script is deliberately opinionated, in order to reduce
+the number of different options that need to be set, and it has a number of
+hard-coded assumptions. It assumes that your input documents use the UTF-8
+character encoding, that the correct document format parser to use can be
+determined from the file extension, and that you always want to save \emph{all}
+the annotations that your application generates. If you need to process
+documents in a different encoding, you have more complex output requirements
+(XCES, JSON, M\'{i}mir, \ldots) or want to output only a subset of the GATE
+annotations from each document, then you should write a batch definition in XML
+and use \verb!gcp-cli.jar! as discussed above.
+
% vim:ft=tex
Modified: gcp/trunk/doc/introduction.tex
===================================================================
--- gcp/trunk/doc/introduction.tex 2015-06-05 01:20:22 UTC (rev 18756)
+++ gcp/trunk/doc/introduction.tex 2015-06-05 15:56:46 UTC (rev 18757)
@@ -134,6 +134,24 @@
This section summarises the main changes between releases of GCP
+\subsection{2.5 (June 2015)}
+
+\bit
+\item Now depends on GATE Embedded 8.1
+\item Introduced ``streaming'' style input and output handlers for JSON
+ data (e.g. from Twitter), which can read a series of documents from
+ a single JSON input file, and write JSON results to a single concatenated
+ output file (sections~\ref{sec:batch-def:json-input} and
+ \ref{sec:batch-def:file-output-handlers}).
+\item Introduced the \verb!gcp-direct.sh! script to cover simple invocations
+ of GCP without the need to write a batch definition XML file
+ (section~\ref{sec:running:gcp-direct}).
+\item For ``controller-aware''
+
PRs\footnote{\url{http://gate.ac.uk/gate/doc/javadoc/gate/creole/ControllerAwarePR.html}},
+ the various callbacks are now invoked just once per batch rather than before
+ and after every single document.
+\eit
+
\subsection{2.4 (May 2014)}
\bit
This was sent by the SourceForge.net collaborative development platform, the
world's largest Open Source development site.
------------------------------------------------------------------------------
_______________________________________________
GATE-cvs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/gate-cvs