Revision: 17998
http://sourceforge.net/p/gate/code/17998
Author: ian_roberts
Date: 2014-05-15 16:59:08 +0000 (Thu, 15 May 2014)
Log Message:
-----------
2.4 changelog
Modified Paths:
--------------
gcp/trunk/doc/batch-def.tex
gcp/trunk/doc/introduction.tex
Modified: gcp/trunk/doc/batch-def.tex
===================================================================
--- gcp/trunk/doc/batch-def.tex 2014-05-15 16:36:54 UTC (rev 17997)
+++ gcp/trunk/doc/batch-def.tex 2014-05-15 16:59:08 UTC (rev 17998)
@@ -172,6 +172,7 @@
assumes that the document ID is the path of an entry in the ZIP file.
\subsection{The {\tt ARCInputHandler} and {\tt WARCInputHandler}}
+\label{sec:batch-def:arc}
These two input handlers read documents out of ARC- and WARC format web archive
files as produced by the Heritrix web crawler and other similar tools. They
Modified: gcp/trunk/doc/introduction.tex
===================================================================
--- gcp/trunk/doc/introduction.tex 2014-05-15 16:36:54 UTC (rev 17997)
+++ gcp/trunk/doc/introduction.tex 2014-05-15 16:59:08 UTC (rev 17998)
@@ -134,6 +134,26 @@
This section summarises the main changes between releases of GCP
+\subsection{2.4 (May 2014)}
+
+\bit
+\item Now depends on GATE Embedded 8.0
+\item Added input handler for WARC format archives, to complement the existing
+ ARC handler (section~\ref{sec:batch-def:arc}).
+\item ARC and WARC handlers can optionally load individual records from
+ remotely hosted archives using HTTP requests with a ``Range'' header. This
+ facility can be used with publicly-hosted data sets such as Common
+ Crawl\footnote{\url{http://www.commoncrawl.org}}. To support this
+ functionality, document identifiers in a batch definition can now take XML
+ attributes as well as the actual string identifier (exactly how such
+ attributes are used is up to the handler implementations).
+\item Added output handler to save documents in a JSON format modelled on that
+ used by Twitter to represent ``entities'' (e.g. username mentions) in Tweets.
+\item Efficiency improvements in the M\'{i}mir output handler, to send
+ documents to the server in batches rather than opening a new HTTP connection
+ for every document.
+\eit
+
\subsection{2.3 (November 2012)}
\bit
This was sent by the SourceForge.net collaborative development platform, the
world's largest Open Source development site.
------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
GATE-cvs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/gate-cvs