Revision: 18875
          http://sourceforge.net/p/gate/code/18875
Author:   johann_p
Date:     2015-08-17 23:12:19 +0000 (Mon, 17 Aug 2015)
Log Message:
-----------
Make gcp-direct.sh deal properly with the environment variable GATE_HOME.
Change the behavior of gcp-direct.sh so that the file extension of the 
output file replaces the file extension of the input file, rather than
gettting appended. Add the parameter "replaceExtension" to the 
SimpleNamingStrategy, for replacing the extension instead of appending.
Add the options -ci (gzip-compressed input) and -co  (gzip-compressed
output) to gcp-direct.sh. Make sure gcp-direct.sh exits if no documents
need to be processed.  Update documentation.

Modified Paths:
--------------
    gcp/trunk/doc/batch-def.tex
    gcp/trunk/doc/gcp-guide.pdf
    gcp/trunk/doc/install-and-run.tex
    gcp/trunk/gcp-direct.sh
    gcp/trunk/src/gate/cloud/batch/BatchRunner.java
    gcp/trunk/src/gate/cloud/io/IOConstants.java
    gcp/trunk/src/gate/cloud/io/file/SimpleNamingStrategy.java

Modified: gcp/trunk/doc/batch-def.tex
===================================================================
--- gcp/trunk/doc/batch-def.tex 2015-08-17 01:20:04 UTC (rev 18874)
+++ gcp/trunk/doc/batch-def.tex 2015-08-17 23:12:19 UTC (rev 18875)
@@ -382,7 +382,11 @@
 strategy} to map from document IDs to output file names.  The default strategy
 is the same \verb!SimpleNamingStrategy! configured with a base \verb!dir! and a
 \verb!fileExtension!, treating the document ID as a path relative to the given
-directory and appending the given extension.  This is appropriate when using a
+directory and appending the given extension. If the \verb!replaceExtension! 
parameter
+is set to \verb!"true"! then the \verb!fileExtension!, if specified, replaces
+any existing file extension of the intput path.
+
+This is appropriate when using a
 file or ZIP input handler but for batches that use an \verb!ARCInputHandler! a
 different strategy is required.
 

Modified: gcp/trunk/doc/gcp-guide.pdf
===================================================================
(Binary files differ)

Modified: gcp/trunk/doc/install-and-run.tex
===================================================================
--- gcp/trunk/doc/install-and-run.tex   2015-08-17 01:20:04 UTC (rev 18874)
+++ gcp/trunk/doc/install-and-run.tex   2015-08-17 23:12:19 UTC (rev 18875)
@@ -18,7 +18,9 @@
 tested and are known to work.  GCJ is known {\em not} to work.  You will also
 need Apache Ant version 1.7.0 or later.  Run ``ant distro'' to build a ZIP file
 containing the binary distribution, and unzip that file somewhere to create
-your GCP installation.
+your GCP installation. If only \verb!gcp-direct.sh! is used, it is sufficient
+to just compile the sources by running \verb!ant! -- for this it is not 
+necessary to create the ZIP file and create a separate GCP installation. 
 
 \section{Running GCP}
 
@@ -27,7 +29,9 @@
 \item using the \verb!gcp-cli.jar! executable
 JAR file in the installation directory (or the \verb!gcp.sh! bash script, which
 simply calls \verb!java -jar gcp-cli.jar!)
-\item using the \verb!gcp-direct.sh! bash script.
+\item using the \verb!gcp-direct.sh! bash script. \emph{Note:} the 
\verb!gcp-direct.sh!
+script can also be used directly from within the original source directory
+after compiling the sources by running \verb!ant!. 
 \eit
 
 \subsection{Using {\tt gcp-cli.jar}}
@@ -144,6 +148,12 @@
   will generate an output file with the same name in the output directory.
 \item[-r] (optional) path to the report file for this batch -- if omitted
   GCP will use \verb!report.xml! in the current directory.
+\item[-ci] the input files are all gzip-compressed
+\item[-co] the output files should all be gzip-compressed. This only makes
+sense if \verb!-f xml! is also specified since the default output format
+\verb!finf! already is a compressed format. If this option is specified, the
+output file name gets the extension \verb!.gz! appended, in addition to 
+any other extension it already may have.
 \ede
 
 Additionally, you can specify \verb!-D! and \verb!-X! options which will be

Modified: gcp/trunk/gcp-direct.sh
===================================================================
--- gcp/trunk/gcp-direct.sh     2015-08-17 01:20:04 UTC (rev 18874)
+++ gcp/trunk/gcp-direct.sh     2015-08-17 23:12:19 UTC (rev 18875)
@@ -52,7 +52,7 @@
 fi
 
 # Pass on GATE_HOME if set
-if [ "${GATE_HOME}" == "" ]; then
+if [ "${GATE_HOME}" != "" ]; then
   jvmparams=( -Dgate.home="${GATE_HOME}" )
 fi
 

Modified: gcp/trunk/src/gate/cloud/batch/BatchRunner.java
===================================================================
--- gcp/trunk/src/gate/cloud/batch/BatchRunner.java     2015-08-17 01:20:04 UTC 
(rev 18874)
+++ gcp/trunk/src/gate/cloud/batch/BatchRunner.java     2015-08-17 23:12:19 UTC 
(rev 18875)
@@ -574,12 +574,15 @@
     options.addOption("r","reportFile",true,"Report file (optional, default: 
report.xml");
     options.addOption("t","numberThreads",true,"Number of threads to use 
(required)");
     options.addOption("I","batchId",true,"Batch ID (optional, default: GCP");
+    options.addOption("ci","compressedInput",false,"Input files are 
gzip-compressed");
+    options.addOption("co","compressedOutput",false,"Output files are 
gzip-compressed");
     options.addOption("h","help",false,"Print this help information");
     BasicParser parser = new BasicParser();
     
     int numThreads = 0;
     File batchFile = null;  
     boolean invokedByGcpCli = true;
+    String outFormat = "finf";
     
     CommandLine line = null;
     try {
@@ -617,12 +620,12 @@
         batchFile = new File(line.getOptionValue('b'));
       }
       if(line.hasOption('f')) {
-        String format = line.getOptionValue('f');
-        if(!format.equals("xml") && !format.equals("finf")) {
+        outFormat = line.getOptionValue('f');
+        if(!outFormat.equals("xml") && !outFormat.equals("finf")) {
           log.error("Output format (option 'f') must be either 'xml' or 
'finf'");
           System.exit(1);
         }
-      }
+      } // if we have option 'f', otherwise use the preset default
       numThreads = Integer.parseInt(line.getOptionValue('t'));
     }
     if(batchFile != null) {
@@ -707,6 +710,15 @@
       Gate.setUserSessionFile(new File(gcpGate, "empty.session"));
       Gate.init();
       
+      // If we run from gcp-direct, we try to load the Format_FastInfoset 
plugin.
+      // This is needed if we write to format finf and the application does 
not load the plugin,
+      // but also if we process finf format documents as input.
+      // If we cannot load the plugin here, the thread will log the error and 
continue which
+      // is good because the application could 
+      // still load the plugin later - with normal GCP, this would be the only 
way to use that format.      
+      if(invokedByGcpCli == false) {
+        gate.Utils.loadPlugin("Format_FastInfoset");
+      }
       BatchRunner instance = new BatchRunner(numThreads);
 
       // depending on how we got invoked, create the batch from either 
@@ -745,7 +757,11 @@
           String inputHandlerClassName = "gate.cloud.io.file.FileInputHandler";
           Map<String,String> configData = new HashMap<String, String>();
           configData.put(IOConstants.PARAM_DOCUMENT_ROOT, 
line.getOptionValue('i'));
-          configData.put(IOConstants.PARAM_COMPRESSION,"none");
+          if(line.hasOption("ci")) {
+            configData.put(IOConstants.PARAM_COMPRESSION,"gzip");            
+          } else {
+            configData.put(IOConstants.PARAM_COMPRESSION,"none");
+          }
           configData.put(IOConstants.PARAM_ENCODING, "UTF-8");
           configData.put(IOConstants.PARAM_FILE_EXTENSION,"");
           Class<? extends InputHandler> inputHandlerClass =
@@ -757,21 +773,27 @@
           // log.info("Have input handler: "+inputHandler);
           aBatch.setInputHandler(inputHandler);
           // set the output Handler
-          String outputHandlerClassName;
-          if(!line.hasOption('f') || line.getOptionValue('f').equals("finf")) {
+          String outputHandlerClassName = null;
+          if(outFormat.equals("finf")) {
             outputHandlerClassName = 
"gate.cloud.io.file.FastInfosetOutputHandler";
-          } else if(line.hasOption('f') && 
line.getOptionValue('f').equals("xml")) {
+          } else if(outFormat.equals("xml")) {
             outputHandlerClassName = 
"gate.cloud.io.file.GATEStandOffFileOutputHandler";
+          } 
+          configData = new HashMap<String, String>();
+          configData.put(IOConstants.PARAM_DOCUMENT_ROOT, 
line.getOptionValue('o'));
+          String outExt = ".finf";
+          if(outFormat.equals("xml")) {
+            outExt = ".xml";
+          }
+          if(line.hasOption("co")) {
+            configData.put(IOConstants.PARAM_COMPRESSION,"gzip");            
+            outExt = outExt + ".gz";
           } else {
-            // this should never happen, since we have checked the format
-            // earlier
-            outputHandlerClassName = null;
+            configData.put(IOConstants.PARAM_COMPRESSION,"none");
           }
-          configData = new HashMap<String, String>();
-          configData.put(IOConstants.PARAM_DOCUMENT_ROOT, 
line.getOptionValue('o'));
-          configData.put(IOConstants.PARAM_COMPRESSION,"none");
+          configData.put(IOConstants.PARAM_FILE_EXTENSION,outExt);
           configData.put(IOConstants.PARAM_ENCODING, "UTF-8");
-          configData.put(IOConstants.PARAM_FILE_EXTENSION,"");
+          configData.put(IOConstants.PARAM_REPLACE_EXTENSION, "true");
           Class<? extends OutputHandler> ouputHandlerClass =
           Class.forName(outputHandlerClassName, true, Gate.getClassLoader())
                  .asSubclass(OutputHandler.class);
@@ -811,9 +833,15 @@
       log.info("Loading time (seconds): 
"+(loadingFinishedTime-startTime)/1000.0);
       log.info("Launching batch:\n" + aBatch);
       
-      instance.runBatch(aBatch);
-      instance.shutdownWhenFinished(true);
-      instance.exitWhenFinished(true);
+      int size = aBatch.getUnprocessedDocumentIDs().length;
+      // if this is run from gcp-direct and there are no unprocessed 
documents, do nothing
+      if(!invokedByGcpCli && size == 0) {
+        log.info("No documents to process, exiting");
+      } else {
+        instance.runBatch(aBatch);
+        instance.shutdownWhenFinished(true);
+        instance.exitWhenFinished(true);
+      }
     } catch(Exception e) {
       log.error("Error starting up batch " + batchFile, e);
       System.exit(1);

Modified: gcp/trunk/src/gate/cloud/io/IOConstants.java
===================================================================
--- gcp/trunk/src/gate/cloud/io/IOConstants.java        2015-08-17 01:20:04 UTC 
(rev 18874)
+++ gcp/trunk/src/gate/cloud/io/IOConstants.java        2015-08-17 23:12:19 UTC 
(rev 18875)
@@ -47,6 +47,14 @@
   public static final String PARAM_FILE_EXTENSION = "fileExtension";
 
   /**
+   * If this is true, any given extension is used to replace any existing file 
extension.
+   * If this configuration option is true, then PARAM_FILE_EXTENSION is not 
empty and the 
+   * file name already does have an extension (something following a dot but 
not including a dot),
+   * then the existing extension is replaced with the new extension. 
+   */
+  public static final String PARAM_REPLACE_EXTENSION = "replaceExtension";
+
+  /**
    * Parameter name for file name prefixes (used e.g. when enumerating files 
    * inside a directory and making them appear under a higer-level top 
    * directory).

Modified: gcp/trunk/src/gate/cloud/io/file/SimpleNamingStrategy.java
===================================================================
--- gcp/trunk/src/gate/cloud/io/file/SimpleNamingStrategy.java  2015-08-17 
01:20:04 UTC (rev 18874)
+++ gcp/trunk/src/gate/cloud/io/file/SimpleNamingStrategy.java  2015-08-17 
23:12:19 UTC (rev 18875)
@@ -14,6 +14,7 @@
 import static gate.cloud.io.IOConstants.PARAM_BATCH_FILE_LOCATION;
 import static gate.cloud.io.IOConstants.PARAM_DOCUMENT_ROOT;
 import static gate.cloud.io.IOConstants.PARAM_FILE_EXTENSION;
+import static gate.cloud.io.IOConstants.PARAM_REPLACE_EXTENSION;
 import gate.cloud.batch.DocumentID;
 import gate.util.GateException;
 
@@ -40,9 +41,25 @@
    * The file extension that should be appended to the ID.
    */  
   protected String fileExtension;
+  
+  /**
+   * If this is true, any given extension is used to replace any existing file 
extension.
+   * If this configuration option is true, then PARAM_FILE_EXTENSION is not 
empty and the 
+   * file name already does have an extension (something following a dot but 
not including a dot),
+   * then the existing extension is replaced with the new extension. 
+   * <p>
+   * Currently, if the input extension is .gz then if this is preced by 
another extension,
+   * both extensions are replaced; for example if the name is filename.xml.gz 
then ".xml.gz" 
+   * is replaced. 
+   */
+  protected boolean replaceExtension = false;
+  
+  protected boolean isOutput = false;
 
   public void config(boolean isOutput, Map<String, String> configData) throws 
IOException,
           GateException {
+    
+    this.isOutput = isOutput;
     //doc root
     String docRootStr = configData.get(PARAM_DOCUMENT_ROOT);
     if(docRootStr == null || docRootStr.trim().length() == 0){
@@ -96,13 +113,32 @@
     
     //extension
     fileExtension = configData.get(PARAM_FILE_EXTENSION);
+    
+    replaceExtension = 
Boolean.parseBoolean(configData.get(PARAM_REPLACE_EXTENSION));
+    
   }
 
   public File toFile(DocumentID id) throws IOException {
     try {
       String path = relativePathFor(id);
-      if(fileExtension != null) {
-        path += fileExtension;
+      if(fileExtension != null && !fileExtension.equals("")) {
+        if(isOutput && replaceExtension) {
+          // first off, strip away any .gz extension
+          if(path.endsWith(".gz") && path.length() > 3) {
+            path = path.substring(0,path.length()-3);
+          }
+          int dotIndex = path.lastIndexOf('.');
+          // it is only an extension if it is not the first character          
+          if(dotIndex > 0) {
+            // replace the existing extension with the new one
+            path = path.substring(0,dotIndex) + fileExtension;
+          } else {
+            // if it is a file without a dot or starting with a dot, append 
the new extension
+            path += fileExtension;
+          }
+        } else {
+          path += fileExtension;
+        }
       }
       URI u = new URI(null, null, path, null);
       URI docUri = documentRoot.resolve(u);
@@ -133,7 +169,8 @@
     StringBuilder text = new StringBuilder();
     text.append("\n\t\tClass:         " + this.getClass().getName() + "\n");
     text.append("\t\tDocument root:  " + documentRoot + "\n");
-    text.append("\t\tFile extension: " + fileExtension);
+    text.append("\t\tFile extension: " + fileExtension+ "\n");
+    text.append("\t\tReplace extension: " + replaceExtension);
     return text.toString();
   }
 }

This was sent by the SourceForge.net collaborative development platform, the 
world's largest Open Source development site.


------------------------------------------------------------------------------
_______________________________________________
GATE-cvs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/gate-cvs

Reply via email to