Revision: 17362
http://sourceforge.net/p/gate/code/17362
Author: ian_roberts
Date: 2014-02-20 12:12:57 +0000 (Thu, 20 Feb 2014)
Log Message:
-----------
Re-instate the stripping of protocol, query and fragment which slipped quietly
away when we introduced support for remote ARC records with HTTP Range
requests.
Modified Paths:
--------------
gcp/trunk/src/gate/cloud/io/arc/ARCDocumentNamingStrategy.java
Modified: gcp/trunk/src/gate/cloud/io/arc/ARCDocumentNamingStrategy.java
===================================================================
--- gcp/trunk/src/gate/cloud/io/arc/ARCDocumentNamingStrategy.java
2014-02-20 12:02:24 UTC (rev 17361)
+++ gcp/trunk/src/gate/cloud/io/arc/ARCDocumentNamingStrategy.java
2014-02-20 12:12:57 UTC (rev 17362)
@@ -25,7 +25,7 @@
import gate.util.GateException;
/**
- * A naming strategy to convert document IDs suitable for use with
+ * <p>A naming strategy to convert document IDs suitable for use with
* an {@link ArchiveInputHandler} to file paths suitable for saving the
* results of their processing. It assumes that the document IDs
* use the record URL as the id text (see {@link DocumentID#getIdText()}), and
@@ -36,16 +36,17 @@
* directories constructed by padding the document sequence number to the left
* with zeros and creating intermediate directories according to a configurable
* pattern. The default pattern is '3/3', which pads the numbers to a minimum
- * of 6 digits and then splits them up into groups of three. The remainder of
- * the ID after the number is cleaned up to remove any URL protocol like
- * http:// and any query string or fragment. Any sequences of non-ASCII
- * characters are removed and any remaining slashes or colons are replaced
- * with underscores. For example with the default pattern, the document
- * ID '0001_http://example.org/file.html?param=value' maps to the file
+ * of 6 digits and then splits them up into groups of three. The ID text
+ * is cleaned up to remove any URL protocol like http:// and any query string
+ * or fragment. Any sequences of non-ASCII characters are removed and any
+ * remaining slashes or colons are replaced with underscores.</p>
+ *
+ * <p>For example with the default pattern, the document
+ * ID with <code>recordPosition="1"</code> and URL
'http://example.org/file.html?param=value' maps to the file
* 000/001_example.org_file.html (with any additional configured file
- * extension appended). If the leading number has more digits than the
+ * extension appended). If the numeric part has more digits than the
* pattern allows then additional digits are used in the first place, so
- * the ID 1234567 maps to 1234/567 rather than 123/4567.
+ * the ID 1234567 maps to 1234/567 rather than 123/4567.</p>
* @author ian
*
*/
@@ -68,6 +69,12 @@
*/
protected static final Pattern NON_FILENAME_PATTERN = Pattern.compile(
"[/:\\\\]");
+
+ /**
+ * Pattern to strip the protocol, query string and fragment from a URL.
+ */
+ protected static final Pattern STRIP_PROTOCOL_QUERY_FRAGMENT =
+ Pattern.compile("^(?:.*?://)?(.*?)(?:\\?.*)?(?:#.*)?$");
public void config(boolean isOutput, Map<String, String> configData)
throws IOException, GateException {
@@ -147,6 +154,13 @@
// the rest of the output file path is constructed from the record URL
String remaining = id.getIdText();
if(remaining != null && remaining.length() > 0) {
+ // strip the protocol, query and fragment
+ Matcher stripQueryMatcher =
STRIP_PROTOCOL_QUERY_FRAGMENT.matcher(remaining);
+ if(stripQueryMatcher.find()) {
+ // this matcher should never fail as every string can match the (.*)
part,
+ // but be conservative anyway
+ remaining = stripQueryMatcher.group(1);
+ }
// append an underscore and the cleaned-up remaining part of the name
pathBuilder.append("_");
Matcher nonAsciiMatcher = NON_ASCII_PATTERN.matcher(remaining);
This was sent by the SourceForge.net collaborative development platform, the
world's largest Open Source development site.
------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk
_______________________________________________
GATE-cvs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/gate-cvs