Hi:

I m using nutch 0.9 as base to do a project. When i use local files in
windows xp2 system to test, I found that protocol-file plugin just breaks.

For example:

       String url =
"file:///C:/cygwin/home/data/train/cv/Brendan%20O'Leary%20CV%20html.html";
       try {
            ProtocolOutput output = new
ProtocolFactory(conf).getProtocol(url)
                    .getProtocolOutput(new Text(url), new CrawlDatum());
            Content content = output.getContent();
            return new ParseUtil(conf).parse(content);
        } catch (Exception e) {
            e.printStackTrace();
        }

I get this exception:

org.apache.nutch.protocol.file.FileError: File Error: 404
    at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:100)
    at
org.apache.nutch.scoring.keyword.ScoringUtilTest.parse(ScoringUtilTest.java:59)
    at
org.apache.nutch.scoring.keyword.ScoringUtilTest.testSectionClassification(ScoringUtilTest.java:79)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at junit.framework.TestCase.runTest(TestCase.java:154)
    at junit.framework.TestCase.runBare(TestCase.java:127)
    at junit.framework.TestResult$1.protect(TestResult.java:106)
    at junit.framework.TestResult.runProtected(TestResult.java:124)
    at junit.framework.TestResult.run(TestResult.java:109)
    at junit.framework.TestCase.run(TestCase.java:118)
    at junit.framework.TestSuite.runTest(TestSuite.java:208)
    at junit.framework.TestSuite.run(TestSuite.java:203)
    at
org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130)
    at
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
    at
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
    at
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
    at
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
    at
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)

I look into the protocol-file plugin source, in FileResponse I find this:

       // url.toURI() is only in j2se 1.5.0
      //java.io.File f = new java.io.File(url.toURI());
      java.io.File f = new java.io.File(path);

so I just change the comment line to use url to uri transformation. This
time it works.

So why don't just use url.toURI to get it right? for jdk 1.4 compatibility?
If that's the case, i think a better solution is to detect the version of
jdk and fall back to manual path translation in jdk1.4.

Sorry, maybe this problem should be posted in dev mailing list.

Reply via email to