Antwort: Re: Why does TestNodeWalker keep failing?

2009-06-16 Thread marcel . schnippe
Hi All, 

According to W3C's Excessive DTD Traffic we should not download any DTD, 
because http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd; denotes a 
namespace, not a ressource allthough it looks and works like an URI.

 A while ago we put a system in place to monitor our servers for abusive 
request patterns 
 and send 503 Service Unavailable responses with custom text depending 
 on the nature of the abuse. Our hope was that the authors of misbehaving 
software and 
 the administrators of sites who deployed it would notice these errors 
and make the 
 necessary fixes to the software responsible. 

 To read the DTD, one might be able to use an alternate URL based on the 
public identifier. Unfortunately, catalogs are not in wide-spread use, and 
W3C does nothing to promote them. 

--
Best regards,
Marcel Schnippe
Changemanager PER
Provinzial Rheinland
Die Versicherung der Sparkassen
40195 Düsseldorf

Telefon: 0211/978-1378
Fax:   0211/978-41378
Provinzial Rheinland Versicherung AG – Die Versicherung der Sparkassen; 
Amtsgericht Düsseldorf HRB 41241;
Provinzial Rheinland Lebensversicherung AG – Die Versicherung der 
Sparkassen; Amtsgericht Düsseldorf HRB 41741;
Sitz der Gesellschaften: Provinzialplatz 1, D-40591 Düsseldorf; 
Vorsitzender der Aufsichtsräte: Harry K. Voigtsberger;
Vorstände: Ulrich Jansen, Vorsitzender; Michael Bock, Patric Fedlmeier, 
Dieter Kurka, Peter Slawik, Dr. Hans Peter Sterk



Doğacan Güney doga...@gmail.com 
13.06.2009 10:26
Bitte antworten an
nutch-dev@lucene.apache.org


An
nutch-dev@lucene.apache.org
Kopie

Thema
Re: Why does TestNodeWalker keep failing?






On Fri, Jun 12, 2009 at 15:12, Andrzej Bialecki a...@getopt.org wrote:
Doğacan Güney wrote:
Hi all,

Does anyone know why TestNodeWalker keeps failing
for the last couple of days?

I can reproduce the error in my computer; test log looks like
this:

Testsuite: org.apache.nutch.util.TestNodeWalker
Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 1.101 sec
- Standard Error -
java.io.IOException: Server returned HTTP response code: 503 for URL: 
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
   at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1241)
   at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown 
Source)
   at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source)
   at org.apache.xerces.impl.XMLEntityManager.startDTDEntity(Unknown 
Source)
   at org.apache.xerces.impl.XMLDTDScannerImpl.setInputSource(Unknown 
Source)
   at 
org.apache.xerces.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(Unknown 
Source)
   at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
Source)
   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
   at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
   at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
   at 
org.apache.nutch.util.TestNodeWalker.testSkipChildren(TestNodeWalker.java:63)

Hmm, error 503 is Service unavailable. Either this is a genuine problem 
at www.w3.org, or the access to this site is not available from the 
machine that runs tests. I believe we should do something similar as we 
did for generating the web docs, i.e. use our own catalog or DTDs instead 
of downloading DTDs from the net.

DTD is defined like this (in file TestNodeWalker.java)

private final static String WEBPAGE=
  !DOCTYPE html PUBLIC \-//W3C//DTD XHTML 1.0 Strict//EN\ \
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\;
// ... rest of the webpage

How can we move that DTD to local? Perhaps, we should just remove
that line, I don't know if it does anything there.
 


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-- 
Doğacan Güney


Re: Antwort: Re: Why does TestNodeWalker keep failing?

2009-06-16 Thread Andrzej Bialecki

marcel.schni...@provinzial.com wrote:


Hi All,

According to W3C's Excessive DTD Traffic 
http://www.w3.org/2005/06/blog/systeam/2008/02/08/w3c_s_excessive_dtd_trafficwe 
should not download any DTD, because 
_http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd_; denotes a

namespace, not a ressource allthough it looks and works like an URI.

  A while ago we put a system in place to monitor our servers for 
abusive request patterns

  and send 503 Service Unavailable responses with custom text depending
  on the nature of the abuse. Our hope was that the authors of 
misbehaving software and
  the administrators of sites who deployed it would notice these errors 
and make the

  necessary fixes to the software responsible.

  To read the DTD, one might be able to use an alternate URL based on 
the public identifier. Unfortunately, catalogs are not in wide-spread 
use, and W3C does nothing to promote them.


Thanks Marcel, this confirms my suspicion.

The proper fix is to use a local copy of DTDs, and set an 
XMLCatalogResolver on every XML parser to access these local copies. An 
interim workaround for TestNodeWalker is to turn off validation and turn 
off loading of external entities - I verified that the test passes then.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Why does TestNodeWalker keep failing?

2009-06-12 Thread Doğacan Güney
Hi all,

Does anyone know why TestNodeWalker keeps failing
for the last couple of days?

I can reproduce the error in my computer; test log looks like
this:

Testsuite: org.apache.nutch.util.TestNodeWalker
Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 1.101 sec
- Standard Error -
java.io.IOException: Server returned HTTP response code: 503 for URL:
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1241)
at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown
Source)
at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source)
at org.apache.xerces.impl.XMLEntityManager.startDTDEntity(Unknown
Source)
at org.apache.xerces.impl.XMLDTDScannerImpl.setInputSource(Unknown
Source)
at
org.apache.xerces.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(Unknown
Source)
at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at
org.apache.nutch.util.TestNodeWalker.testSkipChildren(TestNodeWalker.java:63)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at junit.framework.TestCase.runTest(TestCase.java:154)
at junit.framework.TestCase.runBare(TestCase.java:127)
at junit.framework.TestResult$1.protect(TestResult.java:106)
at junit.framework.TestResult.runProtected(TestResult.java:124)
at junit.framework.TestResult.run(TestResult.java:109)
at junit.framework.TestCase.run(TestCase.java:118)
at junit.framework.TestSuite.runTest(TestSuite.java:208)
at junit.framework.TestSuite.run(TestSuite.java:203)
at
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:421)
at
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:912)
at
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:766)
-  ---

Testcase: testSkipChildren took 1.095 sec
FAILED
UL Content can NOT be found in the node
junit.framework.AssertionFailedError: UL Content can NOT be found in the
node
at
org.apache.nutch.util.TestNodeWalker.testSkipChildren(TestNodeWalker.java:79)

I have no idea why we get a 503 there?

-- 
Doğacan Güney


Re: Why does TestNodeWalker keep failing?

2009-06-12 Thread Andrzej Bialecki

Doğacan Güney wrote:

Hi all,

Does anyone know why TestNodeWalker keeps failing
for the last couple of days?

I can reproduce the error in my computer; test log looks like
this:

Testsuite: org.apache.nutch.util.TestNodeWalker
Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 1.101 sec
- Standard Error -
java.io.IOException: Server returned HTTP response code: 503 for URL: 
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1241)
at 
org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)

at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source)
at org.apache.xerces.impl.XMLEntityManager.startDTDEntity(Unknown 
Source)
at org.apache.xerces.impl.XMLDTDScannerImpl.setInputSource(Unknown 
Source)
at 
org.apache.xerces.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(Unknown 
Source)
at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
Source)

at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at 
org.apache.nutch.util.TestNodeWalker.testSkipChildren(TestNodeWalker.java:63)


Hmm, error 503 is Service unavailable. Either this is a genuine 
problem at www.w3.org, or the access to this site is not available from 
the machine that runs tests. I believe we should do something similar as 
we did for generating the web docs, i.e. use our own catalog or DTDs 
instead of downloading DTDs from the net.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com