[ 
https://issues.apache.org/jira/browse/ANY23-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428460#comment-17428460
 ] 

Sebastian Nagel commented on ANY23-504:
---------------------------------------

Hi [~lewismc], thanks so far. In case it helps - I've tried to narrow the 
problem down and found it's reproducible with Nutch (using NUTCH-2892) by

{noformat}
strace -f bin/nutch parsechecker -Dparser.timeout=120 
-Dany23.extractors=rdf-xml -Dplugin.includes='protocol-file|parse-html|any23' 
file:/path/to/BBC_News_Scotland.html
{noformat}

"strace -f" logs all system calls of the process and its children. The log 
output includes the following lines which show that a DNS lookup for 
"www.w3.org" happened and a connection is opened to the resulting IP address 
"128.30.52.100":
{noformat}
[pid 284098] socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) 
= 363
[pid 284098] setsockopt(363, SOL_IP, IP_RECVERR, [1], 4) = 0
[pid 284098] connect(363, {sa_family=AF_INET, sin_port=htons(53), 
sin_addr=inet_addr("127.0.0.53")}, 16) = 0
[pid 284098] poll([{fd=363, events=POLLOUT}], 1, 0) = 1 ([{fd=363, 
revents=POLLOUT}])
[pid 284098] sendmmsg(363, [{msg_hdr={msg_name=NULL, msg_namelen=0, 
msg_iov=[{iov_base="71\1 \0\1\0\0\0\0\0\1\3www\2w3\3org\0\0\1\0\1\0\0)\4"..., 
iov_len=39}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=39}, 
{msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0036\1 
\0\1\0\0\0\0\0\1\3www\2w3\3org\0\0\34\0\1\0\0)\4"..., iov_len=39}], 
msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=39}], 2, MSG_NOSIGNAL) = 2
[pid 284098] poll([{fd=363, events=POLLIN}], 1, 5000) = 1 ([{fd=363, 
revents=POLLIN}])
[pid 284098] ioctl(363, FIONREAD, [55]) = 0
[pid 284098] recvfrom(363, 
"71\201\200\0\1\0\1\0\0\0\1\3www\2w3\3org\0\0\1\0\1\300\f\0\1"..., 2048, 0, 
{sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.53")}, 
[28->16]) = 55
[pid 284098] poll([{fd=363, events=POLLIN}], 1, 4999 <unfinished ...>
[pid 284098] <... poll resumed>)        = 1 ([{fd=363, revents=POLLIN}])
[pid 284098] ioctl(363, FIONREAD, [39]) = 0
[pid 284098] recvfrom(363, 
"\0036\201\200\0\1\0\0\0\0\0\1\3www\2w3\3org\0\0\34\0\1\0\0)\377"..., 65536, 0, 
{sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.53")}, 
[28->16]) = 39
[pid 284098] close(363)                 = 0
[pid 284098] socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 363
[pid 284098] setsockopt(363, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0
[pid 284098] connect(363, {sa_family=AF_INET6, sin6_port=htons(80), 
sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::ffff:128.30.52.100", 
&sin6_addr), sin6_scope_id=0}, 28 <unfinished ...>
{noformat}

This way you can reproduce the issue even if there is no hanging connection 
which was somehow bad luck. Or good luck?

There are more hints in the strace log:
{noformat}
[pid 284098] sendto(363, "GET /MarkUp/DTD/xhtml-rdfa-1.dtd"..., 175, 0, NULL, 
0) = 175
{noformat}

So, it's obviously the DTD in the first line
{code:xml}
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" 
"http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd";>
{code}

There's also an error message (looks like it does not make it to stderr or in 
any log file):
{noformat}
[pid 284133] write(2, "[Fatal Error] :1:104: External D"..., 179[Fatal Error] 
:1:104: External DTD: Failed to read external DTD 'xhtml-rdfa-1.dtd', because 
'http' access is not allowed due to restriction set by the accessExternalDTD 
property.
 <unfinished ...>
{noformat}

Hope this helps. Thanks!

> Optionally disable remote HTTP connections when resolving XML entities
> ----------------------------------------------------------------------
>
>                 Key: ANY23-504
>                 URL: https://issues.apache.org/jira/browse/ANY23-504
>             Project: Apache Any23
>          Issue Type: Improvement
>            Reporter: Sebastian Nagel
>            Assignee: Lewis John McGibbney
>            Priority: Major
>             Fix For: 2.6
>
>
> The Any23 parser should optionally avoid to open HTTP connections when 
> parsing XML.
> While testing the Nutch's Any23 plugin with 2.5 (NUTCH-2892) on the file 
> "BBC_News_Scotland.htm", the parser did hang for about two minutes with an 
> open HTTP connection to "hans-moleman.w3.org" and the following stack:
> {noformat}
> "parse-0" #19 daemon prio=5 os_prio=0 cpu=1432.93ms elapsed=15.85s 
> tid=0x00007efc713bd800 nid=0x16ff4 runnable  [0x00007efc29f2d000]
>    java.lang.Thread.State: RUNNABLE
>         at java.net.SocketInputStream.socketRead0(java.base@11.0.11/Native 
> Method)
>         at 
> java.net.SocketInputStream.socketRead(java.base@11.0.11/SocketInputStream.java:115)
>         at 
> java.net.SocketInputStream.read(java.base@11.0.11/SocketInputStream.java:168)
>         at 
> java.net.SocketInputStream.read(java.base@11.0.11/SocketInputStream.java:140)
>         at 
> java.io.BufferedInputStream.fill(java.base@11.0.11/BufferedInputStream.java:252)
>         at 
> java.io.BufferedInputStream.read1(java.base@11.0.11/BufferedInputStream.java:292)
>         at 
> java.io.BufferedInputStream.read(java.base@11.0.11/BufferedInputStream.java:351)
>         - locked <0x000000071be1bb68> (a java.io.BufferedInputStream)
>         at 
> sun.net.www.http.HttpClient.parseHTTPHeader(java.base@11.0.11/HttpClient.java:754)
>         at 
> sun.net.www.http.HttpClient.parseHTTP(java.base@11.0.11/HttpClient.java:689)
>         at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(java.base@11.0.11/HttpURLConnection.java:1615)
>         - locked <0x000000071be11040> (a 
> sun.net.www.protocol.http.HttpURLConnection)
>         at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(java.base@11.0.11/HttpURLConnection.java:1520)
>         - locked <0x000000071be11040> (a 
> sun.net.www.protocol.http.HttpURLConnection)
>         at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown 
> Source)
>         at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source)
>         at org.apache.xerces.impl.XMLEntityManager.startDTDEntity(Unknown 
> Source)
>         at org.apache.xerces.impl.XMLDTDScannerImpl.setInputSource(Unknown 
> Source)
>         at 
> org.apache.xerces.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(Unknown 
> Source)
>         at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
>         at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>         at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>         at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>         at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
>         at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown 
> Source)
>         at 
> org.eclipse.rdf4j.common.xml.SimpleSAXParser.parse(SimpleSAXParser.java:197)
>         - locked <0x000000071bfe6f28> (a 
> org.eclipse.rdf4j.common.xml.SimpleSAXParser)
>         at org.eclipse.rdf4j.rio.trix.TriXParser.parse(TriXParser.java:177)
>         at org.eclipse.rdf4j.rio.trix.TriXParser.parse(TriXParser.java:134)
>         at 
> org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:86)
>         at 
> org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:39)
>         at 
> org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:523)
>         at 
> org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:265)
>         at org.apache.any23.Any23.extract(Any23.java:315)
>         at org.apache.any23.Any23.extract(Any23.java:483)
>         at org.apache.any23.Any23.extract(Any23.java:345)
>         at 
> org.apache.nutch.any23.Any23ParseFilter$Any23Parser.parse(Any23ParseFilter.java:106)
>         at 
> org.apache.nutch.any23.Any23ParseFilter$Any23Parser.<init>(Any23ParseFilter.java:81)
>         at 
> org.apache.nutch.any23.Any23ParseFilter.filter(Any23ParseFilter.java:153)
>         at 
> org.apache.nutch.parse.HtmlParseFilters.filter(HtmlParseFilters.java:55)
>         at 
> org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:257)
>         at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)
>         at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
>         at 
> java.util.concurrent.FutureTask.run(java.base@11.0.11/FutureTask.java:264)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1128)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)
>         at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to