[ https://issues.apache.org/jira/browse/ANY23-457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236410#comment-17236410 ]
Lewis John McGibbney edited comment on ANY23-457 at 11/20/20, 7:55 PM: ----------------------------------------------------------------------- Using rover in master branch I cannot replicate this... after a few hours of debugging and writing local unit tests I am a bit puzzled. The [following code|https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/html/TagSoupParsingConfiguration.java#L62] definitely skips over by the DOCTYPE declaration *TagSoupParsingConfiguration.java* {code:java} private static Document convert(org.jsoup.nodes.Document document) { Document w3cDoc = new org.apache.html.dom.HTMLDocumentImpl(); org.jsoup.nodes.Element rootEl = document.children().first(); // SKIPS DOCTYPE if (rootEl != null) { NodeTraversor.traverse(new DocumentConverter(w3cDoc), rootEl); } return w3cDoc; } {code} ... however I am not able to reproduce the bug above now. Closing off until I experience this again. was (Author: lewismc): Using rover in master branch I cannot replicate this... after a few hours of debugging and writing local unit tests I am a bit puzzled. The [following code|https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/html/TagSoupParsingConfiguration.java#L62] definitely skips over by the DOCTYPE declaration {code:java} private static Document convert(org.jsoup.nodes.Document document) { Document w3cDoc = new org.apache.html.dom.HTMLDocumentImpl(); org.jsoup.nodes.Element rootEl = document.children().first(); // SKIPS DOCTYPE if (rootEl != null) { NodeTraversor.traverse(new DocumentConverter(w3cDoc), rootEl); } return w3cDoc; } {code} ... however I am not able to reproduce the bug above now. Closing off until I experience this again. > Fix error: White spaces are required between publicId and systemId > ------------------------------------------------------------------ > > Key: ANY23-457 > URL: https://issues.apache.org/jira/browse/ANY23-457 > Project: Apache Any23 > Issue Type: Bug > Components: fix, rule > Affects Versions: 2.4 > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney > Priority: Major > Fix For: 2.5 > > > This problem is encountered when we attempt to parse the following HTML > https://www-robotics.jpl.nasa.gov/links/index.cfm > https://www-robotics.jpl.nasa.gov/patents/index.cfm > ERROR rdf.BaseRDFExtractor - Error while parsing RDF document. > White spaces are required between publicId and systemId > If one looks at the HTML source you will see the following > {code:html} > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> > <html> > <head> > {code} > Reading [this article|https://stackoverflow.com/a/9225499], it looks like we > may be able to create a rule and 'fix' which would create the following > {code:html} > <!-- Notice the addition of "" --> > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" ""> > <html> > <head> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)