[ 
https://issues.apache.org/jira/browse/ANY23-457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236410#comment-17236410
 ] 

Lewis John McGibbney edited comment on ANY23-457 at 11/20/20, 7:55 PM:
-----------------------------------------------------------------------

Using rover in master branch I cannot replicate this... after a few hours of 
debugging and writing local unit tests I am a bit puzzled.
The [following 
code|https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/html/TagSoupParsingConfiguration.java#L62]
 definitely skips over by the DOCTYPE declaration 

*TagSoupParsingConfiguration.java*
{code:java}
        private static Document convert(org.jsoup.nodes.Document document) {
            Document w3cDoc = new org.apache.html.dom.HTMLDocumentImpl();

            org.jsoup.nodes.Element rootEl = document.children().first();   // 
SKIPS DOCTYPE
            if (rootEl != null) {
                NodeTraversor.traverse(new DocumentConverter(w3cDoc), rootEl);
            }

            return w3cDoc;
        }
{code}
... however I am not able to reproduce the bug above now. Closing off until I 
experience this again. 


was (Author: lewismc):
Using rover in master branch I cannot replicate this... after a few hours of 
debugging and writing local unit tests I am a bit puzzled.
The [following 
code|https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/html/TagSoupParsingConfiguration.java#L62]
 definitely skips over by the DOCTYPE declaration 

{code:java}
        private static Document convert(org.jsoup.nodes.Document document) {
            Document w3cDoc = new org.apache.html.dom.HTMLDocumentImpl();

            org.jsoup.nodes.Element rootEl = document.children().first();   // 
SKIPS DOCTYPE
            if (rootEl != null) {
                NodeTraversor.traverse(new DocumentConverter(w3cDoc), rootEl);
            }

            return w3cDoc;
        }
{code}
... however I am not able to reproduce the bug above now. Closing off until I 
experience this again. 

> Fix error: White spaces are required between publicId and systemId
> ------------------------------------------------------------------
>
>                 Key: ANY23-457
>                 URL: https://issues.apache.org/jira/browse/ANY23-457
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: fix, rule
>    Affects Versions: 2.4
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Major
>             Fix For: 2.5
>
>
> This problem is encountered when we attempt to parse the following HTML
> https://www-robotics.jpl.nasa.gov/links/index.cfm
> https://www-robotics.jpl.nasa.gov/patents/index.cfm
> ERROR rdf.BaseRDFExtractor - Error while parsing RDF document.
> White spaces are required between publicId and systemId
> If one looks at the HTML source you will see the following
> {code:html}
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> {code}
> Reading [this article|https://stackoverflow.com/a/9225499], it looks like we 
> may be able to create a rule and 'fix' which would create the following
> {code:html}
> <!-- Notice the addition of "" -->
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "">
> <html>
> <head>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to