[ 
https://issues.apache.org/jira/browse/ANY23-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338229#comment-16338229
 ] 

ASF GitHub Bot commented on ANY23-326:
--------------------------------------

Github user lewismc commented on a diff in the pull request:

    https://github.com/apache/any23/pull/59#discussion_r163679147
  
    --- Diff: 
core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java ---
    @@ -105,7 +109,24 @@ public void run(
                 
parser.getParserConfig().addNonFatalError(BasicParserSettings.NORMALIZE_DATATYPE_VALUES);
                 //ByteBuffer seems to represent incorrect content. Need to 
make sure it is the content
                 //of the <script> node and not anything else!
    -            parser.parse(in, 
extractionContext.getDocumentIRI().stringValue());
    +            RDFFormat format = parser.getRDFFormat();
    +            String iri = extractionContext.getDocumentIRI().stringValue();
    +
    +            if (format.hasFileExtension("xhtml")) {
    --- End diff --
    
    What happens when the file suffix is just .html?


> parsing unclosed meta and input tags fails
> ------------------------------------------
>
>                 Key: ANY23-326
>                 URL: https://issues.apache.org/jira/browse/ANY23-326
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: CLI
>    Affects Versions: 2.1
>         Environment: ubuntu 17.04
>            Reporter: Ben Roberts
>            Priority: Major
>             Fix For: 2.2
>
>
> parsing fails as soon as it hits an unclosed input or meta tag, as an example 
> try
>  ./bin/any23 rover https://ben.thatmustbe.me/note/2017/12/28/1
> [Fatal Error] :170:3: The element type "input" must be terminated by the 
> matching end-tag "</input>".
>  
> It seems like the issue might be that this is using a very old version of 
> jsoup.  at least as best I could tell.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to