[ 
https://issues.apache.org/jira/browse/ANY23-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende resolved ANY23-443.
-------------------------------
      Assignee: Hans Brende
    Resolution: Fixed

> Improve efficiency of RDFa Extractor
> ------------------------------------
>
>                 Key: ANY23-443
>                 URL: https://issues.apache.org/jira/browse/ANY23-443
>             Project: Apache Any23
>          Issue Type: Improvement
>            Reporter: Hans Brende
>            Assignee: Hans Brende
>            Priority: Major
>             Fix For: 2.4
>
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Our RDFa Extractor is terribly inefficient. 
> 1st, we parse the html "tag soup" input stream into a DOM using Jsoup
> 2nd, we transform the DOM back into an input stream, containing strictly 
> valid XML to avoid errors in the underlying semargl parser
> 3rd, the underlying semargl parser resurrects this input stream as XML and 
> hands off XML streaming events to its underlying XmlSink. 
> 4th, semargl's XmlSink hands its own RDF events back to RDF4J, which in turn 
> hands them back to Any23. 
> I propose cutting out all these intermediate steps by simply walking the 
> original jsoup DOM and handing our own XML events directly to semargl's 
> XmlSink, which we will configure to give RDF events directly back to Any23. 
> This will also allow us to get rid of most (or possibly all) of the various 
> HTML-to-XML "fixups" we had to implement to prevent extraction failures.
> ----
> *TL;DR:*
>  
> {{Jsoup → InputStream → RDF4J → XMLReader → RdfaParser → RDF4J → Any23}} 
> *becomes*
> {{Jsoup → RdfaParser → Any23}} 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to