[
https://issues.apache.org/jira/browse/ANY23-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hans Brende resolved ANY23-443.
-------------------------------
Assignee: Hans Brende
Resolution: Fixed
> Improve efficiency of RDFa Extractor
> ------------------------------------
>
> Key: ANY23-443
> URL: https://issues.apache.org/jira/browse/ANY23-443
> Project: Apache Any23
> Issue Type: Improvement
> Reporter: Hans Brende
> Assignee: Hans Brende
> Priority: Major
> Fix For: 2.4
>
> Time Spent: 2h 10m
> Remaining Estimate: 0h
>
> Our RDFa Extractor is terribly inefficient.
> 1st, we parse the html "tag soup" input stream into a DOM using Jsoup
> 2nd, we transform the DOM back into an input stream, containing strictly
> valid XML to avoid errors in the underlying semargl parser
> 3rd, the underlying semargl parser resurrects this input stream as XML and
> hands off XML streaming events to its underlying XmlSink.
> 4th, semargl's XmlSink hands its own RDF events back to RDF4J, which in turn
> hands them back to Any23.
> I propose cutting out all these intermediate steps by simply walking the
> original jsoup DOM and handing our own XML events directly to semargl's
> XmlSink, which we will configure to give RDF events directly back to Any23.
> This will also allow us to get rid of most (or possibly all) of the various
> HTML-to-XML "fixups" we had to implement to prevent extraction failures.
> ----
> *TL;DR:*
>
> {{Jsoup → InputStream → RDF4J → XMLReader → RdfaParser → RDF4J → Any23}}
> *becomes*
> {{Jsoup → RdfaParser → Any23}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)