[ 
https://issues.apache.org/jira/browse/COR-20?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jan iversen updated COR-20:
---------------------------
    Component/s: DocFormats - platform
                 DocFormats - core

> Write an XML/HTML parser
> ------------------------
>
>                 Key: COR-20
>                 URL: https://issues.apache.org/jira/browse/COR-20
>             Project: Corinthia
>          Issue Type: Improvement
>          Components: DocFormats - core, DocFormats - platform
>            Reporter: Peter Kelly
>             Fix For: 0.5
>
>
> Currently we rely on libxml2 and HTML Tidy for parsing XML and HTML, 
> respectively. In both cases we are only using the parsing functions of 
> libraries, not other features like the DOM tree or other things.
> Parsing XML is not very difficult to do. HTML slightly more, because of all 
> the ambiguities that arise from the poorly-defined parsing rules in earlier 
> versions of the spec ("make a best effort" became "replicate what internet 
> explorer does" because almost every site violated the rules). However the 
> HTML5 spec now defines a proper parsing algorithm that deals with said 
> ambiguities. We'll need to also take into account the details of which tags 
> must have a corresponding close dag and which tags do not require this.
> Having our own parser will simplify dependencies a lot, particularly with the 
> somewhat awkward HTML tidy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to