[ https://issues.apache.org/jira/browse/COR-20?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
jan iversen updated COR-20: --------------------------- Component/s: DocFormats - platform DocFormats - core > Write an XML/HTML parser > ------------------------ > > Key: COR-20 > URL: https://issues.apache.org/jira/browse/COR-20 > Project: Corinthia > Issue Type: Improvement > Components: DocFormats - core, DocFormats - platform > Reporter: Peter Kelly > Fix For: 0.5 > > > Currently we rely on libxml2 and HTML Tidy for parsing XML and HTML, > respectively. In both cases we are only using the parsing functions of > libraries, not other features like the DOM tree or other things. > Parsing XML is not very difficult to do. HTML slightly more, because of all > the ambiguities that arise from the poorly-defined parsing rules in earlier > versions of the spec ("make a best effort" became "replicate what internet > explorer does" because almost every site violated the rules). However the > HTML5 spec now defines a proper parsing algorithm that deals with said > ambiguities. We'll need to also take into account the details of which tags > must have a corresponding close dag and which tags do not require this. > Having our own parser will simplify dependencies a lot, particularly with the > somewhat awkward HTML tidy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)