[ https://issues.apache.org/jira/browse/TIKA-172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Uwe Schindler updated TIKA-172: ------------------------------- Attachment: TIKA-172.patch patch for ODF support > New Open Document Parser that emmits structured XHTML content. > -------------------------------------------------------------- > > Key: TIKA-172 > URL: https://issues.apache.org/jira/browse/TIKA-172 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 0.2-incubating > Reporter: Uwe Schindler > Attachments: TIKA-172.patch > > > The current Open Document parser is very simplistic. It only creates a > paragraph with the whole text content of ODF documents in it. The problem is > also, that all whitespace is stripped. > The attached patch is a new and SAX-featured (so low memory capable) parser > without using external libraries for ODF. The structure of ODF content.xml > files is very clean (and identical for all types of documents) and maps very > good to XHTML. It is possible to map paragraphs to <p> tags and headings to > <hX>-Tags. Also tables (and so spreadsheets) are identical to HTML rules. > The idea behind this parser is a simple tag mapping approach. A new > ContentHandlerDecorator in the o.a.t.sax-Package is able to simple map > element names and attributes by a Map<javax.xml.namespace.QName,...). For > each mapping a second mapping for the attributes > Map<javax.xml.namespace.QName,javax.xml.namespace.QName> is available that > maps the attributes. All not mappable attributes are thrown away. Tag names > not in the mapping are are also not reported to the delegate. > With this new decorator, it is possible to map all ODF content.xml names to > XHTML using a static map in the parser class. In addition to this some > extra-handling for special cases in ODF are done in the SAX handler, that > receives the parsing events (that extends ElementMappingContentHandler) is > done: > a) only direct content of tags from the text:-namespace are reported to > characters(), this excludes style tags and so on. > b) some tags and *all* its content are left out (Templates for TOC, > additional cells for col/rowspan handling) > c) mapping of <text:h> to HTML <hX> is done by using the heading level (in > ODF in an attribute of <text:h>). > As there are still some OpenOffice version 1.0 documents around (.sxw-files) > that use old namespace declarations in meta.xml and content.xml (the current > parser fails to parse metadata and content of such documents), an additional > ContentHandlerDecorator is used, that maps all old namespaces beginning with > "http://openoffice.org/2000/" to the "urn:oasis..." ones. > If support for such ld document types is not needed, we could simply leave > out this additional decorator. > This is a very clean and good working approach for ODF files. In my opinion, > this could also be done in a similar way for OpenXML files for MS Office > 2007. I looked into the new POI version, that has text extraction support for > OpenXML, but this uses a lot of additional XML parser libraries, DOM trees > and does not use SAX, and is memory intensive. I think (I will read the specs > from Microsoft the next days) and maybe I will create the same infracstruture > for OpenXML, too. As POI is for OLE2 document format, it should only be used > for this and not the XML based OpenXML. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.