[
https://issues.apache.org/jira/browse/TIKA-172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jukka Zitting resolved TIKA-172.
--------------------------------
Resolution: Fixed
Fix Version/s: 0.3
Assignee: Jukka Zitting
Good stuff, thanks! Applied the patch in revision 722663.
I replaced all tabs with spaces and updated the code to better match the Sun
Java coding conventions used in Tika.
> New Open Document Parser that emmits structured XHTML content.
> --------------------------------------------------------------
>
> Key: TIKA-172
> URL: https://issues.apache.org/jira/browse/TIKA-172
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 0.2
> Reporter: Uwe Schindler
> Assignee: Jukka Zitting
> Fix For: 0.3
>
> Attachments: TIKA-172.patch
>
>
> The current Open Document parser is very simplistic. It only creates a
> paragraph with the whole text content of ODF documents in it. The problem is
> also, that all whitespace is stripped.
> The attached patch is a new and SAX-featured (so low memory capable) parser
> without using external libraries for ODF. The structure of ODF content.xml
> files is very clean (and identical for all types of documents) and maps very
> good to XHTML. It is possible to map paragraphs to <p> tags and headings to
> <hX>-Tags. Also tables (and so spreadsheets) are identical to HTML rules.
> The idea behind this parser is a simple tag mapping approach. A new
> ContentHandlerDecorator in the o.a.t.sax-Package is able to simple map
> element names and attributes by a Map<javax.xml.namespace.QName,...). For
> each mapping a second mapping for the attributes
> Map<javax.xml.namespace.QName,javax.xml.namespace.QName> is available that
> maps the attributes. All not mappable attributes are thrown away. Tag names
> not in the mapping are are also not reported to the delegate.
> With this new decorator, it is possible to map all ODF content.xml names to
> XHTML using a static map in the parser class. In addition to this some
> extra-handling for special cases in ODF are done in the SAX handler, that
> receives the parsing events (that extends ElementMappingContentHandler) is
> done:
> a) only direct content of tags from the text:-namespace are reported to
> characters(), this excludes style tags and so on.
> b) some tags and *all* its content are left out (Templates for TOC,
> additional cells for col/rowspan handling)
> c) mapping of <text:h> to HTML <hX> is done by using the heading level (in
> ODF in an attribute of <text:h>).
> As there are still some OpenOffice version 1.0 documents around (.sxw-files)
> that use old namespace declarations in meta.xml and content.xml (the current
> parser fails to parse metadata and content of such documents), an additional
> ContentHandlerDecorator is used, that maps all old namespaces beginning with
> "http://openoffice.org/2000/" to the "urn:oasis..." ones.
> If support for such ld document types is not needed, we could simply leave
> out this additional decorator.
> This is a very clean and good working approach for ODF files. In my opinion,
> this could also be done in a similar way for OpenXML files for MS Office
> 2007. I looked into the new POI version, that has text extraction support for
> OpenXML, but this uses a lot of additional XML parser libraries, DOM trees
> and does not use SAX, and is memory intensive. I think (I will read the specs
> from Microsoft the next days) and maybe I will create the same infracstruture
> for OpenXML, too. As POI is for OLE2 document format, it should only be used
> for this and not the XML based OpenXML.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.