RE: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.

Uwe Schindler Thu, 04 Dec 2008 02:01:03 -0800

Just one question: Is there interest to do the same tag mapping approach for
OpenXML (MS Office 2007) files? In my opinion, this is much resource
friendlier (because it is only extracting text from an XML file) than the
POI approach of having DOM trees and megabytes of DOM-Tree mappings of the
OpenXML schema with additional external dependencies. If yes, I would start
to read the OpenXML specs and do a similar approcach for OpenXML.


-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [EMAIL PROTECTED]

> From: Jukka Zitting (JIRA) [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, December 03, 2008 1:15 AM
> To: [email protected]
> Subject: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits
> structured XHTML content.
> 
> 
>      [ https://issues.apache.org/jira/browse/TIKA-
> 172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
> 
> Jukka Zitting resolved TIKA-172.
> --------------------------------
> 
>        Resolution: Fixed
>     Fix Version/s: 0.3
>          Assignee: Jukka Zitting
> 
> Good stuff, thanks! Applied the patch in revision 722663.
> 
> I replaced all tabs with spaces and updated the code to better match the
> Sun Java coding conventions used in Tika.
> 
> > New Open Document Parser that emmits structured XHTML content.
> > --------------------------------------------------------------
> >
> >                 Key: TIKA-172
> >                 URL: https://issues.apache.org/jira/browse/TIKA-172
> >             Project: Tika
> >          Issue Type: Improvement
> >          Components: parser
> >    Affects Versions: 0.2
> >            Reporter: Uwe Schindler
> >            Assignee: Jukka Zitting
> >             Fix For: 0.3
> >
> >         Attachments: TIKA-172.patch
> >
> >
> > The current Open Document parser is very simplistic. It only creates a
> paragraph with the whole text content of ODF documents in it. The problem
> is also, that all whitespace is stripped.
> > The attached patch is a new and SAX-featured (so low memory capable)
> parser without using external libraries for ODF. The structure of ODF
> content.xml files is very clean (and identical for all types of documents)
> and maps very good to XHTML. It is possible to map paragraphs to <p> tags
> and headings to <hX>-Tags. Also tables (and so spreadsheets) are identical
> to HTML rules.
> > The idea behind this parser is a simple tag mapping approach. A new
> ContentHandlerDecorator in the o.a.t.sax-Package is able to simple map
> element names and attributes by a Map<javax.xml.namespace.QName,...). For
> each mapping a second mapping for the attributes
> Map<javax.xml.namespace.QName,javax.xml.namespace.QName> is available that
> maps the attributes. All not mappable attributes are thrown away. Tag
> names not in the mapping are are also not reported to the delegate.
> > With this new decorator, it is possible to map all ODF content.xml names
> to XHTML using a static map in the parser class. In addition to this some
> extra-handling for special cases in ODF are done in the SAX handler, that
> receives the parsing events (that extends ElementMappingContentHandler) is
> done:
> > a) only direct content of tags from the text:-namespace are reported to
> characters(), this excludes style tags and so on.
> > b) some tags and *all* its content are left out (Templates for TOC,
> additional cells for col/rowspan handling)
> > c) mapping of <text:h> to HTML <hX> is done by using the heading level
> (in ODF in an attribute of <text:h>).
> > As there are still some OpenOffice version 1.0 documents around (.sxw-
> files) that use old namespace declarations in meta.xml and content.xml
> (the current parser fails to parse metadata and content of such
> documents), an additional ContentHandlerDecorator is used, that maps all
> old namespaces beginning with "http://openoffice.org/2000/"; to the
> "urn:oasis..." ones.
> > If support for such ld document types is not needed, we could simply
> leave out this additional decorator.
> > This is a very clean and good working approach for ODF files. In my
> opinion, this could also be done in a similar way for OpenXML files for MS
> Office 2007. I looked into the new POI version, that has text extraction
> support for OpenXML, but this uses a lot of additional XML parser
> libraries, DOM trees and does not use SAX, and is memory intensive. I
> think (I will read the specs from Microsoft the next days) and maybe I
> will create the same infracstruture for OpenXML, too. As POI is for OLE2
> document format, it should only be used for this and not the XML based
> OpenXML.
> 
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.

RE: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.

Reply via email to