Hallo, I am really interested in helping in TIKA development. I like the real good TIKA design with SAX events!
> Hi, > > Check out the new ODF toolkit project [1]. Especially the ODFDOM > library [2] seems like something we could use in Tika to better > extract stuff from OpenDocument files. > > [1] http://odftoolkit.org/ > [2] http://odftoolkit.org/projects/odftoolkit/pages/ODFDOM > > BR, > > Jukka Zitting I have seen this project, too. The problem with it is, that it only has Mappings for the Object definitions as customized DOM objects, but that does not really help you when importing the text. TIKA's big advantage is the possibility to use SAX events when importing XML formats. I am currently working on a patch for the ODF importer, that maps content.xml's tags to XHTML tags. This can be done very simple by a new SAX filter: TagMappingContentHandler. I prepare to post 2 patches to TIKA's issue management system, that: a) import ODF documents as structured XHTML items as mentioned before. b) a better conversion of XHTML sax streams to plain text (better than just only reading characters() events), as the problem here is the difference between HTML block and span elements. Just reading the element contents creates whitespace issues... The same technique could be used for Open XML (Office 2007) items. Using the new classes of POI is a pain (the same problem: thousands of ne objects from a really big JAR file that just contains DOM not SAX mappings for Open XML objects). A clean SAX solution would be preferable. Just give me some more two days to finish my patches! Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED]