Thank you for your answers. I created the related JIRA entry https://issues.apache.org/jira/browse/TIKA-727
Pablo. -----Message d'origine----- De : Nick Burch [mailto:[email protected]] Envoyé : jeudi 22 septembre 2011 11:55 À : [email protected] Objet : Re: HSLFExtractor & POI : Looking for better XHTML On Thu, 22 Sep 2011, Pablo Queixalos wrote: > Based on the PowerPointExtractor implementation, I rewrote the > HSLFExtractor parser. This new impl produces a better XHTML but uses > the org.apache.poi.hslf POI model. If you wouldn't mind, please create a new JIRA entry for this, and upload your patch. > - What is the philosophy of Tika parsers implementations against their > dependencies ? I mean, must the HSLFExtractor implement the strict > minimal code to integrate the top POI API, or it is ok to do it the > way I did ? It's fine to use other parts of the API as needed. If you look at some of the other office parsers you'll see that they all do that too > - Is there conventions for the XHTML produced by the parsers : global > formatting (ie, a <div> per page, <h1> for headers) and related CSS > classes ? We try to keep the xhtml simple and clean, with sensible tags, and we try to keep it similar between different formats of the same type. Ideally if you take the same presentation and save it as .ppt, .pptx and .odp, then Tika will give you quite similar XHTML back again. (We don't try to re-create the exact layout and formatting of the original document however) Nick
