RE: HSLFExtractor & POI : Looking for better XHTML

Pablo Queixalos Thu, 22 Sep 2011 03:24:14 -0700

Thank you for your answers.

I created the related JIRA entry https://issues.apache.org/jira/browse/TIKA-727

Pablo.

-----Message d'origine-----
De : Nick Burch [mailto:[email protected]] 
Envoyé : jeudi 22 septembre 2011 11:55
À : [email protected]
Objet : Re: HSLFExtractor & POI : Looking for better XHTML

On Thu, 22 Sep 2011, Pablo Queixalos wrote:
> Based on the PowerPointExtractor implementation, I rewrote the 
> HSLFExtractor parser. This new impl produces a better XHTML but uses 
> the org.apache.poi.hslf POI model.

If you wouldn't mind, please create a new JIRA entry for this, and upload your 
patch.

> - What is the philosophy of Tika parsers implementations against their 
> dependencies ? I mean, must the HSLFExtractor implement the strict 
> minimal code to integrate the top POI API, or it is ok to do it the 
> way I did ?

It's fine to use other parts of the API as needed. If you look at some of the 
other office parsers you'll see that they all do that too

> - Is there conventions for the XHTML produced by the parsers : global 
> formatting (ie, a <div> per page, <h1> for headers) and related CSS 
> classes ?

We try to keep the xhtml simple and clean, with sensible tags, and we try to 
keep it similar between different formats of the same type. Ideally if you take 
the same presentation and save it as .ppt, .pptx and .odp, then Tika will give 
you quite similar XHTML back again.

(We don't try to re-create the exact layout and formatting of the original 
document however)

Nick

RE: HSLFExtractor & POI : Looking for better XHTML

Reply via email to