Hi,
The XHTML output of HSLFExtractor parser is not pure XHTML, it only inserts the full text into a P[aragraph] tag (including non-html carriage returns). This behavior comes from the poor capabilities that the POI PowerPointExtractor offers. Based on the PowerPointExtractor implementation, I rewrote the HSLFExtractor parser. This new impl produces a better XHTML but uses the org.apache.poi.hslf POI model. - What is the philosophy of Tika parsers implementations against their dependencies ? I mean, must the HSLFExtractor implement the strict minimal code to integrate the top POI API, or it is ok to do it the way I did ? - Is there conventions for the XHTML produced by the parsers : global formatting (ie, a <div> per page, <h1> for headers) and related CSS classes ? I am quite new to tika-parsers, so any feedback on the rewritten HSLFExtractor will be apreciated. Thanks, Pablo QUEIXALOS Developer 79, rue du Faubourg Poissonnière 75009 Paris - France P Before printing, think about environment This message may contain confidential or privileged information. If you are not the intended recipient, please advise the sender immediately by reply e-mail and delete this message and any attachments without retaining a copy.
<<image001.gif>>
