HSLFExtractor & POI : Looking for better XHTML

Pablo Queixalos Thu, 22 Sep 2011 02:34:51 -0700

Hi,


 

The XHTML output of HSLFExtractor parser is not pure XHTML, it only inserts the 
full text into a P[aragraph] tag (including non-html carriage returns).  This 
behavior comes from the poor capabilities that the POI PowerPointExtractor 
offers.

 

Based on the PowerPointExtractor implementation, I rewrote the HSLFExtractor 
parser. This new impl produces a better XHTML but uses the org.apache.poi.hslf 
POI model.

 

-          What is the philosophy of Tika  parsers implementations against 
their dependencies ? I mean, must the HSLFExtractor implement the strict 
minimal code to integrate the top POI API, or it is ok to do it the way I did ?

 

-          Is there conventions for the XHTML produced by the parsers : global 
formatting (ie, a <div> per page, <h1> for headers) and related CSS classes ?

 

 

I am quite new to tika-parsers, so any feedback on the rewritten HSLFExtractor 
will be apreciated.

 

 

Thanks,

 

 

Pablo QUEIXALOS

Developer 

 

 

79, rue du Faubourg Poissonnière

75009 Paris - France

P Before printing, think about environment

This message may contain confidential or privileged information. If you are not 
the intended recipient, please advise the sender immediately by reply e-mail 
and delete this message and any attachments without retaining a copy.

<<image001.gif>>

HSLFExtractor & POI : Looking for better XHTML

Reply via email to