Better HTML parsing ------------------- Key: TIKA-138 URL: https://issues.apache.org/jira/browse/TIKA-138 Project: Tika Issue Type: Improvement Components: parser Reporter: julien nioche
The current parser used for HTML leaves code in the extracted text. For instance in the page http://implicitweb.blogspot.com/ the CSS section <style id='page-skin-1' type='text/css'><!-- /* * Blogger Template Style * * Sand Dollar * by Jason Sutter * Updated by Blogger Team *//* Variable definitions ==================== <Variable name="textcolor" description="Text Color" type="color" default="#000"><Variable name="bgcolor" description="Page Background Color" type="color" default="#f6f6f6"><Variable name="pagetitlecolor" description="Blog Title Color" type="color" default="#F5DEB3"><Variable name="pagetitlebgcolor" description="Blog Title Background Color" type="color" default="#DE7008"><Variable name="descriptionColor" description="Blog Description Color" type="color" default="#9E5205" /><Variable name="descbgcolor" description="Description Background Color" type="color" default="#F5E39e"><Variable name="titlecolor" description="Post Title Color" type="color" default="#9E5205"><Variable name="datecolor" description="Date Header Color" type="color" default="#777777"><Variable name="footercolor" description="Post Footer Color" .... is found in the extracted text. This is not the case when saving the same page as txt from Firefox or OpenOffice. J. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.