[ https://issues.apache.org/jira/browse/TIKA-138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting updated TIKA-138: ------------------------------- Assignee: Jukka Zitting Summary: Ignore HTML style and script content (was: Better HTML parsing) Good point. As discussed recently on the mailing list, there are probably some cases where style and script content is useful for a Tika client, but by default the extracted text should match what is normally shown by a browser. > Ignore HTML style and script content > ------------------------------------ > > Key: TIKA-138 > URL: https://issues.apache.org/jira/browse/TIKA-138 > Project: Tika > Issue Type: Improvement > Components: parser > Reporter: julien nioche > Assignee: Jukka Zitting > > The current parser used for HTML leaves code in the extracted text. > For instance in the page http://implicitweb.blogspot.com/ the CSS section > <style id='page-skin-1' type='text/css'><!-- > /* > * Blogger Template Style > * > * Sand Dollar > * by Jason Sutter > * Updated by Blogger Team > *//* Variable definitions > ==================== > <Variable name="textcolor" description="Text Color" > type="color" default="#000"><Variable name="bgcolor" description="Page > Background Color" > type="color" default="#f6f6f6"><Variable name="pagetitlecolor" > description="Blog Title Color" > type="color" default="#F5DEB3"><Variable name="pagetitlebgcolor" > description="Blog Title Background Color" > type="color" default="#DE7008"><Variable name="descriptionColor" > description="Blog Description Color" > type="color" default="#9E5205" /><Variable name="descbgcolor" > description="Description Background Color" > type="color" default="#F5E39e"><Variable name="titlecolor" description="Post > Title Color" > type="color" default="#9E5205"><Variable name="datecolor" description="Date > Header Color" > type="color" default="#777777"><Variable name="footercolor" description="Post > Footer Color" > .... > is found in the extracted text. This is not the case when saving the same > page as txt from Firefox or OpenOffice. > J. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.