[ 
https://issues.apache.org/jira/browse/TIKA-138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

julien nioche closed TIKA-138.
------------------------------


Thanks! That's great

> Ignore HTML style and script content
> ------------------------------------
>
>                 Key: TIKA-138
>                 URL: https://issues.apache.org/jira/browse/TIKA-138
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: julien nioche
>            Assignee: Jukka Zitting
>             Fix For: 0.2-incubating
>
>
> The current parser used for HTML leaves code in the extracted text. 
> For instance in the page http://implicitweb.blogspot.com/ the CSS section
> <style id='page-skin-1' type='text/css'><!--
> /*
> * Blogger Template Style
> *
> * Sand Dollar
> * by Jason Sutter
> * Updated by Blogger Team
> *//* Variable definitions
> ====================
> <Variable name="textcolor" description="Text Color"
> type="color" default="#000"><Variable name="bgcolor" description="Page 
> Background Color"
> type="color" default="#f6f6f6"><Variable name="pagetitlecolor" 
> description="Blog Title Color"
> type="color" default="#F5DEB3"><Variable name="pagetitlebgcolor" 
> description="Blog Title Background Color"
> type="color" default="#DE7008"><Variable name="descriptionColor" 
> description="Blog Description Color"
> type="color" default="#9E5205" /><Variable name="descbgcolor" 
> description="Description Background Color"
> type="color" default="#F5E39e"><Variable name="titlecolor" description="Post 
> Title Color"
> type="color" default="#9E5205"><Variable name="datecolor" description="Date 
> Header Color"
> type="color" default="#777777"><Variable name="footercolor" description="Post 
> Footer Color"
> ....
> is found in the extracted text. This is not the case when saving the same 
> page as txt from Firefox or OpenOffice.
> J.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to