Better HTML parsing
-------------------

                 Key: TIKA-138
                 URL: https://issues.apache.org/jira/browse/TIKA-138
             Project: Tika
          Issue Type: Improvement
          Components: parser
            Reporter: julien nioche


The current parser used for HTML leaves code in the extracted text. 

For instance in the page http://implicitweb.blogspot.com/ the CSS section

<style id='page-skin-1' type='text/css'><!--
/*
* Blogger Template Style
*
* Sand Dollar
* by Jason Sutter
* Updated by Blogger Team
*//* Variable definitions
====================
<Variable name="textcolor" description="Text Color"
type="color" default="#000"><Variable name="bgcolor" description="Page 
Background Color"
type="color" default="#f6f6f6"><Variable name="pagetitlecolor" 
description="Blog Title Color"
type="color" default="#F5DEB3"><Variable name="pagetitlebgcolor" 
description="Blog Title Background Color"
type="color" default="#DE7008"><Variable name="descriptionColor" 
description="Blog Description Color"
type="color" default="#9E5205" /><Variable name="descbgcolor" 
description="Description Background Color"
type="color" default="#F5E39e"><Variable name="titlecolor" description="Post 
Title Color"
type="color" default="#9E5205"><Variable name="datecolor" description="Date 
Header Color"
type="color" default="#777777"><Variable name="footercolor" description="Post 
Footer Color"
....

is found in the extracted text. This is not the case when saving the same page 
as txt from Firefox or OpenOffice.

J.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to