Edwin Yeo Zheng Lin created TIKA-2814:
-----------------------------------------

             Summary: Extracted content of EML file contains words like 
"FONT-SIZE: 9pt; FONT-FAMILY: arial"
                 Key: TIKA-2814
                 URL: https://issues.apache.org/jira/browse/TIKA-2814
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.18, 1.17
         Environment: Source code in MailContentHandler.java, 
handleInlineBodyPart() function
            Reporter: Edwin Yeo Zheng Lin


When we are indexing EML file, the priority setting of TIka is using text/html. 
However, it contains alot of words like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" 
in the content, and all of these are not removed by Tika, which makes the 
content very cluttered and unreadable.

 
This is what is output in the content after being extracted by Tika:
{{ "content":" font-size: 14pt; font-family: book antiqua, palatino, serif; Hi 
There, <br><br> font-size: 14pt; font-family: book antiqua, palatino, serif; My 
client owns the domain name “ font-size: 14pt; color: #0000ff; font-family: 
arial black, sans-serif; TravelInsuranceEurope.com font-size: 14pt; 
font-family: book antiqua, palatino, serif; ” and is considering putting it in 
market. It is keyword rich domain with good search volume,adword bidding and 
type-in-traffic. <br><br> font-size: 14pt; font-family: book antiqua, palatino, 
serif; Based on our extensive study, we strongly feel that you should consider 
buying this domain name to improve the SEO, Online visibility, brand image, 
authority and type-in-traffic for your business. We also do provide free 1 year 
hosting and unlimited emails along with domain name. <br><br> font-size: 14pt; 
font-family: book antiqua, palatino, serif; Besides this, if you need any other 
domain name, web and app designing services and digital marketing services 
(SEO, PPC and SMO) at reasonable charges, feel free to contact us. <br><br> 
font-size: 14pt; font-family: book antiqua, palatino, serif; Best Regards, 
<br><br> font-size: 14pt; font-family: book antiqua, palatino, serif; Josh 
<br><br>"}}

 

In the MailContentHandler.java code, under the function handleInlineBodyPart(), 
for MediaType.TEXT_HTML, it is using the HtmlParser.class, However, this parser 
is not doing the job of removing "*FONT-SIZE: 9pt; FONT-FAMILY: arial*", and 
all these get output to the content. We should resolve the issue with this 
HtmlParser so that it is able to remove those tag, and make the content 
readable after extraction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to