[ 
https://issues.apache.org/jira/browse/TIKA-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16743564#comment-16743564
 ] 

Edwin Yeo Zheng Lin commented on TIKA-2814:
-------------------------------------------

I have uploaded a sample EML file here: 
[https://drive.google.com/file/d/1z1gujv4SiacFeganLkdb0DhfZsNeGD2a/view?usp=sharing]

> Extracted content of EML file contains words like "FONT-SIZE: 9pt; 
> FONT-FAMILY: arial"
> --------------------------------------------------------------------------------------
>
>                 Key: TIKA-2814
>                 URL: https://issues.apache.org/jira/browse/TIKA-2814
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.17, 1.18
>         Environment: Source code in MailContentHandler.java, 
> handleInlineBodyPart() function
>            Reporter: Edwin Yeo Zheng Lin
>            Priority: Major
>              Labels: eml, extraction, parser
>
> When we are indexing EML file, the priority setting of TIka is using 
> text/html. However, it contains alot of words like "*FONT-SIZE: 9pt; 
> FONT-FAMILY: arial*" in the content, and all of these are not removed by 
> Tika, which makes the content very cluttered and unreadable.
>  
>  This is what is output in the content after being extracted by Tika:
> {{ \{{ "content":" font-size: 14pt; font-family: book antiqua, palatino, 
> serif; Hi There, <br><br> font-size: 14pt; font-family: book antiqua, 
> palatino, serif; My client owns the domain name “ font-size: 14pt; color: 
> #0000ff; font-family: arial black, sans-serif; TravelInsuranceEurope.com 
> font-size: 14pt; font-family: book antiqua, palatino, serif; ” and is 
> considering putting it in market. It is keyword rich domain with good search 
> volume,adword bidding and type-in-traffic. <br><br> font-size: 14pt; 
> font-family: book antiqua, palatino, serif; Based on our extensive study, we 
> strongly feel that you should consider buying this domain name to improve the 
> SEO, Online visibility, brand image, authority and type-in-traffic for your 
> business. We also do provide free 1 year hosting and unlimited emails along 
> with domain name. <br><br> font-size: 14pt; font-family: book antiqua, 
> palatino, serif; Besides this, if you need any other domain name, web and app 
> designing services and digital marketing services (SEO, PPC and SMO) at 
> reasonable charges, feel free to contact us. <br><br> font-size: 14pt; 
> font-family: book antiqua, palatino, serif; Best Regards, <br><br> font-size: 
> 14pt; font-family: book antiqua, palatino, serif; Josh <br><br>"}}}}
>  
> In the MailContentHandler.java code, under the function 
> handleInlineBodyPart(), for MediaType.TEXT_HTML, it is using the 
> HtmlParser.class, However, this parser is not doing the job of removing 
> "*FONT-SIZE: 9pt; FONT-FAMILY: arial*", and all these get output to the 
> content. We should resolve the issue with this HtmlParser so that it is able 
> to remove those tag, and make the content readable after extraction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to