[ https://issues.apache.org/jira/browse/TIKA-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16743564#comment-16743564 ]
Edwin Yeo Zheng Lin commented on TIKA-2814: ------------------------------------------- I have uploaded a sample EML file here: [https://drive.google.com/file/d/1z1gujv4SiacFeganLkdb0DhfZsNeGD2a/view?usp=sharing] > Extracted content of EML file contains words like "FONT-SIZE: 9pt; > FONT-FAMILY: arial" > -------------------------------------------------------------------------------------- > > Key: TIKA-2814 > URL: https://issues.apache.org/jira/browse/TIKA-2814 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.17, 1.18 > Environment: Source code in MailContentHandler.java, > handleInlineBodyPart() function > Reporter: Edwin Yeo Zheng Lin > Priority: Major > Labels: eml, extraction, parser > > When we are indexing EML file, the priority setting of TIka is using > text/html. However, it contains alot of words like "*FONT-SIZE: 9pt; > FONT-FAMILY: arial*" in the content, and all of these are not removed by > Tika, which makes the content very cluttered and unreadable. > > This is what is output in the content after being extracted by Tika: > {{ \{{ "content":" font-size: 14pt; font-family: book antiqua, palatino, > serif; Hi There, <br><br> font-size: 14pt; font-family: book antiqua, > palatino, serif; My client owns the domain name “ font-size: 14pt; color: > #0000ff; font-family: arial black, sans-serif; TravelInsuranceEurope.com > font-size: 14pt; font-family: book antiqua, palatino, serif; ” and is > considering putting it in market. It is keyword rich domain with good search > volume,adword bidding and type-in-traffic. <br><br> font-size: 14pt; > font-family: book antiqua, palatino, serif; Based on our extensive study, we > strongly feel that you should consider buying this domain name to improve the > SEO, Online visibility, brand image, authority and type-in-traffic for your > business. We also do provide free 1 year hosting and unlimited emails along > with domain name. <br><br> font-size: 14pt; font-family: book antiqua, > palatino, serif; Besides this, if you need any other domain name, web and app > designing services and digital marketing services (SEO, PPC and SMO) at > reasonable charges, feel free to contact us. <br><br> font-size: 14pt; > font-family: book antiqua, palatino, serif; Best Regards, <br><br> font-size: > 14pt; font-family: book antiqua, palatino, serif; Josh <br><br>"}}}} > > In the MailContentHandler.java code, under the function > handleInlineBodyPart(), for MediaType.TEXT_HTML, it is using the > HtmlParser.class, However, this parser is not doing the job of removing > "*FONT-SIZE: 9pt; FONT-FAMILY: arial*", and all these get output to the > content. We should resolve the issue with this HtmlParser so that it is able > to remove those tag, and make the content readable after extraction. -- This message was sent by Atlassian JIRA (v7.6.3#76005)