plain

Zheng Lin Edwin Yeo Sun, 13 Jan 2019 21:21:12 -0800

Hi,

I have uploaded a sample EML file here:
https://drive.google.com/file/d/1z1gujv4SiacFeganLkdb0DhfZsNeGD2a/view?usp=sharing


This is what is indexed in the content:

        "content":"  font-size: 14pt; font-family: book antiqua,
palatino, serif;  Hi There,   <br><br> font-size: 14pt; font-family:
book antiqua, palatino, serif;  My client owns the domain name “
font-size: 14pt; color: #0000ff; font-family: arial black, sans-serif;
 TravelInsuranceEurope.com   font-size: 14pt; font-family: book
antiqua, palatino, serif;  ” and is considering putting it in market.
It is keyword rich domain with good search volume,adword bidding and
type-in-traffic.   <br><br> font-size: 14pt; font-family: book
antiqua, palatino, serif;  Based on our extensive study, we strongly
feel that you should consider buying this domain name to improve the
SEO, Online visibility, brand image, authority and type-in-traffic for
your business. We also do provide free 1 year hosting and unlimited
emails along with domain name.   <br><br> font-size: 14pt;
font-family: book antiqua, palatino, serif;  Besides this, if you need
any other domain name, web and app designing services and digital
marketing services (SEO, PPC and SMO) at reasonable charges, feel free
to contact us.   <br><br> font-size: 14pt; font-family: book antiqua,
palatino, serif;  Best Regards,   <br><br> font-size: 14pt;
font-family: book antiqua, palatino, serif;  Josh   <br><br>",


As you can see, this is taken from the Content-Type: text/html.
However, the Content-Type: text/plain looks clean, and that is what we want
it to be indexed.

How can we configure the Tika in Solr to change the priority to get the
content from Content-Type: text/plain  instead of Content-Type: text/html?

On Mon, 14 Jan 2019 at 11:18, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
wrote:

> Hi,
>
> I am using Solr 7.5.0 with Tika 1.18.
>
> Currently I am facing a situation during the indexing of EML files,
> whereby the content is being extracted from the Content-type=text/html
> instead of Content-type=text/plain.
>
> The problem with Content-type=text/html is that it contains alot of words
> like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the content, and all of
> these get indexed in Solr as well, which makes the content very cluttered,
> and it also affect the search, as when we search for words like "font", all
> the contents gets returned because of this.
>
> Would like to enquire on the following:
> 1. Why Tika didn't get the text part (text/plain). Is there any way to
> configure the Tika in Solr to change the priority to get the text part
> (text/plain) instead of html part (text/html).
> 2. If that is not possible, as you can see, the content is not clean,
> which is not right. How can we get this to be clean when Tika is extracting
> text?
>
> Regards,
> Edwin
>

Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

Reply via email to