plain

Zheng Lin Edwin Yeo Sun, 13 Jan 2019 19:19:39 -0800

Hi,

I am using Solr 7.5.0 with Tika 1.18.


Currently I am facing a situation during the indexing of EML files, whereby
the content is being extracted from the Content-type=text/html instead of
Content-type=text/plain.

The problem with Content-type=text/html is that it contains alot of words
like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the content, and all of
these get indexed in Solr as well, which makes the content very cluttered,
and it also affect the search, as when we search for words like "font", all
the contents gets returned because of this.

Would like to enquire on the following:
1. Why Tika didn't get the text part (text/plain). Is there any way to
configure the Tika in Solr to change the priority to get the text part
(text/plain) instead of html part (text/html).
2. If that is not possible, as you can see, the content is not clean, which
is not right. How can we get this to be clean when Tika is extracting text?

Regards,
Edwin

Content from EML files indexing from text/html (which is not clean) instead of text/plain

Reply via email to