plain

Zheng Lin Edwin Yeo Wed, 16 Jan 2019 23:15:52 -0800

Based on the discussion in Tika and also on the Jira (TIKA-2814), it was
said that the issue could be with the Solr's ExtractingRequestHandler, in
which the HTMLParser is either not being applied, or is somehow not
stripping the content of <span/> elements. Straight Tika app is able to do
the right thing.


Regards,
Edwin

On Tue, 15 Jan 2019 at 10:56, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
wrote:

> Hi Alex,
>
> Thanks for the suggestions.
> Yes, I have posted it in the Tika mailing list too.
>
> Regards,
> Edwin
>
> On Mon, 14 Jan 2019 at 21:16, Alexandre Rafalovitch <arafa...@gmail.com>
> wrote:
>
>> I think asking this question on Tika mailing list may give you better
>> answers. Then, if the conclusion is that the behavior is configurable,
>> you can see how to do it in Solr. It may be however, that you need to
>> do the parsing outside of Solr with standalone Tika. Standalone Tika
>> is a production advice anyway.
>>
>> I would suggest the title be something like "How to prefer plain/text
>> part of an email message when parsing .eml files".
>>
>> Regards,
>>   Alex.
>>
>> On Mon, 14 Jan 2019 at 00:20, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
>> wrote:
>> >
>> > Hi,
>> >
>> > I have uploaded a sample EML file here:
>> >
>> https://drive.google.com/file/d/1z1gujv4SiacFeganLkdb0DhfZsNeGD2a/view?usp=sharing
>> >
>> > This is what is indexed in the content:
>> >
>> >         "content":"  font-size: 14pt; font-family: book antiqua,
>> > palatino, serif;  Hi There,   <br><br> font-size: 14pt; font-family:
>> > book antiqua, palatino, serif;  My client owns the domain name “
>> > font-size: 14pt; color: #0000ff; font-family: arial black, sans-serif;
>> >  TravelInsuranceEurope.com   font-size: 14pt; font-family: book
>> > antiqua, palatino, serif;  ” and is considering putting it in market.
>> > It is keyword rich domain with good search volume,adword bidding and
>> > type-in-traffic.   <br><br> font-size: 14pt; font-family: book
>> > antiqua, palatino, serif;  Based on our extensive study, we strongly
>> > feel that you should consider buying this domain name to improve the
>> > SEO, Online visibility, brand image, authority and type-in-traffic for
>> > your business. We also do provide free 1 year hosting and unlimited
>> > emails along with domain name.   <br><br> font-size: 14pt;
>> > font-family: book antiqua, palatino, serif;  Besides this, if you need
>> > any other domain name, web and app designing services and digital
>> > marketing services (SEO, PPC and SMO) at reasonable charges, feel free
>> > to contact us.   <br><br> font-size: 14pt; font-family: book antiqua,
>> > palatino, serif;  Best Regards,   <br><br> font-size: 14pt;
>> > font-family: book antiqua, palatino, serif;  Josh   <br><br>",
>> >
>> >
>> > As you can see, this is taken from the Content-Type: text/html.
>> > However, the Content-Type: text/plain looks clean, and that is what we
>> want
>> > it to be indexed.
>> >
>> > How can we configure the Tika in Solr to change the priority to get the
>> > content from Content-Type: text/plain  instead of Content-Type:
>> text/html?
>> >
>> > On Mon, 14 Jan 2019 at 11:18, Zheng Lin Edwin Yeo <edwinye...@gmail.com
>> >
>> > wrote:
>> >
>> > > Hi,
>> > >
>> > > I am using Solr 7.5.0 with Tika 1.18.
>> > >
>> > > Currently I am facing a situation during the indexing of EML files,
>> > > whereby the content is being extracted from the Content-type=text/html
>> > > instead of Content-type=text/plain.
>> > >
>> > > The problem with Content-type=text/html is that it contains alot of
>> words
>> > > like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the content, and all of
>> > > these get indexed in Solr as well, which makes the content very
>> cluttered,
>> > > and it also affect the search, as when we search for words like
>> "font", all
>> > > the contents gets returned because of this.
>> > >
>> > > Would like to enquire on the following:
>> > > 1. Why Tika didn't get the text part (text/plain). Is there any way to
>> > > configure the Tika in Solr to change the priority to get the text part
>> > > (text/plain) instead of html part (text/html).
>> > > 2. If that is not possible, as you can see, the content is not clean,
>> > > which is not right. How can we get this to be clean when Tika is
>> extracting
>> > > text?
>> > >
>> > > Regards,
>> > > Edwin
>> > >
>>
>

Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

Reply via email to