Thanks for your reply.
What I have found is that in the EML file, there are 2 Content-Type, one is
text/html, and the other is text/plain.
The text/html will words like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the
content, but for the text/plain, there is no such words, and the content is
clean
Although Vincenzo and Alexandre's suggestions may be helpful in the right
circumstances, there is a continuum of answers to the original question
here. This continuum is mostly relevant if indexing and querying is likely
to happen simultaneously or the data volume is large enough relative to the
se
Perhaps https://royvanrijn.com/blog/2016/03/java-mail-message-as-download/
may be helpful? Though I see the date on it and am now unsure. -- H
On Mon, 31 Dec 2018 at 17:51, Zheng Lin Edwin Yeo
wrote:
> Hi Alex,
>
> I have tried with a file that is HTML formatted, with those tags like
> , , , etc
Hi Alex,
I have tried with a file that is HTML formatted, with those tags like
, , , etc, and those gets removed during indexing.
For tags like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*", I found that in the
EML file, there are two different content type, text/html and text/plain.
Could it be due to
EML is for emails, so there are probably some HTML-formatted emails
that you are getting. Probably with the alternative text-part. Outlook
would render HTML and/or use text part. I think you can just open EML
in an editor to check it out.
As to URP, are you absolutely sure it is being used? It is
These texts are likely from the original EML file data, but they are not
visible in the content when the EML file is opened in Microsoft Outlook.
I have already applied the HTMLStripFieldUpdateProcessorFactory in
solrconfig.xml, but these texts are still showing up in the index. Below is
my config
Specifically, a custome Update Request Processor chain can be used before
indexing. Probably with HTMLStripFieldUpdateProcessorFactory
Regards,
Alex
On Sun, Dec 30, 2018, 9:26 PM Vincenzo D'Amore Hi,
>
> I think this kind of text manipulation should be done before indexing, if
> you have fon
Hi,
I think this kind of text manipulation should be done before indexing, if you
have font-size font-family in your text, very likely you’re indexing an html
with css.
If I’m right, you’re just entering in a hell of words that should be removed
from your text.
On the other hand, if you have
Hi,
I noticed that during the indexing of EMLfiles, there are words like
"*FONT-SIZE:
9pt; FONT-FAMILY: arial*" that are being indexed into the content as well.
Would like to check, how are we able to remove those words during the
indexing?
I am using Solr 7.5.0
Regards,
Edwin