Re: Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content

2019-01-11 Thread Zheng Lin Edwin Yeo
Thanks for your reply. What I have found is that in the EML file, there are 2 Content-Type, one is text/html, and the other is text/plain. The text/html will words like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the content, but for the text/plain, there is no such words, and the content is clean

Re: Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content

2019-01-01 Thread Gus Heck
Although Vincenzo and Alexandre's suggestions may be helpful in the right circumstances, there is a continuum of answers to the original question here. This continuum is mostly relevant if indexing and querying is likely to happen simultaneously or the data volume is large enough relative to the se

Re: Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content

2018-12-31 Thread Hasan Diwan
Perhaps https://royvanrijn.com/blog/2016/03/java-mail-message-as-download/ may be helpful? Though I see the date on it and am now unsure. -- H On Mon, 31 Dec 2018 at 17:51, Zheng Lin Edwin Yeo wrote: > Hi Alex, > > I have tried with a file that is HTML formatted, with those tags like > , , , etc

Re: Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content

2018-12-31 Thread Zheng Lin Edwin Yeo
Hi Alex, I have tried with a file that is HTML formatted, with those tags like , , , etc, and those gets removed during indexing. For tags like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*", I found that in the EML file, there are two different content type, text/html and text/plain. Could it be due to

Re: Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content

2018-12-31 Thread Alexandre Rafalovitch
EML is for emails, so there are probably some HTML-formatted emails that you are getting. Probably with the alternative text-part. Outlook would render HTML and/or use text part. I think you can just open EML in an editor to check it out. As to URP, are you absolutely sure it is being used? It is

Re: Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content

2018-12-30 Thread Zheng Lin Edwin Yeo
These texts are likely from the original EML file data, but they are not visible in the content when the EML file is opened in Microsoft Outlook. I have already applied the HTMLStripFieldUpdateProcessorFactory in solrconfig.xml, but these texts are still showing up in the index. Below is my config

Re: Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content

2018-12-30 Thread Alexandre Rafalovitch
Specifically, a custome Update Request Processor chain can be used before indexing. Probably with HTMLStripFieldUpdateProcessorFactory Regards, Alex On Sun, Dec 30, 2018, 9:26 PM Vincenzo D'Amore Hi, > > I think this kind of text manipulation should be done before indexing, if > you have fon

Re: Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content

2018-12-30 Thread Vincenzo D'Amore
Hi, I think this kind of text manipulation should be done before indexing, if you have font-size font-family in your text, very likely you’re indexing an html with css. If I’m right, you’re just entering in a hell of words that should be removed from your text. On the other hand, if you have

Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content

2018-12-30 Thread Zheng Lin Edwin Yeo
Hi, I noticed that during the indexing of EMLfiles, there are words like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" that are being indexed into the content as well. Would like to check, how are we able to remove those words during the indexing? I am using Solr 7.5.0 Regards, Edwin