EML is for emails, so there are probably some HTML-formatted emails
that you are getting. Probably with the alternative text-part. Outlook
would render HTML and/or use text part. I think you can just open EML
in an editor to check it out.

As to URP, are you absolutely sure it is being used? It is not
declared as default, so you need to call it explicitly. Try setting a
field in there or some other clear flag that a record has been
processed.

Regards,
    Alex.

On Sun, 30 Dec 2018 at 22:46, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote:
>
> These texts are likely from the original EML file data, but they are not
> visible in the content when the EML file is opened in Microsoft Outlook.
>
> I have already applied the HTMLStripFieldUpdateProcessorFactory in
> solrconfig.xml, but these texts are still showing up in the index. Below is
> my configuration.
>
> <updateRequestProcessorChain name="html-strip-content">
>
>                                 <processor
> class="solr.HTMLStripFieldUpdateProcessorFactory">
>
>                                               <str
> name="fieldName">content_tcs</str>
>
>                                 </processor>
>
>                                 <processor
> class="solr.LogUpdateProcessorFactory" />
>
>                                 <processor
> class="solr.RunUpdateProcessorFactory" />
>
> </updateRequestProcessorChain>
>
>
> Regards,
> Edwin
>
> On Mon, 31 Dec 2018 at 11:29, Alexandre Rafalovitch <arafa...@gmail.com>
> wrote:
>
> > Specifically, a custome Update Request Processor chain can be used before
> > indexing. Probably with HTMLStripFieldUpdateProcessorFactory
> > Regards,
> >      Alex
> >
> > On Sun, Dec 30, 2018, 9:26 PM Vincenzo D'Amore <v.dam...@gmail.com wrote:
> >
> > > Hi,
> > >
> > > I think this kind of text manipulation should be done before indexing, if
> > > you have font-size font-family in your text, very likely you’re indexing
> > an
> > > html with css.
> > > If I’m right, you’re just entering in a hell of words that should be
> > > removed from your text.
> > >
> > > On the other hand, if you have to do this at index time, a quick and
> > dirty
> > > solution is using the pattern-replace filter.
> > >
> > >
> > >
> > https://lucene.apache.org/solr/guide/7_5/filter-descriptions.html#pattern-replace-filter
> > >
> > > Ciao,
> > > Vincenzo
> > >
> > > --
> > > mobile: 3498513251
> > > skype: free.dev
> > >
> > > > On 31 Dec 2018, at 02:47, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > I noticed that during the indexing of EMLfiles, there are words like
> > > > "*FONT-SIZE:
> > > > 9pt; FONT-FAMILY: arial*" that are being indexed into the content as
> > > well.
> > > >
> > > > Would like to check, how are we able to remove those words during the
> > > > indexing?
> > > >
> > > > I am using Solr 7.5.0
> > > >
> > > > Regards,
> > > > Edwin
> > >
> >

Reply via email to