Re: Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content

2019-01-11 Thread Zheng Lin Edwin Yeo
Thanks for your reply.

What I have found is that in the EML file, there are 2 Content-Type, one is
text/html, and the other is text/plain.

The text/html will words like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the
content, but for the text/plain, there is no such words, and the content is
clean (just what is in the email).

As such, I believe that the indexing is done on the text/html part. Is
there any way that we can change the settings so that the indexing is done
on the text/plain part?

Regards,
Edwin

On Wed, 2 Jan 2019 at 03:27, Gus Heck  wrote:

> Although Vincenzo and Alexandre's suggestions may be helpful in the right
> circumstances, there is a continuum of answers to the original question
> here. This continuum is mostly relevant if indexing and querying is likely
> to happen simultaneously or the data volume is large enough relative to the
> server to make you wish indexing would finish faster. Otherwise
> maintainability, local talent and time investment concerns probably
> dominate, with the caveat that in many cases, initial success may lead to a
> future with large data volumes or where querying and indexing do become
> simultaneous.
>
> 1) Vincenzo's answer would be suitable for a single or a few small fields
> with a very narrow set of possible html like tags. If the number of
> patterns that need to be matched is high or the length of the text for
> matching is long I would expect this solution to begin to negatively impact
> performance.
>
> 2) Alexandre's suggestion is much better in the case where there is a
> moderate amount of text and the input could be generalized html, but as the
> amount of text that needs to have html stripped grows the performance of
> the server will also degrade faster than necessary with increased indexing
> load.
>
> 3) If the Solr Cloud you are indexing into will need to simultaneously need
> to provide good response times for queries, and you are not able to supply
> it with an over abundance of hardware relative to the query/indexing load,
> then you should consider pre-processing the documents in an external
> ingestion system such as JesterJ, Fusion, or a variety of other solutions
> out there. As the indexing and query load goes up, the best practice is to
> move as much pre-processing work out of solr as possible so that solr can
> continue to do what it does well and return queries quickly.
>
> In the end, like most engineering decisions, it's a cost trade off
> consideration. What costs more, investing in setting up external processing
> or investing in server hardware. If it's a small amount of data loaded
> batch style prior to querying, you are in a good place and any of these
> will work. Just do whatever is fastest/easiest to implement. If you need to
> support a high volume of data being loaded into solr in a timely manner or
> you require minimal impact to query latency due to indexing, you want some
> variation of 3.
>
> -Gus
>
> On Sun, Dec 30, 2018 at 10:29 PM Alexandre Rafalovitch  >
> wrote:
>
> > Specifically, a custome Update Request Processor chain can be used before
> > indexing. Probably with HTMLStripFieldUpdateProcessorFactory
> > Regards,
> >  Alex
> >
> > On Sun, Dec 30, 2018, 9:26 PM Vincenzo D'Amore  wrote:
> >
> > > Hi,
> > >
> > > I think this kind of text manipulation should be done before indexing,
> if
> > > you have font-size font-family in your text, very likely you’re
> indexing
> > an
> > > html with css.
> > > If I’m right, you’re just entering in a hell of words that should be
> > > removed from your text.
> > >
> > > On the other hand, if you have to do this at index time, a quick and
> > dirty
> > > solution is using the pattern-replace filter.
> > >
> > >
> > >
> >
> https://lucene.apache.org/solr/guide/7_5/filter-descriptions.html#pattern-replace-filter
> > >
> > > Ciao,
> > > Vincenzo
> > >
> > > --
> > > mobile: 3498513251
> > > skype: free.dev
> > >
> > > > On 31 Dec 2018, at 02:47, Zheng Lin Edwin Yeo 
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > I noticed that during the indexing of EMLfiles, there are words like
> > > > "*FONT-SIZE:
> > > > 9pt; FONT-FAMILY: arial*" that are being indexed into the content as
> > > well.
> > > >
> > > > Would like to check, how are we able to remove those words during the
> > > > indexing?
> > > >
> > > > I am using Solr 7.5.0
> > > >
> > > > Regards,
> > > > Edwin
> > >
> >
>
>
> --
> http://www.the111shift.com
>


Re: Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content

2019-01-01 Thread Gus Heck
Although Vincenzo and Alexandre's suggestions may be helpful in the right
circumstances, there is a continuum of answers to the original question
here. This continuum is mostly relevant if indexing and querying is likely
to happen simultaneously or the data volume is large enough relative to the
server to make you wish indexing would finish faster. Otherwise
maintainability, local talent and time investment concerns probably
dominate, with the caveat that in many cases, initial success may lead to a
future with large data volumes or where querying and indexing do become
simultaneous.

1) Vincenzo's answer would be suitable for a single or a few small fields
with a very narrow set of possible html like tags. If the number of
patterns that need to be matched is high or the length of the text for
matching is long I would expect this solution to begin to negatively impact
performance.

2) Alexandre's suggestion is much better in the case where there is a
moderate amount of text and the input could be generalized html, but as the
amount of text that needs to have html stripped grows the performance of
the server will also degrade faster than necessary with increased indexing
load.

3) If the Solr Cloud you are indexing into will need to simultaneously need
to provide good response times for queries, and you are not able to supply
it with an over abundance of hardware relative to the query/indexing load,
then you should consider pre-processing the documents in an external
ingestion system such as JesterJ, Fusion, or a variety of other solutions
out there. As the indexing and query load goes up, the best practice is to
move as much pre-processing work out of solr as possible so that solr can
continue to do what it does well and return queries quickly.

In the end, like most engineering decisions, it's a cost trade off
consideration. What costs more, investing in setting up external processing
or investing in server hardware. If it's a small amount of data loaded
batch style prior to querying, you are in a good place and any of these
will work. Just do whatever is fastest/easiest to implement. If you need to
support a high volume of data being loaded into solr in a timely manner or
you require minimal impact to query latency due to indexing, you want some
variation of 3.

-Gus

On Sun, Dec 30, 2018 at 10:29 PM Alexandre Rafalovitch 
wrote:

> Specifically, a custome Update Request Processor chain can be used before
> indexing. Probably with HTMLStripFieldUpdateProcessorFactory
> Regards,
>  Alex
>
> On Sun, Dec 30, 2018, 9:26 PM Vincenzo D'Amore 
> > Hi,
> >
> > I think this kind of text manipulation should be done before indexing, if
> > you have font-size font-family in your text, very likely you’re indexing
> an
> > html with css.
> > If I’m right, you’re just entering in a hell of words that should be
> > removed from your text.
> >
> > On the other hand, if you have to do this at index time, a quick and
> dirty
> > solution is using the pattern-replace filter.
> >
> >
> >
> https://lucene.apache.org/solr/guide/7_5/filter-descriptions.html#pattern-replace-filter
> >
> > Ciao,
> > Vincenzo
> >
> > --
> > mobile: 3498513251
> > skype: free.dev
> >
> > > On 31 Dec 2018, at 02:47, Zheng Lin Edwin Yeo 
> > wrote:
> > >
> > > Hi,
> > >
> > > I noticed that during the indexing of EMLfiles, there are words like
> > > "*FONT-SIZE:
> > > 9pt; FONT-FAMILY: arial*" that are being indexed into the content as
> > well.
> > >
> > > Would like to check, how are we able to remove those words during the
> > > indexing?
> > >
> > > I am using Solr 7.5.0
> > >
> > > Regards,
> > > Edwin
> >
>


-- 
http://www.the111shift.com


Re: Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content

2018-12-31 Thread Hasan Diwan
Perhaps https://royvanrijn.com/blog/2016/03/java-mail-message-as-download/
may be helpful? Though I see the date on it and am now unsure. -- H

On Mon, 31 Dec 2018 at 17:51, Zheng Lin Edwin Yeo 
wrote:

> Hi Alex,
>
> I have tried with a file that is HTML formatted, with those tags like
> , , , etc, and those gets removed during indexing.
>
> For tags like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*", I found that in the
> EML file, there are two different content type, text/html and text/plain.
> Could it be due to Tika getting the content type from text/html instead of
> text/plain?
>
> Regards,
> Edwin
>
> On Mon, 31 Dec 2018 at 23:52, Alexandre Rafalovitch 
> wrote:
>
> > EML is for emails, so there are probably some HTML-formatted emails
> > that you are getting. Probably with the alternative text-part. Outlook
> > would render HTML and/or use text part. I think you can just open EML
> > in an editor to check it out.
> >
> > As to URP, are you absolutely sure it is being used? It is not
> > declared as default, so you need to call it explicitly. Try setting a
> > field in there or some other clear flag that a record has been
> > processed.
> >
> > Regards,
> > Alex.
> >
> > On Sun, 30 Dec 2018 at 22:46, Zheng Lin Edwin Yeo 
> > wrote:
> > >
> > > These texts are likely from the original EML file data, but they are
> not
> > > visible in the content when the EML file is opened in Microsoft
> Outlook.
> > >
> > > I have already applied the HTMLStripFieldUpdateProcessorFactory in
> > > solrconfig.xml, but these texts are still showing up in the index.
> Below
> > is
> > > my configuration.
> > >
> > > 
> > >
> > >  > > class="solr.HTMLStripFieldUpdateProcessorFactory">
> > >
> > >> > name="fieldName">content_tcs
> > >
> > > 
> > >
> > >  > > class="solr.LogUpdateProcessorFactory" />
> > >
> > >  > > class="solr.RunUpdateProcessorFactory" />
> > >
> > > 
> > >
> > >
> > > Regards,
> > > Edwin
> > >
> > > On Mon, 31 Dec 2018 at 11:29, Alexandre Rafalovitch <
> arafa...@gmail.com>
> > > wrote:
> > >
> > > > Specifically, a custome Update Request Processor chain can be used
> > before
> > > > indexing. Probably with HTMLStripFieldUpdateProcessorFactory
> > > > Regards,
> > > >  Alex
> > > >
> > > > On Sun, Dec 30, 2018, 9:26 PM Vincenzo D'Amore  > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I think this kind of text manipulation should be done before
> > indexing, if
> > > > > you have font-size font-family in your text, very likely you’re
> > indexing
> > > > an
> > > > > html with css.
> > > > > If I’m right, you’re just entering in a hell of words that should
> be
> > > > > removed from your text.
> > > > >
> > > > > On the other hand, if you have to do this at index time, a quick
> and
> > > > dirty
> > > > > solution is using the pattern-replace filter.
> > > > >
> > > > >
> > > > >
> > > >
> >
> https://lucene.apache.org/solr/guide/7_5/filter-descriptions.html#pattern-replace-filter
> > > > >
> > > > > Ciao,
> > > > > Vincenzo
> > > > >
> > > > > --
> > > > > mobile: 3498513251
> > > > > skype: free.dev
> > > > >
> > > > > > On 31 Dec 2018, at 02:47, Zheng Lin Edwin Yeo <
> > edwinye...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I noticed that during the indexing of EMLfiles, there are words
> > like
> > > > > > "*FONT-SIZE:
> > > > > > 9pt; FONT-FAMILY: arial*" that are being indexed into the content
> > as
> > > > > well.
> > > > > >
> > > > > > Would like to check, how are we able to remove those words during
> > the
> > > > > > indexing?
> > > > > >
> > > > > > I am using Solr 7.5.0
> > > > > >
> > > > > > Regards,
> > > > > > Edwin
> > > > >
> > > >
> >
>


-- 
OpenPGP:
https://sks-keyservers.net/pks/lookup?op=get=0xFEBAD7FFD041BBA1
If you wish to request my time, please do so using
*bit.ly/hd1AppointmentRequest
*.
Si vous voudrais faire connnaisance, allez a *bit.ly/hd1AppointmentRequest
*.

Sent
from my mobile device
Envoye de mon portable


Re: Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content

2018-12-31 Thread Zheng Lin Edwin Yeo
Hi Alex,

I have tried with a file that is HTML formatted, with those tags like
, , , etc, and those gets removed during indexing.

For tags like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*", I found that in the
EML file, there are two different content type, text/html and text/plain.
Could it be due to Tika getting the content type from text/html instead of
text/plain?

Regards,
Edwin

On Mon, 31 Dec 2018 at 23:52, Alexandre Rafalovitch 
wrote:

> EML is for emails, so there are probably some HTML-formatted emails
> that you are getting. Probably with the alternative text-part. Outlook
> would render HTML and/or use text part. I think you can just open EML
> in an editor to check it out.
>
> As to URP, are you absolutely sure it is being used? It is not
> declared as default, so you need to call it explicitly. Try setting a
> field in there or some other clear flag that a record has been
> processed.
>
> Regards,
> Alex.
>
> On Sun, 30 Dec 2018 at 22:46, Zheng Lin Edwin Yeo 
> wrote:
> >
> > These texts are likely from the original EML file data, but they are not
> > visible in the content when the EML file is opened in Microsoft Outlook.
> >
> > I have already applied the HTMLStripFieldUpdateProcessorFactory in
> > solrconfig.xml, but these texts are still showing up in the index. Below
> is
> > my configuration.
> >
> > 
> >
> >  > class="solr.HTMLStripFieldUpdateProcessorFactory">
> >
> >> name="fieldName">content_tcs
> >
> > 
> >
> >  > class="solr.LogUpdateProcessorFactory" />
> >
> >  > class="solr.RunUpdateProcessorFactory" />
> >
> > 
> >
> >
> > Regards,
> > Edwin
> >
> > On Mon, 31 Dec 2018 at 11:29, Alexandre Rafalovitch 
> > wrote:
> >
> > > Specifically, a custome Update Request Processor chain can be used
> before
> > > indexing. Probably with HTMLStripFieldUpdateProcessorFactory
> > > Regards,
> > >  Alex
> > >
> > > On Sun, Dec 30, 2018, 9:26 PM Vincenzo D'Amore  wrote:
> > >
> > > > Hi,
> > > >
> > > > I think this kind of text manipulation should be done before
> indexing, if
> > > > you have font-size font-family in your text, very likely you’re
> indexing
> > > an
> > > > html with css.
> > > > If I’m right, you’re just entering in a hell of words that should be
> > > > removed from your text.
> > > >
> > > > On the other hand, if you have to do this at index time, a quick and
> > > dirty
> > > > solution is using the pattern-replace filter.
> > > >
> > > >
> > > >
> > >
> https://lucene.apache.org/solr/guide/7_5/filter-descriptions.html#pattern-replace-filter
> > > >
> > > > Ciao,
> > > > Vincenzo
> > > >
> > > > --
> > > > mobile: 3498513251
> > > > skype: free.dev
> > > >
> > > > > On 31 Dec 2018, at 02:47, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com>
> > > > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > I noticed that during the indexing of EMLfiles, there are words
> like
> > > > > "*FONT-SIZE:
> > > > > 9pt; FONT-FAMILY: arial*" that are being indexed into the content
> as
> > > > well.
> > > > >
> > > > > Would like to check, how are we able to remove those words during
> the
> > > > > indexing?
> > > > >
> > > > > I am using Solr 7.5.0
> > > > >
> > > > > Regards,
> > > > > Edwin
> > > >
> > >
>


Re: Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content

2018-12-31 Thread Alexandre Rafalovitch
EML is for emails, so there are probably some HTML-formatted emails
that you are getting. Probably with the alternative text-part. Outlook
would render HTML and/or use text part. I think you can just open EML
in an editor to check it out.

As to URP, are you absolutely sure it is being used? It is not
declared as default, so you need to call it explicitly. Try setting a
field in there or some other clear flag that a record has been
processed.

Regards,
Alex.

On Sun, 30 Dec 2018 at 22:46, Zheng Lin Edwin Yeo  wrote:
>
> These texts are likely from the original EML file data, but they are not
> visible in the content when the EML file is opened in Microsoft Outlook.
>
> I have already applied the HTMLStripFieldUpdateProcessorFactory in
> solrconfig.xml, but these texts are still showing up in the index. Below is
> my configuration.
>
> 
>
>  class="solr.HTMLStripFieldUpdateProcessorFactory">
>
>name="fieldName">content_tcs
>
> 
>
>  class="solr.LogUpdateProcessorFactory" />
>
>  class="solr.RunUpdateProcessorFactory" />
>
> 
>
>
> Regards,
> Edwin
>
> On Mon, 31 Dec 2018 at 11:29, Alexandre Rafalovitch 
> wrote:
>
> > Specifically, a custome Update Request Processor chain can be used before
> > indexing. Probably with HTMLStripFieldUpdateProcessorFactory
> > Regards,
> >  Alex
> >
> > On Sun, Dec 30, 2018, 9:26 PM Vincenzo D'Amore  >
> > > Hi,
> > >
> > > I think this kind of text manipulation should be done before indexing, if
> > > you have font-size font-family in your text, very likely you’re indexing
> > an
> > > html with css.
> > > If I’m right, you’re just entering in a hell of words that should be
> > > removed from your text.
> > >
> > > On the other hand, if you have to do this at index time, a quick and
> > dirty
> > > solution is using the pattern-replace filter.
> > >
> > >
> > >
> > https://lucene.apache.org/solr/guide/7_5/filter-descriptions.html#pattern-replace-filter
> > >
> > > Ciao,
> > > Vincenzo
> > >
> > > --
> > > mobile: 3498513251
> > > skype: free.dev
> > >
> > > > On 31 Dec 2018, at 02:47, Zheng Lin Edwin Yeo 
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > I noticed that during the indexing of EMLfiles, there are words like
> > > > "*FONT-SIZE:
> > > > 9pt; FONT-FAMILY: arial*" that are being indexed into the content as
> > > well.
> > > >
> > > > Would like to check, how are we able to remove those words during the
> > > > indexing?
> > > >
> > > > I am using Solr 7.5.0
> > > >
> > > > Regards,
> > > > Edwin
> > >
> >


Re: Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content

2018-12-30 Thread Zheng Lin Edwin Yeo
These texts are likely from the original EML file data, but they are not
visible in the content when the EML file is opened in Microsoft Outlook.

I have already applied the HTMLStripFieldUpdateProcessorFactory in
solrconfig.xml, but these texts are still showing up in the index. Below is
my configuration.





  content_tcs










Regards,
Edwin

On Mon, 31 Dec 2018 at 11:29, Alexandre Rafalovitch 
wrote:

> Specifically, a custome Update Request Processor chain can be used before
> indexing. Probably with HTMLStripFieldUpdateProcessorFactory
> Regards,
>  Alex
>
> On Sun, Dec 30, 2018, 9:26 PM Vincenzo D'Amore 
> > Hi,
> >
> > I think this kind of text manipulation should be done before indexing, if
> > you have font-size font-family in your text, very likely you’re indexing
> an
> > html with css.
> > If I’m right, you’re just entering in a hell of words that should be
> > removed from your text.
> >
> > On the other hand, if you have to do this at index time, a quick and
> dirty
> > solution is using the pattern-replace filter.
> >
> >
> >
> https://lucene.apache.org/solr/guide/7_5/filter-descriptions.html#pattern-replace-filter
> >
> > Ciao,
> > Vincenzo
> >
> > --
> > mobile: 3498513251
> > skype: free.dev
> >
> > > On 31 Dec 2018, at 02:47, Zheng Lin Edwin Yeo 
> > wrote:
> > >
> > > Hi,
> > >
> > > I noticed that during the indexing of EMLfiles, there are words like
> > > "*FONT-SIZE:
> > > 9pt; FONT-FAMILY: arial*" that are being indexed into the content as
> > well.
> > >
> > > Would like to check, how are we able to remove those words during the
> > > indexing?
> > >
> > > I am using Solr 7.5.0
> > >
> > > Regards,
> > > Edwin
> >
>


Re: Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content

2018-12-30 Thread Alexandre Rafalovitch
Specifically, a custome Update Request Processor chain can be used before
indexing. Probably with HTMLStripFieldUpdateProcessorFactory
Regards,
 Alex

On Sun, Dec 30, 2018, 9:26 PM Vincenzo D'Amore  Hi,
>
> I think this kind of text manipulation should be done before indexing, if
> you have font-size font-family in your text, very likely you’re indexing an
> html with css.
> If I’m right, you’re just entering in a hell of words that should be
> removed from your text.
>
> On the other hand, if you have to do this at index time, a quick and dirty
> solution is using the pattern-replace filter.
>
>
> https://lucene.apache.org/solr/guide/7_5/filter-descriptions.html#pattern-replace-filter
>
> Ciao,
> Vincenzo
>
> --
> mobile: 3498513251
> skype: free.dev
>
> > On 31 Dec 2018, at 02:47, Zheng Lin Edwin Yeo 
> wrote:
> >
> > Hi,
> >
> > I noticed that during the indexing of EMLfiles, there are words like
> > "*FONT-SIZE:
> > 9pt; FONT-FAMILY: arial*" that are being indexed into the content as
> well.
> >
> > Would like to check, how are we able to remove those words during the
> > indexing?
> >
> > I am using Solr 7.5.0
> >
> > Regards,
> > Edwin
>


Re: Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content

2018-12-30 Thread Vincenzo D'Amore
Hi,

I think this kind of text manipulation should be done before indexing, if you 
have font-size font-family in your text, very likely you’re indexing an html 
with css.
If I’m right, you’re just entering in a hell of words that should be removed 
from your text. 

On the other hand, if you have to do this at index time, a quick and dirty 
solution is using the pattern-replace filter. 

https://lucene.apache.org/solr/guide/7_5/filter-descriptions.html#pattern-replace-filter

Ciao,
Vincenzo

--
mobile: 3498513251
skype: free.dev

> On 31 Dec 2018, at 02:47, Zheng Lin Edwin Yeo  wrote:
> 
> Hi,
> 
> I noticed that during the indexing of EMLfiles, there are words like
> "*FONT-SIZE:
> 9pt; FONT-FAMILY: arial*" that are being indexed into the content as well.
> 
> Would like to check, how are we able to remove those words during the
> indexing?
> 
> I am using Solr 7.5.0
> 
> Regards,
> Edwin


Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content

2018-12-30 Thread Zheng Lin Edwin Yeo
Hi,

I noticed that during the indexing of EMLfiles, there are words like
"*FONT-SIZE:
9pt; FONT-FAMILY: arial*" that are being indexed into the content as well.

Would like to check, how are we able to remove those words during the
indexing?

I am using Solr 7.5.0

Regards,
Edwin