Re: Profanity

2018-01-08 Thread John Blythe
Gladly. Good luck!

On Mon, Jan 8, 2018 at 8:27 PM Sadiki Latty  wrote:

> Thanks for the feedback John,
>
> This is a genius idea if I don’t want to create my own processor. I could
> simply check that field for data for my reports. Either the field will have
> data or it won’t.
>
> Thanks
>
> Sid
>
> Sent from my iPhone
>
> > On Jan 8, 2018, at 4:38 PM, John Blythe  wrote:
> >
> > you could use the keepwords functionality. have a field that only keeps
> > profanity and then you can query against that field having its default
> > value vs. profane text
> >
> > --
> > John Blythe
> >
> >> On Mon, Jan 8, 2018 at 3:12 PM, Sadiki Latty  wrote:
> >>
> >> Hey
> >>
> >> I would like to find a solution to flag (at index-time) profanity.
> >> Optimally, it would be good if it function similar to stopwords in the
> >> sense that I can have a predefined list that is read and if token is on
> the
> >> list that document is 'flagged' in a different field. Does anyone know
> of
> >> solution (outside of configuring my own). If none exists and I end up
> >> configuring my own would I be doing this in the updateprcoessor phase.
> I am
> >> still fairly new to Solr, but from what I've read, that seems to be the
> >> best place to look.
> >>
> >>
> >> Thanks,
> >>
> >> Sid
> >>
>
-- 
John Blythe


Re: Profanity

2018-01-08 Thread Sadiki Latty
Thanks for the feedback John,

This is a genius idea if I don’t want to create my own processor. I could 
simply check that field for data for my reports. Either the field will have 
data or it won’t. 

Thanks

Sid

Sent from my iPhone

> On Jan 8, 2018, at 4:38 PM, John Blythe  wrote:
> 
> you could use the keepwords functionality. have a field that only keeps
> profanity and then you can query against that field having its default
> value vs. profane text
> 
> --
> John Blythe
> 
>> On Mon, Jan 8, 2018 at 3:12 PM, Sadiki Latty  wrote:
>> 
>> Hey
>> 
>> I would like to find a solution to flag (at index-time) profanity.
>> Optimally, it would be good if it function similar to stopwords in the
>> sense that I can have a predefined list that is read and if token is on the
>> list that document is 'flagged' in a different field. Does anyone know of
>> solution (outside of configuring my own). If none exists and I end up
>> configuring my own would I be doing this in the updateprcoessor phase. I am
>> still fairly new to Solr, but from what I've read, that seems to be the
>> best place to look.
>> 
>> 
>> Thanks,
>> 
>> Sid
>> 


Re: Profanity

2018-01-08 Thread Sadiki Latty
Thanks a lot guys. Multilingual will also be a hurdle tbh. The data will only 
be coming From 2 languages but it will prove to be potentially challenging 
nonetheless. French and English so “merde” will be making that list. This 
requirement is in itself an edge case for my project so ML may be overkill 
hence why I was thinking the list. The data being inserted is from sources that 
we have “control” over. This requirement is simply for the worst case scenario 
that we miss something. We might also want to allow this profanity which is why 
we need to flag it rather than strip it all together. 

This provides me with great direction.

Sent from my iPhone

> On Jan 8, 2018, at 5:17 PM, Markus Jelsma  wrote:
> 
> Indeed, hence the small suggestion to use ML for this instead of a dumb set 
> of terms, which is useless in almost any real solution. We have had very good 
> results with SVM's for text processing, although in the end it depends on 
> your input data, and the care for selecting edge cases.
> 
> Regards,
> Markus
> 
> -Original message-
>> From:Davis, Daniel (NIH/NLM) [C] 
>> Sent: Monday 8th January 2018 23:12
>> To: solr-user@lucene.apache.org
>> Subject: RE: Profanity
>> 
>> Fun topic.   Same complicated issues as normal search:
>> 
>> Multilingual support?Is "Merde" profanity too, or just in French.
>> Multi-word synonyms?   Does "God Damn" becomes "goddamn", or do you 
>> treat "Damn" and "God damn" the same because you drop "God"
>>   "Merde Alors" is same as "Merde" or 
>> again multi-word synonyms
>> 
>> -Original Message-
>> From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
>> Sent: Monday, January 8, 2018 4:42 PM
>> To: solr-user@lucene.apache.org
>> Subject: RE: Profanity
>> 
>> Yes, an UpdateRequestProcessor is the API to implement for these sorts of 
>> requirements. In the URP you have access to a SolrDocument object that 
>> carries the input data. You can inspect the fields, and add, remove or 
>> modify fields if you want, or discard the input altogether.
>> 
>> So, check your text input field for 'profanity' and set another boolean 
>> field if it matches or doesn't. If you are using a list of words - or an SVM 
>> or another machine learning algorithm - to detect provanity is up to you.
>> 
>> Cheers,
>> Markus
>>   
>> -Original message-
>>> From:Sadiki Latty 
>>> Sent: Monday 8th January 2018 22:12
>>> To: solr-user@lucene.apache.org
>>> Subject: Profanity
>>> 
>>> Hey
>>> 
>>> I would like to find a solution to flag (at index-time) profanity. 
>>> Optimally, it would be good if it function similar to stopwords in the 
>>> sense that I can have a predefined list that is read and if token is on the 
>>> list that document is 'flagged' in a different field. Does anyone know of 
>>> solution (outside of configuring my own). If none exists and I end up 
>>> configuring my own would I be doing this in the updateprcoessor phase. I am 
>>> still fairly new to Solr, but from what I've read, that seems to be the 
>>> best place to look.
>>> 
>>> 
>>> Thanks,
>>> 
>>> Sid
>>> 
>> 


RE: Profanity

2018-01-08 Thread Markus Jelsma
Indeed, hence the small suggestion to use ML for this instead of a dumb set of 
terms, which is useless in almost any real solution. We have had very good 
results with SVM's for text processing, although in the end it depends on your 
input data, and the care for selecting edge cases.

Regards,
Markus
 
-Original message-
> From:Davis, Daniel (NIH/NLM) [C] 
> Sent: Monday 8th January 2018 23:12
> To: solr-user@lucene.apache.org
> Subject: RE: Profanity
> 
> Fun topic.   Same complicated issues as normal search:
> 
> Multilingual support?    Is "Merde" profanity too, or just in French.
> Multi-word synonyms?   Does "God Damn" becomes "goddamn", or do you treat 
> "Damn" and "God damn" the same because you drop "God"
>  "Merde Alors" is same as "Merde" or 
>again multi-word synonyms
> 
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> Sent: Monday, January 8, 2018 4:42 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Profanity
> 
> Yes, an UpdateRequestProcessor is the API to implement for these sorts of 
> requirements. In the URP you have access to a SolrDocument object that 
> carries the input data. You can inspect the fields, and add, remove or modify 
> fields if you want, or discard the input altogether.
> 
> So, check your text input field for 'profanity' and set another boolean field 
> if it matches or doesn't. If you are using a list of words - or an SVM or 
> another machine learning algorithm - to detect provanity is up to you.
> 
> Cheers,
> Markus
>  
> -Original message-
> > From:Sadiki Latty 
> > Sent: Monday 8th January 2018 22:12
> > To: solr-user@lucene.apache.org
> > Subject: Profanity
> > 
> > Hey
> > 
> > I would like to find a solution to flag (at index-time) profanity. 
> > Optimally, it would be good if it function similar to stopwords in the 
> > sense that I can have a predefined list that is read and if token is on the 
> > list that document is 'flagged' in a different field. Does anyone know of 
> > solution (outside of configuring my own). If none exists and I end up 
> > configuring my own would I be doing this in the updateprcoessor phase. I am 
> > still fairly new to Solr, but from what I've read, that seems to be the 
> > best place to look.
> > 
> > 
> > Thanks,
> > 
> > Sid
> > 
> 


RE: Profanity

2018-01-08 Thread Davis, Daniel (NIH/NLM) [C]
Fun topic.   Same complicated issues as normal search:

Multilingual support?Is "Merde" profanity too, or just in French.
Multi-word synonyms?   Does "God Damn" becomes "goddamn", or do you treat 
"Damn" and "God damn" the same because you drop "God"
 "Merde Alors" is same as "Merde" or again 
multi-word synonyms

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Monday, January 8, 2018 4:42 PM
To: solr-user@lucene.apache.org
Subject: RE: Profanity

Yes, an UpdateRequestProcessor is the API to implement for these sorts of 
requirements. In the URP you have access to a SolrDocument object that carries 
the input data. You can inspect the fields, and add, remove or modify fields if 
you want, or discard the input altogether.

So, check your text input field for 'profanity' and set another boolean field 
if it matches or doesn't. If you are using a list of words - or an SVM or 
another machine learning algorithm - to detect provanity is up to you.

Cheers,
Markus
 
-Original message-
> From:Sadiki Latty 
> Sent: Monday 8th January 2018 22:12
> To: solr-user@lucene.apache.org
> Subject: Profanity
> 
> Hey
> 
> I would like to find a solution to flag (at index-time) profanity. Optimally, 
> it would be good if it function similar to stopwords in the sense that I can 
> have a predefined list that is read and if token is on the list that document 
> is 'flagged' in a different field. Does anyone know of solution (outside of 
> configuring my own). If none exists and I end up configuring my own would I 
> be doing this in the updateprcoessor phase. I am still fairly new to Solr, 
> but from what I've read, that seems to be the best place to look.
> 
> 
> Thanks,
> 
> Sid
> 


RE: Profanity

2018-01-08 Thread Markus Jelsma
Yes, an UpdateRequestProcessor is the API to implement for these sorts of 
requirements. In the URP you have access to a SolrDocument object that carries 
the input data. You can inspect the fields, and add, remove or modify fields if 
you want, or discard the input altogether.

So, check your text input field for 'profanity' and set another boolean field 
if it matches or doesn't. If you are using a list of words - or an SVM or 
another machine learning algorithm - to detect provanity is up to you.

Cheers,
Markus
 
-Original message-
> From:Sadiki Latty 
> Sent: Monday 8th January 2018 22:12
> To: solr-user@lucene.apache.org
> Subject: Profanity
> 
> Hey
> 
> I would like to find a solution to flag (at index-time) profanity. Optimally, 
> it would be good if it function similar to stopwords in the sense that I can 
> have a predefined list that is read and if token is on the list that document 
> is 'flagged' in a different field. Does anyone know of solution (outside of 
> configuring my own). If none exists and I end up configuring my own would I 
> be doing this in the updateprcoessor phase. I am still fairly new to Solr, 
> but from what I've read, that seems to be the best place to look.
> 
> 
> Thanks,
> 
> Sid
> 


Re: Profanity

2018-01-08 Thread John Blythe
you could use the keepwords functionality. have a field that only keeps
profanity and then you can query against that field having its default
value vs. profane text

--
John Blythe

On Mon, Jan 8, 2018 at 3:12 PM, Sadiki Latty  wrote:

> Hey
>
> I would like to find a solution to flag (at index-time) profanity.
> Optimally, it would be good if it function similar to stopwords in the
> sense that I can have a predefined list that is read and if token is on the
> list that document is 'flagged' in a different field. Does anyone know of
> solution (outside of configuring my own). If none exists and I end up
> configuring my own would I be doing this in the updateprcoessor phase. I am
> still fairly new to Solr, but from what I've read, that seems to be the
> best place to look.
>
>
> Thanks,
>
> Sid
>


Profanity

2018-01-08 Thread Sadiki Latty
Hey

I would like to find a solution to flag (at index-time) profanity. Optimally, 
it would be good if it function similar to stopwords in the sense that I can 
have a predefined list that is read and if token is on the list that document 
is 'flagged' in a different field. Does anyone know of solution (outside of 
configuring my own). If none exists and I end up configuring my own would I be 
doing this in the updateprcoessor phase. I am still fairly new to Solr, but 
from what I've read, that seems to be the best place to look.


Thanks,

Sid


Re: implementing profanity detector

2010-02-16 Thread Lance Norskog
A problem is that your profanity list will not stop growing, and with
each new word you will want to rescrub the index.

We had a thousand-word NOT clause in every query (a filter query would
be true for 99% of the index) until we switched to another
arrangement.

Another small problem was that I knew of many more perversions than my
co-workers, but did not wish to display my vast erudition in the
seamier side of life :)

On Fri, Feb 12, 2010 at 4:26 PM, Chris Hostetter
 wrote:
>
> : Otherwise, I'd do it via copy fields.  Your first field is your main
> : field and is analyzed as before.  Your second field does the profanity
> : detection and simply outputs a single token at the end, safe/unsafe.
>
> you don't even need custom code for this ... copyFiled all your text into
> a 'has_profanity' field where you use a suitable Tokenizer followed by the
> KeepWordsTokenFilter that only keeps profane words and then a
> PatternReplaceTokenFilter that matches .* and replaces it with "HELL_YEA"
> ... now a search for "is_profane:HELL_YEA" finds all profane docs, with
> the added bonus that the scores are based on how many profane words occur
> in the doc.
>
> it could be used as a filter query (probably negated) as needed.
>
>
>
> -Hoss
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: implementing profanity detector

2010-02-12 Thread Chris Hostetter

: Otherwise, I'd do it via copy fields.  Your first field is your main 
: field and is analyzed as before.  Your second field does the profanity 
: detection and simply outputs a single token at the end, safe/unsafe.

you don't even need custom code for this ... copyFiled all your text into 
a 'has_profanity' field where you use a suitable Tokenizer followed by the 
KeepWordsTokenFilter that only keeps profane words and then a 
PatternReplaceTokenFilter that matches .* and replaces it with "HELL_YEA" 
... now a search for "is_profane:HELL_YEA" finds all profane docs, with 
the added bonus that the scores are based on how many profane words occur 
in the doc.

it could be used as a filter query (probably negated) as needed.



-Hoss



Re: implementing profanity detector

2010-02-12 Thread Mike Perham
On Thu, Feb 11, 2010 at 10:49 AM, Grant Ingersoll  wrote:
>
> Otherwise, I'd do it via copy fields.  Your first field is your main field 
> and is analyzed as before.  Your second field does the profanity detection 
> and simply outputs a single token at the end, safe/unsafe.
>
> How long are your documents?  The extra copy field is extra work, but in this 
> case it should be fast as you should be able to create a pretty streamlined 
> analyzer chain for the second task.
>

The documents are web page text, so they shouldn't be more than 10-20k
generally.  Would something like this do the trick?

  @Override
  public boolean incrementToken() throws IOException {
while (input.incrementToken()) {
  if (profanities.contains(termAtt.termBuffer(), 0, termAtt.termLength())) {
  termAtt.setTermBuffer("y", 0, 1);
  return false;
  }
}
termAtt.setTermBuffer("n", 0, 1);
return false;
  }

mike


Re: implementing profanity detector

2010-02-11 Thread Grant Ingersoll

On Jan 28, 2010, at 4:46 PM, Mike Perham wrote:

> We'd like to implement a profanity detector for documents during indexing.
> That is, given a file of profane words, we'd like to be able to mark a
> document as safe or not safe if it contains any of those words so that we
> can have something similar to google's safe search.
> 
> I'm trying to figure out how best to implement this with Solr 1.4:
> 
> - An UpdateRequestProcessor would allow me to dynamically populate a "safe"
> boolean field but requires me to pull out the content, tokenize it and run
> each token through my set of profanities, essentially running the analysis
> pipeline again.  That's a lot of overheard AFAIK.
> 
> - A TokenFilter would allow me to tap into the existing analysis pipeline so
> I get the tokens for free but I can't access the document.
> 
> Any suggestions on how to best implement this?
> 


TeeSinkTokenFilter (Lucene only) would do the trick if you're up for some 
hardcoding b/c it isn't supported in Solr (patch welcome) all that well.  A 
one-off solution shouldn't be too hard to wedge in, but it will involve 
hardcoding some field names in your analyzer, I think.  

Otherwise, I'd do it via copy fields.  Your first field is your main field and 
is analyzed as before.  Your second field does the profanity detection and 
simply outputs a single token at the end, safe/unsafe.

How long are your documents?  The extra copy field is extra work, but in this 
case it should be fast as you should be able to create a pretty streamlined 
analyzer chain for the second task.

Short term, I'd do the copy field approach while maybe, depending on its 
importance to you, working on the first approach.

-Grant




Re: implementing profanity detector

2010-02-11 Thread Alexey Serba
> - A TokenFilter would allow me to tap into the existing analysis pipeline so
> I get the tokens for free but I can't access the document.
https://issues.apache.org/jira/browse/SOLR-1536

On Fri, Jan 29, 2010 at 12:46 AM, Mike Perham  wrote:
> We'd like to implement a profanity detector for documents during indexing.
>  That is, given a file of profane words, we'd like to be able to mark a
> document as safe or not safe if it contains any of those words so that we
> can have something similar to google's safe search.
>
> I'm trying to figure out how best to implement this with Solr 1.4:
>
> - An UpdateRequestProcessor would allow me to dynamically populate a "safe"
> boolean field but requires me to pull out the content, tokenize it and run
> each token through my set of profanities, essentially running the analysis
> pipeline again.  That's a lot of overheard AFAIK.
>
> - A TokenFilter would allow me to tap into the existing analysis pipeline so
> I get the tokens for free but I can't access the document.
>
> Any suggestions on how to best implement this?
>
> Thanks in advance,
> mike
>


implementing profanity detector

2010-02-10 Thread Mike Perham
FYI this does not work.  It appears that the update seems to run on a
different thread to the analysis, perhaps because the update is done
when the commit happens?  I'm sending the document XML with
commitWithin="6".

I would appreciate any other ideas.  I'm drawing a blank on how to
implement this efficiently with Lucene/Solr.

mike

On Thu, Jan 28, 2010 at 4:31 PM, Otis Gospodnetic
 wrote:
>
> How about this crazy idea - a custom TokenFilter that stores the safe flag in 
> ThreadLocal?
>
>
>
> - Original Message 
> > From: Mike Perham 
> > To: solr-user@lucene.apache.org
> > Sent: Thu, January 28, 2010 4:46:54 PM
> > Subject: implementing profanity detector
> >
> > We'd like to implement a profanity detector for documents during indexing.
> > That is, given a file of profane words, we'd like to be able to mark a
> > document as safe or not safe if it contains any of those words so that we
> > can have something similar to google's safe search.
> >
> > I'm trying to figure out how best to implement this with Solr 1.4:
> >
> > - An UpdateRequestProcessor would allow me to dynamically populate a "safe"
> > boolean field but requires me to pull out the content, tokenize it and run
> > each token through my set of profanities, essentially running the analysis
> > pipeline again.  That's a lot of overheard AFAIK.
> >
> > - A TokenFilter would allow me to tap into the existing analysis pipeline so
> > I get the tokens for free but I can't access the document.
> >
> > Any suggestions on how to best implement this?
> >
> > Thanks in advance,
> > mike
>


Re: implementing profanity detector

2010-01-28 Thread Lance Norskog
You could have a synonym file that, for each dirty word, changes the
word into an "impossible word": for example, xyzzy. Then, a search for
clean contents is:

(user search) AND NOT xyzzy

A synonym filter that included payloads would be cool.

On Thu, Jan 28, 2010 at 2:31 PM, Otis Gospodnetic
 wrote:
> How about this crazy idea - a custom TokenFilter that stores the safe flag in 
> ThreadLocal?
>
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Hadoop ecosystem search :: http://search-hadoop.com/
>
>
>
> - Original Message 
>> From: Mike Perham 
>> To: solr-user@lucene.apache.org
>> Sent: Thu, January 28, 2010 4:46:54 PM
>> Subject: implementing profanity detector
>>
>> We'd like to implement a profanity detector for documents during indexing.
>> That is, given a file of profane words, we'd like to be able to mark a
>> document as safe or not safe if it contains any of those words so that we
>> can have something similar to google's safe search.
>>
>> I'm trying to figure out how best to implement this with Solr 1.4:
>>
>> - An UpdateRequestProcessor would allow me to dynamically populate a "safe"
>> boolean field but requires me to pull out the content, tokenize it and run
>> each token through my set of profanities, essentially running the analysis
>> pipeline again.  That's a lot of overheard AFAIK.
>>
>> - A TokenFilter would allow me to tap into the existing analysis pipeline so
>> I get the tokens for free but I can't access the document.
>>
>> Any suggestions on how to best implement this?
>>
>> Thanks in advance,
>> mike
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: implementing profanity detector

2010-01-28 Thread Otis Gospodnetic
How about this crazy idea - a custom TokenFilter that stores the safe flag in 
ThreadLocal?


Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



- Original Message 
> From: Mike Perham 
> To: solr-user@lucene.apache.org
> Sent: Thu, January 28, 2010 4:46:54 PM
> Subject: implementing profanity detector
> 
> We'd like to implement a profanity detector for documents during indexing.
> That is, given a file of profane words, we'd like to be able to mark a
> document as safe or not safe if it contains any of those words so that we
> can have something similar to google's safe search.
> 
> I'm trying to figure out how best to implement this with Solr 1.4:
> 
> - An UpdateRequestProcessor would allow me to dynamically populate a "safe"
> boolean field but requires me to pull out the content, tokenize it and run
> each token through my set of profanities, essentially running the analysis
> pipeline again.  That's a lot of overheard AFAIK.
> 
> - A TokenFilter would allow me to tap into the existing analysis pipeline so
> I get the tokens for free but I can't access the document.
> 
> Any suggestions on how to best implement this?
> 
> Thanks in advance,
> mike



implementing profanity detector

2010-01-28 Thread Mike Perham
We'd like to implement a profanity detector for documents during indexing.
 That is, given a file of profane words, we'd like to be able to mark a
document as safe or not safe if it contains any of those words so that we
can have something similar to google's safe search.

I'm trying to figure out how best to implement this with Solr 1.4:

- An UpdateRequestProcessor would allow me to dynamically populate a "safe"
boolean field but requires me to pull out the content, tokenize it and run
each token through my set of profanities, essentially running the analysis
pipeline again.  That's a lot of overheard AFAIK.

- A TokenFilter would allow me to tap into the existing analysis pipeline so
I get the tokens for free but I can't access the document.

Any suggestions on how to best implement this?

Thanks in advance,
mike