Re: Percolate feature?

2013-10-01 Thread Charlie Hull

On 01/10/2013 04:12, Otis Gospodnetic wrote:

Just came across this ancient thread.  Charlie, did this end up
happening?  I suspect Wolfgang may be interested, but that's just a
wild guess.


Hi Otis  all,

Yes we're actually planning to talk about it at Lucene Revolution in 
November and open source it around then - it's called 'Luwak' and we're 
working on a live customer implementation based on it currently.


I was curious about your feeling that what you were open-sourcing
might be a lot faster and more flexible than ES's percolator - can you
share more about why do you have that feeling and whether you've
confirmed this?


Difficult to say at present - we've not done a direct comparative test 
yet and obviously we like our own implementation! It works very well for 
our clients' use case.


Cheers

Charlie



Thanks,
Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Mon, Aug 5, 2013 at 6:34 AM, Charlie Hull char...@flax.co.uk wrote:

On 03/08/2013 00:50, Mark wrote:


We have a set number of known terms we want to match against.

In Index:
term one
term two
term three

I know how to match all terms of a user query against the index but we
would like to know how/if we can match a user's query against all the terms
in the index?

Search Queries:
my search term = 0 matches
my term search one = 1 match  (term one)
some prefix term two = 1 match (term two)
one two three = 0 matches

I can only explain this is almost a reverse search???

I came across the following from ElasticSearch
(http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds
like this may accomplish the above but haven't tested. I was wondering if
Solr had something similar or an alternative way of accomplishing this?

Thanks



Hi Mark,

We've built something that implements this kind of reverse search for our
clients in the media monitoring sector - we're working on releasing the core
of this as open source very soon, hopefully in a month or two. It's based on
Lucene.

Just for reference it's able to apply tens of thousands of stored queries to
a document per second (our clients often have very large and complex Boolean
strings representing their clients' interests and may monitor hundreds of
thousands of news stories every day). It also records the positions of every
match. We suspect it's a lot faster and more flexible than Elasticsearch's
Percolate feature.

Cheers

Charlie

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Percolate feature?

2013-09-30 Thread Otis Gospodnetic
Just came across this ancient thread.  Charlie, did this end up
happening?  I suspect Wolfgang may be interested, but that's just a
wild guess.

I was curious about your feeling that what you were open-sourcing
might be a lot faster and more flexible than ES's percolator - can you
share more about why do you have that feeling and whether you've
confirmed this?

Thanks,
Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Mon, Aug 5, 2013 at 6:34 AM, Charlie Hull char...@flax.co.uk wrote:
 On 03/08/2013 00:50, Mark wrote:

 We have a set number of known terms we want to match against.

 In Index:
 term one
 term two
 term three

 I know how to match all terms of a user query against the index but we
 would like to know how/if we can match a user's query against all the terms
 in the index?

 Search Queries:
 my search term = 0 matches
 my term search one = 1 match  (term one)
 some prefix term two = 1 match (term two)
 one two three = 0 matches

 I can only explain this is almost a reverse search???

 I came across the following from ElasticSearch
 (http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds
 like this may accomplish the above but haven't tested. I was wondering if
 Solr had something similar or an alternative way of accomplishing this?

 Thanks


 Hi Mark,

 We've built something that implements this kind of reverse search for our
 clients in the media monitoring sector - we're working on releasing the core
 of this as open source very soon, hopefully in a month or two. It's based on
 Lucene.

 Just for reference it's able to apply tens of thousands of stored queries to
 a document per second (our clients often have very large and complex Boolean
 strings representing their clients' interests and may monitor hundreds of
 thousands of news stories every day). It also records the positions of every
 match. We suspect it's a lot faster and more flexible than Elasticsearch's
 Percolate feature.

 Cheers

 Charlie

 --
 Charlie Hull
 Flax - Open Source Enterprise Search

 tel/fax: +44 (0)8700 118334
 mobile:  +44 (0)7767 825828
 web: www.flax.co.uk


Re: Percolate feature?

2013-08-19 Thread Chris Hostetter
: Let's talk about the real use case. We are marketplace that sells 
: products that users have listed. For certain popular, high risk or 
: restricted keywords we charge the seller an extra fee/ban the listing. 
: We now have sellers purposely misspelling their listings to circumvent 
: this fee. They will start adding suffixes to their product listings such 
: as Sonies knowing that it gets indexed down to Sony and thus 
: matching a users query for Sony. Or they will munge together numbers and 
: products… 2013Sony. Same thing goes for adding crazy non-ascii 
: characters to the front of the keyword Î’Sony. This is obviously a 
: problem because we aren't charging for these keywords and more 
: importantly it makes our search results look like shit.

: 1) Detect when a certain keyword is in a product title at listing time 
: so we may charge the seller. This was my idea of a reverse search 
: although sounds like I may have caused to much confusion with that term.

Ok ... with the concrete specifics of your situation in mind, i can think 
of 2 completley differnet approaches -- depending on how precise you need 
to be about your definition of a match and how you want to deal with 
ongoing maintence as your system evolves...

## Approach #1 - NRT index  searching w/custom plugin

Even if you have 1000-5000 of these special queries you need to check, 
a custom comonent to execute those 1000-5000 queries should be very fast 
against a small index where most of the queries won't match anything -- 
especially if you write a custom component that pre-parses them into Query 
obejcts and hangs onto them in memory.

(As a sample data point: With the 32 sample docs from Solr 4.x, I 
configured a request handler with 5000 unique facet.query defaults using 
hte {!field} qparser.  most of these facet queries didn't match anything 
but a handfull of which matched on of the same documents.  With completely 
cold caches, these 5000 facet queries had a QTime of 502ms on my laptop -- 
and that includes parsing all 5000 query strings)

So imagine if you wrote a custom SearchComonent that could read your X 
special queries from some remote database on init (and re-load them on 
command) and parse them into Queries which it then holds on to in kind of 
datastructure that also tracked why you cared about them (ie: charge 10% 
more, banned, etc...).  At query time, your custom component would filter 
the main result set of docs against these queries to look for matches that 
should be reported (along with the metdata about hte queries that match) 
and could also inspect the results of any query that matches, and generate 
highlighting each query+doc that matches.  You would then register this 
custom search component in a special validation solr core that is 
otherwise confiure exactly the same as your regular production index.  

When a client says here's my Y products i want to add you would...

 1) index those Y products into your validation solr core using 
softCommit=trueopenSearcher=true
 2) execute a query using your special search component filtered to just 
the list of Y unique ids of hte products the client just gave you (that 
way you can handle concurrent requests from different clients w/o false 
positives)
 3) use the results of that query to tell your client things like product 
#123 matches 'Sony' so we are charging you more; and product #456 matches 
'Porn' so we are rejecting it
 4) only when done, would you re-index those products into your real 
index.
 5) help keep your validating index small by also doing a deleteById on 
all of that batch of Y docs when you are done validating.


The upside of this approach is that it helps you ensure the validation 
logic you apply to products when you get them from clients *exactly* 
matches your real queries, even if your schema  analysis evolve over 
time.  the downside is it's a decent mount of custom plugin code you need 
to write upfront, and it will get slower if/when the number of special 
validation queries increases.

## Approach #1 - Approximate things with a reverse search

Build a small index where each document contains the text of one 
of your special queries copied into multiple fields with a variety of 
analysis options configured (in particular: i suspect using shingles would 
be fruitful here).  setup a query structure that uses functions to combine 
together the scores of many queries against each of those fields -- this 
might be simple addition, or you might want it to be considtional, ie: 
maybe you multiple the sum of the scores of some queries against simple 
fields with teh score of a query against a really simple field to 
eliminate false positives.

Experiment a bit to see what kinds of inputs get you what kinds of scores, 
and maybe associate a threshold with each document which you index as a 
numeric field on those docs and then fold that threshold value into your 
calvulation using the {!frame} parser to make sure you only count matches 

Re: Percolate feature?

2013-08-13 Thread Mark
Any ideas?

On Aug 10, 2013, at 6:28 PM, Mark static.void@gmail.com wrote:

 Our schema is pretty basic.. nothing fancy going on here
 
 fieldType name=text class=solr.TextField omitNorms=false
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory 
 protected=protected.txt/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 
 preserveOriginal=1/
filter class=solr.KStemFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
   analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
 ignoreCase=true expand=true/
filter class=solr.KeywordMarkerFilterFactory 
 protected=protected.txt/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 
 preserveOriginal=1/
filter class=solr.KStemFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType
 
 
 On Aug 10, 2013, at 3:40 PM, Jack Krupansky j...@basetechnology.com wrote:
 
 Now we're getting somewhere!
 
 To (over-simplify), you simply want to know if a given listing would match 
 a high-value pattern, either in a clean manner (obvious keywords) or in an 
 unclean manner (e.g., fuzzy keyword matching, stemming, n-grams.)
 
 To a large this also depends on how rich and powerful your end-user query 
 support is. So, if the user searches for sony, samsung, or apple, will 
 it match some oddball listing that fuzzily matches those terms.
 
 So... tell us, how rich your query interface is. I mean, do you support 
 wildcard, fuzzy query, ngrams (e.g., can they type son or sam or app, 
 or... will sony match sonblah-blah)?
 
 Reverse-search may in fact be what you need in this case since you literally 
 do mean if I index this document, will it match any of these queries (but 
 doesn't score a hit on your direct check for whether it is a clean keyword 
 match.)
 
 In your previous examples you only gave clean product titles, not examples 
 of circumventions of simple keyword matches.
 
 -- Jack Krupansky
 
 -Original Message- From: Mark
 Sent: Saturday, August 10, 2013 6:24 PM
 To: solr-user@lucene.apache.org
 Cc: Chris Hostetter
 Subject: Re: Percolate feature?
 
 So to reiteratve your examples from before, but change the labels a
 bit and add some more converse examples (and ignore the highlighting
 aspect for a moment...
 
 doc1 = Sony
 doc2 = Samsung Galaxy
 doc3 = Sony Playstation
 
 queryA = Sony Experia   ... matches only doc1
 queryB = Sony Playstation 3 ... matches doc3 and doc1
 queryC = Samsung 52inch LC  ... doesn't match anything
 queryD = Samsung Galaxy S4  ... matches doc2
 queryE = Galaxy Samsung S4  ... matches doc2
 
 
 ...do i still have that correct?
 
 Yes
 
 2) if you *do* care about using non-trivial analysis, then you can't use
 the simple termfreq() function, which deals with raw terms -- in stead
 you have to use the query() function to ensure that the input is parsed
 appropriately -- but then you have to wrap that function in something that
 will normalize the scores - so in place of termfreq('words','Galaxy')
 you'd want something like...
 
 
 Yes we will be using non-trivial analysis. Now heres another twist… what if 
 we don't care about scoring?
 
 
 Let's talk about the real use case. We are marketplace that sells products 
 that users have listed. For certain popular, high risk or restricted 
 keywords we charge the seller an extra fee/ban the listing. We now have 
 sellers purposely misspelling their listings to circumvent this fee. They 
 will start adding suffixes to their product listings such as Sonies 
 knowing that it gets indexed down to Sony and thus matching a users query 
 for Sony. Or they will munge together numbers and products… 2013Sony. Same 
 thing goes for adding crazy non-ascii characters to the front of the keyword 
 Î’Sony. This is obviously a problem because we aren't charging for these 
 keywords and more importantly it makes our search results look like shit.
 
 We would like to:
 
 1) Detect when a certain keyword is in a product title at listing time so we 
 may charge the seller. This was my idea of a reverse search although 
 sounds like I may have caused to much confusion with that term.
 2) Attempt to autocorrect these titles hence the need for highlighting so we 
 can try and replace the terms… this of course done outside of Solr via an 
 external service.
 
 Since we do some stemming (KStemmer) and filtering 
 (WordDelimiterFilterFactory) this makes conventional approaches such as 
 regex quite troublesome. Regex is also quite

Re: Percolate feature?

2013-08-10 Thread Mark
 So to reiteratve your examples from before, but change the labels a 
 bit and add some more converse examples (and ignore the highlighting 
 aspect for a moment...
 
 doc1 = Sony
 doc2 = Samsung Galaxy
 doc3 = Sony Playstation
 
 queryA = Sony Experia   ... matches only doc1
 queryB = Sony Playstation 3 ... matches doc3 and doc1
 queryC = Samsung 52inch LC  ... doesn't match anything
 queryD = Samsung Galaxy S4  ... matches doc2
 queryE = Galaxy Samsung S4  ... matches doc2
 
 
 ...do i still have that correct?

Yes

 2) if you *do* care about using non-trivial analysis, then you can't use 
 the simple termfreq() function, which deals with raw terms -- in stead 
 you have to use the query() function to ensure that the input is parsed 
 appropriately -- but then you have to wrap that function in something that 
 will normalize the scores - so in place of termfreq('words','Galaxy') 
 you'd want something like...


Yes we will be using non-trivial analysis. Now heres another twist… what if we 
don't care about scoring?


Let's talk about the real use case. We are marketplace that sells products that 
users have listed. For certain popular, high risk or restricted keywords we 
charge the seller an extra fee/ban the listing. We now have sellers purposely 
misspelling their listings to circumvent this fee. They will start adding 
suffixes to their product listings such as Sonies knowing that it gets 
indexed down to Sony and thus matching a users query for Sony. Or they will 
munge together numbers and products… 2013Sony. Same thing goes for adding 
crazy non-ascii characters to the front of the keyword Î’Sony. This is 
obviously a problem because we aren't charging for these keywords and more 
importantly it makes our search results look like shit. 

We would like to:

1) Detect when a certain keyword is in a product title at listing time so we 
may charge the seller. This was my idea of a reverse search although sounds 
like I may have caused to much confusion with that term.
2) Attempt to autocorrect these titles hence the need for highlighting so we 
can try and replace the terms… this of course done outside of Solr via an 
external service.

Since we do some stemming (KStemmer) and filtering (WordDelimiterFilterFactory) 
this makes conventional approaches such as regex quite troublesome. Regex is 
also quite slow and scales horribly and always needs to be in lockstep with 
schema changes.

Now knowing this, is there a good way to approach this?

Thanks


On Aug 9, 2013, at 11:56 AM, Chris Hostetter hossman_luc...@fucit.org wrote:

 
 : I'll look into this. Thanks for the concrete example as I don't even 
 : know which classes to start to look at to implement such a feature.
 
 Either roman isn't understanding what you are aksing for, or i'm not -- 
 but i don't think what roman described will work for you...
 
 :  so if your query contains no duplicates and all terms must match, you can
 :  be sure that you are collecting docs only when the number of terms matches
 :  number of clauses in the query
 
 several of the examples you gave did not match what Roman is describing, 
 as i understand it.  Most people on this thread seem to be getting 
 confused by having their perceptions flipped about what your data known 
 in advance is vs the data you get at request time.
 
 You described this...
 
 :  Product keyword:  Sony
 :  Product keyword:  Samsung Galaxy
 :  
 :  We would like to be able to detect given a product title whether or
 :  not it
 :  matches any known keywords. For a keyword to be matched all of it's
 :  terms
 :  must be present in the product title given.
 :  
 :  Product Title: Sony Experia
 :  Matches and returns a highlight: emSony/em Experia
 
 ...suggesting that what you call product keywords are the data you know 
 about in advance and product titles are the data you get at request 
 time.
 
 So your example of the request time input (ie: query) Sony Experia 
 matching data known in advance (ie: indexed document) Sony would not 
 work with Roman's example.
 
 To rephrase (what i think i understand is) your goal...
 
 * you have many (10*3+) documents known in advance
 * any document D contain a set of words W(D) of varing sizes
 * any requests Q contains a set of words W(Q) of varing izes
 * you want a given request R to match a document D if and only if:
   - W(D) is a subset of W(Q)
   - ie: no iten exists in W(D) that does not exist in W(Q)
   - ie: any number of items may exist in W(Q) that are not in W(D)
 
 So to reiteratve your examples from before, but change the labels a 
 bit and add some more converse examples (and ignore the highlighting 
 aspect for a moment...
 
 doc1 = Sony
 doc2 = Samsung Galaxy
 doc3 = Sony Playstation
 
 queryA = Sony Experia   ... matches only doc1
 queryB = Sony Playstation 3 ... matches doc3 and doc1
 queryC = Samsung 52inch LC  ... doesn't match anything
 queryD = Samsung Galaxy S4  ... matches doc2
 queryE = Galaxy Samsung S4  ... matches 

Re: Percolate feature?

2013-08-10 Thread Jack Krupansky

Now we're getting somewhere!

To (over-simplify), you simply want to know if a given listing would match 
a high-value pattern, either in a clean manner (obvious keywords) or in an 
unclean manner (e.g., fuzzy keyword matching, stemming, n-grams.)


To a large this also depends on how rich and powerful your end-user query 
support is. So, if the user searches for sony, samsung, or apple, will 
it match some oddball listing that fuzzily matches those terms.


So... tell us, how rich your query interface is. I mean, do you support 
wildcard, fuzzy query, ngrams (e.g., can they type son or sam or app, 
or... will sony match sonblah-blah)?


Reverse-search may in fact be what you need in this case since you literally 
do mean if I index this document, will it match any of these queries (but 
doesn't score a hit on your direct check for whether it is a clean keyword 
match.)


In your previous examples you only gave clean product titles, not examples 
of circumventions of simple keyword matches.


-- Jack Krupansky

-Original Message- 
From: Mark

Sent: Saturday, August 10, 2013 6:24 PM
To: solr-user@lucene.apache.org
Cc: Chris Hostetter
Subject: Re: Percolate feature?


So to reiteratve your examples from before, but change the labels a
bit and add some more converse examples (and ignore the highlighting
aspect for a moment...

doc1 = Sony
doc2 = Samsung Galaxy
doc3 = Sony Playstation

queryA = Sony Experia   ... matches only doc1
queryB = Sony Playstation 3 ... matches doc3 and doc1
queryC = Samsung 52inch LC  ... doesn't match anything
queryD = Samsung Galaxy S4  ... matches doc2
queryE = Galaxy Samsung S4  ... matches doc2


...do i still have that correct?


Yes


2) if you *do* care about using non-trivial analysis, then you can't use
the simple termfreq() function, which deals with raw terms -- in stead
you have to use the query() function to ensure that the input is parsed
appropriately -- but then you have to wrap that function in something that
will normalize the scores - so in place of termfreq('words','Galaxy')
you'd want something like...



Yes we will be using non-trivial analysis. Now heres another twist… what if 
we don't care about scoring?



Let's talk about the real use case. We are marketplace that sells products 
that users have listed. For certain popular, high risk or restricted 
keywords we charge the seller an extra fee/ban the listing. We now have 
sellers purposely misspelling their listings to circumvent this fee. They 
will start adding suffixes to their product listings such as Sonies 
knowing that it gets indexed down to Sony and thus matching a users query 
for Sony. Or they will munge together numbers and products… 2013Sony. Same 
thing goes for adding crazy non-ascii characters to the front of the keyword 
Î’Sony. This is obviously a problem because we aren't charging for these 
keywords and more importantly it makes our search results look like shit.


We would like to:

1) Detect when a certain keyword is in a product title at listing time so we 
may charge the seller. This was my idea of a reverse search although 
sounds like I may have caused to much confusion with that term.
2) Attempt to autocorrect these titles hence the need for highlighting so we 
can try and replace the terms… this of course done outside of Solr via an 
external service.


Since we do some stemming (KStemmer) and filtering 
(WordDelimiterFilterFactory) this makes conventional approaches such as 
regex quite troublesome. Regex is also quite slow and scales horribly and 
always needs to be in lockstep with schema changes.


Now knowing this, is there a good way to approach this?

Thanks


On Aug 9, 2013, at 11:56 AM, Chris Hostetter hossman_luc...@fucit.org 
wrote:




: I'll look into this. Thanks for the concrete example as I don't even
: know which classes to start to look at to implement such a feature.

Either roman isn't understanding what you are aksing for, or i'm not -- 
but i don't think what roman described will work for you...


:  so if your query contains no duplicates and all terms must match, you 
can
:  be sure that you are collecting docs only when the number of terms 
matches

:  number of clauses in the query

several of the examples you gave did not match what Roman is describing,
as i understand it.  Most people on this thread seem to be getting
confused by having their perceptions flipped about what your data known
in advance is vs the data you get at request time.

You described this...

:  Product keyword:  Sony
:  Product keyword:  Samsung Galaxy
: 
:  We would like to be able to detect given a product title whether 
or

:  not it
:  matches any known keywords. For a keyword to be matched all of 
it's

:  terms
:  must be present in the product title given.
: 
:  Product Title: Sony Experia
:  Matches and returns a highlight: emSony/em Experia

...suggesting that what you call product keywords are the data you know
about in advance and product titles

Re: Percolate feature?

2013-08-10 Thread Mark
Our schema is pretty basic.. nothing fancy going on here

fieldType name=text class=solr.TextField omitNorms=false
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory 
protected=protected.txt/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 
preserveOriginal=1/
filter class=solr.KStemFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
   analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true/
filter class=solr.KeywordMarkerFilterFactory 
protected=protected.txt/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 
preserveOriginal=1/
filter class=solr.KStemFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType


On Aug 10, 2013, at 3:40 PM, Jack Krupansky j...@basetechnology.com wrote:

 Now we're getting somewhere!
 
 To (over-simplify), you simply want to know if a given listing would match 
 a high-value pattern, either in a clean manner (obvious keywords) or in an 
 unclean manner (e.g., fuzzy keyword matching, stemming, n-grams.)
 
 To a large this also depends on how rich and powerful your end-user query 
 support is. So, if the user searches for sony, samsung, or apple, will 
 it match some oddball listing that fuzzily matches those terms.
 
 So... tell us, how rich your query interface is. I mean, do you support 
 wildcard, fuzzy query, ngrams (e.g., can they type son or sam or app, 
 or... will sony match sonblah-blah)?
 
 Reverse-search may in fact be what you need in this case since you literally 
 do mean if I index this document, will it match any of these queries (but 
 doesn't score a hit on your direct check for whether it is a clean keyword 
 match.)
 
 In your previous examples you only gave clean product titles, not examples of 
 circumventions of simple keyword matches.
 
 -- Jack Krupansky
 
 -Original Message- From: Mark
 Sent: Saturday, August 10, 2013 6:24 PM
 To: solr-user@lucene.apache.org
 Cc: Chris Hostetter
 Subject: Re: Percolate feature?
 
 So to reiteratve your examples from before, but change the labels a
 bit and add some more converse examples (and ignore the highlighting
 aspect for a moment...
 
 doc1 = Sony
 doc2 = Samsung Galaxy
 doc3 = Sony Playstation
 
 queryA = Sony Experia   ... matches only doc1
 queryB = Sony Playstation 3 ... matches doc3 and doc1
 queryC = Samsung 52inch LC  ... doesn't match anything
 queryD = Samsung Galaxy S4  ... matches doc2
 queryE = Galaxy Samsung S4  ... matches doc2
 
 
 ...do i still have that correct?
 
 Yes
 
 2) if you *do* care about using non-trivial analysis, then you can't use
 the simple termfreq() function, which deals with raw terms -- in stead
 you have to use the query() function to ensure that the input is parsed
 appropriately -- but then you have to wrap that function in something that
 will normalize the scores - so in place of termfreq('words','Galaxy')
 you'd want something like...
 
 
 Yes we will be using non-trivial analysis. Now heres another twist… what if 
 we don't care about scoring?
 
 
 Let's talk about the real use case. We are marketplace that sells products 
 that users have listed. For certain popular, high risk or restricted keywords 
 we charge the seller an extra fee/ban the listing. We now have sellers 
 purposely misspelling their listings to circumvent this fee. They will start 
 adding suffixes to their product listings such as Sonies knowing that it 
 gets indexed down to Sony and thus matching a users query for Sony. Or they 
 will munge together numbers and products… 2013Sony. Same thing goes for 
 adding crazy non-ascii characters to the front of the keyword Î’Sony. This 
 is obviously a problem because we aren't charging for these keywords and more 
 importantly it makes our search results look like shit.
 
 We would like to:
 
 1) Detect when a certain keyword is in a product title at listing time so we 
 may charge the seller. This was my idea of a reverse search although sounds 
 like I may have caused to much confusion with that term.
 2) Attempt to autocorrect these titles hence the need for highlighting so we 
 can try and replace the terms… this of course done outside of Solr via an 
 external service.
 
 Since we do some stemming (KStemmer) and filtering 
 (WordDelimiterFilterFactory) this makes conventional approaches such as regex 
 quite troublesome. Regex is also quite slow and scales horribly and always 
 needs to be in lockstep with schema changes.
 
 Now

Re: Percolate feature?

2013-08-09 Thread Erick Erickson
This _looks_ like simple phrase matching (no slop) and highlighting...

But whenever I think the answer is really simple, it usually means
that I'm missing something

Best
Erick


On Thu, Aug 8, 2013 at 11:18 PM, Mark static.void@gmail.com wrote:

 Ok forget the mention of percolate.

 We have a large list of known keywords we would like to match against.

 Product keyword:  Sony
 Product keyword:  Samsung Galaxy

 We would like to be able to detect given a product title whether or not it
 matches any known keywords. For a keyword to be matched all of it's terms
 must be present in the product title given.

 Product Title: Sony Experia
 Matches and returns a highlight: emSony/em Experia

 Product Title: Samsung 52inch LC
 Does not match

 Product Title: Samsung Galaxy S4
 Matches a returns a highlight: emSamsung Galaxy/em

 Product Title: Galaxy Samsung S4
 Matches a returns a highlight: em Galaxy  Samsung/em

 What would be the best way to approach this?




 On Aug 5, 2013, at 7:02 PM, Chris Hostetter hossman_luc...@fucit.org
 wrote:

 
  : Subject: Percolate feature?
 
  can you give a more concrete, realistic example of what you are trying to
  do? your synthetic hypothetical example is kind of hard to make sense of.
 
  your Subject line and comment that the percolate feature of elastic
  search sounds like what you want seems to have some lead people down a
  path of assuming you want to run these types of queries as documents are
  indexed -- but that isn't at all clear to me from the way you worded your
  question other then that.
 
  it's also not clear what aspect of the results you really care about --
  are you only looking for the *number* of documents that match according
  to your concept of matching, or are you looking for a list of matches?
  what multiple documents have all of their terms in the query string --
 how
  should they score relative to eachother?  what if a document contains the
  same term multiple times, do you expect it to be a match of a query only
  if that term appears in the query multiple times as well?  do you care
  about hte ordering of the terms in the query? the ordering of hte terms
 in
  the document?
 
  Ideally: describe for us what you wnat to do, w/o assuming
  solr/elasticsearch/anything specific about the implementation -- just
  describe your actual use case for us, with several real document/query
  examples.
 
 
 
  https://people.apache.org/~hossman/#xyproblem
  XY Problem
 
  Your question appears to be an XY Problem ... that is: you are dealing
  with X, you are assuming Y will help you, and you are asking about
 Y
  without giving more details about the X so that we can understand the
  full issue.  Perhaps the best solution doesn't involve Y at all?
  See Also: http://www.perlmonks.org/index.pl?node_id=542341
 
 
 
 
 
 
  -Hoss




Re: Percolate feature?

2013-08-09 Thread Yonik Seeley
*All* of the terms in the field must be matched by the querynot vice-versa.
And no, we don't have a query for that out of the box.  To implement,
it seems like it would require the total number of terms indexed for a
field (for each document).
I guess you could also index start and end tokens and then use query
expansion to all possible combinations... messy though.

-Yonik
http://lucidworks.com

On Fri, Aug 9, 2013 at 8:19 AM, Erick Erickson erickerick...@gmail.com wrote:
 This _looks_ like simple phrase matching (no slop) and highlighting...

 But whenever I think the answer is really simple, it usually means
 that I'm missing something

 Best
 Erick


 On Thu, Aug 8, 2013 at 11:18 PM, Mark static.void@gmail.com wrote:

 Ok forget the mention of percolate.

 We have a large list of known keywords we would like to match against.

 Product keyword:  Sony
 Product keyword:  Samsung Galaxy

 We would like to be able to detect given a product title whether or not it
 matches any known keywords. For a keyword to be matched all of it's terms
 must be present in the product title given.

 Product Title: Sony Experia
 Matches and returns a highlight: emSony/em Experia

 Product Title: Samsung 52inch LC
 Does not match

 Product Title: Samsung Galaxy S4
 Matches a returns a highlight: emSamsung Galaxy/em

 Product Title: Galaxy Samsung S4
 Matches a returns a highlight: em Galaxy  Samsung/em

 What would be the best way to approach this?




 On Aug 5, 2013, at 7:02 PM, Chris Hostetter hossman_luc...@fucit.org
 wrote:

 
  : Subject: Percolate feature?
 
  can you give a more concrete, realistic example of what you are trying to
  do? your synthetic hypothetical example is kind of hard to make sense of.
 
  your Subject line and comment that the percolate feature of elastic
  search sounds like what you want seems to have some lead people down a
  path of assuming you want to run these types of queries as documents are
  indexed -- but that isn't at all clear to me from the way you worded your
  question other then that.
 
  it's also not clear what aspect of the results you really care about --
  are you only looking for the *number* of documents that match according
  to your concept of matching, or are you looking for a list of matches?
  what multiple documents have all of their terms in the query string --
 how
  should they score relative to eachother?  what if a document contains the
  same term multiple times, do you expect it to be a match of a query only
  if that term appears in the query multiple times as well?  do you care
  about hte ordering of the terms in the query? the ordering of hte terms
 in
  the document?
 
  Ideally: describe for us what you wnat to do, w/o assuming
  solr/elasticsearch/anything specific about the implementation -- just
  describe your actual use case for us, with several real document/query
  examples.
 
 
 
  https://people.apache.org/~hossman/#xyproblem
  XY Problem
 
  Your question appears to be an XY Problem ... that is: you are dealing
  with X, you are assuming Y will help you, and you are asking about
 Y
  without giving more details about the X so that we can understand the
  full issue.  Perhaps the best solution doesn't involve Y at all?
  See Also: http://www.perlmonks.org/index.pl?node_id=542341
 
 
 
 
 
 
  -Hoss




Re: Percolate feature?

2013-08-09 Thread Mark
 *All* of the terms in the field must be matched by the querynot 
 vice-versa.

Exactly. This is why I was trying to explain it as a reverse search.

I just realized I describe it as a *large list of known keywords when really 
its small; no more than 1000. Forgetting about performance  how hard do you 
think this would be to implement? How should I even start? 

Thanks for the input

On Aug 9, 2013, at 6:56 AM, Yonik Seeley yo...@lucidworks.com wrote:

 *All* of the terms in the field must be matched by the querynot 
 vice-versa.
 And no, we don't have a query for that out of the box.  To implement,
 it seems like it would require the total number of terms indexed for a
 field (for each document).
 I guess you could also index start and end tokens and then use query
 expansion to all possible combinations... messy though.
 
 -Yonik
 http://lucidworks.com
 
 On Fri, Aug 9, 2013 at 8:19 AM, Erick Erickson erickerick...@gmail.com 
 wrote:
 This _looks_ like simple phrase matching (no slop) and highlighting...
 
 But whenever I think the answer is really simple, it usually means
 that I'm missing something
 
 Best
 Erick
 
 
 On Thu, Aug 8, 2013 at 11:18 PM, Mark static.void@gmail.com wrote:
 
 Ok forget the mention of percolate.
 
 We have a large list of known keywords we would like to match against.
 
 Product keyword:  Sony
 Product keyword:  Samsung Galaxy
 
 We would like to be able to detect given a product title whether or not it
 matches any known keywords. For a keyword to be matched all of it's terms
 must be present in the product title given.
 
 Product Title: Sony Experia
 Matches and returns a highlight: emSony/em Experia
 
 Product Title: Samsung 52inch LC
 Does not match
 
 Product Title: Samsung Galaxy S4
 Matches a returns a highlight: emSamsung Galaxy/em
 
 Product Title: Galaxy Samsung S4
 Matches a returns a highlight: em Galaxy  Samsung/em
 
 What would be the best way to approach this?
 
 
 
 
 On Aug 5, 2013, at 7:02 PM, Chris Hostetter hossman_luc...@fucit.org
 wrote:
 
 
 : Subject: Percolate feature?
 
 can you give a more concrete, realistic example of what you are trying to
 do? your synthetic hypothetical example is kind of hard to make sense of.
 
 your Subject line and comment that the percolate feature of elastic
 search sounds like what you want seems to have some lead people down a
 path of assuming you want to run these types of queries as documents are
 indexed -- but that isn't at all clear to me from the way you worded your
 question other then that.
 
 it's also not clear what aspect of the results you really care about --
 are you only looking for the *number* of documents that match according
 to your concept of matching, or are you looking for a list of matches?
 what multiple documents have all of their terms in the query string --
 how
 should they score relative to eachother?  what if a document contains the
 same term multiple times, do you expect it to be a match of a query only
 if that term appears in the query multiple times as well?  do you care
 about hte ordering of the terms in the query? the ordering of hte terms
 in
 the document?
 
 Ideally: describe for us what you wnat to do, w/o assuming
 solr/elasticsearch/anything specific about the implementation -- just
 describe your actual use case for us, with several real document/query
 examples.
 
 
 
 https://people.apache.org/~hossman/#xyproblem
 XY Problem
 
 Your question appears to be an XY Problem ... that is: you are dealing
 with X, you are assuming Y will help you, and you are asking about
 Y
 without giving more details about the X so that we can understand the
 full issue.  Perhaps the best solution doesn't involve Y at all?
 See Also: http://www.perlmonks.org/index.pl?node_id=542341
 
 
 
 
 
 
 -Hoss
 
 



Re: Percolate feature?

2013-08-09 Thread Jack Krupansky
Starting with the presumption that Solr is a search engine for user 
queries, what exactly would a user query look like?


Are you really requiring your users to enter long, carefully constructed, 
full length product titles??


What kind of application would force its users to do such a thing?

Put another way, if the user has entered what they consider important terms 
in their query, why are you being so ready to ignore a lot of those terms?


Or, is this simply a case where some old software had a feature that for 
reasons unknown behaved this way and you are merely trying to replicate that 
feature merely in the name of compatibility without thinking about whether 
the feature actually makes sense in a modern software environment? (Or, 
maybe your manager or marketing invented this feature and you're just 
trying to implement it as stated without trying to decide whether it makes 
sense?) The point is that you are making us try to guess what the actual use 
case is, rather than simply telling us what it is!


Please clarify what your use case really is. If you would explain the use 
case (not some proposed solution), maybe we could offer suggestions for 
solutions.


Put another way, what exactly do you perceive to be wrong with normal, 
traditional, simply query matching that causes you to go to such great 
lengths to avoid using normal, traditional, simple query matching?


IOW, why are you trying to re-invent and re-imagine a wheel that doesn't 
appear to need to be re-invented or re-imagined?


I'm sure you must have some reason for doing that, but why not disclose that 
reason so that we can utilize it in understanding what you are trying to do?


-- Jack Krupansky

-Original Message- 
From: Mark

Sent: Friday, August 09, 2013 11:29 AM
To: solr-user@lucene.apache.org
Subject: Re: Percolate feature?

*All* of the terms in the field must be matched by the querynot 
vice-versa.


Exactly. This is why I was trying to explain it as a reverse search.

I just realized I describe it as a *large list of known keywords when really 
its small; no more than 1000. Forgetting about performance  how hard do you 
think this would be to implement? How should I even start?


Thanks for the input

On Aug 9, 2013, at 6:56 AM, Yonik Seeley yo...@lucidworks.com wrote:

*All* of the terms in the field must be matched by the querynot 
vice-versa.

And no, we don't have a query for that out of the box.  To implement,
it seems like it would require the total number of terms indexed for a
field (for each document).
I guess you could also index start and end tokens and then use query
expansion to all possible combinations... messy though.

-Yonik
http://lucidworks.com

On Fri, Aug 9, 2013 at 8:19 AM, Erick Erickson erickerick...@gmail.com 
wrote:

This _looks_ like simple phrase matching (no slop) and highlighting...

But whenever I think the answer is really simple, it usually means
that I'm missing something

Best
Erick


On Thu, Aug 8, 2013 at 11:18 PM, Mark static.void@gmail.com wrote:


Ok forget the mention of percolate.

We have a large list of known keywords we would like to match against.

Product keyword:  Sony
Product keyword:  Samsung Galaxy

We would like to be able to detect given a product title whether or not 
it
matches any known keywords. For a keyword to be matched all of it's 
terms

must be present in the product title given.

Product Title: Sony Experia
Matches and returns a highlight: emSony/em Experia

Product Title: Samsung 52inch LC
Does not match

Product Title: Samsung Galaxy S4
Matches a returns a highlight: emSamsung Galaxy/em

Product Title: Galaxy Samsung S4
Matches a returns a highlight: em Galaxy  Samsung/em

What would be the best way to approach this?




On Aug 5, 2013, at 7:02 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:



: Subject: Percolate feature?

can you give a more concrete, realistic example of what you are trying 
to
do? your synthetic hypothetical example is kind of hard to make sense 
of.


your Subject line and comment that the percolate feature of elastic
search sounds like what you want seems to have some lead people down a
path of assuming you want to run these types of queries as documents 
are
indexed -- but that isn't at all clear to me from the way you worded 
your

question other then that.

it's also not clear what aspect of the results you really care 
about --
are you only looking for the *number* of documents that match 
according

to your concept of matching, or are you looking for a list of matches?
what multiple documents have all of their terms in the query string --

how
should they score relative to eachother?  what if a document contains 
the
same term multiple times, do you expect it to be a match of a query 
only

if that term appears in the query multiple times as well?  do you care
about hte ordering of the terms in the query? the ordering of hte terms

in

the document?

Ideally: describe for us what you wnat to do, w

Re: Percolate feature?

2013-08-09 Thread Roman Chyla
On Fri, Aug 9, 2013 at 11:29 AM, Mark static.void@gmail.com wrote:

  *All* of the terms in the field must be matched by the querynot
 vice-versa.

 Exactly. This is why I was trying to explain it as a reverse search.

 I just realized I describe it as a *large list of known keywords when
 really its small; no more than 1000. Forgetting about performance  how hard
 do you think this would be to implement? How should I even start?


not hard, index all terms into a field - make sure there are no duplicates,
as you want to count them - then I can imagine at least two options: save
the number of terms as a payload together with the terms, or in second step
(in a collector, for example), load the document and count them terms in
the field - if they match the query size, you are done

a trivial, naive implementation (as you say 'forget performance') could be:

searcher.search(query, null, new Collector() {
  ...
  public void collect(int i) throws Exception {
 d = reader.document(i, fieldsToLoa);
 if (d.getValues(fieldToLoad).size() == query.size()) {
PriorityQueue.add(new ScoreDoc(score, i + docBase));
 }
  }
}

so if your query contains no duplicates and all terms must match, you can
be sure that you are collecting docs only when the number of terms matches
number of clauses in the query

roman


 Thanks for the input

 On Aug 9, 2013, at 6:56 AM, Yonik Seeley yo...@lucidworks.com wrote:

  *All* of the terms in the field must be matched by the querynot
 vice-versa.
  And no, we don't have a query for that out of the box.  To implement,
  it seems like it would require the total number of terms indexed for a
  field (for each document).
  I guess you could also index start and end tokens and then use query
  expansion to all possible combinations... messy though.
 
  -Yonik
  http://lucidworks.com
 
  On Fri, Aug 9, 2013 at 8:19 AM, Erick Erickson erickerick...@gmail.com
 wrote:
  This _looks_ like simple phrase matching (no slop) and highlighting...
 
  But whenever I think the answer is really simple, it usually means
  that I'm missing something
 
  Best
  Erick
 
 
  On Thu, Aug 8, 2013 at 11:18 PM, Mark static.void@gmail.com
 wrote:
 
  Ok forget the mention of percolate.
 
  We have a large list of known keywords we would like to match against.
 
  Product keyword:  Sony
  Product keyword:  Samsung Galaxy
 
  We would like to be able to detect given a product title whether or
 not it
  matches any known keywords. For a keyword to be matched all of it's
 terms
  must be present in the product title given.
 
  Product Title: Sony Experia
  Matches and returns a highlight: emSony/em Experia
 
  Product Title: Samsung 52inch LC
  Does not match
 
  Product Title: Samsung Galaxy S4
  Matches a returns a highlight: emSamsung Galaxy/em
 
  Product Title: Galaxy Samsung S4
  Matches a returns a highlight: em Galaxy  Samsung/em
 
  What would be the best way to approach this?
 
 
 
 
  On Aug 5, 2013, at 7:02 PM, Chris Hostetter hossman_luc...@fucit.org
  wrote:
 
 
  : Subject: Percolate feature?
 
  can you give a more concrete, realistic example of what you are
 trying to
  do? your synthetic hypothetical example is kind of hard to make sense
 of.
 
  your Subject line and comment that the percolate feature of elastic
  search sounds like what you want seems to have some lead people down a
  path of assuming you want to run these types of queries as documents
 are
  indexed -- but that isn't at all clear to me from the way you worded
 your
  question other then that.
 
  it's also not clear what aspect of the results you really care
 about --
  are you only looking for the *number* of documents that match
 according
  to your concept of matching, or are you looking for a list of matches?
  what multiple documents have all of their terms in the query string --
  how
  should they score relative to eachother?  what if a document contains
 the
  same term multiple times, do you expect it to be a match of a query
 only
  if that term appears in the query multiple times as well?  do you care
  about hte ordering of the terms in the query? the ordering of hte
 terms
  in
  the document?
 
  Ideally: describe for us what you wnat to do, w/o assuming
  solr/elasticsearch/anything specific about the implementation -- just
  describe your actual use case for us, with several real document/query
  examples.
 
 
 
  https://people.apache.org/~hossman/#xyproblem
  XY Problem
 
  Your question appears to be an XY Problem ... that is: you are
 dealing
  with X, you are assuming Y will help you, and you are asking about
  Y
  without giving more details about the X so that we can understand
 the
  full issue.  Perhaps the best solution doesn't involve Y at all?
  See Also: http://www.perlmonks.org/index.pl?node_id=542341
 
 
 
 
 
 
  -Hoss
 
 




Re: Percolate feature?

2013-08-09 Thread Mark
I'll look into this. Thanks for the concrete example as I don't even know which 
classes to start to look at to implement such a feature.

On Aug 9, 2013, at 9:49 AM, Roman Chyla roman.ch...@gmail.com wrote:

 On Fri, Aug 9, 2013 at 11:29 AM, Mark static.void@gmail.com wrote:
 
 *All* of the terms in the field must be matched by the querynot
 vice-versa.
 
 Exactly. This is why I was trying to explain it as a reverse search.
 
 I just realized I describe it as a *large list of known keywords when
 really its small; no more than 1000. Forgetting about performance  how hard
 do you think this would be to implement? How should I even start?
 
 
 not hard, index all terms into a field - make sure there are no duplicates,
 as you want to count them - then I can imagine at least two options: save
 the number of terms as a payload together with the terms, or in second step
 (in a collector, for example), load the document and count them terms in
 the field - if they match the query size, you are done
 
 a trivial, naive implementation (as you say 'forget performance') could be:
 
 searcher.search(query, null, new Collector() {
  ...
  public void collect(int i) throws Exception {
 d = reader.document(i, fieldsToLoa);
 if (d.getValues(fieldToLoad).size() == query.size()) {
PriorityQueue.add(new ScoreDoc(score, i + docBase));
 }
  }
 }
 
 so if your query contains no duplicates and all terms must match, you can
 be sure that you are collecting docs only when the number of terms matches
 number of clauses in the query
 
 roman
 
 
 Thanks for the input
 
 On Aug 9, 2013, at 6:56 AM, Yonik Seeley yo...@lucidworks.com wrote:
 
 *All* of the terms in the field must be matched by the querynot
 vice-versa.
 And no, we don't have a query for that out of the box.  To implement,
 it seems like it would require the total number of terms indexed for a
 field (for each document).
 I guess you could also index start and end tokens and then use query
 expansion to all possible combinations... messy though.
 
 -Yonik
 http://lucidworks.com
 
 On Fri, Aug 9, 2013 at 8:19 AM, Erick Erickson erickerick...@gmail.com
 wrote:
 This _looks_ like simple phrase matching (no slop) and highlighting...
 
 But whenever I think the answer is really simple, it usually means
 that I'm missing something
 
 Best
 Erick
 
 
 On Thu, Aug 8, 2013 at 11:18 PM, Mark static.void@gmail.com
 wrote:
 
 Ok forget the mention of percolate.
 
 We have a large list of known keywords we would like to match against.
 
 Product keyword:  Sony
 Product keyword:  Samsung Galaxy
 
 We would like to be able to detect given a product title whether or
 not it
 matches any known keywords. For a keyword to be matched all of it's
 terms
 must be present in the product title given.
 
 Product Title: Sony Experia
 Matches and returns a highlight: emSony/em Experia
 
 Product Title: Samsung 52inch LC
 Does not match
 
 Product Title: Samsung Galaxy S4
 Matches a returns a highlight: emSamsung Galaxy/em
 
 Product Title: Galaxy Samsung S4
 Matches a returns a highlight: em Galaxy  Samsung/em
 
 What would be the best way to approach this?
 
 
 
 
 On Aug 5, 2013, at 7:02 PM, Chris Hostetter hossman_luc...@fucit.org
 wrote:
 
 
 : Subject: Percolate feature?
 
 can you give a more concrete, realistic example of what you are
 trying to
 do? your synthetic hypothetical example is kind of hard to make sense
 of.
 
 your Subject line and comment that the percolate feature of elastic
 search sounds like what you want seems to have some lead people down a
 path of assuming you want to run these types of queries as documents
 are
 indexed -- but that isn't at all clear to me from the way you worded
 your
 question other then that.
 
 it's also not clear what aspect of the results you really care
 about --
 are you only looking for the *number* of documents that match
 according
 to your concept of matching, or are you looking for a list of matches?
 what multiple documents have all of their terms in the query string --
 how
 should they score relative to eachother?  what if a document contains
 the
 same term multiple times, do you expect it to be a match of a query
 only
 if that term appears in the query multiple times as well?  do you care
 about hte ordering of the terms in the query? the ordering of hte
 terms
 in
 the document?
 
 Ideally: describe for us what you wnat to do, w/o assuming
 solr/elasticsearch/anything specific about the implementation -- just
 describe your actual use case for us, with several real document/query
 examples.
 
 
 
 https://people.apache.org/~hossman/#xyproblem
 XY Problem
 
 Your question appears to be an XY Problem ... that is: you are
 dealing
 with X, you are assuming Y will help you, and you are asking about
 Y
 without giving more details about the X so that we can understand
 the
 full issue.  Perhaps the best solution doesn't involve Y at all?
 See Also: http://www.perlmonks.org/index.pl?node_id

Re: Percolate feature?

2013-08-09 Thread Walter Underwood
All of the query words must match, right? So this is a phrase query in edismax 
with mm=100%.

We have suggestions for exactly matching a whole field, but you need samsung 
galaxy to match the document samsung galaxy s4. That means you do not need 
an exact match on the field.

If you do need that, I have a suggestion, but I don't want to confuse things 
further.

wunder

On Aug 9, 2013, at 10:01 AM, Mark wrote:

 I'll look into this. Thanks for the concrete example as I don't even know 
 which classes to start to look at to implement such a feature.
 
 On Aug 9, 2013, at 9:49 AM, Roman Chyla roman.ch...@gmail.com wrote:
 
 On Fri, Aug 9, 2013 at 11:29 AM, Mark static.void@gmail.com wrote:
 
 *All* of the terms in the field must be matched by the querynot
 vice-versa.
 
 Exactly. This is why I was trying to explain it as a reverse search.
 
 I just realized I describe it as a *large list of known keywords when
 really its small; no more than 1000. Forgetting about performance  how hard
 do you think this would be to implement? How should I even start?
 
 
 not hard, index all terms into a field - make sure there are no duplicates,
 as you want to count them - then I can imagine at least two options: save
 the number of terms as a payload together with the terms, or in second step
 (in a collector, for example), load the document and count them terms in
 the field - if they match the query size, you are done
 
 a trivial, naive implementation (as you say 'forget performance') could be:
 
 searcher.search(query, null, new Collector() {
 ...
 public void collect(int i) throws Exception {
d = reader.document(i, fieldsToLoa);
if (d.getValues(fieldToLoad).size() == query.size()) {
   PriorityQueue.add(new ScoreDoc(score, i + docBase));
}
 }
 }
 
 so if your query contains no duplicates and all terms must match, you can
 be sure that you are collecting docs only when the number of terms matches
 number of clauses in the query
 
 roman
 
 
 Thanks for the input
 
 On Aug 9, 2013, at 6:56 AM, Yonik Seeley yo...@lucidworks.com wrote:
 
 *All* of the terms in the field must be matched by the querynot
 vice-versa.
 And no, we don't have a query for that out of the box.  To implement,
 it seems like it would require the total number of terms indexed for a
 field (for each document).
 I guess you could also index start and end tokens and then use query
 expansion to all possible combinations... messy though.
 
 -Yonik
 http://lucidworks.com
 
 On Fri, Aug 9, 2013 at 8:19 AM, Erick Erickson erickerick...@gmail.com
 wrote:
 This _looks_ like simple phrase matching (no slop) and highlighting...
 
 But whenever I think the answer is really simple, it usually means
 that I'm missing something
 
 Best
 Erick
 
 
 On Thu, Aug 8, 2013 at 11:18 PM, Mark static.void@gmail.com
 wrote:
 
 Ok forget the mention of percolate.
 
 We have a large list of known keywords we would like to match against.
 
 Product keyword:  Sony
 Product keyword:  Samsung Galaxy
 
 We would like to be able to detect given a product title whether or
 not it
 matches any known keywords. For a keyword to be matched all of it's
 terms
 must be present in the product title given.
 
 Product Title: Sony Experia
 Matches and returns a highlight: emSony/em Experia
 
 Product Title: Samsung 52inch LC
 Does not match
 
 Product Title: Samsung Galaxy S4
 Matches a returns a highlight: emSamsung Galaxy/em
 
 Product Title: Galaxy Samsung S4
 Matches a returns a highlight: em Galaxy  Samsung/em
 
 What would be the best way to approach this?
 
 
 
 
 On Aug 5, 2013, at 7:02 PM, Chris Hostetter hossman_luc...@fucit.org
 wrote:
 
 
 : Subject: Percolate feature?
 
 can you give a more concrete, realistic example of what you are
 trying to
 do? your synthetic hypothetical example is kind of hard to make sense
 of.
 
 your Subject line and comment that the percolate feature of elastic
 search sounds like what you want seems to have some lead people down a
 path of assuming you want to run these types of queries as documents
 are
 indexed -- but that isn't at all clear to me from the way you worded
 your
 question other then that.
 
 it's also not clear what aspect of the results you really care
 about --
 are you only looking for the *number* of documents that match
 according
 to your concept of matching, or are you looking for a list of matches?
 what multiple documents have all of their terms in the query string --
 how
 should they score relative to eachother?  what if a document contains
 the
 same term multiple times, do you expect it to be a match of a query
 only
 if that term appears in the query multiple times as well?  do you care
 about hte ordering of the terms in the query? the ordering of hte
 terms
 in
 the document?
 
 Ideally: describe for us what you wnat to do, w/o assuming
 solr/elasticsearch/anything specific about the implementation -- just
 describe your actual use case for us, with several real document/query

Re: Percolate feature?

2013-08-09 Thread Chris Hostetter

: I'll look into this. Thanks for the concrete example as I don't even 
: know which classes to start to look at to implement such a feature.

Either roman isn't understanding what you are aksing for, or i'm not -- 
but i don't think what roman described will work for you...

:  so if your query contains no duplicates and all terms must match, you can
:  be sure that you are collecting docs only when the number of terms matches
:  number of clauses in the query

several of the examples you gave did not match what Roman is describing, 
as i understand it.  Most people on this thread seem to be getting 
confused by having their perceptions flipped about what your data known 
in advance is vs the data you get at request time.

You described this...

:  Product keyword:  Sony
:  Product keyword:  Samsung Galaxy
:  
:  We would like to be able to detect given a product title whether or
:  not it
:  matches any known keywords. For a keyword to be matched all of it's
:  terms
:  must be present in the product title given.
:  
:  Product Title: Sony Experia
:  Matches and returns a highlight: emSony/em Experia

...suggesting that what you call product keywords are the data you know 
about in advance and product titles are the data you get at request 
time.

So your example of the request time input (ie: query) Sony Experia 
matching data known in advance (ie: indexed document) Sony would not 
work with Roman's example.

To rephrase (what i think i understand is) your goal...

 * you have many (10*3+) documents known in advance
 * any document D contain a set of words W(D) of varing sizes
 * any requests Q contains a set of words W(Q) of varing izes
 * you want a given request R to match a document D if and only if:
   - W(D) is a subset of W(Q)
   - ie: no iten exists in W(D) that does not exist in W(Q)
   - ie: any number of items may exist in W(Q) that are not in W(D)

So to reiteratve your examples from before, but change the labels a 
bit and add some more converse examples (and ignore the highlighting 
aspect for a moment...

doc1 = Sony
doc2 = Samsung Galaxy
doc3 = Sony Playstation

queryA = Sony Experia   ... matches only doc1
queryB = Sony Playstation 3 ... matches doc3 and doc1
queryC = Samsung 52inch LC  ... doesn't match anything
queryD = Samsung Galaxy S4  ... matches doc2
queryE = Galaxy Samsung S4  ... matches doc2


...do i still have that correct?


A similar question came up in the past, but i can't find my response now 
so i'll try to recreate it ...


1) if you don't care about using non-trivial analysis (ie: you don't need 
stemming, or synonyms, etc..), you can do this with some 
really simple function queries -- asusming you index a field containing 
hte number of words in each document, in addition to the words 
themselves.  Assuming your words are in a field named words and the 
number of words is in a field named words_count a request for something 
like Galaxy Samsung S4 can be represented as...

  q={!frange l=0 u=0}sub(words_count,
 sum(termfreq('words','Galaxy'),
 termfreq('words','Samsung'),
 termfreq('words','S4'))

...ie: you want to compute the sub of the term frequencies for each of 
hte words requested, and then you want ot subtract that sum from the 
number of terms in the documengt -- and then you only want ot match 
documents where the result of that subtraction is 0.

one complexity that comes up, is that you haven't specified:
  
  * can the list of words in your documents contain duplicates?
  * can the list of words in your query contain duplicates?
  * should a document with duplicatewords match only if the query also 
contains the same word duplicated?

...the answers to those questions make hte math more complicated (and are 
left as an excersize for the reader)


2) if you *do* care about using non-trivial analysis, then you can't use 
the simple termfreq() function, which deals with raw terms -- in stead 
you have to use the query() function to ensure that the input is parsed 
appropriately -- but then you have to wrap that function in something that 
will normalize the scores - so in place of termfreq('words','Galaxy') 
you'd want something like...

if(query({!field f=words v='Galaxy'}),1,0)

...but again the math gets much harder if you make things more complex 
with duplicate words i nthe document or duplicate words in the query -- you'd 
probably have to use a custom similarity to get the scores returned by the 
query() function to be usable as is in the match equation (and drop the 
if() function)


As for the highlighting part of hte problme -- that becomes much easier -- 
independent of the queries you use to *match* the documents, you can then 
specify a hl.q param to specify a much simpler query just containing the 
basic lst of words (as a simple boolean query, all clouses optional) and 
let it highlight them in your list of words.







-Hoss


Re: Percolate feature?

2013-08-09 Thread Jack Krupansky

I thought about that suggested doc/query model, but...

Do you really want a query of Sony xbox or Sony ipad or even Sony 
Samsung to match document Sony? Seems quite odd.


-- Jack Krupansky

-Original Message- 
From: Chris Hostetter

Sent: Friday, August 09, 2013 2:56 PM
To: solr-user@lucene.apache.org
Subject: Re: Percolate feature?


: I'll look into this. Thanks for the concrete example as I don't even
: know which classes to start to look at to implement such a feature.

Either roman isn't understanding what you are aksing for, or i'm not -- 
but i don't think what roman described will work for you...


:  so if your query contains no duplicates and all terms must match, you 
can
:  be sure that you are collecting docs only when the number of terms 
matches

:  number of clauses in the query

several of the examples you gave did not match what Roman is describing,
as i understand it.  Most people on this thread seem to be getting
confused by having their perceptions flipped about what your data known
in advance is vs the data you get at request time.

You described this...

:  Product keyword:  Sony
:  Product keyword:  Samsung Galaxy
: 
:  We would like to be able to detect given a product title whether or
:  not it
:  matches any known keywords. For a keyword to be matched all of it's
:  terms
:  must be present in the product title given.
: 
:  Product Title: Sony Experia
:  Matches and returns a highlight: emSony/em Experia

...suggesting that what you call product keywords are the data you know
about in advance and product titles are the data you get at request
time.

So your example of the request time input (ie: query) Sony Experia
matching data known in advance (ie: indexed document) Sony would not
work with Roman's example.

To rephrase (what i think i understand is) your goal...

* you have many (10*3+) documents known in advance
* any document D contain a set of words W(D) of varing sizes
* any requests Q contains a set of words W(Q) of varing izes
* you want a given request R to match a document D if and only if:
  - W(D) is a subset of W(Q)
  - ie: no iten exists in W(D) that does not exist in W(Q)
  - ie: any number of items may exist in W(Q) that are not in W(D)

So to reiteratve your examples from before, but change the labels a
bit and add some more converse examples (and ignore the highlighting
aspect for a moment...

doc1 = Sony
doc2 = Samsung Galaxy
doc3 = Sony Playstation

queryA = Sony Experia   ... matches only doc1
queryB = Sony Playstation 3 ... matches doc3 and doc1
queryC = Samsung 52inch LC  ... doesn't match anything
queryD = Samsung Galaxy S4  ... matches doc2
queryE = Galaxy Samsung S4  ... matches doc2


...do i still have that correct?


A similar question came up in the past, but i can't find my response now
so i'll try to recreate it ...


1) if you don't care about using non-trivial analysis (ie: you don't need
stemming, or synonyms, etc..), you can do this with some
really simple function queries -- asusming you index a field containing
hte number of words in each document, in addition to the words
themselves.  Assuming your words are in a field named words and the
number of words is in a field named words_count a request for something
like Galaxy Samsung S4 can be represented as...

 q={!frange l=0 u=0}sub(words_count,
sum(termfreq('words','Galaxy'),
termfreq('words','Samsung'),
termfreq('words','S4'))

...ie: you want to compute the sub of the term frequencies for each of
hte words requested, and then you want ot subtract that sum from the
number of terms in the documengt -- and then you only want ot match
documents where the result of that subtraction is 0.

one complexity that comes up, is that you haven't specified:

 * can the list of words in your documents contain duplicates?
 * can the list of words in your query contain duplicates?
 * should a document with duplicatewords match only if the query also
contains the same word duplicated?

...the answers to those questions make hte math more complicated (and are
left as an excersize for the reader)


2) if you *do* care about using non-trivial analysis, then you can't use
the simple termfreq() function, which deals with raw terms -- in stead
you have to use the query() function to ensure that the input is parsed
appropriately -- but then you have to wrap that function in something that
will normalize the scores - so in place of termfreq('words','Galaxy')
you'd want something like...

   if(query({!field f=words v='Galaxy'}),1,0)

...but again the math gets much harder if you make things more complex
with duplicate words i nthe document or duplicate words in the query --  
you'd

probably have to use a custom similarity to get the scores returned by the
query() function to be usable as is in the match equation (and drop the
if() function)


As for the highlighting part of hte problme -- that becomes much

Re: Percolate feature?

2013-08-09 Thread Roman Chyla
On Fri, Aug 9, 2013 at 2:56 PM, Chris Hostetter hossman_luc...@fucit.orgwrote:


 : I'll look into this. Thanks for the concrete example as I don't even
 : know which classes to start to look at to implement such a feature.

 Either roman isn't understanding what you are aksing for, or i'm not --
 but i don't think what roman described will work for you...

 :  so if your query contains no duplicates and all terms must match, you
 can
 :  be sure that you are collecting docs only when the number of terms
 matches
 :  number of clauses in the query

 several of the examples you gave did not match what Roman is describing,
 as i understand it.  Most people on this thread seem to be getting
 confused by having their perceptions flipped about what your data known
 in advance is vs the data you get at request time.

 You described this...

 :  Product keyword:  Sony
 :  Product keyword:  Samsung Galaxy
 : 
 :  We would like to be able to detect given a product title whether or
 :  not it
 :  matches any known keywords. For a keyword to be matched all of it's
 :  terms
 :  must be present in the product title given.
 : 
 :  Product Title: Sony Experia
 :  Matches and returns a highlight: emSony/em Experia

 ...suggesting that what you call product keywords are the data you know
 about in advance and product titles are the data you get at request
 time.

 So your example of the request time input (ie: query) Sony Experia
 matching data known in advance (ie: indexed document) Sony would not
 work with Roman's example.

 To rephrase (what i think i understand is) your goal...

  * you have many (10*3+) documents known in advance
  * any document D contain a set of words W(D) of varing sizes
  * any requests Q contains a set of words W(Q) of varing izes
  * you want a given request R to match a document D if and only if:
- W(D) is a subset of W(Q)


aha! this was not what i was understanding! i was assuming W(Q) is a subset
of W(D) - or rather, W(Q) === W(D)

so now i finally see the reasoning behind it and the use case, which is a
VERY interesting one.

roman



- ie: no iten exists in W(D) that does not exist in W(Q)
- ie: any number of items may exist in W(Q) that are not in W(D)





 So to reiteratve your examples from before, but change the labels a
 bit and add some more converse examples (and ignore the highlighting
 aspect for a moment...

 doc1 = Sony
 doc2 = Samsung Galaxy
 doc3 = Sony Playstation

 queryA = Sony Experia   ... matches only doc1
 queryB = Sony Playstation 3 ... matches doc3 and doc1
 queryC = Samsung 52inch LC  ... doesn't match anything
 queryD = Samsung Galaxy S4  ... matches doc2
 queryE = Galaxy Samsung S4  ... matches doc2


 ...do i still have that correct?


 A similar question came up in the past, but i can't find my response now
 so i'll try to recreate it ...


 1) if you don't care about using non-trivial analysis (ie: you don't need
 stemming, or synonyms, etc..), you can do this with some
 really simple function queries -- asusming you index a field containing
 hte number of words in each document, in addition to the words
 themselves.  Assuming your words are in a field named words and the
 number of words is in a field named words_count a request for something
 like Galaxy Samsung S4 can be represented as...

   q={!frange l=0 u=0}sub(words_count,
  sum(termfreq('words','Galaxy'),
  termfreq('words','Samsung'),
  termfreq('words','S4'))

 ...ie: you want to compute the sub of the term frequencies for each of
 hte words requested, and then you want ot subtract that sum from the
 number of terms in the documengt -- and then you only want ot match
 documents where the result of that subtraction is 0.

 one complexity that comes up, is that you haven't specified:

   * can the list of words in your documents contain duplicates?
   * can the list of words in your query contain duplicates?
   * should a document with duplicatewords match only if the query also
 contains the same word duplicated?

 ...the answers to those questions make hte math more complicated (and are
 left as an excersize for the reader)


 2) if you *do* care about using non-trivial analysis, then you can't use
 the simple termfreq() function, which deals with raw terms -- in stead
 you have to use the query() function to ensure that the input is parsed
 appropriately -- but then you have to wrap that function in something that
 will normalize the scores - so in place of termfreq('words','Galaxy')
 you'd want something like...

 if(query({!field f=words v='Galaxy'}),1,0)

 ...but again the math gets much harder if you make things more complex
 with duplicate words i nthe document or duplicate words in the query --
 you'd
 probably have to use a custom similarity to get the scores returned by the
 query() function to be usable as is in the match equation (and drop the
 if() function)


 As for the 

Re: Percolate feature?

2013-08-08 Thread Mark
Ok forget the mention of percolate. 

We have a large list of known keywords we would like to match against. 

Product keyword:  Sony
Product keyword:  Samsung Galaxy

We would like to be able to detect given a product title whether or not it 
matches any known keywords. For a keyword to be matched all of it's terms must 
be present in the product title given. 

Product Title: Sony Experia
Matches and returns a highlight: emSony/em Experia

Product Title: Samsung 52inch LC
Does not match

Product Title: Samsung Galaxy S4
Matches a returns a highlight: emSamsung Galaxy/em

Product Title: Galaxy Samsung S4
Matches a returns a highlight: em Galaxy  Samsung/em

What would be the best way to approach this? 




On Aug 5, 2013, at 7:02 PM, Chris Hostetter hossman_luc...@fucit.org wrote:

 
 : Subject: Percolate feature?
 
 can you give a more concrete, realistic example of what you are trying to 
 do? your synthetic hypothetical example is kind of hard to make sense of.
 
 your Subject line and comment that the percolate feature of elastic 
 search sounds like what you want seems to have some lead people down a 
 path of assuming you want to run these types of queries as documents are 
 indexed -- but that isn't at all clear to me from the way you worded your 
 question other then that.
 
 it's also not clear what aspect of the results you really care about -- 
 are you only looking for the *number* of documents that match according 
 to your concept of matching, or are you looking for a list of matches?  
 what multiple documents have all of their terms in the query string -- how 
 should they score relative to eachother?  what if a document contains the 
 same term multiple times, do you expect it to be a match of a query only 
 if that term appears in the query multiple times as well?  do you care 
 about hte ordering of the terms in the query? the ordering of hte terms in 
 the document?
 
 Ideally: describe for us what you wnat to do, w/o assuming 
 solr/elasticsearch/anything specific about the implementation -- just 
 describe your actual use case for us, with several real document/query 
 examples.
 
 
 
 https://people.apache.org/~hossman/#xyproblem
 XY Problem
 
 Your question appears to be an XY Problem ... that is: you are dealing
 with X, you are assuming Y will help you, and you are asking about Y
 without giving more details about the X so that we can understand the
 full issue.  Perhaps the best solution doesn't involve Y at all?
 See Also: http://www.perlmonks.org/index.pl?node_id=542341
 
 
 
 
 
 
 -Hoss



Re: Percolate feature?

2013-08-05 Thread Charlie Hull

On 03/08/2013 00:50, Mark wrote:

We have a set number of known terms we want to match against.

In Index:
term one
term two
term three

I know how to match all terms of a user query against the index but we would 
like to know how/if we can match a user's query against all the terms in the 
index?

Search Queries:
my search term = 0 matches
my term search one = 1 match  (term one)
some prefix term two = 1 match (term two)
one two three = 0 matches

I can only explain this is almost a reverse search???

I came across the following from ElasticSearch 
(http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds 
like this may accomplish the above but haven't tested. I was wondering if Solr 
had something similar or an alternative way of accomplishing this?

Thanks



Hi Mark,

We've built something that implements this kind of reverse search for 
our clients in the media monitoring sector - we're working on releasing 
the core of this as open source very soon, hopefully in a month or two. 
It's based on Lucene.


Just for reference it's able to apply tens of thousands of stored 
queries to a document per second (our clients often have very large and 
complex Boolean strings representing their clients' interests and may 
monitor hundreds of thousands of news stories every day). It also 
records the positions of every match. We suspect it's a lot faster and 
more flexible than Elasticsearch's Percolate feature.


Cheers

Charlie

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Percolate feature?

2013-08-05 Thread Mark
 can match a user's query against all the terms in the index - that's 
 exactly what Lucene and Solr have done since Day One, for all queries. 
 Percolate actually does the opposite - matches an input document against a 
 registered set of queries - and doesn't match against indexed documents.
 
 Solr does support Lucene's min should match feature so that you can 
 specify, say, four query terms  and return if at least two match. This is the 
 mm parameter.


I don't think you understand me.

Say I only have one document indexed and it's contents are Foo Bar. I want 
this documented returned if and only if the query has the words Foo and Bar 
in it. If I use a mm of 100% for Foo Bar Bazz this document will not be 
returned because the full user query didn't match. I i use a 0% mm and search 
Foo Baz the documented will be returned even though it shouldn't.

On Aug 2, 2013, at 5:09 PM, Jack Krupansky j...@basetechnology.com wrote:

 You seem to be mixing a couple of different concepts here. Prospective 
 search or reverse search, (sometimes called alerts) is a logistics matter, 
 but how to match terms is completely different.
 
 Solr does not have the exact percolate feature of ES, but your examples 
 don't indicate a need for what percolate actually does.
 
 can match a user's query against all the terms in the index - that's 
 exactly what Lucene and Solr have done since Day One, for all queries. 
 Percolate actually does the opposite - matches an input document against a 
 registered set of queries - and doesn't match against indexed documents.
 
 Solr does support Lucene's min should match feature so that you can 
 specify, say, four query terms  and return if at least two match. This is the 
 mm parameter.
 
 See:
 http://wiki.apache.org/solr/ExtendedDisMax#mm_.28Minimum_.27Should.27_Match.29
 
 Try to clarify your requirements... or maybe min-should-match was all you 
 needed?
 
 -- Jack Krupansky
 
 -Original Message- From: Mark
 Sent: Friday, August 02, 2013 7:50 PM
 To: solr-user@lucene.apache.org
 Subject: Percolate feature?
 
 We have a set number of known terms we want to match against.
 
 In Index:
 term one
 term two
 term three
 
 I know how to match all terms of a user query against the index but we would 
 like to know how/if we can match a user's query against all the terms in the 
 index?
 
 Search Queries:
 my search term = 0 matches
 my term search one = 1 match  (term one)
 some prefix term two = 1 match (term two)
 one two three = 0 matches
 
 I can only explain this is almost a reverse search???
 
 I came across the following from ElasticSearch 
 (http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds 
 like this may accomplish the above but haven't tested. I was wondering if 
 Solr had something similar or an alternative way of accomplishing this?
 
 Thanks
 



Re: Percolate feature?

2013-08-05 Thread Jack Krupansky

Fine, then write the query that way:  +foo +bar baz

But it still doesn't sound as if any of this relates to prospective 
search/percolate.


-- Jack Krupansky

-Original Message- 
From: Mark

Sent: Monday, August 05, 2013 2:11 PM
To: solr-user@lucene.apache.org
Subject: Re: Percolate feature?

can match a user's query against all the terms in the index - that's 
exactly what Lucene and Solr have done since Day One, for all queries. 
Percolate actually does the opposite - matches an input document against a 
registered set of queries - and doesn't match against indexed documents.


Solr does support Lucene's min should match feature so that you can 
specify, say, four query terms  and return if at least two match. This is 
the mm parameter.



I don't think you understand me.

Say I only have one document indexed and it's contents are Foo Bar. I want 
this documented returned if and only if the query has the words Foo and 
Bar in it. If I use a mm of 100% for Foo Bar Bazz this document will not 
be returned because the full user query didn't match. I i use a 0% mm and 
search Foo Baz the documented will be returned even though it shouldn't.


On Aug 2, 2013, at 5:09 PM, Jack Krupansky j...@basetechnology.com wrote:

You seem to be mixing a couple of different concepts here. Prospective 
search or reverse search, (sometimes called alerts) is a logistics 
matter, but how to match terms is completely different.


Solr does not have the exact percolate feature of ES, but your examples 
don't indicate a need for what percolate actually does.


can match a user's query against all the terms in the index - that's 
exactly what Lucene and Solr have done since Day One, for all queries. 
Percolate actually does the opposite - matches an input document against a 
registered set of queries - and doesn't match against indexed documents.


Solr does support Lucene's min should match feature so that you can 
specify, say, four query terms  and return if at least two match. This is 
the mm parameter.


See:
http://wiki.apache.org/solr/ExtendedDisMax#mm_.28Minimum_.27Should.27_Match.29

Try to clarify your requirements... or maybe min-should-match was all you 
needed?


-- Jack Krupansky

-Original Message- From: Mark
Sent: Friday, August 02, 2013 7:50 PM
To: solr-user@lucene.apache.org
Subject: Percolate feature?

We have a set number of known terms we want to match against.

In Index:
term one
term two
term three

I know how to match all terms of a user query against the index but we 
would like to know how/if we can match a user's query against all the 
terms in the index?


Search Queries:
my search term = 0 matches
my term search one = 1 match  (term one)
some prefix term two = 1 match (term two)
one two three = 0 matches

I can only explain this is almost a reverse search???

I came across the following from ElasticSearch 
(http://www.elasticsearch.org/guide/reference/api/percolate/) and it 
sounds like this may accomplish the above but haven't tested. I was 
wondering if Solr had something similar or an alternative way of 
accomplishing this?


Thanks



Re: Percolate feature?

2013-08-05 Thread Mark
Still not understanding. How do I know which words to require while searching? 
I want to search across all documents and return ones that have all of their 
terms matched.


 I came across the following from ElasticSearch 
 (http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds 
 like this may accomplish the above but haven't tested. I was wondering if 
 Solr had something similar or an alternative way of accomplishing this?

Also never said this was Percolate, just looked similar

On Aug 5, 2013, at 11:43 AM, Jack Krupansky j...@basetechnology.com wrote:

 Fine, then write the query that way:  +foo +bar baz
 
 But it still doesn't sound as if any of this relates to prospective 
 search/percolate.
 
 -- Jack Krupansky
 
 -Original Message- From: Mark
 Sent: Monday, August 05, 2013 2:11 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Percolate feature?
 
 can match a user's query against all the terms in the index - that's 
 exactly what Lucene and Solr have done since Day One, for all queries. 
 Percolate actually does the opposite - matches an input document against a 
 registered set of queries - and doesn't match against indexed documents.
 
 Solr does support Lucene's min should match feature so that you can 
 specify, say, four query terms  and return if at least two match. This is 
 the mm parameter.
 
 
 I don't think you understand me.
 
 Say I only have one document indexed and it's contents are Foo Bar. I want 
 this documented returned if and only if the query has the words Foo and 
 Bar in it. If I use a mm of 100% for Foo Bar Bazz this document will not 
 be returned because the full user query didn't match. I i use a 0% mm and 
 search Foo Baz the documented will be returned even though it shouldn't.
 
 On Aug 2, 2013, at 5:09 PM, Jack Krupansky j...@basetechnology.com wrote:
 
 You seem to be mixing a couple of different concepts here. Prospective 
 search or reverse search, (sometimes called alerts) is a logistics matter, 
 but how to match terms is completely different.
 
 Solr does not have the exact percolate feature of ES, but your examples 
 don't indicate a need for what percolate actually does.
 
 can match a user's query against all the terms in the index - that's 
 exactly what Lucene and Solr have done since Day One, for all queries. 
 Percolate actually does the opposite - matches an input document against a 
 registered set of queries - and doesn't match against indexed documents.
 
 Solr does support Lucene's min should match feature so that you can 
 specify, say, four query terms  and return if at least two match. This is 
 the mm parameter.
 
 See:
 http://wiki.apache.org/solr/ExtendedDisMax#mm_.28Minimum_.27Should.27_Match.29
 
 Try to clarify your requirements... or maybe min-should-match was all you 
 needed?
 
 -- Jack Krupansky
 
 -Original Message- From: Mark
 Sent: Friday, August 02, 2013 7:50 PM
 To: solr-user@lucene.apache.org
 Subject: Percolate feature?
 
 We have a set number of known terms we want to match against.
 
 In Index:
 term one
 term two
 term three
 
 I know how to match all terms of a user query against the index but we would 
 like to know how/if we can match a user's query against all the terms in the 
 index?
 
 Search Queries:
 my search term = 0 matches
 my term search one = 1 match  (term one)
 some prefix term two = 1 match (term two)
 one two three = 0 matches
 
 I can only explain this is almost a reverse search???
 
 I came across the following from ElasticSearch 
 (http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds 
 like this may accomplish the above but haven't tested. I was wondering if 
 Solr had something similar or an alternative way of accomplishing this?
 
 Thanks



Re: Percolate feature?

2013-08-05 Thread Jack Krupansky
Percolate does not search across documents, it searches across registered 
queries for a single input document. As such, it still seems irrelevant to 
your desire to search across all documents.


You still haven't explained how you can't do what you want using basic, 
plain Lucene search.


Now, if all you really want is the ES percolate feature, as said, Solr 
doesn't have that - if you are sure that percolate really is what you need.


But your use case still isn't clearly elaborated to the point where we can 
at least guess what you really need.


For reference:
http://www.elasticsearch.org/guide/reference/api/percolate/

The percolator allows to register queries against an index, and then send 
percolate requests which include a doc, and getting back the queries that 
match on that doc out of the set of registered queries.


Think of it as the reverse operation of indexing and then searching. Instead 
of sending docs, indexing them, and then running queries. One sends queries, 
registers them, and then sends docs and finds out which queries match that 
doc.


But that's rather different from what you asked, wanting to match queries 
against all terms in the index.


-- Jack Krupansky

-Original Message- 
From: Mark

Sent: Monday, August 05, 2013 3:44 PM
To: solr-user@lucene.apache.org
Subject: Re: Percolate feature?

Still not understanding. How do I know which words to require while 
searching? I want to search across all documents and return ones that have 
all of their terms matched.



I came across the following from ElasticSearch 
(http://www.elasticsearch.org/guide/reference/api/percolate/) and it 
sounds like this may accomplish the above but haven't tested. I was 
wondering if Solr had something similar or an alternative way of 
accomplishing this?


Also never said this was Percolate, just looked similar

On Aug 5, 2013, at 11:43 AM, Jack Krupansky j...@basetechnology.com 
wrote:



Fine, then write the query that way:  +foo +bar baz

But it still doesn't sound as if any of this relates to prospective 
search/percolate.


-- Jack Krupansky

-Original Message- From: Mark
Sent: Monday, August 05, 2013 2:11 PM
To: solr-user@lucene.apache.org
Subject: Re: Percolate feature?

can match a user's query against all the terms in the index - that's 
exactly what Lucene and Solr have done since Day One, for all queries. 
Percolate actually does the opposite - matches an input document against 
a registered set of queries - and doesn't match against indexed 
documents.


Solr does support Lucene's min should match feature so that you can 
specify, say, four query terms  and return if at least two match. This is 
the mm parameter.



I don't think you understand me.

Say I only have one document indexed and it's contents are Foo Bar. I 
want this documented returned if and only if the query has the words Foo 
and Bar in it. If I use a mm of 100% for Foo Bar Bazz this document 
will not be returned because the full user query didn't match. I i use a 
0% mm and search Foo Baz the documented will be returned even though it 
shouldn't.


On Aug 2, 2013, at 5:09 PM, Jack Krupansky j...@basetechnology.com 
wrote:


You seem to be mixing a couple of different concepts here. Prospective 
search or reverse search, (sometimes called alerts) is a logistics 
matter, but how to match terms is completely different.


Solr does not have the exact percolate feature of ES, but your examples 
don't indicate a need for what percolate actually does.


can match a user's query against all the terms in the index - that's 
exactly what Lucene and Solr have done since Day One, for all queries. 
Percolate actually does the opposite - matches an input document against 
a registered set of queries - and doesn't match against indexed 
documents.


Solr does support Lucene's min should match feature so that you can 
specify, say, four query terms  and return if at least two match. This is 
the mm parameter.


See:
http://wiki.apache.org/solr/ExtendedDisMax#mm_.28Minimum_.27Should.27_Match.29

Try to clarify your requirements... or maybe min-should-match was all you 
needed?


-- Jack Krupansky

-Original Message- From: Mark
Sent: Friday, August 02, 2013 7:50 PM
To: solr-user@lucene.apache.org
Subject: Percolate feature?

We have a set number of known terms we want to match against.

In Index:
term one
term two
term three

I know how to match all terms of a user query against the index but we 
would like to know how/if we can match a user's query against all the 
terms in the index?


Search Queries:
my search term = 0 matches
my term search one = 1 match  (term one)
some prefix term two = 1 match (term two)
one two three = 0 matches

I can only explain this is almost a reverse search???

I came across the following from ElasticSearch 
(http://www.elasticsearch.org/guide/reference/api/percolate/) and it 
sounds like this may accomplish the above but haven't tested. I was 
wondering if Solr had something

Re: Percolate feature?

2013-08-05 Thread Chris Hostetter

: Subject: Percolate feature?

can you give a more concrete, realistic example of what you are trying to 
do? your synthetic hypothetical example is kind of hard to make sense of.

your Subject line and comment that the percolate feature of elastic 
search sounds like what you want seems to have some lead people down a 
path of assuming you want to run these types of queries as documents are 
indexed -- but that isn't at all clear to me from the way you worded your 
question other then that.

it's also not clear what aspect of the results you really care about -- 
are you only looking for the *number* of documents that match according 
to your concept of matching, or are you looking for a list of matches?  
what multiple documents have all of their terms in the query string -- how 
should they score relative to eachother?  what if a document contains the 
same term multiple times, do you expect it to be a match of a query only 
if that term appears in the query multiple times as well?  do you care 
about hte ordering of the terms in the query? the ordering of hte terms in 
the document?

Ideally: describe for us what you wnat to do, w/o assuming 
solr/elasticsearch/anything specific about the implementation -- just 
describe your actual use case for us, with several real document/query 
examples.



https://people.apache.org/~hossman/#xyproblem
XY Problem

Your question appears to be an XY Problem ... that is: you are dealing
with X, you are assuming Y will help you, and you are asking about Y
without giving more details about the X so that we can understand the
full issue.  Perhaps the best solution doesn't involve Y at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341






-Hoss


Re: Percolate feature?

2013-08-05 Thread Lance Norskog

Cool!

On 08/05/2013 03:34 AM, Charlie Hull wrote:

On 03/08/2013 00:50, Mark wrote:

We have a set number of known terms we want to match against.

In Index:
term one
term two
term three

I know how to match all terms of a user query against the index but 
we would like to know how/if we can match a user's query against all 
the terms in the index?


Search Queries:
my search term = 0 matches
my term search one = 1 match  (term one)
some prefix term two = 1 match (term two)
one two three = 0 matches

I can only explain this is almost a reverse search???

I came across the following from ElasticSearch 
(http://www.elasticsearch.org/guide/reference/api/percolate/) and it 
sounds like this may accomplish the above but haven't tested. I was 
wondering if Solr had something similar or an alternative way of 
accomplishing this?


Thanks



Hi Mark,

We've built something that implements this kind of reverse search for 
our clients in the media monitoring sector - we're working on 
releasing the core of this as open source very soon, hopefully in a 
month or two. It's based on Lucene.


Just for reference it's able to apply tens of thousands of stored 
queries to a document per second (our clients often have very large 
and complex Boolean strings representing their clients' interests and 
may monitor hundreds of thousands of news stories every day). It also 
records the positions of every match. We suspect it's a lot faster and 
more flexible than Elasticsearch's Percolate feature.


Cheers

Charlie





Re: Percolate feature?

2013-08-03 Thread Alexandre Rafalovitch
How difficult would it be to write percolate as an UpdateRequestProcessor?

Is there a magic hook to parse and run query against single doc?

Regards,
 Alex
On 2 Aug 2013 20:10, Jack Krupansky j...@basetechnology.com wrote:

 You seem to be mixing a couple of different concepts here. Prospective
 search or reverse search, (sometimes called alerts) is a logistics matter,
 but how to match terms is completely different.

 Solr does not have the exact percolate feature of ES, but your examples
 don't indicate a need for what percolate actually does.

 can match a user's query against all the terms in the index - that's
 exactly what Lucene and Solr have done since Day One, for all queries.
 Percolate actually does the opposite - matches an input document against a
 registered set of queries - and doesn't match against indexed documents.

 Solr does support Lucene's min should match feature so that you can
 specify, say, four query terms  and return if at least two match. This is
 the mm parameter.

 See:
 http://wiki.apache.org/solr/**ExtendedDisMax#mm_.28Minimum_.**
 27Should.27_Match.29http://wiki.apache.org/solr/ExtendedDisMax#mm_.28Minimum_.27Should.27_Match.29

 Try to clarify your requirements... or maybe min-should-match was all you
 needed?

 -- Jack Krupansky

 -Original Message- From: Mark
 Sent: Friday, August 02, 2013 7:50 PM
 To: solr-user@lucene.apache.org
 Subject: Percolate feature?

 We have a set number of known terms we want to match against.

 In Index:
 term one
 term two
 term three

 I know how to match all terms of a user query against the index but we
 would like to know how/if we can match a user's query against all the terms
 in the index?

 Search Queries:
 my search term = 0 matches
 my term search one = 1 match  (term one)
 some prefix term two = 1 match (term two)
 one two three = 0 matches

 I can only explain this is almost a reverse search???

 I came across the following from ElasticSearch (
 http://www.elasticsearch.org/**guide/reference/api/percolate/http://www.elasticsearch.org/guide/reference/api/percolate/
 **) and it sounds like this may accomplish the above but haven't tested.
 I was wondering if Solr had something similar or an alternative way of
 accomplishing this?

 Thanks




Percolate feature?

2013-08-02 Thread Mark
We have a set number of known terms we want to match against.

In Index:
term one
term two
term three

I know how to match all terms of a user query against the index but we would 
like to know how/if we can match a user's query against all the terms in the 
index?

Search Queries:
my search term = 0 matches
my term search one = 1 match  (term one)
some prefix term two = 1 match (term two)
one two three = 0 matches

I can only explain this is almost a reverse search??? 

I came across the following from ElasticSearch 
(http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds 
like this may accomplish the above but haven't tested. I was wondering if Solr 
had something similar or an alternative way of accomplishing this?

Thanks



Re: Percolate feature?

2013-08-02 Thread Jack Krupansky
You seem to be mixing a couple of different concepts here. Prospective 
search or reverse search, (sometimes called alerts) is a logistics matter, 
but how to match terms is completely different.


Solr does not have the exact percolate feature of ES, but your examples 
don't indicate a need for what percolate actually does.


can match a user's query against all the terms in the index - that's 
exactly what Lucene and Solr have done since Day One, for all queries. 
Percolate actually does the opposite - matches an input document against a 
registered set of queries - and doesn't match against indexed documents.


Solr does support Lucene's min should match feature so that you can 
specify, say, four query terms  and return if at least two match. This is 
the mm parameter.


See:
http://wiki.apache.org/solr/ExtendedDisMax#mm_.28Minimum_.27Should.27_Match.29

Try to clarify your requirements... or maybe min-should-match was all you 
needed?


-- Jack Krupansky

-Original Message- 
From: Mark

Sent: Friday, August 02, 2013 7:50 PM
To: solr-user@lucene.apache.org
Subject: Percolate feature?

We have a set number of known terms we want to match against.

In Index:
term one
term two
term three

I know how to match all terms of a user query against the index but we would 
like to know how/if we can match a user's query against all the terms in the 
index?


Search Queries:
my search term = 0 matches
my term search one = 1 match  (term one)
some prefix term two = 1 match (term two)
one two three = 0 matches

I can only explain this is almost a reverse search???

I came across the following from ElasticSearch 
(http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds 
like this may accomplish the above but haven't tested. I was wondering if 
Solr had something similar or an alternative way of accomplishing this?


Thanks