Make phrases into single tokens at indexing and query time. Let the engine do
the rest of the work.

For example, “subunits of the army” can become “subunitsofthearmy” or 
“subunits_of_the_army”.
We used patterns to choose phrases, so “word word”, “word glue word”, or “word 
glue glue word”
could become phrases.

Nutch did something like this, but used it for filtering down the candidates 
for matching,
then used regular Lucene scoring for ranking.

The Infoseek Ultra index used these phrase terms but did not store positions.

The idea came from early DNA search engines.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 17, 2020, at 10:53 AM, David Hastings <hastings.recurs...@gmail.com> 
> wrote:
> 
> interesting, i cant seem to find anything on Phrase IDF, dont suppose you
> have a link or two i could look at by chance?
> 
> On Mon, Feb 17, 2020 at 1:48 PM Walter Underwood <wun...@wunderwood.org>
> wrote:
> 
>> At Infoseek, we used “glue words” to build phrase tokens. It was really
>> effective.
>> Phrase IDF is powerful stuff.
>> 
>> Luckily for you, the patent on that has expired. :-)
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Feb 17, 2020, at 10:46 AM, David Hastings <
>> hastings.recurs...@gmail.com> wrote:
>>> 
>>> i use stop words for building shingles into "interesting phrases" for my
>>> machine teacher/students, so i wouldnt say theres no reason, however my
>> use
>>> case is very specific.  Otherwise yeah, theyre gone for all practical
>>> reasons/search scenarios.
>>> 
>>> On Mon, Feb 17, 2020 at 1:41 PM Walter Underwood <wun...@wunderwood.org>
>>> wrote:
>>> 
>>>> Why are you using stopwords? I would need a really, really good reason
>> to
>>>> use those.
>>>> 
>>>> Stopwords are an obsolete technique from 16-bit processors. I’ve never
>>>> used them and
>>>> I’ve been a search engineer since 1997.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wun...@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>> 
>>>>> On Feb 17, 2020, at 7:31 AM, Thomas Corthals <tho...@klascement.net>
>>>> wrote:
>>>>> 
>>>>> Hi
>>>>> 
>>>>> I've run into an issue with creating a Managed Stopwords list that has
>>>> the
>>>>> same name as a previously deleted list. Going through the same flow
>> with
>>>>> Managed Synonyms doesn't result in this unexpected behaviour. Am I
>>>> missing
>>>>> something or did I discover a bug in Solr?
>>>>> 
>>>>> On a newly started solr with the techproducts core:
>>>>> 
>>>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>>>> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
>>>>> 
>>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>>>> curl -X DELETE
>>>>> 
>>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>>>> curl
>>>> http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
>>>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>>>> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
>>>>> 
>>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>>>> 
>>>>> The second PUT request results in a status 500 with error
>>>>> msg "java.util.LinkedHashMap cannot be cast to java.util.List".
>>>>> 
>>>>> Similar requests for synonyms work fine, no matter how many times I
>>>> repeat
>>>>> the CREATE/DELETE/RELOAD cycle:
>>>>> 
>>>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>>>> 
>>>> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
>>>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
>>>>> curl -X DELETE
>>>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
>>>>> curl
>>>> http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
>>>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>>>> 
>>>> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
>>>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
>>>>> 
>>>>> Reloading after creating the Stopwords list but not after deleting it
>>>> works
>>>>> without error too on a fresh techproducts core (you'll have to remove
>> the
>>>>> directory from disk and create the core again after running the
>> previous
>>>>> commands).
>>>>> 
>>>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>>>> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
>>>>> 
>>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>>>> curl
>>>> http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
>>>>> curl -X DELETE
>>>>> 
>>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>>>> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
>>>>> 
>>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>>>> 
>>>>> And even curiouser, when doing a CREATE/DELETE for Stopwords, then a
>>>>> CREATE/DELETE for Synonyms, and only then a RELOAD of the core, the
>> cycle
>>>>> can be completed twice. (Again, on a freshly created techproducts
>> core.)
>>>>> Only the third attempt to create a list results in an error. Synonyms
>> can
>>>>> still be created and deleted repeatedly after this.
>>>>> 
>>>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>>>> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
>>>>> 
>>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>>>> curl -X DELETE
>>>>> 
>>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>>>> 
>>>> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
>>>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
>>>>> curl -X DELETE
>>>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
>>>>> curl
>>>> http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
>>>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>>>> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
>>>>> 
>>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>>>> curl -X DELETE
>>>>> 
>>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>>>> 
>>>> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
>>>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
>>>>> curl -X DELETE
>>>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
>>>>> curl
>>>> http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
>>>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>>>> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
>>>>> 
>>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>>>> 
>>>>> The same successes/errors occur when running each cycle against a
>>>> different
>>>>> core if the cores share the same configset.
>>>>> 
>>>>> Any ideas on what might be going wrong?
>>>> 
>>>> 
>> 
>> 

Reply via email to