Thanks all for your ideas. It was very useful information. On Fri, 21 Sep 2018 at 19:04, Jan Høydahl <jan....@cominvent.com> wrote:
> I have made a FieldType specially for this > https://github.com/cominvent/exactmatch/ < > https://github.com/cominvent/exactmatch/> > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > > > 21. sep. 2018 kl. 18:14 skrev Steve Rowe <sar...@gmail.com>: > > > > Link correction - wrong fragment identifier in ref #5 - should be: > > > > [5] > https://lucene.apache.org/solr/guide/7_4/other-parsers.html#function-range-query-parser > > > > -- > > Steve > > www.lucidworks.com > > > >> On Sep 21, 2018, at 12:04 PM, Steve Rowe <sar...@gmail.com> wrote: > >> > >> Hi Sergio, > >> > >> Chris “Hoss” Hostetter has a solution to this kind of problem here: > https://lists.apache.org/thread.html/6b0f0cb864aa55f0a9eadfd92d27d374ab8deb16e8131ed2b7234463@%3Csolr-user.lucene.apache.org%3E > . See also the suggestions in comments on SOLR-12673[1], which include a > version of Hoss’ss solution. > >> > >> Hoss’ss solution assumes a multivalued StrField with values counted > using CountFieldValuesUpdateProcessorFactory, which doesn’t apply to you. > You could instead count unique tokens in an analyzed field using the > StatelessScriptUpdateProcessorFactory[2][3], e.g. see slides 10&11 of Erik > Hatcher’s Lucene/Solr Revolution 2013 talk[4]. > >> > >> Your script could look something like this (untested; replace "<field > type>” with your field type): > >> > >> ===== > >> function getUniqueTokenCount(analyzer, fieldName, fieldValue) { > >> var uniqueTokens = {}; > >> var stream = analyzer.tokenStream(fieldName, fieldValue); > >> var termAttr = > stream.getAttribute(Packages.org.apache.lucene.analysis.tokenattributes.CharTermAttribute); > >> stream.reset(); > >> while (stream.incrementToken()) { uniqueTokens[termAttr.toString()] = > 1; } > >> stream.end(); > >> stream.close(); > >> return Object.keys(uniqueTokens).length; > >> } > >> function processAdd(cmd) { > >> var analyzer = > req.getCore().getLatestSchema().getFieldTypeByName("<field > type>").getIndexAnalyzer(); > >> doc.setField(“unique_token_count_i", getUniqueTokenCount(analyzer, > null, content)); > >> } > >> function processDelete(cmd) { } > >> function processMergeIndexes(cmd) { } > >> function processCommit(cmd) { } > >> function processRollback(cmd) { } > >> function finish() { } > >> ===== > >> > >> And your query could then look something like (replace "<field>” with > your field name)[5][6]: > >> > >> ===== > >> fq={!frange l=0 > h=0}sub(unique_token_count_i,sum(termfreq(<field>,’CENTURY’),termfreq(<field>,’BANCORP’),termfreq(<field>,‘INC’))) > >> ===== > >> > >> Note that to construct the query ^^ you’ll need to tokenize and > uniquify terms on the client side - if tokenization is non-trivial, you > could use Solr's Field Analysis API[8] to perform tokenization for you. > >> > >> [1] https://issues.apache.org/jira/browse/SOLR-12673 > >> [2] https://wiki.apache.org/solr/ScriptUpdateProcessor > >> [3] > https://lucene.apache.org/solr/7_4_0/solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html > >> [4] > https://www.slideshare.net/erikhatcher/solr-indexing-and-analysis-tricks > >> [5] > https://lucene.apache.org/solr/guide/7_4/other-parsers.html#OtherParsers-FunctionRangeQueryParser > >> [6] > https://lucene.apache.org/solr/guide/7_4/function-queries.html#termfreq-function > >> [7] > https://lucene.apache.org/solr/guide/7_4/function-queries.html#sub-function > >> [8] > https://lucene.apache.org/solr/guide/7_4/implicit-requesthandlers.html#analysis-handlers > >> > >> -- > >> Steve > >> www.lucidworks.com > >> > >>> On Sep 21, 2018, at 10:45 AM, Erick Erickson <erickerick...@gmail.com> > wrote: > >>> > >>> A variant on Alexandre's approach is: > >>> at index time, count the tokens that will be produced yourself (this > >>> may be a little tricky, you shouldn't have WordDelimiterFilterFactory > >>> in your analysis for instance). > >>> Put the number of tokens in a separate field > >>> At query time, you'd search q=+company_name:(+century +bancorp +inc) > >>> +tokens_in_company_name_field:3 > >>> > >>> You don't need phrase queries with this approach, order doesn't matter. > >>> > >>> It can get tricky though, should "CENTURY BANCORP, INC." and "CENTURY > >>> BANCORP, INCORPORATED." match? > >>> > >>> Again, though, this means your indexing code has to do the same thing > >>> as your analysis chain. Which isn't very hard if the analysis chain is > >>> simple. I might use a char _filter_ factory to remove all > >>> non-alphanumeric characters, then a whitespace tokenizer and > >>> (probably) a lowercasefilter. That's pretty easy to replicate in order > >>> to count tokens. > >>> > >>> Best, > >>> Erick > >>> On Fri, Sep 21, 2018 at 7:18 AM Alexandre Rafalovitch > >>> <arafa...@gmail.com> wrote: > >>>> > >>>> I think you can match everything in the query to the field using > either > >>>> 1) disMax/eDisMax with mm=100% > >>>> > https://lucene.apache.org/solr/guide/7_4/the-dismax-query-parser.html#mm-minimum-should-match-parameter > >>>> 2) Complex Phrase Query Parser with inOrder=false: > >>>> > https://lucene.apache.org/solr/guide/7_4/other-parsers.html#complex-phrase-query-parser > >>>> > >>>> The number of tokens though is hard. You only know what your tokens > >>>> are at the end of the indexing pipeline. And during search, the tokens > >>>> are looked up from their indexes and only then the documents are > >>>> looked up. > >>>> > >>>> You may be able to do this with custom Postfilter that would run after > >>>> everything else to just reject records with extra tokens. That would > >>>> not be too expensive. > >>>> > >>>> Or (possibly simpler way) you could try to precalculate things, by > >>>> writing a custom TokenFilter that takes a stream and returns token > >>>> count to be used as a copyField target. Then you send your query to > >>>> the same field with any full-query preserving syntax, either as a > >>>> phrase or as a field query parser: > >>>> > https://lucene.apache.org/solr/guide/7_4/other-parsers.html#complex-phrase-query-parser > >>>> > >>>> I would love to know if any/all of this works for you. > >>>> > >>>> Regards, > >>>> Alex. > >>>> > >>>> On 21 September 2018 at 09:00, marotosg <marot...@gmail.com> wrote: > >>>>> Hi, > >>>>> > >>>>> I have to search for company names where my first requirement is to > find > >>>>> only exact matches on the company name. > >>>>> > >>>>> For instance if I search for "CENTURY BANCORP, INC." I shouldn't > find "NEW > >>>>> CENTURY BANCORP, INC." > >>>>> because the result company has the extra keyword "NEW". > >>>>> > >>>>> I can't use exact match because the sequence of tokens may differ. > Basically > >>>>> I need to find results where the tokens are the same in any order > and the > >>>>> number of tokens match. > >>>>> > >>>>> I have no idea if it's possible as include in the query the number > of tokens > >>>>> and solr field has that info within to match it. > >>>>> > >>>>> Thanks for your help > >>>>> Sergio > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html > >> > > > >