Hi Sergio, Chris “Hoss” Hostetter has a solution to this kind of problem here: https://lists.apache.org/thread.html/6b0f0cb864aa55f0a9eadfd92d27d374ab8deb16e8131ed2b7234463@%3Csolr-user.lucene.apache.org%3E . See also the suggestions in comments on SOLR-12673[1], which include a version of Hoss’ss solution.
Hoss’ss solution assumes a multivalued StrField with values counted using CountFieldValuesUpdateProcessorFactory, which doesn’t apply to you. You could instead count unique tokens in an analyzed field using the StatelessScriptUpdateProcessorFactory[2][3], e.g. see slides 10&11 of Erik Hatcher’s Lucene/Solr Revolution 2013 talk[4]. Your script could look something like this (untested; replace "<field type>” with your field type): ===== function getUniqueTokenCount(analyzer, fieldName, fieldValue) { var uniqueTokens = {}; var stream = analyzer.tokenStream(fieldName, fieldValue); var termAttr = stream.getAttribute(Packages.org.apache.lucene.analysis.tokenattributes.CharTermAttribute); stream.reset(); while (stream.incrementToken()) { uniqueTokens[termAttr.toString()] = 1; } stream.end(); stream.close(); return Object.keys(uniqueTokens).length; } function processAdd(cmd) { var analyzer = req.getCore().getLatestSchema().getFieldTypeByName("<field type>").getIndexAnalyzer(); doc.setField(“unique_token_count_i", getUniqueTokenCount(analyzer, null, content)); } function processDelete(cmd) { } function processMergeIndexes(cmd) { } function processCommit(cmd) { } function processRollback(cmd) { } function finish() { } ===== And your query could then look something like (replace "<field>” with your field name)[5][6]: ===== fq={!frange l=0 h=0}sub(unique_token_count_i,sum(termfreq(<field>,’CENTURY’),termfreq(<field>,’BANCORP’),termfreq(<field>,‘INC’))) ===== Note that to construct the query ^^ you’ll need to tokenize and uniquify terms on the client side - if tokenization is non-trivial, you could use Solr's Field Analysis API[8] to perform tokenization for you. [1] https://issues.apache.org/jira/browse/SOLR-12673 [2] https://wiki.apache.org/solr/ScriptUpdateProcessor [3] https://lucene.apache.org/solr/7_4_0/solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html [4] https://www.slideshare.net/erikhatcher/solr-indexing-and-analysis-tricks [5] https://lucene.apache.org/solr/guide/7_4/other-parsers.html#OtherParsers-FunctionRangeQueryParser [6] https://lucene.apache.org/solr/guide/7_4/function-queries.html#termfreq-function [7] https://lucene.apache.org/solr/guide/7_4/function-queries.html#sub-function [8] https://lucene.apache.org/solr/guide/7_4/implicit-requesthandlers.html#analysis-handlers -- Steve www.lucidworks.com > On Sep 21, 2018, at 10:45 AM, Erick Erickson <erickerick...@gmail.com> wrote: > > A variant on Alexandre's approach is: > at index time, count the tokens that will be produced yourself (this > may be a little tricky, you shouldn't have WordDelimiterFilterFactory > in your analysis for instance). > Put the number of tokens in a separate field > At query time, you'd search q=+company_name:(+century +bancorp +inc) > +tokens_in_company_name_field:3 > > You don't need phrase queries with this approach, order doesn't matter. > > It can get tricky though, should "CENTURY BANCORP, INC." and "CENTURY > BANCORP, INCORPORATED." match? > > Again, though, this means your indexing code has to do the same thing > as your analysis chain. Which isn't very hard if the analysis chain is > simple. I might use a char _filter_ factory to remove all > non-alphanumeric characters, then a whitespace tokenizer and > (probably) a lowercasefilter. That's pretty easy to replicate in order > to count tokens. > > Best, > Erick > On Fri, Sep 21, 2018 at 7:18 AM Alexandre Rafalovitch > <arafa...@gmail.com> wrote: >> >> I think you can match everything in the query to the field using either >> 1) disMax/eDisMax with mm=100% >> https://lucene.apache.org/solr/guide/7_4/the-dismax-query-parser.html#mm-minimum-should-match-parameter >> 2) Complex Phrase Query Parser with inOrder=false: >> https://lucene.apache.org/solr/guide/7_4/other-parsers.html#complex-phrase-query-parser >> >> The number of tokens though is hard. You only know what your tokens >> are at the end of the indexing pipeline. And during search, the tokens >> are looked up from their indexes and only then the documents are >> looked up. >> >> You may be able to do this with custom Postfilter that would run after >> everything else to just reject records with extra tokens. That would >> not be too expensive. >> >> Or (possibly simpler way) you could try to precalculate things, by >> writing a custom TokenFilter that takes a stream and returns token >> count to be used as a copyField target. Then you send your query to >> the same field with any full-query preserving syntax, either as a >> phrase or as a field query parser: >> https://lucene.apache.org/solr/guide/7_4/other-parsers.html#complex-phrase-query-parser >> >> I would love to know if any/all of this works for you. >> >> Regards, >> Alex. >> >> On 21 September 2018 at 09:00, marotosg <marot...@gmail.com> wrote: >>> Hi, >>> >>> I have to search for company names where my first requirement is to find >>> only exact matches on the company name. >>> >>> For instance if I search for "CENTURY BANCORP, INC." I shouldn't find "NEW >>> CENTURY BANCORP, INC." >>> because the result company has the extra keyword "NEW". >>> >>> I can't use exact match because the sequence of tokens may differ. Basically >>> I need to find results where the tokens are the same in any order and the >>> number of tokens match. >>> >>> I have no idea if it's possible as include in the query the number of tokens >>> and solr field has that info within to match it. >>> >>> Thanks for your help >>> Sergio >>> >>> >>> >>> -- >>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html