dsmiley commented on pull request #1740:
URL: https://github.com/apache/lucene-solr/pull/1740#issuecomment-673640254


   > I wonder if this is something that we should enforce somewhere. It seems 
that you only want the original term to come first so maybe we can mark it with 
a special flag ?
   
   Even if the test is changed to not involve the original term (thus don't 
expect it in output either), the test would reveal a problem with the 
"catenateAll" token (different from "preserveOriginal").
   
   > Why does it matter that it appears first ? Is it to be compatible with 
other filters like the synonym filter ?
   
   It "looks right" to me that it comes first intuitively.  Longest tokens 
first that start at the same position.  That's not a great reason, I realize.  
We should tie break on something so that it's not arbitrary -- a better reason. 
 Also, at my company I have a delegating TokenFilter to WDGF that presumes this 
is the case, and this is how I found this inconsistency.
   
   I want to try a "fuzz test" of WDGF to see if a pure start & end offset 
based ordering is sufficient, or is it truly necessary to also look at the 
position increment and position length.  My theory is that with what WDGF/WDF 
does, the offsets alone are fine to sort on because I don't think any later 
sub-token ("later" by offset) would have a token position happening earlier, 
and that likewise longer pos lengths come first.  The fuzz test would use a 
small dictionary of small sub-tokens and then it'd recombine them randomly to 
see if the graph it produces is identical to an alternate WDGF/WDF with tweaked 
token sort rules.  I don't think this'd be committable because the "alternate" 
would be temporary changes to the sorter compare implementation that looks at a 
system property.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to