[jira] [Commented] (LUCENE-5620) LowerCaseFilter.preserveOriginal

Robert Muir (JIRA) Sat, 19 Apr 2014 14:51:26 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13974982#comment-13974982
 ]


Robert Muir commented on LUCENE-5620:
-------------------------------------

{quote}
It makes me wonder whether length normalization shouldn't use max position 
instead of term count when it is available.
{quote}

This is your choice, its whatever your similarity uses. currently most of the 
similarities shipped with lucene have an option you can choose. The problem is 
if you are synonyms-heavy, its bad. But I opened and issue and changed the 
default to exactly this way in lucene 3.1 because so many users were injecting 
'fake' terms without thinking about the consequences.

{quote}
The drawbacks of the field splitting were
1) QParser flexibility- (not being forced to use a dismax defType in order to 
query multiple fields in a single query.
2) "readability" - the developer / user could see in a single place all the 
terms a query could match in an indexed document via the admin UI without 
asking him to understand a parsedQuery string or the qf param.
3) term position - enabling a phrase query that would match "originalTerm 
stemmedTerm". Enabling it in a splitted field would mean saving the original 
term (dictionary and posting) twice,
3) perf (more of an anecdote) - as the terms were generally suffix stemmed we 
had good chances of loading the same term block and posting list to memory as 
they should be sequential.
{quote}

Those all sound like solr problems: not relevant to any decisions to be made 
here. In lucene (even the queryparser) you can override phrase queries, to use 
unstemmed field for example. And if you want to do it that way, just enable 
documents and frequencies on the stemmed field (no proximity necessary there, 
just ordinary scoring).

> LowerCaseFilter.preserveOriginal
> --------------------------------
>
>                 Key: LUCENE-5620
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5620
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>         Attachments: LUCENE-5620.patch
>
>
> Following closely the model of LUCENE-5437 (which worked on 
> ASCIIFoldingFilter), this patch adds the ability to preserve the original 
> token to LowerCaseFilter.  This is useful if you want an all-lowercase search 
> term to match without regard to case, while search terms with uppercase 
> letters match in a case-sensitive manner. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5620) LowerCaseFilter.preserveOriginal

Reply via email to