[jira] [Commented] (LUCENE-7321) Character Mapping
[ https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594154#comment-16594154 ] Ivan Provalov commented on LUCENE-7321: --- [~arafalov], the clean use case is for this filter is to externalize the morphological modifications rules. Most stemmers have hard-coded rules. With this one, the rules are expressed in the flat mapping files and configurations. Originally, it was developed to extend a few cases for some languages listed here and a few other languages, as well as to visualize these rules which would help the linguists involved in the project to understand the modification rules for more complex scenarios. I added the Russian stemmer implementation as a general reference just to show how one can configure the entire stemmer implementation without hard-coded rules. We have not seen any performance issues with this so far. Hope this helps. > Character Mapping > - > > Key: LUCENE-7321 > URL: https://issues.apache.org/jira/browse/LUCENE-7321 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.6.1, 5.4.1, 6.0, 6.0.1 >Reporter: Ivan Provalov >Priority: Minor > Labels: patch > Fix For: 6.0.1 > > Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch > > > One of the challenges in search is recall of an item with a common typing > variant. These cases can be as simple as lower/upper case in most languages, > accented characters, or more complex morphological phenomena like prefix > omitting, or constructing a character with some combining mark. This > component addresses the cases, which are not covered by ASCII folding > component, or more complex to design with other tools. The idea is that a > linguist could provide the mappings in a tab-delimited file, which then can > be directly used by Solr. > The mappings are maintained in the tab-delimited file, which could be just a > copy paste from Excel spreadsheet. This gives the linguists the opportunity > to create the mappings, then for the developer to include them in Solr > configuration. There are a few cases, when the mappings grow complex, where > some additional debugging may be required. The mappings can contain any > sequence of characters to any other sequence of characters. > Some of the cases I discuss in detail document are handling the voiced vowels > for Japanese; common typing substitutions for Korean, Russian, Polish; > transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding > for Japanese. In the appendix, I give an example of implementing a Russian > light weight stemmer using this component. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7321) Character Mapping
[ https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593976#comment-16593976 ] Ivan Provalov commented on LUCENE-7321: --- [~erickerickson], Good questions: 1. I just ran the tests in the patch against the master, they passed. 2. It allows you to configure/modify morphological analysis with externalized mapping files. I attached a description and a reference implementation of the Russian stemmer using this filter. Thanks, Ivan > Character Mapping > - > > Key: LUCENE-7321 > URL: https://issues.apache.org/jira/browse/LUCENE-7321 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.6.1, 5.4.1, 6.0, 6.0.1 >Reporter: Ivan Provalov >Priority: Minor > Labels: patch > Fix For: 6.0.1 > > Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch > > > One of the challenges in search is recall of an item with a common typing > variant. These cases can be as simple as lower/upper case in most languages, > accented characters, or more complex morphological phenomena like prefix > omitting, or constructing a character with some combining mark. This > component addresses the cases, which are not covered by ASCII folding > component, or more complex to design with other tools. The idea is that a > linguist could provide the mappings in a tab-delimited file, which then can > be directly used by Solr. > The mappings are maintained in the tab-delimited file, which could be just a > copy paste from Excel spreadsheet. This gives the linguists the opportunity > to create the mappings, then for the developer to include them in Solr > configuration. There are a few cases, when the mappings grow complex, where > some additional debugging may be required. The mappings can contain any > sequence of characters to any other sequence of characters. > Some of the cases I discuss in detail document are handling the voiced vowels > for Japanese; common typing substitutions for Korean, Russian, Polish; > transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding > for Japanese. In the appendix, I give an example of implementing a Russian > light weight stemmer using this component. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7321) Character Mapping
[ https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593779#comment-16593779 ] Ivan Provalov commented on LUCENE-7321: --- [~erike4...@yahoo.com], any progress on committing this patch? Thanks, Ivan > Character Mapping > - > > Key: LUCENE-7321 > URL: https://issues.apache.org/jira/browse/LUCENE-7321 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.6.1, 5.4.1, 6.0, 6.0.1 >Reporter: Ivan Provalov >Priority: Minor > Labels: patch > Fix For: 6.0.1 > > Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch > > > One of the challenges in search is recall of an item with a common typing > variant. These cases can be as simple as lower/upper case in most languages, > accented characters, or more complex morphological phenomena like prefix > omitting, or constructing a character with some combining mark. This > component addresses the cases, which are not covered by ASCII folding > component, or more complex to design with other tools. The idea is that a > linguist could provide the mappings in a tab-delimited file, which then can > be directly used by Solr. > The mappings are maintained in the tab-delimited file, which could be just a > copy paste from Excel spreadsheet. This gives the linguists the opportunity > to create the mappings, then for the developer to include them in Solr > configuration. There are a few cases, when the mappings grow complex, where > some additional debugging may be required. The mappings can contain any > sequence of characters to any other sequence of characters. > Some of the cases I discuss in detail document are handling the voiced vowels > for Japanese; common typing substitutions for Korean, Russian, Polish; > transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding > for Japanese. In the appendix, I give an example of implementing a Russian > light weight stemmer using this component. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8131) Kuromoji User Dictionary Resources Not Closed
[ https://issues.apache.org/jira/browse/LUCENE-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16326653#comment-16326653 ] Ivan Provalov commented on LUCENE-8131: --- Thanks, [~hossman]! > Kuromoji User Dictionary Resources Not Closed > - > > Key: LUCENE-8131 > URL: https://issues.apache.org/jira/browse/LUCENE-8131 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 6.3 >Reporter: Ivan Provalov >Priority: Major > Fix For: 5.5.4, 6.4, 7.0 > > > InputStream and Reader need to be closed in JapaneseTokenizerFactory. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8131) Kuromoji User Dictionary Resources Not Closed
Ivan Provalov created LUCENE-8131: - Summary: Kuromoji User Dictionary Resources Not Closed Key: LUCENE-8131 URL: https://issues.apache.org/jira/browse/LUCENE-8131 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Affects Versions: 6.3 Reporter: Ivan Provalov InputStream and Reader need to be closed in JapaneseTokenizerFactory. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-9942) MoreLikeThis Performance Degraded With Filtered Query
[ https://issues.apache.org/jira/browse/SOLR-9942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Provalov updated SOLR-9942: Attachment: (was: solr_mlt_test.tar) > MoreLikeThis Performance Degraded With Filtered Query > - > > Key: SOLR-9942 > URL: https://issues.apache.org/jira/browse/SOLR-9942 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: MoreLikeThis >Affects Versions: 5.5.2 >Reporter: Ivan Provalov > Attachments: solr_mlt_test2.tar > > > Without any filters, the MLT is performing normal. With any added filters, > the performance degrades compared to 4.6.1 (2.5-3.0X in our case). The issue > goes away with 6.0 upgrade. The hot method is Lucene's DisiPriorityQueue > downHeap(), which takes 5X more calls in 5.5.2 compared to 6.0. I am > guessing that some of the Solr filters refactoring fixed it for 6.0 release. > As a work-around, for now I just refactored the custom MLT handler to convert > the filters into boolean clauses, which takes care of the issue. > Our configuration: > 1. mlt.maxqt=100 > 2. There is an additional filter passed as a parameter > 3. multiValued="true" omitNorms="false" termVectors="true"/> > 4. text_general is a pretty standard text fieldType. > I have a code to populate a test dataset and run a query in order to > reproduce this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-9942) MoreLikeThis Performance Degraded With Filtered Query
[ https://issues.apache.org/jira/browse/SOLR-9942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Provalov updated SOLR-9942: Attachment: solr_mlt_test2.tar test case > MoreLikeThis Performance Degraded With Filtered Query > - > > Key: SOLR-9942 > URL: https://issues.apache.org/jira/browse/SOLR-9942 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: MoreLikeThis >Affects Versions: 5.5.2 >Reporter: Ivan Provalov > Attachments: solr_mlt_test2.tar > > > Without any filters, the MLT is performing normal. With any added filters, > the performance degrades compared to 4.6.1 (2.5-3.0X in our case). The issue > goes away with 6.0 upgrade. The hot method is Lucene's DisiPriorityQueue > downHeap(), which takes 5X more calls in 5.5.2 compared to 6.0. I am > guessing that some of the Solr filters refactoring fixed it for 6.0 release. > As a work-around, for now I just refactored the custom MLT handler to convert > the filters into boolean clauses, which takes care of the issue. > Our configuration: > 1. mlt.maxqt=100 > 2. There is an additional filter passed as a parameter > 3. multiValued="true" omitNorms="false" termVectors="true"/> > 4. text_general is a pretty standard text fieldType. > I have a code to populate a test dataset and run a query in order to > reproduce this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-9942) MoreLikeThis Performance Degraded With Filtered Query
[ https://issues.apache.org/jira/browse/SOLR-9942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Provalov updated SOLR-9942: Description: Without any filters, the MLT is performing normal. With any added filters, the performance degrades compared to 4.6.1 (2.5-3.0X in our case). The issue goes away with 6.0 upgrade. The hot method is Lucene's DisiPriorityQueue downHeap(), which takes 5X more calls in 5.5.2 compared to 6.0. I am guessing that some of the Solr filters refactoring fixed it for 6.0 release. As a work-around, for now I just refactored the custom MLT handler to convert the filters into boolean clauses, which takes care of the issue. Our configuration: 1. mlt.maxqt=100 2. There is an additional filter passed as a parameter 3. 4. text_general is a pretty standard text fieldType. I have a code to populate a test dataset and run a query in order to reproduce this. was: Without any filters, the MLT is performing normal. With any added filters, the performance degrades compared to 4.6.1 (2.5-3.0X in our case). The issue goes away with 6.0 upgrade. The hot method is Lucene's DisiPriorityQueue downHeap(), which takes 5X more calls in 5.5.2 compared to 6.0. I am guessing that some of the Solr filters refactoring fixed it for 6.0 release. As a work-around, for now I just refactored the custom MLT handler to convert the filters into boolean clauses, which takes care of the issue. Our configuration: 1. mlt.maxqt=100 2. There is an additional filter passed as a parameter 3. 4. text_en is a pretty standard text fieldType. I have a code to populate a test dataset and run a query in order to reproduce this. > MoreLikeThis Performance Degraded With Filtered Query > - > > Key: SOLR-9942 > URL: https://issues.apache.org/jira/browse/SOLR-9942 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: MoreLikeThis >Affects Versions: 5.5.2 >Reporter: Ivan Provalov > Attachments: solr_mlt_test2.tar > > > Without any filters, the MLT is performing normal. With any added filters, > the performance degrades compared to 4.6.1 (2.5-3.0X in our case). The issue > goes away with 6.0 upgrade. The hot method is Lucene's DisiPriorityQueue > downHeap(), which takes 5X more calls in 5.5.2 compared to 6.0. I am > guessing that some of the Solr filters refactoring fixed it for 6.0 release. > As a work-around, for now I just refactored the custom MLT handler to convert > the filters into boolean clauses, which takes care of the issue. > Our configuration: > 1. mlt.maxqt=100 > 2. There is an additional filter passed as a parameter > 3. multiValued="true" omitNorms="false" termVectors="true"/> > 4. text_general is a pretty standard text fieldType. > I have a code to populate a test dataset and run a query in order to > reproduce this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-9942) MoreLikeThis Performance Degraded With Filtered Query
[ https://issues.apache.org/jira/browse/SOLR-9942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Provalov updated SOLR-9942: Description: Without any filters, the MLT is performing normal. With any added filters, the performance degrades compared to 4.6.1 (2.5-3.0X in our case). The issue goes away with 6.0 upgrade. The hot method is Lucene's DisiPriorityQueue downHeap(), which takes 5X more calls in 5.5.2 compared to 6.0. I am guessing that some of the Solr filters refactoring fixed it for 6.0 release. As a work-around, for now I just refactored the custom MLT handler to convert the filters into boolean clauses, which takes care of the issue. Our configuration: 1. mlt.maxqt=100 2. There is an additional filter passed as a parameter 3. 4. text_en is a pretty standard text fieldType. I have a code to populate a test dataset and run a query in order to reproduce this. was: Without any filters, the MLT is performing normal. With any added filters, the performance degrades (2.5-3.0X in our case). The issue goes away with 6.0 upgrade. The hot method is Lucene's DisiPriorityQueue downHeap(), which takes 5X more calls in 5.5.2 compared to 6.0. I am guessing that some of the Solr filters refactoring fixed it for 6.0 release. As a work-around, for now I just refactored the custom MLT handler to convert the filters into boolean clauses, which takes care of the issue. Our configuration: 1. mlt.maxqt=100 2. There is an additional filter passed as a parameter 3. 4. text_en is a pretty standard text fieldType. I have a code to populate a test dataset and run a query in order to reproduce this. > MoreLikeThis Performance Degraded With Filtered Query > - > > Key: SOLR-9942 > URL: https://issues.apache.org/jira/browse/SOLR-9942 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: MoreLikeThis >Affects Versions: 5.5.2 >Reporter: Ivan Provalov > Attachments: solr_mlt_test.tar > > > Without any filters, the MLT is performing normal. With any added filters, > the performance degrades compared to 4.6.1 (2.5-3.0X in our case). The issue > goes away with 6.0 upgrade. The hot method is Lucene's DisiPriorityQueue > downHeap(), which takes 5X more calls in 5.5.2 compared to 6.0. I am > guessing that some of the Solr filters refactoring fixed it for 6.0 release. > As a work-around, for now I just refactored the custom MLT handler to convert > the filters into boolean clauses, which takes care of the issue. > Our configuration: > 1. mlt.maxqt=100 > 2. There is an additional filter passed as a parameter > 3. multiValued="true" omitNorms="false" termVectors="true"/> > 4. text_en is a pretty standard text fieldType. > I have a code to populate a test dataset and run a query in order to > reproduce this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-9942) MoreLikeThis Performance Degraded With Filtered Query
[ https://issues.apache.org/jira/browse/SOLR-9942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Provalov updated SOLR-9942: Attachment: solr_mlt_test.tar test for mlt performance issue > MoreLikeThis Performance Degraded With Filtered Query > - > > Key: SOLR-9942 > URL: https://issues.apache.org/jira/browse/SOLR-9942 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: MoreLikeThis >Affects Versions: 5.5.2 >Reporter: Ivan Provalov > Attachments: solr_mlt_test.tar > > > Without any filters, the MLT is performing normal. With any added filters, > the performance degrades (2.5-3.0X in our case). The issue goes away with > 6.0 upgrade. The hot method is Lucene's DisiPriorityQueue downHeap(), which > takes 5X more calls in 5.5.2 compared to 6.0. I am guessing that some of the > Solr filters refactoring fixed it for 6.0 release. > As a work-around, for now I just refactored the custom MLT handler to convert > the filters into boolean clauses, which takes care of the issue. > Our configuration: > 1. mlt.maxqt=100 > 2. There is an additional filter passed as a parameter > 3. multiValued="true" omitNorms="false" termVectors="true"/> > 4. text_en is a pretty standard text fieldType. > I have a code to populate a test dataset and run a query in order to > reproduce this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-9942) MoreLikeThis Performance Degraded With Filtered Query
Ivan Provalov created SOLR-9942: --- Summary: MoreLikeThis Performance Degraded With Filtered Query Key: SOLR-9942 URL: https://issues.apache.org/jira/browse/SOLR-9942 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Components: MoreLikeThis Affects Versions: 5.5.2 Reporter: Ivan Provalov Without any filters, the MLT is performing normal. With any added filters, the performance degrades (2.5-3.0X in our case). The issue goes away with 6.0 upgrade. The hot method is Lucene's DisiPriorityQueue downHeap(), which takes 5X more calls in 5.5.2 compared to 6.0. I am guessing that some of the Solr filters refactoring fixed it for 6.0 release. As a work-around, for now I just refactored the custom MLT handler to convert the filters into boolean clauses, which takes care of the issue. Our configuration: 1. mlt.maxqt=100 2. There is an additional filter passed as a parameter 3. 4. text_en is a pretty standard text fieldType. I have a code to populate a test dataset and run a query in order to reproduce this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-9730) IndexSchema Dynamic Field Definition Caching
Ivan Provalov created SOLR-9730: --- Summary: IndexSchema Dynamic Field Definition Caching Key: SOLR-9730 URL: https://issues.apache.org/jira/browse/SOLR-9730 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Components: Schema and Analysis Affects Versions: 6.2.1, 5.5.2 Reporter: Ivan Provalov Priority: Minor A small optimization suggestion for IndexSchema class, cache the definitions of the dynamic fields: {code} private Map cachedDynamicFields = new HashMap<>(); @Override public SchemaField getFieldOrNull(String fieldName) { SchemaField f = fields.get(fieldName); if (f != null) return f; f = cachedDynamicFields.get(fieldName); if (f != null) return f; for (DynamicField df : dynamicFields) { if (df.matches(fieldName)) { f = df.makeSchemaField(fieldName); cachedDynamicFields.put(fieldName, f); return f; } } return f; } {code} Are there any reasons not to do this? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7486) DisjunctionMaxScorer Initializes scoreMax to Zero Preventing From Using Negative Scores
[ https://issues.apache.org/jira/browse/LUCENE-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566543#comment-15566543 ] Ivan Provalov commented on LUCENE-7486: --- +1 > DisjunctionMaxScorer Initializes scoreMax to Zero Preventing From Using > Negative Scores > --- > > Key: LUCENE-7486 > URL: https://issues.apache.org/jira/browse/LUCENE-7486 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 5.5.2 >Reporter: Ivan Provalov >Assignee: Uwe Schindler > Fix For: 6.x, master (7.0), 6.3 > > Attachments: LUCENE-7486.patch, LUCENE-7486.patch > > > We are using a log of probability for scoring, which gives us negative > scores. > DisjunctionMaxScorer initializes scoreMax in the score(...) function to zero > preventing us from using negative scores. Is there a reason it couldn't be > initialized to something like this: > float scoreMax = Float.MAX_VALUE * -1; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7486) DisjunctionMaxScorer Initializes scoreMax to Zero Preventing From Using Negative Scores
[ https://issues.apache.org/jira/browse/LUCENE-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15563360#comment-15563360 ] Ivan Provalov commented on LUCENE-7486: --- Thanks, Uwe! > DisjunctionMaxScorer Initializes scoreMax to Zero Preventing From Using > Negative Scores > --- > > Key: LUCENE-7486 > URL: https://issues.apache.org/jira/browse/LUCENE-7486 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 5.5.2 >Reporter: Ivan Provalov >Assignee: Uwe Schindler > > We are using a log of probability for scoring, which gives us negative > scores. > DisjunctionMaxScorer initializes scoreMax in the score(...) function to zero > preventing us from using negative scores. Is there a reason it couldn't be > initialized to something like this: > float scoreMax = Float.MAX_VALUE * -1; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Deleted] (LUCENE-7486) DisjunctionMaxScorer Initializes scoreMax to Zero Preventing From Using Negative Scores
[ https://issues.apache.org/jira/browse/LUCENE-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Provalov updated LUCENE-7486: -- Comment: was deleted (was: Thanks, Uwe!) > DisjunctionMaxScorer Initializes scoreMax to Zero Preventing From Using > Negative Scores > --- > > Key: LUCENE-7486 > URL: https://issues.apache.org/jira/browse/LUCENE-7486 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 5.5.2 >Reporter: Ivan Provalov >Assignee: Uwe Schindler > > We are using a log of probability for scoring, which gives us negative > scores. > DisjunctionMaxScorer initializes scoreMax in the score(...) function to zero > preventing us from using negative scores. Is there a reason it couldn't be > initialized to something like this: > float scoreMax = Float.MAX_VALUE * -1; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7486) DisjunctionMaxScorer Initializes scoreMax to Zero Preventing From Using Negative Scores
[ https://issues.apache.org/jira/browse/LUCENE-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15563361#comment-15563361 ] Ivan Provalov commented on LUCENE-7486: --- Thanks, Uwe! > DisjunctionMaxScorer Initializes scoreMax to Zero Preventing From Using > Negative Scores > --- > > Key: LUCENE-7486 > URL: https://issues.apache.org/jira/browse/LUCENE-7486 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 5.5.2 >Reporter: Ivan Provalov >Assignee: Uwe Schindler > > We are using a log of probability for scoring, which gives us negative > scores. > DisjunctionMaxScorer initializes scoreMax in the score(...) function to zero > preventing us from using negative scores. Is there a reason it couldn't be > initialized to something like this: > float scoreMax = Float.MAX_VALUE * -1; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7486) DisjunctionMaxScorer Initializes scoreMax to Zero Preventing From Using Negative Scores
[ https://issues.apache.org/jira/browse/LUCENE-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15562771#comment-15562771 ] Ivan Provalov commented on LUCENE-7486: --- Good point, Uwe. Is there a reason it shouldn't be done in Lucene source? > DisjunctionMaxScorer Initializes scoreMax to Zero Preventing From Using > Negative Scores > --- > > Key: LUCENE-7486 > URL: https://issues.apache.org/jira/browse/LUCENE-7486 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 5.5.2 >Reporter: Ivan Provalov > > We are using a log of probability for scoring, which gives us negative > scores. > DisjunctionMaxScorer initializes scoreMax in the score(...) function to zero > preventing us from using negative scores. Is there a reason it couldn't be > initialized to something like this: > float scoreMax = Float.MAX_VALUE * -1; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-7486) DisjunctionMaxScorer Initializes scoreMax to Zero Preventing From Using Negative Scores
Ivan Provalov created LUCENE-7486: - Summary: DisjunctionMaxScorer Initializes scoreMax to Zero Preventing From Using Negative Scores Key: LUCENE-7486 URL: https://issues.apache.org/jira/browse/LUCENE-7486 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 5.5.2 Reporter: Ivan Provalov We are using a log of probability for scoring, which gives us negative scores. DisjunctionMaxScorer initializes scoreMax in the score(...) function to zero preventing us from using negative scores. Is there a reason it couldn't be initialized to something like this: float scoreMax = Float.MAX_VALUE * -1; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7321) Character Mapping
[ https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321865#comment-15321865 ] Ivan Provalov commented on LUCENE-7321: --- Koji, this one works on a token level, allowing do things like prefix/suffix manipulations. Graph generator and collapser also makes it user friendly when dealing with a lot of mappings (please see the attached description file). > Character Mapping > - > > Key: LUCENE-7321 > URL: https://issues.apache.org/jira/browse/LUCENE-7321 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.6.1, 6.0, 5.4.1, 6.0.1 >Reporter: Ivan Provalov >Priority: Minor > Labels: patch > Fix For: 6.0.1 > > Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch > > > One of the challenges in search is recall of an item with a common typing > variant. These cases can be as simple as lower/upper case in most languages, > accented characters, or more complex morphological phenomena like prefix > omitting, or constructing a character with some combining mark. This > component addresses the cases, which are not covered by ASCII folding > component, or more complex to design with other tools. The idea is that a > linguist could provide the mappings in a tab-delimited file, which then can > be directly used by Solr. > The mappings are maintained in the tab-delimited file, which could be just a > copy paste from Excel spreadsheet. This gives the linguists the opportunity > to create the mappings, then for the developer to include them in Solr > configuration. There are a few cases, when the mappings grow complex, where > some additional debugging may be required. The mappings can contain any > sequence of characters to any other sequence of characters. > Some of the cases I discuss in detail document are handling the voiced vowels > for Japanese; common typing substitutions for Korean, Russian, Polish; > transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding > for Japanese. In the appendix, I give an example of implementing a Russian > light weight stemmer using this component. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7321) Character Mapping
[ https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Provalov updated LUCENE-7321: -- Description: One of the challenges in search is recall of an item with a common typing variant. These cases can be as simple as lower/upper case in most languages, accented characters, or more complex morphological phenomena like prefix omitting, or constructing a character with some combining mark. This component addresses the cases, which are not covered by ASCII folding component, or more complex to design with other tools. The idea is that a linguist could provide the mappings in a tab-delimited file, which then can be directly used by Solr. The mappings are maintained in the tab-delimited file, which could be just a copy paste from Excel spreadsheet. This gives the linguists the opportunity to create the mappings, then for the developer to include them in Solr configuration. There are a few cases, when the mappings grow complex, where some additional debugging may be required. The mappings can contain any sequence of characters to any other sequence of characters. Some of the cases I discuss in detail document are handling the voiced vowels for Japanese; common typing substitutions for Korean, Russian, Polish; transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding for Japanese. In the appendix, I give an example of implementing a Russian light weight stemmer using this component. was: One of the challenges in search is recall of an item with a common typing variant. These cases can be as simple as lower/upper case in most languages, accented characters, or more complex morphological phenomena like prefix omitting, or constructing a character with some combining mark. This component addresses the cases, which are not covered by ASCII folding component, or more complex to design with other tools. The idea is that a linguist could provide the mappings in a tab-delimited file, which then can be directly used by Solr. The mappings are maintained in the tab-delimited file, which could be just a copy paste from Excel spreadsheet. This gives the linguists the opportunity to create the mappings, then for the developer to include them in Solr configuration. There are a few cases, when the mappings grow complex, where some additional debugging may be required. The mappings can contain any sequence of characters to any other sequence of characters. Some of the cases I discuss in detail document are handling the voiced vowels for Japanese; common typing substitutions for Korean, Russian, Polish; transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding for Japanese. > Character Mapping > - > > Key: LUCENE-7321 > URL: https://issues.apache.org/jira/browse/LUCENE-7321 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.6.1, 6.0, 5.4.1, 6.0.1 >Reporter: Ivan Provalov >Priority: Minor > Labels: patch > Fix For: 6.0.1 > > Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch > > > One of the challenges in search is recall of an item with a common typing > variant. These cases can be as simple as lower/upper case in most languages, > accented characters, or more complex morphological phenomena like prefix > omitting, or constructing a character with some combining mark. This > component addresses the cases, which are not covered by ASCII folding > component, or more complex to design with other tools. The idea is that a > linguist could provide the mappings in a tab-delimited file, which then can > be directly used by Solr. > The mappings are maintained in the tab-delimited file, which could be just a > copy paste from Excel spreadsheet. This gives the linguists the opportunity > to create the mappings, then for the developer to include them in Solr > configuration. There are a few cases, when the mappings grow complex, where > some additional debugging may be required. The mappings can contain any > sequence of characters to any other sequence of characters. > Some of the cases I discuss in detail document are handling the voiced vowels > for Japanese; common typing substitutions for Korean, Russian, Polish; > transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding > for Japanese. In the appendix, I give an example of implementing a Russian > light weight stemmer using this component. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7321) Character Mapping
[ https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Provalov updated LUCENE-7321: -- Attachment: CharacterMappingComponent.pdf Detail component description. > Character Mapping > - > > Key: LUCENE-7321 > URL: https://issues.apache.org/jira/browse/LUCENE-7321 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.6.1, 6.0, 5.4.1, 6.0.1 >Reporter: Ivan Provalov >Priority: Minor > Labels: patch > Fix For: 6.0.1 > > Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch > > > One of the challenges in search is recall of an item with a common typing > variant. These cases can be as simple as lower/upper case in most languages, > accented characters, or more complex morphological phenomena like prefix > omitting, or constructing a character with some combining mark. This > component addresses the cases, which are not covered by ASCII folding > component, or more complex to design with other tools. The idea is that a > linguist could provide the mappings in a tab-delimited file, which then can > be directly used by Solr. > The mappings are maintained in the tab-delimited file, which could be just a > copy paste from Excel spreadsheet. This gives the linguists the opportunity > to create the mappings, then for the developer to include them in Solr > configuration. There are a few cases, when the mappings grow complex, where > some additional debugging may be required. The mappings can contain any > sequence of characters to any other sequence of characters. > Some of the cases I discuss in detail document are handling the voiced vowels > for Japanese; common typing substitutions for Korean, Russian, Polish; > transliteration for Polish, Arabic; prefix removal for Arabic; suffix > folding for Japanese. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7321) Character Mapping
[ https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Provalov updated LUCENE-7321: -- Attachment: LUCENE-7321.patch Initial patch. > Character Mapping > - > > Key: LUCENE-7321 > URL: https://issues.apache.org/jira/browse/LUCENE-7321 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.6.1, 6.0, 5.4.1, 6.0.1 >Reporter: Ivan Provalov >Priority: Minor > Labels: patch > Fix For: 6.0.1 > > Attachments: LUCENE-7321.patch > > > One of the challenges in search is recall of an item with a common typing > variant. These cases can be as simple as lower/upper case in most languages, > accented characters, or more complex morphological phenomena like prefix > omitting, or constructing a character with some combining mark. This > component addresses the cases, which are not covered by ASCII folding > component, or more complex to design with other tools. The idea is that a > linguist could provide the mappings in a tab-delimited file, which then can > be directly used by Solr. > The mappings are maintained in the tab-delimited file, which could be just a > copy paste from Excel spreadsheet. This gives the linguists the opportunity > to create the mappings, then for the developer to include them in Solr > configuration. There are a few cases, when the mappings grow complex, where > some additional debugging may be required. The mappings can contain any > sequence of characters to any other sequence of characters. > Some of the cases I discuss in detail document are handling the voiced vowels > for Japanese; common typing substitutions for Korean, Russian, Polish; > transliteration for Polish, Arabic; prefix removal for Arabic; suffix > folding for Japanese. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-7321) Character Mapping
Ivan Provalov created LUCENE-7321: - Summary: Character Mapping Key: LUCENE-7321 URL: https://issues.apache.org/jira/browse/LUCENE-7321 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 6.0.1, 5.4.1, 6.0, 4.6.1 Reporter: Ivan Provalov Priority: Minor Fix For: 6.0.1 One of the challenges in search is recall of an item with a common typing variant. These cases can be as simple as lower/upper case in most languages, accented characters, or more complex morphological phenomena like prefix omitting, or constructing a character with some combining mark. This component addresses the cases, which are not covered by ASCII folding component, or more complex to design with other tools. The idea is that a linguist could provide the mappings in a tab-delimited file, which then can be directly used by Solr. The mappings are maintained in the tab-delimited file, which could be just a copy paste from Excel spreadsheet. This gives the linguists the opportunity to create the mappings, then for the developer to include them in Solr configuration. There are a few cases, when the mappings grow complex, where some additional debugging may be required. The mappings can contain any sequence of characters to any other sequence of characters. Some of the cases I discuss in detail document are handling the voiced vowels for Japanese; common typing substitutions for Korean, Russian, Polish; transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding for Japanese. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3931) Turn off coord() factor for scoring
[ https://issues.apache.org/jira/browse/SOLR-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14938785#comment-14938785 ] Ivan Provalov commented on SOLR-3931: - Ideally, I would like to plug in a similarity class at search time, given the norm encoding is compatible across the index time and search time. Right now, this requires a lot of custom extensions. > Turn off coord() factor for scoring > --- > > Key: SOLR-3931 > URL: https://issues.apache.org/jira/browse/SOLR-3931 > Project: Solr > Issue Type: Bug >Affects Versions: 4.0 >Reporter: Bill Bell > > We would like to remove coordination factor from scoring. > FOr small fields (like name of doctor), we want to not score higher if the > same term is in the field more than once. Makes sense for books, not so much > for formal names. > /solr/select?q=*:*&coordFactor=false > Default is true. > (Note: we might want to make each of these optional - tf, idf, coord, > queryNorm > coord(q,d) is a score factor based on how many of the query terms are found > in the specified document. Typically, a document that contains more of the > query's terms will receive a higher score than another document with fewer > query terms. This is a search time factor computed in coord(q,d) by the > Similarity in effect at search time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-474) High Frequency Terms/Phrases at the Index level
[ https://issues.apache.org/jira/browse/LUCENE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Provalov updated LUCENE-474: - Attachment: (was: collocations.zip) > High Frequency Terms/Phrases at the Index level > --- > > Key: LUCENE-474 > URL: https://issues.apache.org/jira/browse/LUCENE-474 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Affects Versions: 1.4 >Reporter: Suri Babu B >Assignee: Otis Gospodnetic >Priority: Minor > Attachments: colloc.zip, collocations.zip > > > We should be able to find the all the high frequency terms/phrases ( where > frequency is the search criteria / benchmark) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-474) High Frequency Terms/Phrases at the Index level
[ https://issues.apache.org/jira/browse/LUCENE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Provalov updated LUCENE-474: - Attachment: collocations.zip > High Frequency Terms/Phrases at the Index level > --- > > Key: LUCENE-474 > URL: https://issues.apache.org/jira/browse/LUCENE-474 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Affects Versions: 1.4 >Reporter: Suri Babu B >Assignee: Otis Gospodnetic >Priority: Minor > Attachments: colloc.zip, collocations.zip > > > We should be able to find the all the high frequency terms/phrases ( where > frequency is the search criteria / benchmark) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-474) High Frequency Terms/Phrases at the Index level
[ https://issues.apache.org/jira/browse/LUCENE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Provalov updated LUCENE-474: - Attachment: (was: collocations.zip) > High Frequency Terms/Phrases at the Index level > --- > > Key: LUCENE-474 > URL: https://issues.apache.org/jira/browse/LUCENE-474 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Affects Versions: 1.4 >Reporter: Suri Babu B >Assignee: Otis Gospodnetic >Priority: Minor > Attachments: colloc.zip, collocations.zip > > > We should be able to find the all the high frequency terms/phrases ( where > frequency is the search criteria / benchmark) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-474) High Frequency Terms/Phrases at the Index level
[ https://issues.apache.org/jira/browse/LUCENE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Provalov updated LUCENE-474: - Attachment: collocations.zip Included the scoring in the CollocationsSearcher which now will return the LinkedHashMap of Collocated Terms and their scores relative to the query term. Did some minor refactoring and changed the test. > High Frequency Terms/Phrases at the Index level > --- > > Key: LUCENE-474 > URL: https://issues.apache.org/jira/browse/LUCENE-474 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Affects Versions: 1.4 >Reporter: Suri Babu B >Assignee: Otis Gospodnetic >Priority: Minor > Attachments: colloc.zip, collocations.zip > > > We should be able to find the all the high frequency terms/phrases ( where > frequency is the search criteria / benchmark) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-474) High Frequency Terms/Phrases at the Index level
[ https://issues.apache.org/jira/browse/LUCENE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Provalov updated LUCENE-474: - Attachment: collocations.zip I saw some activity on the term collocations in the lucene user forum recently and decided to make a few changes to the colloc.zip package which Mark worked on. I used it before and it worked well for my project. I ended up doing some fixes and refactoring and adding couple of unit tests, as well as a new class which will search the collocated terms if provided with a term. This version works with Lucene 3.0.2. Also, I changed package names, added the license verbage, as well as added maven and ant for contrib packaging. If Mark is OK with these changes, it could be published as a contrib. > High Frequency Terms/Phrases at the Index level > --- > > Key: LUCENE-474 > URL: https://issues.apache.org/jira/browse/LUCENE-474 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Affects Versions: 1.4 >Reporter: Suri Babu B >Assignee: Otis Gospodnetic >Priority: Minor > Attachments: colloc.zip, collocations.zip > > > We should be able to find the all the high frequency terms/phrases ( where > frequency is the search criteria / benchmark) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
[ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1285#action_1285 ] Ivan Provalov commented on LUCENE-2458: --- Robert has asked me to post our test results on the Chinese Collection. We used the following data collection from TREC: http://trec.nist.gov/data/qrels_noneng/index.html qrels.trec6.29-54.chinese.gz qrels.1-28.chinese.gz http://trec.nist.gov/data/topics_noneng TREC-6 Chinese topics (.gz) TREC-5 Chinese topics (.gz) Mandarin Data Collection http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2000T52 Analyzer Name Plain analyzers Added PositionFilter (only at query time) ChineseAnalyzer 0.028 0.264 CJKAnalyzer 0.027 0.284 SmartChinese 0.027 0.265 IKAnalyzer 0.028 0.259 (Note: IKAnalyzer has its own IKQueryParser which yields 0.084 for the average precision) Thanks, Ivan Provalov > queryparser shouldn't generate phrasequeries based on term count > > > Key: LUCENE-2458 > URL: https://issues.apache.org/jira/browse/LUCENE-2458 > Project: Lucene - Java > Issue Type: Bug > Components: QueryParser >Reporter: Robert Muir >Priority: Critical > > The current method in the queryparser to generate phrasequeries is wrong: > The Query Syntax documentation > (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: > {noformat} > A Phrase is a group of words surrounded by double quotes such as "hello > dolly". > {noformat} > But as we know, this isn't actually true. > Instead the terms are first divided on whitespace, then the analyzer term > count is used as some sort of "heuristic" to determine if its a phrase query > or not. > This assumption is a disaster for languages that don't use whitespace > separation: CJK, compounding European languages like German, Finnish, etc. It > also > makes it difficult for people to use n-gram analysis techniques. In these > cases you get bad relevance (MAP improves nearly *10x* if you use a > PositionFilter at query-time to "turn this off" for chinese). > For even english, this undocumented behavior is bad. Perhaps in some cases > its being abused as some heuristic to "second guess" the tokenizer and piece > back things it shouldn't have split, but for large collections, doing things > like generating phrasequeries because StandardTokenizer split a compound on a > dash can cause serious performance problems. Instead people should analyze > their text with the appropriate methods, and QueryParser should only generate > phrase queries when the syntax asks for one. > The PositionFilter in contrib can be seen as a workaround, but its pretty > obscure and people are not familiar with it. The result is we have bad > out-of-box behavior for many languages, and bad performance for others on > some inputs. > I propose instead that we change the grammar to actually look for double > quotes to determine when to generate a phrase query, consistent with the > documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org