[jira] [Commented] (LUCENE-7321) Character Mapping

2018-08-27 Thread Ivan Provalov (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594154#comment-16594154
 ] 

Ivan Provalov commented on LUCENE-7321:
---

[~arafalov], the clean use case is for this filter is to externalize the 
morphological modifications rules.  Most stemmers have hard-coded rules.  With 
this one, the rules are expressed in the flat mapping files and configurations. 
 Originally, it was developed to extend a few cases for some languages listed 
here and a few other languages, as well as to visualize these rules which would 
help the linguists involved in the project to understand the modification rules 
for more complex scenarios.  I added the Russian stemmer implementation as a 
general reference just to show how one can configure the entire stemmer 
implementation without hard-coded rules.  We have not seen any performance 
issues with this so far.  Hope this helps.

> Character Mapping
> -
>
> Key: LUCENE-7321
> URL: https://issues.apache.org/jira/browse/LUCENE-7321
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.6.1, 5.4.1, 6.0, 6.0.1
>Reporter: Ivan Provalov
>Priority: Minor
>  Labels: patch
> Fix For: 6.0.1
>
> Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch
>
>
> One of the challenges in search is recall of an item with a common typing 
> variant.  These cases can be as simple as lower/upper case in most languages, 
> accented characters, or more complex morphological phenomena like prefix 
> omitting, or constructing a character with some combining mark.  This 
> component addresses the cases, which are not covered by ASCII folding 
> component, or more complex to design with other tools.  The idea is that a 
> linguist could provide the mappings in a tab-delimited file, which then can 
> be directly used by Solr.
> The mappings are maintained in the tab-delimited file, which could be just a 
> copy paste from Excel spreadsheet.  This gives the linguists the opportunity 
> to create the mappings, then for the developer to include them in Solr 
> configuration.  There are a few cases, when the mappings grow complex, where 
> some additional debugging may be required.  The mappings can contain any 
> sequence of characters to any other sequence of characters.
> Some of the cases I discuss in detail document are handling the voiced vowels 
> for Japanese; common typing substitutions for Korean, Russian, Polish; 
> transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding 
> for Japanese.  In the appendix, I give an example of implementing a Russian 
> light weight stemmer using this component.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7321) Character Mapping

2018-08-27 Thread Ivan Provalov (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593976#comment-16593976
 ] 

Ivan Provalov commented on LUCENE-7321:
---

[~erickerickson], 

Good questions: 

1. I just ran the tests in the patch against the master, they passed. 

2. It allows you to configure/modify morphological analysis with externalized 
mapping files.  I attached a description and a reference implementation of the 
Russian stemmer using this filter.

Thanks,

Ivan

  

> Character Mapping
> -
>
> Key: LUCENE-7321
> URL: https://issues.apache.org/jira/browse/LUCENE-7321
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.6.1, 5.4.1, 6.0, 6.0.1
>Reporter: Ivan Provalov
>Priority: Minor
>  Labels: patch
> Fix For: 6.0.1
>
> Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch
>
>
> One of the challenges in search is recall of an item with a common typing 
> variant.  These cases can be as simple as lower/upper case in most languages, 
> accented characters, or more complex morphological phenomena like prefix 
> omitting, or constructing a character with some combining mark.  This 
> component addresses the cases, which are not covered by ASCII folding 
> component, or more complex to design with other tools.  The idea is that a 
> linguist could provide the mappings in a tab-delimited file, which then can 
> be directly used by Solr.
> The mappings are maintained in the tab-delimited file, which could be just a 
> copy paste from Excel spreadsheet.  This gives the linguists the opportunity 
> to create the mappings, then for the developer to include them in Solr 
> configuration.  There are a few cases, when the mappings grow complex, where 
> some additional debugging may be required.  The mappings can contain any 
> sequence of characters to any other sequence of characters.
> Some of the cases I discuss in detail document are handling the voiced vowels 
> for Japanese; common typing substitutions for Korean, Russian, Polish; 
> transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding 
> for Japanese.  In the appendix, I give an example of implementing a Russian 
> light weight stemmer using this component.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7321) Character Mapping

2018-08-27 Thread Ivan Provalov (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593779#comment-16593779
 ] 

Ivan Provalov commented on LUCENE-7321:
---

[~erike4...@yahoo.com], any progress on committing this patch?

Thanks,

Ivan

> Character Mapping
> -
>
> Key: LUCENE-7321
> URL: https://issues.apache.org/jira/browse/LUCENE-7321
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.6.1, 5.4.1, 6.0, 6.0.1
>Reporter: Ivan Provalov
>Priority: Minor
>  Labels: patch
> Fix For: 6.0.1
>
> Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch
>
>
> One of the challenges in search is recall of an item with a common typing 
> variant.  These cases can be as simple as lower/upper case in most languages, 
> accented characters, or more complex morphological phenomena like prefix 
> omitting, or constructing a character with some combining mark.  This 
> component addresses the cases, which are not covered by ASCII folding 
> component, or more complex to design with other tools.  The idea is that a 
> linguist could provide the mappings in a tab-delimited file, which then can 
> be directly used by Solr.
> The mappings are maintained in the tab-delimited file, which could be just a 
> copy paste from Excel spreadsheet.  This gives the linguists the opportunity 
> to create the mappings, then for the developer to include them in Solr 
> configuration.  There are a few cases, when the mappings grow complex, where 
> some additional debugging may be required.  The mappings can contain any 
> sequence of characters to any other sequence of characters.
> Some of the cases I discuss in detail document are handling the voiced vowels 
> for Japanese; common typing substitutions for Korean, Russian, Polish; 
> transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding 
> for Japanese.  In the appendix, I give an example of implementing a Russian 
> light weight stemmer using this component.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8131) Kuromoji User Dictionary Resources Not Closed

2018-01-15 Thread Ivan Provalov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326653#comment-16326653
 ] 

Ivan Provalov commented on LUCENE-8131:
---

Thanks, [~hossman]!

> Kuromoji User Dictionary Resources Not Closed
> -
>
> Key: LUCENE-8131
> URL: https://issues.apache.org/jira/browse/LUCENE-8131
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.3
>Reporter: Ivan Provalov
>Priority: Major
> Fix For: 5.5.4, 6.4, 7.0
>
>
> InputStream and Reader need to be closed in JapaneseTokenizerFactory.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8131) Kuromoji User Dictionary Resources Not Closed

2018-01-15 Thread Ivan Provalov (JIRA)
Ivan Provalov created LUCENE-8131:
-

 Summary: Kuromoji User Dictionary Resources Not Closed
 Key: LUCENE-8131
 URL: https://issues.apache.org/jira/browse/LUCENE-8131
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/analysis
Affects Versions: 6.3
Reporter: Ivan Provalov


InputStream and Reader need to be closed in JapaneseTokenizerFactory.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9942) MoreLikeThis Performance Degraded With Filtered Query

2017-01-07 Thread Ivan Provalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Provalov updated SOLR-9942:

Attachment: (was: solr_mlt_test.tar)

> MoreLikeThis Performance Degraded With Filtered Query
> -
>
> Key: SOLR-9942
> URL: https://issues.apache.org/jira/browse/SOLR-9942
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: MoreLikeThis
>Affects Versions: 5.5.2
>Reporter: Ivan Provalov
> Attachments: solr_mlt_test2.tar
>
>
> Without any filters, the MLT is performing normal.  With any added filters, 
> the performance degrades compared to 4.6.1 (2.5-3.0X in our case).  The issue 
> goes away with 6.0 upgrade.  The hot method is Lucene's DisiPriorityQueue 
> downHeap(), which takes 5X more calls in 5.5.2 compared to 6.0.  I am 
> guessing that some of the Solr filters refactoring fixed it for 6.0 release.
> As a work-around, for now I just refactored the custom MLT handler to convert 
> the filters into boolean clauses, which takes care of the issue.   
> Our configuration: 
> 1. mlt.maxqt=100
> 2. There is an additional filter passed as a parameter
> 3.  multiValued="true" omitNorms="false" termVectors="true"/>
> 4. text_general is a pretty standard text fieldType.
> I have a code to populate a test dataset and run a query in order to 
> reproduce this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9942) MoreLikeThis Performance Degraded With Filtered Query

2017-01-07 Thread Ivan Provalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Provalov updated SOLR-9942:

Description: 
Without any filters, the MLT is performing normal.  With any added filters, the 
performance degrades compared to 4.6.1 (2.5-3.0X in our case).  The issue goes 
away with 6.0 upgrade.  The hot method is Lucene's DisiPriorityQueue 
downHeap(), which takes 5X more calls in 5.5.2 compared to 6.0.  I am guessing 
that some of the Solr filters refactoring fixed it for 6.0 release.

As a work-around, for now I just refactored the custom MLT handler to convert 
the filters into boolean clauses, which takes care of the issue.   

Our configuration: 
1. mlt.maxqt=100
2. There is an additional filter passed as a parameter
3. 
4. text_general is a pretty standard text fieldType.

I have a code to populate a test dataset and run a query in order to reproduce 
this.

  was:
Without any filters, the MLT is performing normal.  With any added filters, the 
performance degrades compared to 4.6.1 (2.5-3.0X in our case).  The issue goes 
away with 6.0 upgrade.  The hot method is Lucene's DisiPriorityQueue 
downHeap(), which takes 5X more calls in 5.5.2 compared to 6.0.  I am guessing 
that some of the Solr filters refactoring fixed it for 6.0 release.

As a work-around, for now I just refactored the custom MLT handler to convert 
the filters into boolean clauses, which takes care of the issue.   

Our configuration: 
1. mlt.maxqt=100
2. There is an additional filter passed as a parameter
3. 
4. text_en is a pretty standard text fieldType.

I have a code to populate a test dataset and run a query in order to reproduce 
this.


> MoreLikeThis Performance Degraded With Filtered Query
> -
>
> Key: SOLR-9942
> URL: https://issues.apache.org/jira/browse/SOLR-9942
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: MoreLikeThis
>Affects Versions: 5.5.2
>Reporter: Ivan Provalov
> Attachments: solr_mlt_test2.tar
>
>
> Without any filters, the MLT is performing normal.  With any added filters, 
> the performance degrades compared to 4.6.1 (2.5-3.0X in our case).  The issue 
> goes away with 6.0 upgrade.  The hot method is Lucene's DisiPriorityQueue 
> downHeap(), which takes 5X more calls in 5.5.2 compared to 6.0.  I am 
> guessing that some of the Solr filters refactoring fixed it for 6.0 release.
> As a work-around, for now I just refactored the custom MLT handler to convert 
> the filters into boolean clauses, which takes care of the issue.   
> Our configuration: 
> 1. mlt.maxqt=100
> 2. There is an additional filter passed as a parameter
> 3.  multiValued="true" omitNorms="false" termVectors="true"/>
> 4. text_general is a pretty standard text fieldType.
> I have a code to populate a test dataset and run a query in order to 
> reproduce this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9942) MoreLikeThis Performance Degraded With Filtered Query

2017-01-07 Thread Ivan Provalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Provalov updated SOLR-9942:

Attachment: solr_mlt_test2.tar

test case

> MoreLikeThis Performance Degraded With Filtered Query
> -
>
> Key: SOLR-9942
> URL: https://issues.apache.org/jira/browse/SOLR-9942
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: MoreLikeThis
>Affects Versions: 5.5.2
>Reporter: Ivan Provalov
> Attachments: solr_mlt_test2.tar
>
>
> Without any filters, the MLT is performing normal.  With any added filters, 
> the performance degrades compared to 4.6.1 (2.5-3.0X in our case).  The issue 
> goes away with 6.0 upgrade.  The hot method is Lucene's DisiPriorityQueue 
> downHeap(), which takes 5X more calls in 5.5.2 compared to 6.0.  I am 
> guessing that some of the Solr filters refactoring fixed it for 6.0 release.
> As a work-around, for now I just refactored the custom MLT handler to convert 
> the filters into boolean clauses, which takes care of the issue.   
> Our configuration: 
> 1. mlt.maxqt=100
> 2. There is an additional filter passed as a parameter
> 3.  multiValued="true" omitNorms="false" termVectors="true"/>
> 4. text_general is a pretty standard text fieldType.
> I have a code to populate a test dataset and run a query in order to 
> reproduce this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9942) MoreLikeThis Performance Degraded With Filtered Query

2017-01-07 Thread Ivan Provalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Provalov updated SOLR-9942:

Description: 
Without any filters, the MLT is performing normal.  With any added filters, the 
performance degrades compared to 4.6.1 (2.5-3.0X in our case).  The issue goes 
away with 6.0 upgrade.  The hot method is Lucene's DisiPriorityQueue 
downHeap(), which takes 5X more calls in 5.5.2 compared to 6.0.  I am guessing 
that some of the Solr filters refactoring fixed it for 6.0 release.

As a work-around, for now I just refactored the custom MLT handler to convert 
the filters into boolean clauses, which takes care of the issue.   

Our configuration: 
1. mlt.maxqt=100
2. There is an additional filter passed as a parameter
3. 
4. text_en is a pretty standard text fieldType.

I have a code to populate a test dataset and run a query in order to reproduce 
this.

  was:
Without any filters, the MLT is performing normal.  With any added filters, the 
performance degrades (2.5-3.0X in our case).  The issue goes away with 6.0 
upgrade.  The hot method is Lucene's DisiPriorityQueue downHeap(), which takes 
5X more calls in 5.5.2 compared to 6.0.  I am guessing that some of the Solr 
filters refactoring fixed it for 6.0 release.

As a work-around, for now I just refactored the custom MLT handler to convert 
the filters into boolean clauses, which takes care of the issue.   

Our configuration: 
1. mlt.maxqt=100
2. There is an additional filter passed as a parameter
3. 
4. text_en is a pretty standard text fieldType.

I have a code to populate a test dataset and run a query in order to reproduce 
this.


> MoreLikeThis Performance Degraded With Filtered Query
> -
>
> Key: SOLR-9942
> URL: https://issues.apache.org/jira/browse/SOLR-9942
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: MoreLikeThis
>Affects Versions: 5.5.2
>Reporter: Ivan Provalov
> Attachments: solr_mlt_test.tar
>
>
> Without any filters, the MLT is performing normal.  With any added filters, 
> the performance degrades compared to 4.6.1 (2.5-3.0X in our case).  The issue 
> goes away with 6.0 upgrade.  The hot method is Lucene's DisiPriorityQueue 
> downHeap(), which takes 5X more calls in 5.5.2 compared to 6.0.  I am 
> guessing that some of the Solr filters refactoring fixed it for 6.0 release.
> As a work-around, for now I just refactored the custom MLT handler to convert 
> the filters into boolean clauses, which takes care of the issue.   
> Our configuration: 
> 1. mlt.maxqt=100
> 2. There is an additional filter passed as a parameter
> 3.  multiValued="true" omitNorms="false" termVectors="true"/>
> 4. text_en is a pretty standard text fieldType.
> I have a code to populate a test dataset and run a query in order to 
> reproduce this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9942) MoreLikeThis Performance Degraded With Filtered Query

2017-01-07 Thread Ivan Provalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Provalov updated SOLR-9942:

Attachment: solr_mlt_test.tar

test for mlt performance issue

> MoreLikeThis Performance Degraded With Filtered Query
> -
>
> Key: SOLR-9942
> URL: https://issues.apache.org/jira/browse/SOLR-9942
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: MoreLikeThis
>Affects Versions: 5.5.2
>Reporter: Ivan Provalov
> Attachments: solr_mlt_test.tar
>
>
> Without any filters, the MLT is performing normal.  With any added filters, 
> the performance degrades (2.5-3.0X in our case).  The issue goes away with 
> 6.0 upgrade.  The hot method is Lucene's DisiPriorityQueue downHeap(), which 
> takes 5X more calls in 5.5.2 compared to 6.0.  I am guessing that some of the 
> Solr filters refactoring fixed it for 6.0 release.
> As a work-around, for now I just refactored the custom MLT handler to convert 
> the filters into boolean clauses, which takes care of the issue.   
> Our configuration: 
> 1. mlt.maxqt=100
> 2. There is an additional filter passed as a parameter
> 3.  multiValued="true" omitNorms="false" termVectors="true"/>
> 4. text_en is a pretty standard text fieldType.
> I have a code to populate a test dataset and run a query in order to 
> reproduce this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-9942) MoreLikeThis Performance Degraded With Filtered Query

2017-01-07 Thread Ivan Provalov (JIRA)
Ivan Provalov created SOLR-9942:
---

 Summary: MoreLikeThis Performance Degraded With Filtered Query
 Key: SOLR-9942
 URL: https://issues.apache.org/jira/browse/SOLR-9942
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: MoreLikeThis
Affects Versions: 5.5.2
Reporter: Ivan Provalov


Without any filters, the MLT is performing normal.  With any added filters, the 
performance degrades (2.5-3.0X in our case).  The issue goes away with 6.0 
upgrade.  The hot method is Lucene's DisiPriorityQueue downHeap(), which takes 
5X more calls in 5.5.2 compared to 6.0.  I am guessing that some of the Solr 
filters refactoring fixed it for 6.0 release.

As a work-around, for now I just refactored the custom MLT handler to convert 
the filters into boolean clauses, which takes care of the issue.   

Our configuration: 
1. mlt.maxqt=100
2. There is an additional filter passed as a parameter
3. 
4. text_en is a pretty standard text fieldType.

I have a code to populate a test dataset and run a query in order to reproduce 
this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-9730) IndexSchema Dynamic Field Definition Caching

2016-11-04 Thread Ivan Provalov (JIRA)
Ivan Provalov created SOLR-9730:
---

 Summary: IndexSchema Dynamic Field Definition Caching
 Key: SOLR-9730
 URL: https://issues.apache.org/jira/browse/SOLR-9730
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: Schema and Analysis
Affects Versions: 6.2.1, 5.5.2
Reporter: Ivan Provalov
Priority: Minor


A small optimization suggestion for IndexSchema class, cache the definitions of 
the dynamic fields:

{code}
private Map cachedDynamicFields = new HashMap<>();

@Override
public SchemaField getFieldOrNull(String fieldName)
{
SchemaField f = fields.get(fieldName);
if (f != null) return f;

f = cachedDynamicFields.get(fieldName);
if (f != null) return f;

for (DynamicField df : dynamicFields) {
if (df.matches(fieldName)) {
f = df.makeSchemaField(fieldName);
cachedDynamicFields.put(fieldName, f);
return f;
}
}
return f;
}
{code}

Are there any reasons not to do this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7486) DisjunctionMaxScorer Initializes scoreMax to Zero Preventing From Using Negative Scores

2016-10-11 Thread Ivan Provalov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566543#comment-15566543
 ] 

Ivan Provalov commented on LUCENE-7486:
---

+1

> DisjunctionMaxScorer Initializes scoreMax to Zero Preventing From Using 
> Negative Scores
> ---
>
> Key: LUCENE-7486
> URL: https://issues.apache.org/jira/browse/LUCENE-7486
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 5.5.2
>Reporter: Ivan Provalov
>Assignee: Uwe Schindler
> Fix For: 6.x, master (7.0), 6.3
>
> Attachments: LUCENE-7486.patch, LUCENE-7486.patch
>
>
> We are using a log of probability for scoring, which gives us negative 
> scores.  
> DisjunctionMaxScorer initializes scoreMax in the score(...) function to zero 
> preventing us from using negative scores.  Is there a reason it couldn't be 
> initialized to something like this:
> float scoreMax = Float.MAX_VALUE * -1;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7486) DisjunctionMaxScorer Initializes scoreMax to Zero Preventing From Using Negative Scores

2016-10-10 Thread Ivan Provalov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563360#comment-15563360
 ] 

Ivan Provalov commented on LUCENE-7486:
---

Thanks, Uwe!

> DisjunctionMaxScorer Initializes scoreMax to Zero Preventing From Using 
> Negative Scores
> ---
>
> Key: LUCENE-7486
> URL: https://issues.apache.org/jira/browse/LUCENE-7486
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 5.5.2
>Reporter: Ivan Provalov
>Assignee: Uwe Schindler
>
> We are using a log of probability for scoring, which gives us negative 
> scores.  
> DisjunctionMaxScorer initializes scoreMax in the score(...) function to zero 
> preventing us from using negative scores.  Is there a reason it couldn't be 
> initialized to something like this:
> float scoreMax = Float.MAX_VALUE * -1;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Deleted] (LUCENE-7486) DisjunctionMaxScorer Initializes scoreMax to Zero Preventing From Using Negative Scores

2016-10-10 Thread Ivan Provalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Provalov updated LUCENE-7486:
--
Comment: was deleted

(was: Thanks, Uwe!)

> DisjunctionMaxScorer Initializes scoreMax to Zero Preventing From Using 
> Negative Scores
> ---
>
> Key: LUCENE-7486
> URL: https://issues.apache.org/jira/browse/LUCENE-7486
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 5.5.2
>Reporter: Ivan Provalov
>Assignee: Uwe Schindler
>
> We are using a log of probability for scoring, which gives us negative 
> scores.  
> DisjunctionMaxScorer initializes scoreMax in the score(...) function to zero 
> preventing us from using negative scores.  Is there a reason it couldn't be 
> initialized to something like this:
> float scoreMax = Float.MAX_VALUE * -1;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7486) DisjunctionMaxScorer Initializes scoreMax to Zero Preventing From Using Negative Scores

2016-10-10 Thread Ivan Provalov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563361#comment-15563361
 ] 

Ivan Provalov commented on LUCENE-7486:
---

Thanks, Uwe!

> DisjunctionMaxScorer Initializes scoreMax to Zero Preventing From Using 
> Negative Scores
> ---
>
> Key: LUCENE-7486
> URL: https://issues.apache.org/jira/browse/LUCENE-7486
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 5.5.2
>Reporter: Ivan Provalov
>Assignee: Uwe Schindler
>
> We are using a log of probability for scoring, which gives us negative 
> scores.  
> DisjunctionMaxScorer initializes scoreMax in the score(...) function to zero 
> preventing us from using negative scores.  Is there a reason it couldn't be 
> initialized to something like this:
> float scoreMax = Float.MAX_VALUE * -1;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7486) DisjunctionMaxScorer Initializes scoreMax to Zero Preventing From Using Negative Scores

2016-10-10 Thread Ivan Provalov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15562771#comment-15562771
 ] 

Ivan Provalov commented on LUCENE-7486:
---

Good point, Uwe.  Is there a reason it shouldn't be done in Lucene source?

> DisjunctionMaxScorer Initializes scoreMax to Zero Preventing From Using 
> Negative Scores
> ---
>
> Key: LUCENE-7486
> URL: https://issues.apache.org/jira/browse/LUCENE-7486
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 5.5.2
>Reporter: Ivan Provalov
>
> We are using a log of probability for scoring, which gives us negative 
> scores.  
> DisjunctionMaxScorer initializes scoreMax in the score(...) function to zero 
> preventing us from using negative scores.  Is there a reason it couldn't be 
> initialized to something like this:
> float scoreMax = Float.MAX_VALUE * -1;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7486) DisjunctionMaxScorer Initializes scoreMax to Zero Preventing From Using Negative Scores

2016-10-10 Thread Ivan Provalov (JIRA)
Ivan Provalov created LUCENE-7486:
-

 Summary: DisjunctionMaxScorer Initializes scoreMax to Zero 
Preventing From Using Negative Scores
 Key: LUCENE-7486
 URL: https://issues.apache.org/jira/browse/LUCENE-7486
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Affects Versions: 5.5.2
Reporter: Ivan Provalov


We are using a log of probability for scoring, which gives us negative scores.  

DisjunctionMaxScorer initializes scoreMax in the score(...) function to zero 
preventing us from using negative scores.  Is there a reason it couldn't be 
initialized to something like this:

float scoreMax = Float.MAX_VALUE * -1;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7321) Character Mapping

2016-06-08 Thread Ivan Provalov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15321865#comment-15321865
 ] 

Ivan Provalov commented on LUCENE-7321:
---

Koji, this one works on a token level, allowing do things like prefix/suffix 
manipulations.  Graph generator and collapser also makes it user friendly when 
dealing with a lot of mappings (please see the attached description file).

> Character Mapping
> -
>
> Key: LUCENE-7321
> URL: https://issues.apache.org/jira/browse/LUCENE-7321
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.6.1, 6.0, 5.4.1, 6.0.1
>Reporter: Ivan Provalov
>Priority: Minor
>  Labels: patch
> Fix For: 6.0.1
>
> Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch
>
>
> One of the challenges in search is recall of an item with a common typing 
> variant.  These cases can be as simple as lower/upper case in most languages, 
> accented characters, or more complex morphological phenomena like prefix 
> omitting, or constructing a character with some combining mark.  This 
> component addresses the cases, which are not covered by ASCII folding 
> component, or more complex to design with other tools.  The idea is that a 
> linguist could provide the mappings in a tab-delimited file, which then can 
> be directly used by Solr.
> The mappings are maintained in the tab-delimited file, which could be just a 
> copy paste from Excel spreadsheet.  This gives the linguists the opportunity 
> to create the mappings, then for the developer to include them in Solr 
> configuration.  There are a few cases, when the mappings grow complex, where 
> some additional debugging may be required.  The mappings can contain any 
> sequence of characters to any other sequence of characters.
> Some of the cases I discuss in detail document are handling the voiced vowels 
> for Japanese; common typing substitutions for Korean, Russian, Polish; 
> transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding 
> for Japanese.  In the appendix, I give an example of implementing a Russian 
> light weight stemmer using this component.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7321) Character Mapping

2016-06-08 Thread Ivan Provalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Provalov updated LUCENE-7321:
--
Description: 
One of the challenges in search is recall of an item with a common typing 
variant.  These cases can be as simple as lower/upper case in most languages, 
accented characters, or more complex morphological phenomena like prefix 
omitting, or constructing a character with some combining mark.  This component 
addresses the cases, which are not covered by ASCII folding component, or more 
complex to design with other tools.  The idea is that a linguist could provide 
the mappings in a tab-delimited file, which then can be directly used by Solr.

The mappings are maintained in the tab-delimited file, which could be just a 
copy paste from Excel spreadsheet.  This gives the linguists the opportunity to 
create the mappings, then for the developer to include them in Solr 
configuration.  There are a few cases, when the mappings grow complex, where 
some additional debugging may be required.  The mappings can contain any 
sequence of characters to any other sequence of characters.

Some of the cases I discuss in detail document are handling the voiced vowels 
for Japanese; common typing substitutions for Korean, Russian, Polish; 
transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding 
for Japanese.  In the appendix, I give an example of implementing a Russian 
light weight stemmer using this component.

  was:
One of the challenges in search is recall of an item with a common typing 
variant.  These cases can be as simple as lower/upper case in most languages, 
accented characters, or more complex morphological phenomena like prefix 
omitting, or constructing a character with some combining mark.  This component 
addresses the cases, which are not covered by ASCII folding component, or more 
complex to design with other tools.  The idea is that a linguist could provide 
the mappings in a tab-delimited file, which then can be directly used by Solr.

The mappings are maintained in the tab-delimited file, which could be just a 
copy paste from Excel spreadsheet.  This gives the linguists the opportunity to 
create the mappings, then for the developer to include them in Solr 
configuration.  There are a few cases, when the mappings grow complex, where 
some additional debugging may be required.  The mappings can contain any 
sequence of characters to any other sequence of characters.

Some of the cases I discuss in detail document are handling the voiced vowels 
for Japanese; common typing substitutions for Korean, Russian, Polish; 
transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding 
for Japanese.


> Character Mapping
> -
>
> Key: LUCENE-7321
> URL: https://issues.apache.org/jira/browse/LUCENE-7321
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.6.1, 6.0, 5.4.1, 6.0.1
>Reporter: Ivan Provalov
>Priority: Minor
>  Labels: patch
> Fix For: 6.0.1
>
> Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch
>
>
> One of the challenges in search is recall of an item with a common typing 
> variant.  These cases can be as simple as lower/upper case in most languages, 
> accented characters, or more complex morphological phenomena like prefix 
> omitting, or constructing a character with some combining mark.  This 
> component addresses the cases, which are not covered by ASCII folding 
> component, or more complex to design with other tools.  The idea is that a 
> linguist could provide the mappings in a tab-delimited file, which then can 
> be directly used by Solr.
> The mappings are maintained in the tab-delimited file, which could be just a 
> copy paste from Excel spreadsheet.  This gives the linguists the opportunity 
> to create the mappings, then for the developer to include them in Solr 
> configuration.  There are a few cases, when the mappings grow complex, where 
> some additional debugging may be required.  The mappings can contain any 
> sequence of characters to any other sequence of characters.
> Some of the cases I discuss in detail document are handling the voiced vowels 
> for Japanese; common typing substitutions for Korean, Russian, Polish; 
> transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding 
> for Japanese.  In the appendix, I give an example of implementing a Russian 
> light weight stemmer using this component.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7321) Character Mapping

2016-06-08 Thread Ivan Provalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Provalov updated LUCENE-7321:
--
Attachment: CharacterMappingComponent.pdf

Detail component description.

> Character Mapping
> -
>
> Key: LUCENE-7321
> URL: https://issues.apache.org/jira/browse/LUCENE-7321
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.6.1, 6.0, 5.4.1, 6.0.1
>Reporter: Ivan Provalov
>Priority: Minor
>  Labels: patch
> Fix For: 6.0.1
>
> Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch
>
>
> One of the challenges in search is recall of an item with a common typing 
> variant.  These cases can be as simple as lower/upper case in most languages, 
> accented characters, or more complex morphological phenomena like prefix 
> omitting, or constructing a character with some combining mark.  This 
> component addresses the cases, which are not covered by ASCII folding 
> component, or more complex to design with other tools.  The idea is that a 
> linguist could provide the mappings in a tab-delimited file, which then can 
> be directly used by Solr.
> The mappings are maintained in the tab-delimited file, which could be just a 
> copy paste from Excel spreadsheet.  This gives the linguists the opportunity 
> to create the mappings, then for the developer to include them in Solr 
> configuration.  There are a few cases, when the mappings grow complex, where 
> some additional debugging may be required.  The mappings can contain any 
> sequence of characters to any other sequence of characters.
> Some of the cases I discuss in detail document are handling the voiced vowels 
> for Japanese; common typing substitutions for Korean, Russian, Polish; 
> transliteration for Polish, Arabic; prefix removal   for Arabic; suffix 
> folding for Japanese.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7321) Character Mapping

2016-06-08 Thread Ivan Provalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Provalov updated LUCENE-7321:
--
Attachment: LUCENE-7321.patch

Initial patch.

> Character Mapping
> -
>
> Key: LUCENE-7321
> URL: https://issues.apache.org/jira/browse/LUCENE-7321
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.6.1, 6.0, 5.4.1, 6.0.1
>Reporter: Ivan Provalov
>Priority: Minor
>  Labels: patch
> Fix For: 6.0.1
>
> Attachments: LUCENE-7321.patch
>
>
> One of the challenges in search is recall of an item with a common typing 
> variant.  These cases can be as simple as lower/upper case in most languages, 
> accented characters, or more complex morphological phenomena like prefix 
> omitting, or constructing a character with some combining mark.  This 
> component addresses the cases, which are not covered by ASCII folding 
> component, or more complex to design with other tools.  The idea is that a 
> linguist could provide the mappings in a tab-delimited file, which then can 
> be directly used by Solr.
> The mappings are maintained in the tab-delimited file, which could be just a 
> copy paste from Excel spreadsheet.  This gives the linguists the opportunity 
> to create the mappings, then for the developer to include them in Solr 
> configuration.  There are a few cases, when the mappings grow complex, where 
> some additional debugging may be required.  The mappings can contain any 
> sequence of characters to any other sequence of characters.
> Some of the cases I discuss in detail document are handling the voiced vowels 
> for Japanese; common typing substitutions for Korean, Russian, Polish; 
> transliteration for Polish, Arabic; prefix removal   for Arabic; suffix 
> folding for Japanese.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7321) Character Mapping

2016-06-07 Thread Ivan Provalov (JIRA)
Ivan Provalov created LUCENE-7321:
-

 Summary: Character Mapping
 Key: LUCENE-7321
 URL: https://issues.apache.org/jira/browse/LUCENE-7321
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 6.0.1, 5.4.1, 6.0, 4.6.1
Reporter: Ivan Provalov
Priority: Minor
 Fix For: 6.0.1


One of the challenges in search is recall of an item with a common typing 
variant.  These cases can be as simple as lower/upper case in most languages, 
accented characters, or more complex morphological phenomena like prefix 
omitting, or constructing a character with some combining mark.  This component 
addresses the cases, which are not covered by ASCII folding component, or more 
complex to design with other tools.  The idea is that a linguist could provide 
the mappings in a tab-delimited file, which then can be directly used by Solr.

The mappings are maintained in the tab-delimited file, which could be just a 
copy paste from Excel spreadsheet.  This gives the linguists the opportunity to 
create the mappings, then for the developer to include them in Solr 
configuration.  There are a few cases, when the mappings grow complex, where 
some additional debugging may be required.  The mappings can contain any 
sequence of characters to any other sequence of characters.

Some of the cases I discuss in detail document are handling the voiced vowels 
for Japanese; common typing substitutions for Korean, Russian, Polish; 
transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding 
for Japanese.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3931) Turn off coord() factor for scoring

2015-09-30 Thread Ivan Provalov (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938785#comment-14938785
 ] 

Ivan Provalov commented on SOLR-3931:
-

Ideally, I would like to plug in a similarity class at search time, given the 
norm encoding is compatible across the index time and search time.  Right now, 
this requires a lot of custom extensions.

> Turn off coord() factor for scoring
> ---
>
> Key: SOLR-3931
> URL: https://issues.apache.org/jira/browse/SOLR-3931
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Bill Bell
>
> We would like to remove coordination factor from scoring.
> FOr small fields (like name of doctor), we want to not score higher if the 
> same term is in the field more than once. Makes sense for books, not so much 
> for formal names.
> /solr/select?q=*:*=false
> Default is true.
> (Note: we might want to make each of these optional - tf, idf, coord, 
> queryNorm
> coord(q,d) is a score factor based on how many of the query terms are found 
> in the specified document. Typically, a document that contains more of the 
> query's terms will receive a higher score than another document with fewer 
> query terms. This is a search time factor computed in coord(q,d) by the 
> Similarity in effect at search time. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-474) High Frequency Terms/Phrases at the Index level

2010-08-22 Thread Ivan Provalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Provalov updated LUCENE-474:
-

Attachment: collocations.zip

Included the scoring in the CollocationsSearcher which now will return the 
LinkedHashMap of Collocated Terms and their scores relative to the query term.  
Did some minor refactoring and changed the test.

 High Frequency Terms/Phrases at the Index level
 ---

 Key: LUCENE-474
 URL: https://issues.apache.org/jira/browse/LUCENE-474
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 1.4
Reporter: Suri Babu B
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: colloc.zip, collocations.zip


 We should be able to find the all the high frequency terms/phrases ( where 
 frequency  is the search criteria / benchmark)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-474) High Frequency Terms/Phrases at the Index level

2010-08-22 Thread Ivan Provalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Provalov updated LUCENE-474:
-

Attachment: (was: collocations.zip)

 High Frequency Terms/Phrases at the Index level
 ---

 Key: LUCENE-474
 URL: https://issues.apache.org/jira/browse/LUCENE-474
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 1.4
Reporter: Suri Babu B
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: colloc.zip, collocations.zip


 We should be able to find the all the high frequency terms/phrases ( where 
 frequency  is the search criteria / benchmark)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-474) High Frequency Terms/Phrases at the Index level

2010-08-22 Thread Ivan Provalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Provalov updated LUCENE-474:
-

Attachment: (was: collocations.zip)

 High Frequency Terms/Phrases at the Index level
 ---

 Key: LUCENE-474
 URL: https://issues.apache.org/jira/browse/LUCENE-474
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 1.4
Reporter: Suri Babu B
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: colloc.zip, collocations.zip


 We should be able to find the all the high frequency terms/phrases ( where 
 frequency  is the search criteria / benchmark)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread Ivan Provalov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1285#action_1285
 ] 

Ivan Provalov commented on LUCENE-2458:
---

Robert has asked me to post our test results on the Chinese Collection. We used 
the following data collection from TREC:

http://trec.nist.gov/data/qrels_noneng/index.html
qrels.trec6.29-54.chinese.gz
qrels.1-28.chinese.gz

http://trec.nist.gov/data/topics_noneng
TREC-6 Chinese topics (.gz)
TREC-5 Chinese topics (.gz)

Mandarin Data Collection
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2000T52

Analyzer Name Plain analyzers Added PositionFilter (only at query time)
ChineseAnalyzer 0.028 0.264
CJKAnalyzer 0.027 0.284
SmartChinese 0.027 0.265
IKAnalyzer 0.028 0.259

(Note: IKAnalyzer has its own IKQueryParser which yields 0.084 for the average 
precision)

Thanks,

Ivan Provalov

 queryparser shouldn't generate phrasequeries based on term count
 

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Priority: Critical

 The current method in the queryparser to generate phrasequeries is wrong:
 The Query Syntax documentation 
 (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
 {noformat}
 A Phrase is a group of words surrounded by double quotes such as hello 
 dolly.
 {noformat}
 But as we know, this isn't actually true.
 Instead the terms are first divided on whitespace, then the analyzer term 
 count is used as some sort of heuristic to determine if its a phrase query 
 or not.
 This assumption is a disaster for languages that don't use whitespace 
 separation: CJK, compounding European languages like German, Finnish, etc. It 
 also
 makes it difficult for people to use n-gram analysis techniques. In these 
 cases you get bad relevance (MAP improves nearly *10x* if you use a 
 PositionFilter at query-time to turn this off for chinese).
 For even english, this undocumented behavior is bad. Perhaps in some cases 
 its being abused as some heuristic to second guess the tokenizer and piece 
 back things it shouldn't have split, but for large collections, doing things 
 like generating phrasequeries because StandardTokenizer split a compound on a 
 dash can cause serious performance problems. Instead people should analyze 
 their text with the appropriate methods, and QueryParser should only generate 
 phrase queries when the syntax asks for one.
 The PositionFilter in contrib can be seen as a workaround, but its pretty 
 obscure and people are not familiar with it. The result is we have bad 
 out-of-box behavior for many languages, and bad performance for others on 
 some inputs.
 I propose instead that we change the grammar to actually look for double 
 quotes to determine when to generate a phrase query, consistent with the 
 documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org