[jira] Updated: (SOLR-572) Spell Checker as a Search Component

2008-07-03 Thread Bojan Smid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bojan Smid updated SOLR-572:


Attachment: solr-572.patch

I notice that old pizza->plaza, golf->roof issue is still here. 

I created a patch for latest trunk version which deals with this, here is the 
attachment, I believe the fix should be submitted (maybe it should be 
implemented differently, but that's open for the discussion, I used 
spellchecker.exist() method).

> Spell Checker as a Search Component
> ---
>
> Key: SOLR-572
> URL: https://issues.apache.org/jira/browse/SOLR-572
> Project: Solr
>  Issue Type: New Feature
>  Components: spellchecker
>Affects Versions: 1.3
>Reporter: Shalin Shekhar Mangar
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 1.3
>
> Attachments: solr-572.patch, SOLR-572.patch, SOLR-572.patch, 
> SOLR-572.patch, SOLR-572.patch, SOLR-572.patch, SOLR-572.patch, 
> SOLR-572.patch, SOLR-572.patch, SOLR-572.patch, SOLR-572.patch, 
> SOLR-572.patch, SOLR-572.patch, SOLR-572.patch, SOLR-572.patch, 
> SOLR-572.patch, SOLR-572.patch, SOLR-572.patch, SOLR-572.patch, 
> SOLR-572.patch, SOLR-572.patch, SOLR-572.patch, SOLR-572.patch, 
> SOLR-572.patch, SOLR-572.patch, SOLR-572.patch, SOLR-572.patch
>
>
> http://wiki.apache.org/solr/SpellCheckComponent
> Expose the Lucene contrib SpellChecker as a Search Component. Provide the 
> following features:
> * Allow creating a spell index on a given field and make it possible to have 
> multiple spell indices -- one for each field
> * Give suggestions on a per-field basis
> * Given a multi-word query, give only one consistent suggestion
> * Process the query with the same analyzer specified for the source field and 
> process each token separately
> * Allow the user to specify minimum length for a token (optional)
> Consistency criteria for a multi-word query can consist of the following:
> * Preserve the correct words in the original query as it is
> * Never give duplicate words in a suggestion

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-236) Field collapsing

2008-06-21 Thread Bojan Smid (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12606986#action_12606986
 ] 

Bojan Smid commented on SOLR-236:
-

You can check discussion about this same problem in the posts above (starting 
with 1st Feb 2008). It seems like a rather complex issue which could require 
some serious refactoring of collapsing code.

> Field collapsing
> 
>
> Key: SOLR-236
> URL: https://issues.apache.org/jira/browse/SOLR-236
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Affects Versions: 1.3
>Reporter: Emmanuel Keller
>Assignee: Otis Gospodnetic
> Attachments: field-collapsing-extended-592129.patch, 
> field_collapsing_1.1.0.patch, field_collapsing_1.3.patch, 
> field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
> field_collapsing_dsteigerwald.diff, SOLR-236-FieldCollapsing.patch, 
> SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, solr-236.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given 
> field to a single entry in the result set. Site collapsing is a special case 
> of this, where all results for a given web site is collapsed into one or two 
> entries in the result set, typically with an associated "more documents from 
> this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before 
> collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-236) Field collapsing

2008-06-07 Thread Bojan Smid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bojan Smid updated SOLR-236:


Attachment: solr-236.patch

I updated the patch so that it can be compiled on Solr trunk. Also, since 
CollapseComponent essentially copied QueryComponent's prepare method (and it 
seems that it is supposed to be used instead of it), I made it extend 
QueryComponent (with collapsing-specific process() method, and prepare() method 
inherited from super class).

> Field collapsing
> 
>
> Key: SOLR-236
> URL: https://issues.apache.org/jira/browse/SOLR-236
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Affects Versions: 1.3
>Reporter: Emmanuel Keller
>Assignee: Otis Gospodnetic
> Attachments: field-collapsing-extended-592129.patch, 
> field_collapsing_1.1.0.patch, field_collapsing_1.3.patch, 
> field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
> field_collapsing_dsteigerwald.diff, SOLR-236-FieldCollapsing.patch, 
> SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, solr-236.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given 
> field to a single entry in the result set. Site collapsing is a special case 
> of this, where all results for a given web site is collapsed into one or two 
> entries in the result set, typically with an associated "more documents from 
> this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before 
> collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-572) Spell Checker as a Search Component

2008-06-05 Thread Bojan Smid (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602651#action_12602651
 ] 

Bojan Smid commented on SOLR-572:
-

File based spell checker would probably be used in cases when Solr index is too 
small or too young. So a user would compile a dictionary file (for instance, 
UNIX words file) and use it as a dictionary.

> Spell Checker as a Search Component
> ---
>
> Key: SOLR-572
> URL: https://issues.apache.org/jira/browse/SOLR-572
> Project: Solr
>  Issue Type: New Feature
>  Components: spellchecker
>Affects Versions: 1.3
>Reporter: Shalin Shekhar Mangar
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 1.3
>
> Attachments: SOLR-572.patch, SOLR-572.patch, SOLR-572.patch, 
> SOLR-572.patch, SOLR-572.patch, SOLR-572.patch, SOLR-572.patch, 
> SOLR-572.patch, SOLR-572.patch, SOLR-572.patch, SOLR-572.patch, 
> SOLR-572.patch, SOLR-572.patch, SOLR-572.patch, SOLR-572.patch
>
>
> Expose the Lucene contrib SpellChecker as a Search Component. Provide the 
> following features:
> * Allow creating a spell index on a given field and make it possible to have 
> multiple spell indices -- one for each field
> * Give suggestions on a per-field basis
> * Given a multi-word query, give only one consistent suggestion
> * Process the query with the same analyzer specified for the source field and 
> process each token separately
> * Allow the user to specify minimum length for a token (optional)
> Consistency criteria for a multi-word query can consist of the following:
> * Preserve the correct words in the original query as it is
> * Never give duplicate words in a suggestion

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-236) Field collapsing

2008-05-25 Thread Bojan Smid (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12599714#action_12599714
 ] 

Bojan Smid commented on SOLR-236:
-

Hi Oleg. I'll look into this also. In case you have any working code, you can 
mail it to me, and I'll see what can be reused.

> Field collapsing
> 
>
> Key: SOLR-236
> URL: https://issues.apache.org/jira/browse/SOLR-236
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Affects Versions: 1.3
>Reporter: Emmanuel Keller
>Assignee: Otis Gospodnetic
> Attachments: field-collapsing-extended-592129.patch, 
> field_collapsing_1.1.0.patch, field_collapsing_1.3.patch, 
> field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
> field_collapsing_dsteigerwald.diff, SOLR-236-FieldCollapsing.patch, 
> SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given 
> field to a single entry in the result set. Site collapsing is a special case 
> of this, where all results for a given web site is collapsed into one or two 
> entries in the result set, typically with an associated "more documents from 
> this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before 
> collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-236) Field collapsing

2008-05-25 Thread Bojan Smid (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12599660#action_12599660
 ] 

Bojan Smid commented on SOLR-236:
-

I will try to bring this patch up to date. Currently I see two main problems:

1) The patch applies to trunk, but it doesn't compile. The problem occurs 
mainly because of changes in Search Components (for instance, some method 
signatures which CollapseComponent implements were changed). I have this fixed 
locally (more or less), but I have to test it before posting new version of 
patch.

2) It seems that CollapseComponent can't be used in chain with QueryComponent, 
but instead of it. CollapseComponent basically copies QueryComponent querying 
logic and adds some of it's own. I guess this isn't the right way to go. 
CollapseComponent should contain only collapsing logic and should be chainable 
with other components. Can anyone confirm if I'm right here? Of course, there 
might be some fundamental reason why CollapseComponent had to be implemented 
this way.

Does anyone else see any other issues with this component?

> Field collapsing
> 
>
> Key: SOLR-236
> URL: https://issues.apache.org/jira/browse/SOLR-236
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Affects Versions: 1.3
>Reporter: Emmanuel Keller
>Assignee: Otis Gospodnetic
> Attachments: field-collapsing-extended-592129.patch, 
> field_collapsing_1.1.0.patch, field_collapsing_1.3.patch, 
> field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
> field_collapsing_dsteigerwald.diff, SOLR-236-FieldCollapsing.patch, 
> SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given 
> field to a single entry in the result set. Site collapsing is a special case 
> of this, where all results for a given web site is collapsed into one or two 
> entries in the result set, typically with an associated "more documents from 
> this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before 
> collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (SOLR-572) Spell Checker as a Search Component

2008-05-21 Thread Bojan Smid (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598835#action_12598835
 ] 

bosmid edited comment on SOLR-572 at 5/21/08 3:42 PM:
--

Sure. A quick fix can be done easily, but it probably wouldn't cover all 
possibilities, hence my post...

  was (Author: bosmid):
Sure, a quick fix can be done easily, but it probably wouldn't cover all 
possibilities, hence my post...
  
> Spell Checker as a Search Component
> ---
>
> Key: SOLR-572
> URL: https://issues.apache.org/jira/browse/SOLR-572
> Project: Solr
>  Issue Type: New Feature
>  Components: spellchecker
>Affects Versions: 1.3
>Reporter: Shalin Shekhar Mangar
> Fix For: 1.3
>
> Attachments: SOLR-572.patch, SOLR-572.patch, SOLR-572.patch, 
> SOLR-572.patch
>
>
> Expose the Lucene contrib SpellChecker as a Search Component. Provide the 
> following features:
> * Allow creating a spell index on a given field and make it possible to have 
> multiple spell indices -- one for each field
> * Give suggestions on a per-field basis
> * Given a multi-word query, give only one consistent suggestion
> * Process the query with the same analyzer specified for the source field and 
> process each token separately
> * Allow the user to specify minimum length for a token (optional)
> Consistency criteria for a multi-word query can consist of the following:
> * Preserve the correct words in the original query as it is
> * Never give duplicate words in a suggestion

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-572) Spell Checker as a Search Component

2008-05-21 Thread Bojan Smid (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598835#action_12598835
 ] 

Bojan Smid commented on SOLR-572:
-

Sure, a quick fix can be done easily, but it probably wouldn't cover all 
possibilities, hence my post...

> Spell Checker as a Search Component
> ---
>
> Key: SOLR-572
> URL: https://issues.apache.org/jira/browse/SOLR-572
> Project: Solr
>  Issue Type: New Feature
>  Components: spellchecker
>Affects Versions: 1.3
>Reporter: Shalin Shekhar Mangar
> Fix For: 1.3
>
> Attachments: SOLR-572.patch, SOLR-572.patch, SOLR-572.patch, 
> SOLR-572.patch
>
>
> Expose the Lucene contrib SpellChecker as a Search Component. Provide the 
> following features:
> * Allow creating a spell index on a given field and make it possible to have 
> multiple spell indices -- one for each field
> * Give suggestions on a per-field basis
> * Given a multi-word query, give only one consistent suggestion
> * Process the query with the same analyzer specified for the source field and 
> process each token separately
> * Allow the user to specify minimum length for a token (optional)
> Consistency criteria for a multi-word query can consist of the following:
> * Preserve the correct words in the original query as it is
> * Never give duplicate words in a suggestion

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-572) Spell Checker as a Search Component

2008-05-21 Thread Bojan Smid (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598752#action_12598752
 ] 

Bojan Smid commented on SOLR-572:
-

I noticed that when searching for suggestion for a word which exists in 
dictionary, SC returns some similar word instead of returning that same word. 
Old SCRH had field "exist" which returned true if word exists in the dictionary 
(so the client can treat it as correct word that doesn't need suggestion). 

We can't have exactly the same functionality here (since "multi-word" queries 
should be supported), but we can make SC return field "spellingCorrect" in case 
all words from the query exist in the dictionary. Otherwise, there is no way to 
know if spelling was correct or we should display suggestion.

There is a method in Lucene's SC to check if word exists in the index, so it's 
easy to check if word is correct. However, I'm also thinking of situation when 
we don't have just simple words in the query, for instance : "toyata AND 
miles:[1 to 1]", we want to check just toyata in the index, and return 
suggestion "toyota AND miles:[1 to 1]". Other query types which might pose 
a problem are:
- fuzzy query
- wildcard query
- prefix query
...

> Spell Checker as a Search Component
> ---
>
> Key: SOLR-572
> URL: https://issues.apache.org/jira/browse/SOLR-572
> Project: Solr
>  Issue Type: New Feature
>  Components: spellchecker
>Affects Versions: 1.3
>Reporter: Shalin Shekhar Mangar
> Fix For: 1.3
>
> Attachments: SOLR-572.patch, SOLR-572.patch, SOLR-572.patch, 
> SOLR-572.patch
>
>
> Expose the Lucene contrib SpellChecker as a Search Component. Provide the 
> following features:
> * Allow creating a spell index on a given field and make it possible to have 
> multiple spell indices -- one for each field
> * Give suggestions on a per-field basis
> * Given a multi-word query, give only one consistent suggestion
> * Process the query with the same analyzer specified for the source field and 
> process each token separately
> * Allow the user to specify minimum length for a token (optional)
> Consistency criteria for a multi-word query can consist of the following:
> * Preserve the correct words in the original query as it is
> * Never give duplicate words in a suggestion

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-572) Spell Checker as a Search Component

2008-05-21 Thread Bojan Smid (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598738#action_12598738
 ] 

Bojan Smid commented on SOLR-572:
-

Oleg, that field is now called fieldType, so something like word should work for you as long as you have fileType 
with name word defined in your schema.xml. Let me know if this works.

> Spell Checker as a Search Component
> ---
>
> Key: SOLR-572
> URL: https://issues.apache.org/jira/browse/SOLR-572
> Project: Solr
>  Issue Type: New Feature
>  Components: spellchecker
>Affects Versions: 1.3
>Reporter: Shalin Shekhar Mangar
> Fix For: 1.3
>
> Attachments: SOLR-572.patch, SOLR-572.patch, SOLR-572.patch, 
> SOLR-572.patch
>
>
> Expose the Lucene contrib SpellChecker as a Search Component. Provide the 
> following features:
> * Allow creating a spell index on a given field and make it possible to have 
> multiple spell indices -- one for each field
> * Give suggestions on a per-field basis
> * Given a multi-word query, give only one consistent suggestion
> * Process the query with the same analyzer specified for the source field and 
> process each token separately
> * Allow the user to specify minimum length for a token (optional)
> Consistency criteria for a multi-word query can consist of the following:
> * Preserve the correct words in the original query as it is
> * Never give duplicate words in a suggestion

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-572) Spell Checker as a Search Component

2008-05-21 Thread Bojan Smid (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598728#action_12598728
 ] 

Bojan Smid commented on SOLR-572:
-

I already found the same problem, made a fix and sent it to Shalin, he will 
incorporate it into next patch when it's ready. If you specify field "field 
type" for that dictionary (and that field type can be found in Solr schema), 
you'll avoid the problem for now.

> Spell Checker as a Search Component
> ---
>
> Key: SOLR-572
> URL: https://issues.apache.org/jira/browse/SOLR-572
> Project: Solr
>  Issue Type: New Feature
>  Components: spellchecker
>Affects Versions: 1.3
>Reporter: Shalin Shekhar Mangar
> Fix For: 1.3
>
> Attachments: SOLR-572.patch, SOLR-572.patch, SOLR-572.patch, 
> SOLR-572.patch
>
>
> Expose the Lucene contrib SpellChecker as a Search Component. Provide the 
> following features:
> * Allow creating a spell index on a given field and make it possible to have 
> multiple spell indices -- one for each field
> * Give suggestions on a per-field basis
> * Given a multi-word query, give only one consistent suggestion
> * Process the query with the same analyzer specified for the source field and 
> process each token separately
> * Allow the user to specify minimum length for a token (optional)
> Consistency criteria for a multi-word query can consist of the following:
> * Preserve the correct words in the original query as it is
> * Never give duplicate words in a suggestion

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (SOLR-572) Spell Checker as a Search Component

2008-05-19 Thread Bojan Smid (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597930#action_12597930
 ] 

bosmid edited comment on SOLR-572 at 5/19/08 5:05 AM:
--

Character encodings for file-based dictionaries now supported with property 
characterEncoding. So, configuration for such dictionary would look like this:

{code:xml}

external
file
spellings.txt
UTF-8
c:\spellchecker

{code}

New code needs latest lucene-spellchecker-2.4*.jar from Lucene trunk.

Since SolrResourceLoader.getLines method doesn't support configurable encodings 
(treats everything as UTF-8), I wasn't sure how to add that support. I could 
have added overloaded method to SolrResourceLoader, but there is a TODO 
comment, so I decided to create getLines() method inside SpellCheckComponent 
class instead. What do you think of this?

  was (Author: bosmid):
Character encodings for file-based dictionaries now supported with property 
characterEncoding. So, configuration for such dictionary would look like this:


{code:xml}

external
file
spellings.txt
UTF-8
c:\spellchecker

{code}

New code needs latest lucene-spellchecker-2.4*.jar from Lucene trunk.

Since SolrResourceLoader.getLines method doesn't support configurable encodings 
(treats everything as UTF-8), I wasn't sure how to add that support. I could 
have added overloaded method to SolrResourceLoader, but there is a TODO 
comment, so I decided to create getLines() method inside SpellCheckComponent 
class instead. What do you think of this?
  
> Spell Checker as a Search Component
> ---
>
> Key: SOLR-572
> URL: https://issues.apache.org/jira/browse/SOLR-572
> Project: Solr
>  Issue Type: New Feature
>  Components: spellchecker
>Affects Versions: 1.3
>Reporter: Shalin Shekhar Mangar
> Fix For: 1.3
>
> Attachments: SOLR-572.patch, SOLR-572.patch, SOLR-572.patch, 
> SOLR-572.patch
>
>
> Expose the Lucene contrib SpellChecker as a Search Component. Provide the 
> following features:
> * Allow creating a spell index on a given field and make it possible to have 
> multiple spell indices -- one for each field
> * Give suggestions on a per-field basis
> * Given a multi-word query, give only one consistent suggestion
> * Process the query with the same analyzer specified for the source field and 
> process each token separately
> * Allow the user to specify minimum length for a token (optional)
> Consistency criteria for a multi-word query can consist of the following:
> * Preserve the correct words in the original query as it is
> * Never give duplicate words in a suggestion

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (SOLR-572) Spell Checker as a Search Component

2008-05-19 Thread Bojan Smid (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597930#action_12597930
 ] 

bosmid edited comment on SOLR-572 at 5/19/08 5:03 AM:
--

Character encodings for file-based dictionaries now supported with property 
characterEncoding. So, configuration for such dictionary would look like this:


{code:xml}

external
file
spellings.txt
UTF-8
c:\spellchecker

{code}

New code needs latest lucene-spellchecker-2.4*.jar from Lucene trunk.

Since SolrResourceLoader.getLines method doesn't support configurable encodings 
(treats everything as UTF-8), I wasn't sure how to add that support. I could 
have added overloaded method to SolrResourceLoader, but there is a TODO 
comment, so I decided to create getLines() method inside SpellCheckComponent 
class instead. What do you think of this?

  was (Author: bosmid):
Character encodings for file-based dictionaries now supported with property 
characterEncoding. So, configuration for such dictionary would look like this:

{code:xml}

external
file
spellings.txt
UTF-8
c:\spellchecker

{code}

New code needs latest lucene-spellchecker-2.4*.jar from Lucene trunk.

Since SolrResourceLoader.getLines method doesn't support configurable encodings 
(treats everything as UTF-8), I wasn't sure how to add that support. I could 
have added overloaded method to SolrResourceLoader, but there is a TODO 
comment, so I decided to create getLines() method inside SpellCheckComponent 
class instead. What do you think of this?
  
> Spell Checker as a Search Component
> ---
>
> Key: SOLR-572
> URL: https://issues.apache.org/jira/browse/SOLR-572
> Project: Solr
>  Issue Type: New Feature
>  Components: spellchecker
>Affects Versions: 1.3
>Reporter: Shalin Shekhar Mangar
> Fix For: 1.3
>
> Attachments: SOLR-572.patch, SOLR-572.patch, SOLR-572.patch, 
> SOLR-572.patch
>
>
> Expose the Lucene contrib SpellChecker as a Search Component. Provide the 
> following features:
> * Allow creating a spell index on a given field and make it possible to have 
> multiple spell indices -- one for each field
> * Give suggestions on a per-field basis
> * Given a multi-word query, give only one consistent suggestion
> * Process the query with the same analyzer specified for the source field and 
> process each token separately
> * Allow the user to specify minimum length for a token (optional)
> Consistency criteria for a multi-word query can consist of the following:
> * Preserve the correct words in the original query as it is
> * Never give duplicate words in a suggestion

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-572) Spell Checker as a Search Component

2008-05-19 Thread Bojan Smid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bojan Smid updated SOLR-572:


Attachment: SOLR-572.patch

Character encodings for file-based dictionaries now supported with property 
characterEncoding. So, configuration for such dictionary would look like this:

{code:xml}

external
file
spellings.txt
UTF-8
c:\spellchecker

{code}

New code needs latest lucene-spellchecker-2.4*.jar from Lucene trunk.

Since SolrResourceLoader.getLines method doesn't support configurable encodings 
(treats everything as UTF-8), I wasn't sure how to add that support. I could 
have added overloaded method to SolrResourceLoader, but there is a TODO 
comment, so I decided to create getLines() method inside SpellCheckComponent 
class instead. What do you think of this?

> Spell Checker as a Search Component
> ---
>
> Key: SOLR-572
> URL: https://issues.apache.org/jira/browse/SOLR-572
> Project: Solr
>  Issue Type: New Feature
>  Components: spellchecker
>Affects Versions: 1.3
>Reporter: Shalin Shekhar Mangar
> Fix For: 1.3
>
> Attachments: SOLR-572.patch, SOLR-572.patch, SOLR-572.patch, 
> SOLR-572.patch
>
>
> Expose the Lucene contrib SpellChecker as a Search Component. Provide the 
> following features:
> * Allow creating a spell index on a given field and make it possible to have 
> multiple spell indices -- one for each field
> * Give suggestions on a per-field basis
> * Given a multi-word query, give only one consistent suggestion
> * Process the query with the same analyzer specified for the source field and 
> process each token separately
> * Allow the user to specify minimum length for a token (optional)
> Consistency criteria for a multi-word query can consist of the following:
> * Preserve the correct words in the original query as it is
> * Never give duplicate words in a suggestion

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-572) Spell Checker as a Search Component

2008-05-19 Thread Bojan Smid (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597913#action_12597913
 ] 

Bojan Smid commented on SOLR-572:
-

I would like to add support for different character encodings in file-based 
dictionaries (current implementation will take system's default settings). I'm 
not sure how we'll synchronize your work with my fix? Can you let me know 
when/how can I start my work?

> Spell Checker as a Search Component
> ---
>
> Key: SOLR-572
> URL: https://issues.apache.org/jira/browse/SOLR-572
> Project: Solr
>  Issue Type: New Feature
>  Components: spellchecker
>Affects Versions: 1.3
>Reporter: Shalin Shekhar Mangar
> Fix For: 1.3
>
> Attachments: SOLR-572.patch, SOLR-572.patch
>
>
> Expose the Lucene contrib SpellChecker as a Search Component. Provide the 
> following features:
> * Allow creating a spell index on a given field and make it possible to have 
> multiple spell indices -- one for each field
> * Give suggestions on a per-field basis
> * Given a multi-word query, give only one consistent suggestion
> * Process the query with the same analyzer specified for the source field and 
> process each token separately
> * Allow the user to specify minimum length for a token (optional)
> Consistency criteria for a multi-word query can consist of the following:
> * Preserve the correct words in the original query as it is
> * Never give duplicate words in a suggestion

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-572) Spell Checker as a Search Component

2008-05-16 Thread Bojan Smid (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597466#action_12597466
 ] 

Bojan Smid commented on SOLR-572:
-

The "field" attribute for file-based dictionary is basically the same "field" 
attribute as in default dictionary (in both cases they are used to obtain query 
analyzer), so that is the reason why I used the same name. My question was is 
it ok for default dictionary to use the same field to build dictionary from 
solr index and to obtain query analyzer for extracting tokens?

> Spell Checker as a Search Component
> ---
>
> Key: SOLR-572
> URL: https://issues.apache.org/jira/browse/SOLR-572
> Project: Solr
>  Issue Type: New Feature
>  Components: spellchecker
>Affects Versions: 1.3
>Reporter: Shalin Shekhar Mangar
> Fix For: 1.3
>
> Attachments: SOLR-572.patch, SOLR-572.patch
>
>
> Expose the Lucene contrib SpellChecker as a Search Component. Provide the 
> following features:
> * Allow creating a spell index on a given field and make it possible to have 
> multiple spell indices -- one for each field
> * Give suggestions on a per-field basis
> * Given a multi-word query, give only one consistent suggestion
> * Process the query with the same analyzer specified for the source field and 
> process each token separately
> * Allow the user to specify minimum length for a token (optional)
> Consistency criteria for a multi-word query can consist of the following:
> * Preserve the correct words in the original query as it is
> * Never give duplicate words in a suggestion

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-572) Spell Checker as a Search Component

2008-05-16 Thread Bojan Smid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bojan Smid updated SOLR-572:


Attachment: SOLR-572.patch

I added support for file-based dictionaries (they are configured as described 
in Shalin's post) using Lucene's PlainTextDictionary.

However, I had to add property "field" to the configuration for this dictionary 
in order to obtain analyzer (which is passed to FieldSpellChecker). This 
analyzer is later used to extract tokens from the query.

I guess my current solution is not quite correct (since PlainTextDictionary 
doesn't really need analyzer), but it also makes me wonder if in case of 
dictionary built from solr index, same analyzer should be used when building 
dictionary and parsing query strings?

> Spell Checker as a Search Component
> ---
>
> Key: SOLR-572
> URL: https://issues.apache.org/jira/browse/SOLR-572
> Project: Solr
>  Issue Type: New Feature
>  Components: spellchecker
>Affects Versions: 1.3
>Reporter: Shalin Shekhar Mangar
> Fix For: 1.3
>
> Attachments: SOLR-572.patch, SOLR-572.patch
>
>
> Expose the Lucene contrib SpellChecker as a Search Component. Provide the 
> following features:
> * Allow creating a spell index on a given field and make it possible to have 
> multiple spell indices -- one for each field
> * Give suggestions on a per-field basis
> * Given a multi-word query, give only one consistent suggestion
> * Process the query with the same analyzer specified for the source field and 
> process each token separately
> * Allow the user to specify minimum length for a token (optional)
> Consistency criteria for a multi-word query can consist of the following:
> * Preserve the correct words in the original query as it is
> * Never give duplicate words in a suggestion

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-553) Highlighter does not match phrase queries correctly

2008-05-15 Thread Bojan Smid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bojan Smid updated SOLR-553:


Attachment: Solr-553.patch

Added unit test for this fix to the patch.

> Highlighter does not match phrase queries correctly
> ---
>
> Key: SOLR-553
> URL: https://issues.apache.org/jira/browse/SOLR-553
> Project: Solr
>  Issue Type: New Feature
>  Components: highlighter
>Affects Versions: 1.2
> Environment: all
>Reporter: Brian Whitman
>Assignee: Otis Gospodnetic
> Attachments: highlighttest.xml, Solr-553.patch, Solr-553.patch
>
>
> http://www.nabble.com/highlighting-pt2%3A-returning-tokens-out-of-order-from-PhraseQuery-to16156718.html
> Say we search for the band "I Love You But I've Chosen Darkness"
> .../selectrows=100&q=%22I%20Love%20You%20But%20I\'ve%20Chosen%20Darkness%22&fq=type:html&hl=true&hl.fl=content&hl.fragsize=500&hl.snippets=5&hl.simple.pre=%3Cspan%3E&hl.simple.post=%3C/span%3E
> The highlight returns a snippet that does have the name altogether:
> Lights (Live) : I Love You But 
> I've Chosen Darkness :
> But also returns unrelated snips from the same page:
> Black Francis Shop "I Think I Love 
> You"
> A correct highlighter should not return snippets that do not match the phrase 
> exactly.
> LUCENE-794 (not yet committed, but seems to be ready) fixes up the problem 
> from the Lucene end. Solr should get it too.
> Related: SOLR-575 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-553) Highlighter does not match phrase queries correctly

2008-05-14 Thread Bojan Smid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bojan Smid updated SOLR-553:


Attachment: Solr-553.patch

Patch for Solr-553 (uses Lucene-794 highlighting fix)

> Highlighter does not match phrase queries correctly
> ---
>
> Key: SOLR-553
> URL: https://issues.apache.org/jira/browse/SOLR-553
> Project: Solr
>  Issue Type: New Feature
>  Components: highlighter
>Affects Versions: 1.2
> Environment: all
>Reporter: Brian Whitman
> Attachments: highlighttest.xml, Solr-553.patch
>
>
> http://www.nabble.com/highlighting-pt2%3A-returning-tokens-out-of-order-from-PhraseQuery-to16156718.html
> Say we search for the band "I Love You But I've Chosen Darkness"
> .../selectrows=100&q=%22I%20Love%20You%20But%20I\'ve%20Chosen%20Darkness%22&fq=type:html&hl=true&hl.fl=content&hl.fragsize=500&hl.snippets=5&hl.simple.pre=%3Cspan%3E&hl.simple.post=%3C/span%3E
> The highlight returns a snippet that does have the name altogether:
> Lights (Live) : I Love You But 
> I've Chosen Darkness :
> But also returns unrelated snips from the same page:
> Black Francis Shop "I Think I Love 
> You"
> A correct highlighter should not return snippets that do not match the phrase 
> exactly.
> LUCENE-794 (not yet committed, but seems to be ready) fixes up the problem 
> from the Lucene end. Solr should get it too.
> Related: SOLR-575 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-553) Highlighter does not match phrase queries correctly

2008-05-14 Thread Bojan Smid (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596793#action_12596793
 ] 

Bojan Smid commented on SOLR-553:
-

I made a fix, patch is uploaded. LUCENE-794 is now incorporated into default 
Solr highlighter.

Old way of highlighting is still retained and will be used in case requests to 
Solr Highlighter remain the same as they were (same request parameters). New 
functionality is invoked by adding another request parameter to URL, 
hl.usePhraseHighlighter=true.

So, for URL:
http://localhost:8983/solr/select?q=features:%22ax%20bx%20cx%22&hl=on&hl.fl=features&hl.fragsize=20&hl.snippets=10

results will be the same as they were, but in case you want to use this fix 
(and have correct phrase highlighting), the URL would look like this:

http://localhost:8983/solr/select?q=features:%22ax%20bx%20cx%22&hl=on&hl.fl=features&hl.fragsize=20&hl.snippets=10&hl.usePhraseHighlighter=true

This patch needs latest lucene-highlighter-*.jar and lucene-memory-*.jar from 
trunk (since LUCENE-794 fix is committed there).

> Highlighter does not match phrase queries correctly
> ---
>
> Key: SOLR-553
> URL: https://issues.apache.org/jira/browse/SOLR-553
> Project: Solr
>  Issue Type: New Feature
>  Components: highlighter
>Affects Versions: 1.2
> Environment: all
>Reporter: Brian Whitman
> Attachments: highlighttest.xml
>
>
> http://www.nabble.com/highlighting-pt2%3A-returning-tokens-out-of-order-from-PhraseQuery-to16156718.html
> Say we search for the band "I Love You But I've Chosen Darkness"
> .../selectrows=100&q=%22I%20Love%20You%20But%20I\'ve%20Chosen%20Darkness%22&fq=type:html&hl=true&hl.fl=content&hl.fragsize=500&hl.snippets=5&hl.simple.pre=%3Cspan%3E&hl.simple.post=%3C/span%3E
> The highlight returns a snippet that does have the name altogether:
> Lights (Live) : I Love You But 
> I've Chosen Darkness :
> But also returns unrelated snips from the same page:
> Black Francis Shop "I Think I Love 
> You"
> A correct highlighter should not return snippets that do not match the phrase 
> exactly.
> LUCENE-794 (not yet committed, but seems to be ready) fixes up the problem 
> from the Lucene end. Solr should get it too.
> Related: SOLR-575 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (SOLR-553) Highlighter does not match phrase queries correctly

2008-05-13 Thread Bojan Smid (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596350#action_12596350
 ] 

bosmid edited comment on SOLR-553 at 5/13/08 4:06 AM:
--

I am playing around with LUCENE-794 integration into Solr. I have two options:

1) add LUCENE-794 code to current implementation in DefaultSolrHighlighter 
where client would provide request parameter (say useSpanScorer) if he wants to 
use new functionality. In case he didn't provide the parameter, he would get 
old functionality.

or

2) to provide LUCENE-794 highlighting in new SolrHighlighter, for instance in 
class PhraseQuerySolrHighlighter

I would appreciate any comments on this.

Also, since I already test some of this code, I noticed that we still wouldn't 
get exact behavior from description. For instance, in text  ax bx cx dx ax bx

for phrase query "ax bx cx"

the result is : axbxcx dx ax bx

Which means that we got a fix for part of the problem (words from unrelated 
snippets are no longer highlighted), but we still wouldn't get whole phrase 
highlighted inside single tag.

  was (Author: bosmid):
I am playing around with LUCENE-794 integration into Solr. I have two 
options:

1) add LUCENE-794 code to current implementation in DefaultSolrHighlighter 
where client would provide request parameter (say useSpanScorer) if he wants to 
use new functionality. In case he didn't provide the parameter, he would get 
old functionality.

or

2) to provide LUCENE-794 highlighting in new SolrHighlighter, for instance in 
class PhraseQuerySolrHighlighter

I would appreciate any comments on this.

Also, since I already test some of this code, I noticed that we still wouldn't 
get exact behavior from description. For instance, in text  ax bx cx dx ax bx

for phrase query "ax bx cx"

the result is : axbxcx dx ax bx

Which means that we got fix part of the problem (words from unrelated snippets 
are no longer highlighted), but we still wouldn't get whole phrase highlighted 
inside single tag.
  
> Highlighter does not match phrase queries correctly
> ---
>
> Key: SOLR-553
> URL: https://issues.apache.org/jira/browse/SOLR-553
> Project: Solr
>  Issue Type: New Feature
>  Components: highlighter
>Affects Versions: 1.2
> Environment: all
>Reporter: Brian Whitman
> Attachments: highlighttest.xml
>
>
> http://www.nabble.com/highlighting-pt2%3A-returning-tokens-out-of-order-from-PhraseQuery-to16156718.html
> Say we search for the band "I Love You But I've Chosen Darkness"
> .../selectrows=100&q=%22I%20Love%20You%20But%20I\'ve%20Chosen%20Darkness%22&fq=type:html&hl=true&hl.fl=content&hl.fragsize=500&hl.snippets=5&hl.simple.pre=%3Cspan%3E&hl.simple.post=%3C/span%3E
> The highlight returns a snippet that does have the name altogether:
> Lights (Live) : I Love You But 
> I've Chosen Darkness :
> But also returns unrelated snips from the same page:
> Black Francis Shop "I Think I Love 
> You"
> A correct highlighter should only return
> Lights (Live) : I Love You But I've Chosen Darkness
> And no snippets that do not match the phrase exactly.
> LUCENE-794 (not yet committed, but seems to be ready) fixes up the problem 
> from the Lucene end. Solr should get it too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-553) Highlighter does not match phrase queries correctly

2008-05-13 Thread Bojan Smid (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596350#action_12596350
 ] 

Bojan Smid commented on SOLR-553:
-

I am playing around with LUCENE-794 integration into Solr. I have two options:

1) add LUCENE-794 code to current implementation in DefaultSolrHighlighter 
where client would provide request parameter (say useSpanScorer) if he wants to 
use new functionality. In case he didn't provide the parameter, he would get 
old functionality.

or

2) to provide LUCENE-794 highlighting in new SolrHighlighter, for instance in 
class PhraseQuerySolrHighlighter

I would appreciate any comments on this.

Also, since I already test some of this code, I noticed that we still wouldn't 
get exact behavior from description. For instance, in text  ax bx cx dx ax bx

for phrase query "ax bx cx"

the result is : axbxcx dx ax bx

Which means that we got fix part of the problem (words from unrelated snippets 
are no longer highlighted), but we still wouldn't get whole phrase highlighted 
inside single tag.

> Highlighter does not match phrase queries correctly
> ---
>
> Key: SOLR-553
> URL: https://issues.apache.org/jira/browse/SOLR-553
> Project: Solr
>  Issue Type: New Feature
>  Components: highlighter
>Affects Versions: 1.2
> Environment: all
>Reporter: Brian Whitman
> Attachments: highlighttest.xml
>
>
> http://www.nabble.com/highlighting-pt2%3A-returning-tokens-out-of-order-from-PhraseQuery-to16156718.html
> Say we search for the band "I Love You But I've Chosen Darkness"
> .../selectrows=100&q=%22I%20Love%20You%20But%20I\'ve%20Chosen%20Darkness%22&fq=type:html&hl=true&hl.fl=content&hl.fragsize=500&hl.snippets=5&hl.simple.pre=%3Cspan%3E&hl.simple.post=%3C/span%3E
> The highlight returns a snippet that does have the name altogether:
> Lights (Live) : I Love You But 
> I've Chosen Darkness :
> But also returns unrelated snips from the same page:
> Black Francis Shop "I Think I Love 
> You"
> A correct highlighter should only return
> Lights (Live) : I Love You But I've Chosen Darkness
> And no snippets that do not match the phrase exactly.
> LUCENE-794 (not yet committed, but seems to be ready) fixes up the problem 
> from the Lucene end. Solr should get it too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.