[jira] Updated: (SOLR-572) Spell Checker as a Search Component

2008-05-15 Thread Shalin Shekhar Mangar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-572:
---

Summary: Spell Checker as a Search Component  (was: Spell Checker as a 
Search Handler)

 Spell Checker as a Search Component
 ---

 Key: SOLR-572
 URL: https://issues.apache.org/jira/browse/SOLR-572
 Project: Solr
  Issue Type: New Feature
  Components: spellchecker
Affects Versions: 1.3
Reporter: Shalin Shekhar Mangar
 Fix For: 1.3


 Expose the Lucene contrib SpellChecker as a Search Component. Provide the 
 following features:
 * Allow creating a spell index on a given field and make it possible to have 
 multiple spell indices -- one for each field
 * Give suggestions on a per-field basis
 * Given a multi-word query, give only one consistent suggestion
 * Process the query with the same analyzer specified for the source field and 
 process each token separately
 * Allow the user to specify minimum length for a token (optional)
 Consistency criteria for a multi-word query can consist of the following:
 * Preserve the correct words in the original query as it is
 * Never give duplicate words in a suggestion

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-553) Highlighter does not match phrase queries correctly

2008-05-15 Thread Bojan Smid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bojan Smid updated SOLR-553:


Attachment: Solr-553.patch

Added unit test for this fix to the patch.

 Highlighter does not match phrase queries correctly
 ---

 Key: SOLR-553
 URL: https://issues.apache.org/jira/browse/SOLR-553
 Project: Solr
  Issue Type: New Feature
  Components: highlighter
Affects Versions: 1.2
 Environment: all
Reporter: Brian Whitman
Assignee: Otis Gospodnetic
 Attachments: highlighttest.xml, Solr-553.patch, Solr-553.patch


 http://www.nabble.com/highlighting-pt2%3A-returning-tokens-out-of-order-from-PhraseQuery-to16156718.html
 Say we search for the band I Love You But I've Chosen Darkness
 .../selectrows=100q=%22I%20Love%20You%20But%20I\'ve%20Chosen%20Darkness%22fq=type:htmlhl=truehl.fl=contenthl.fragsize=500hl.snippets=5hl.simple.pre=%3Cspan%3Ehl.simple.post=%3C/span%3E
 The highlight returns a snippet that does have the name altogether:
 Lights (Live) : spanI/span spanLove/span spanYou/span But 
 spanI've/span spanChosen/span spanDarkness/span :
 But also returns unrelated snips from the same page:
 Black Francis Shop spanI/span Think spanI/span spanLove/span 
 spanYou/span
 A correct highlighter should not return snippets that do not match the phrase 
 exactly.
 LUCENE-794 (not yet committed, but seems to be ready) fixes up the problem 
 from the Lucene end. Solr should get it too.
 Related: SOLR-575 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-379) KStem Token Filter

2008-05-15 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12597185#action_12597185
 ] 

Otis Gospodnetic commented on SOLR-379:
---

It would be great to have this available in Solr.  Because of Kstem's 
incompatible library, I don't know how we can handle this.  Incompatible 
license really just means we cannot distribute the KStem code (and cannot have 
it in the Lucene/Solr svn repository).  Usually when incompatible licensing is 
a problem we say modify the build script to download the needed library on 
demand if it's not present locally.  This is what some of the Lucene contrib 
components do, for example.

However, looking at your ZIP file I see:

  -rw-r--r--  2836  15-Oct-2007  17:16:46  
src/java/org/apache/solr/analysis/KStemFilterFactory.java
  -rw-r--r-- 4  15-Oct-2007  16:28:08  
src/java/org/apache/lucene/analysis/KStemmer.java
  -rw-r--r--  4501  15-Oct-2007  17:08:38  
src/java/org/apache/lucene/analysis/KStemFilter.java
  -rw-r--r-- 34259  15-Oct-2007  16:28:24  
src/java/org/apache/lucene/analysis/KStemData8.java
  -rw-r--r-- 39918  15-Oct-2007  16:28:28  
src/java/org/apache/lucene/analysis/KStemData7.java
  -rw-r--r-- 41412  15-Oct-2007  16:28:34  
src/java/org/apache/lucene/analysis/KStemData6.java
  -rw-r--r-- 40457  15-Oct-2007  16:28:40  
src/java/org/apache/lucene/analysis/KStemData5.java
  -rw-r--r-- 40823  15-Oct-2007  16:28:44  
src/java/org/apache/lucene/analysis/KStemData4.java
  -rw-r--r-- 39808  15-Oct-2007  16:28:50  
src/java/org/apache/lucene/analysis/KStemData3.java
  -rw-r--r-- 42696  15-Oct-2007  16:29:00  
src/java/org/apache/lucene/analysis/KStemData2.java
  -rw-r--r-- 40020  15-Oct-2007  16:29:14  
src/java/org/apache/lucene/analysis/KStemData1.java

But this is really just a duplicate of what's in 
http://ciir.cs.umass.edu/downloads/files/KStem.jar, plus the Solr-specific 
KStemFilterFactory.java.

So, could we simply download KStem.jar on demand?  And is 
KStemFilterFactory.java really copyright CIIR?  If we can change that to ASL 
then we can include it in the repo and with the modified build that downloads 
KStem.jar before compiling this class would compile.


 KStem Token Filter
 --

 Key: SOLR-379
 URL: https://issues.apache.org/jira/browse/SOLR-379
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Pieter Berkel
Priority: Minor
 Attachments: KStemSolr.zip


 A Lucene / Solr implementation of the KStem stemmer.  Full credit goes to 
 Harry Wagner for adapting the Lucene version found here:
 http://ciir.cs.umass.edu/cgi-bin/downloads/downloads.cgi
 Background discussion to this stemmer (including licensing issues) can be 
 found in this thread:
 http://www.nabble.com/Embedded-about-50--faster-for-indexing-tf4325720.html#a12376295
 I've made some minor changes to KStemFilterFactory so that it compiles 
 cleanly against trunk:
 1) removed some unnecessary imports
 2) changed the init() method parameters introduced by SOLR-215
 3) moved KStemFilterFactory into package org.apache.solr.analysis
 Once compiled and included in your Solr war (or as a jar in your lib 
 directory, the KStem filter can be used in your schema very easily:
   analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
 words=stopwords.txt/
 filter class=solr.StandardFilterFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.KStemFilterFactory cacheSize=2/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Accessing IndexReader during core initialization hangs init

2008-05-15 Thread Shalin Shekhar Mangar
Hi,

While working on SOLR-572, I found that if I try to access the
IndexReader using SolrCore.getSearcher().get().getReader() within the
SolrCoreAware.inform method, the initialization process hangs.
Basically, the SolrCore.getSearcher halts at the searcherLock.wait()
call in the snippet below:

// check to see if we can wait for someone else's searcher to be set
  if (onDeckSearchers0  !forceNew  _searcher==null) {
try {
  searcherLock.wait();
} catch (InterruptedException e) {
  log.info(SolrException.toStr(e));
}
  }

Is this by design? Are SearchComponents not supposed to access the
IndexReader in this way? I needed access to the IndexReader so that I
can create the spell check index during core initialization. For now,
I've moved the index creation to the first query coming into
SpellCheckComponent (note to myself: review thread-safety in the init
code).

--
Regards,
Shalin Shekhar Mangar.


[jira] Commented: (SOLR-556) Highlighting of multi-valued fields returns snippets which span multiple different values

2008-05-15 Thread Mike Klaas (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12597226#action_12597226
 ] 

Mike Klaas commented on SOLR-556:
-

Thanks for the report, Lars.  I'll take a look at this shortly.

 Highlighting of multi-valued fields returns snippets which span multiple 
 different values
 -

 Key: SOLR-556
 URL: https://issues.apache.org/jira/browse/SOLR-556
 Project: Solr
  Issue Type: Bug
  Components: highlighter
Affects Versions: 1.3
 Environment: Tomcat 5.5
Reporter: Lars Kotthoff
Priority: Minor
 Attachments: solr-highlight-multivalued-example.xml, 
 solr-highlight-multivalued.patch


 When highlighting multi-valued fields, the highlighter sometimes returns 
 snippets which span multiple values, e.g. with values foo and bar and 
 search term ba the highlighter will create the snippet fooemba/emr. 
 Furthermore it sometimes returns smaller snippets than it should, e.g. with 
 value foobar and search term oo it will create the snippet emoo/em 
 regardless of hl.fragsize.
 I have been unable to determine the real cause for this, or indeed what 
 actually goes on at all. To reproduce the problem, I've used the following 
 steps:
 * create an index with multi-valued fields, one document should have at least 
 3 values for these fields (in my case strings of length between 5 and 15 
 Japanese characters -- as far as I can tell plain old ASCII should produce 
 the same effect though)
 * search for part of a value in such a field with highlighting enabled, the 
 additional parameters I use are hl.fragsize=70, hl.requireFieldMatch=true, 
 hl.mergeContiguous=true (changing the parameters does not seem to have any 
 effect on the result though)
 * highlighted snippets should show effects described above

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (SOLR-556) Highlighting of multi-valued fields returns snippets which span multiple different values

2008-05-15 Thread Mike Klaas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Klaas reassigned SOLR-556:
---

Assignee: Mike Klaas

 Highlighting of multi-valued fields returns snippets which span multiple 
 different values
 -

 Key: SOLR-556
 URL: https://issues.apache.org/jira/browse/SOLR-556
 Project: Solr
  Issue Type: Bug
  Components: highlighter
Affects Versions: 1.3
 Environment: Tomcat 5.5
Reporter: Lars Kotthoff
Assignee: Mike Klaas
Priority: Minor
 Attachments: solr-highlight-multivalued-example.xml, 
 solr-highlight-multivalued.patch


 When highlighting multi-valued fields, the highlighter sometimes returns 
 snippets which span multiple values, e.g. with values foo and bar and 
 search term ba the highlighter will create the snippet fooemba/emr. 
 Furthermore it sometimes returns smaller snippets than it should, e.g. with 
 value foobar and search term oo it will create the snippet emoo/em 
 regardless of hl.fragsize.
 I have been unable to determine the real cause for this, or indeed what 
 actually goes on at all. To reproduce the problem, I've used the following 
 steps:
 * create an index with multi-valued fields, one document should have at least 
 3 values for these fields (in my case strings of length between 5 and 15 
 Japanese characters -- as far as I can tell plain old ASCII should produce 
 the same effect though)
 * search for part of a value in such a field with highlighting enabled, the 
 additional parameters I use are hl.fragsize=70, hl.requireFieldMatch=true, 
 hl.mergeContiguous=true (changing the parameters does not seem to have any 
 effect on the result though)
 * highlighted snippets should show effects described above

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-576) Make DocSetHitCollector public

2008-05-15 Thread Jason Rutherglen (JIRA)
Make DocSetHitCollector public
--

 Key: SOLR-576
 URL: https://issues.apache.org/jira/browse/SOLR-576
 Project: Solr
  Issue Type: Improvement
  Components: search
Affects Versions: 1.3
Reporter: Jason Rutherglen
Priority: Minor


Make org.apache.solr.search.DocSetHitCollector public for use by other code

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: svn commit: r656826 - in /lucene/solr/trunk/src: java/org/apache/solr/update/DirectUpdateHandler2.java test/org/apache/solr/update/AutoCommitTest.java

2008-05-15 Thread Yonik Seeley
On Thu, May 15, 2008 at 4:39 PM,  [EMAIL PROTECTED] wrote:
 remove last vestiges of maxPendingDeletes from DUH2

Oops, thanks - I guess I missed that (I previously did a quick grep
and didn't see anything).

-Yonik


[jira] Assigned: (SOLR-319) changes SynonymFilterFactoryto Analyze synonyms file

2008-05-15 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi reassigned SOLR-319:
---

Assignee: Koji Sekiguchi

 changes SynonymFilterFactoryto Analyze synonyms file
 --

 Key: SOLR-319
 URL: https://issues.apache.org/jira/browse/SOLR-319
 Project: Solr
  Issue Type: Improvement
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Minor
 Attachments: SOLR-319.patch, SOLR-319.patch, SOLR-319.patch


 WHAT:
 Currently, SynonymFilterFactory works very well with N-gram tokenizer 
 (CJKTokenizer, for example).
 But we have to take care of the statement in synonyms.txt.
 For example, if I use CJKTokenizer (work as bi-gram for CJK chars) and want 
 C1C2C3 maps to C4C5C6,
 I have to write the rule as follows:
 C1C2 C2C3 = C4C5 C5C6
 But I want to write it C1C2C3=C4C5C6. This patch allows it. It is also 
 helpful for sharing synonyms.txt.
 HOW:
 tokenFactory attribute is added to filter 
 class=solr.SynonymFilterFactory/.
 If the attribute is specified, SynonymFilterFactory uses the TokenizerFactory 
 to create Tokenizer.
 Then SynonymFilterFactory uses the Tokenizer to get tokens from the rules in 
 synonyms.txt file.
 sample-1: CJKTokenizer
 fieldtype name=text_cjk class=solr.TextField 
 positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.CJKTokenizerFactory/
 filter class=solr.SynonymFilterFactory 
 synonyms=ngram_synonym_test_ja.txt
   ignoreCase=true expand=true 
 tokenFactory=solr.CJKTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class=solr.CJKTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
 /fieldtype
 sample-2: NGramTokenizer
 fieldtype name=text_ngram class=solr.TextField 
 positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.NGramTokenizerFactory minGramSize=2 
 maxGramSize=2/
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class=solr.NGramTokenizerFactory minGramSize=2 
 maxGramSize=2/
 filter class=solr.SynonymFilterFactory 
 synonyms=ngram_synonym_test_ngram.txt
   ignoreCase=true expand=true
   tokenFactory=solr.NGramTokenizerFactory 
 minGramSize=2 maxGramSize=2/
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
 /fieldtype
 backward compatibility:
 Yes. If you omit tokenFactory attribute from filter 
 class=solr.SynonymFilterFactory/ tag, it works as usual.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-572) Spell Checker as a Search Component

2008-05-15 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12597345#action_12597345
 ] 

Noble Paul commented on SOLR-572:
-

 * the spellcheck.dictionary=default must be optional in query. The user must 
be able to name a dictionary as 'default' and that can be used as the default 
if no value is passed.
 



 Spell Checker as a Search Component
 ---

 Key: SOLR-572
 URL: https://issues.apache.org/jira/browse/SOLR-572
 Project: Solr
  Issue Type: New Feature
  Components: spellchecker
Affects Versions: 1.3
Reporter: Shalin Shekhar Mangar
 Fix For: 1.3

 Attachments: SOLR-572.patch


 Expose the Lucene contrib SpellChecker as a Search Component. Provide the 
 following features:
 * Allow creating a spell index on a given field and make it possible to have 
 multiple spell indices -- one for each field
 * Give suggestions on a per-field basis
 * Given a multi-word query, give only one consistent suggestion
 * Process the query with the same analyzer specified for the source field and 
 process each token separately
 * Allow the user to specify minimum length for a token (optional)
 Consistency criteria for a multi-word query can consist of the following:
 * Preserve the correct words in the original query as it is
 * Never give duplicate words in a suggestion

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-572) Spell Checker as a Search Component

2008-05-15 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12597351#action_12597351
 ] 

Otis Gospodnetic commented on SOLR-572:
---

I had a quick look and it all looks nice and clean.
I like the config, though I think solr is too specific - the source field 
could be in a vanilla Lucene indexthat lives somewhere on disk, or example.  
Thus, I'd change solr to index.  Oh, I see, you are reading field values 
from the index of the current core.  I think that is fine, but wouldn't it also 
be good to be able to read field values from a vanilla Lucene index? (but you 
wouldn't know the field type and thus would not be able to get the Analyzer for 
the field)

Also, and regardless of the above, instead of having indexDir and path, why 
not call them both location and maybe even let them include the file: schema 
for consistency, if it works with the code that uses those locations?

Also on TODO:
* Read dictionary from plain-text files.

 Spell Checker as a Search Component
 ---

 Key: SOLR-572
 URL: https://issues.apache.org/jira/browse/SOLR-572
 Project: Solr
  Issue Type: New Feature
  Components: spellchecker
Affects Versions: 1.3
Reporter: Shalin Shekhar Mangar
 Fix For: 1.3

 Attachments: SOLR-572.patch


 Expose the Lucene contrib SpellChecker as a Search Component. Provide the 
 following features:
 * Allow creating a spell index on a given field and make it possible to have 
 multiple spell indices -- one for each field
 * Give suggestions on a per-field basis
 * Given a multi-word query, give only one consistent suggestion
 * Process the query with the same analyzer specified for the source field and 
 process each token separately
 * Allow the user to specify minimum length for a token (optional)
 Consistency criteria for a multi-word query can consist of the following:
 * Preserve the correct words in the original query as it is
 * Never give duplicate words in a suggestion

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-572) Spell Checker as a Search Component

2008-05-15 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12597354#action_12597354
 ] 

Shalin Shekhar Mangar commented on SOLR-572:


Otis, I agree that we should call index' instead of solr for the type and 
path can be renamed to location. But indexDir refers to the target for the 
spell check index whereas path currently refers to the source of the 
dictionary, so IMHO we should keep indexDir as it is (It can also be a 
relative path).

For supporting arbitrary lucene indices, user must specify type=index, 
field=fieldName, location=path/to/lucene/index/directory which should be 
enough (TODO). In that case the analyzer can be fixed as something (say 
WhitespaceAnalyzer or StandardAnalyzer).

I'm not sure I understand your comment on the schema. If this is for text files 
then I was thinking more about having a text file which would have one word per 
line and all those words would go into the same dictionary.

 Spell Checker as a Search Component
 ---

 Key: SOLR-572
 URL: https://issues.apache.org/jira/browse/SOLR-572
 Project: Solr
  Issue Type: New Feature
  Components: spellchecker
Affects Versions: 1.3
Reporter: Shalin Shekhar Mangar
 Fix For: 1.3

 Attachments: SOLR-572.patch


 Expose the Lucene contrib SpellChecker as a Search Component. Provide the 
 following features:
 * Allow creating a spell index on a given field and make it possible to have 
 multiple spell indices -- one for each field
 * Give suggestions on a per-field basis
 * Given a multi-word query, give only one consistent suggestion
 * Process the query with the same analyzer specified for the source field and 
 process each token separately
 * Allow the user to specify minimum length for a token (optional)
 Consistency criteria for a multi-word query can consist of the following:
 * Preserve the correct words in the original query as it is
 * Never give duplicate words in a suggestion

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-572) Spell Checker as a Search Component

2008-05-15 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12597358#action_12597358
 ] 

Otis Gospodnetic commented on SOLR-572:
---

I see (indexDir comment).  Might be better to make it more obvious then - 
sourceIndex for the Lucene index that serves as the source of data) vs. 
targetIndex (or spellcheckerIndex) for the resulting spellchecker index.

For Lucene indices to be used as sources of data type=index, 
field=fieldName, location=path/to/lucene/index/directory makes sense.

Ignore my comment about the schema, I'm just complicating things with that.  
Yes, one word per line for plain-text file data sources - that can easily be 
digested with PlainTextDictionary class (part of Lucene SC).


 Spell Checker as a Search Component
 ---

 Key: SOLR-572
 URL: https://issues.apache.org/jira/browse/SOLR-572
 Project: Solr
  Issue Type: New Feature
  Components: spellchecker
Affects Versions: 1.3
Reporter: Shalin Shekhar Mangar
 Fix For: 1.3

 Attachments: SOLR-572.patch


 Expose the Lucene contrib SpellChecker as a Search Component. Provide the 
 following features:
 * Allow creating a spell index on a given field and make it possible to have 
 multiple spell indices -- one for each field
 * Give suggestions on a per-field basis
 * Given a multi-word query, give only one consistent suggestion
 * Process the query with the same analyzer specified for the source field and 
 process each token separately
 * Allow the user to specify minimum length for a token (optional)
 Consistency criteria for a multi-word query can consist of the following:
 * Preserve the correct words in the original query as it is
 * Never give duplicate words in a suggestion

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.