[jira] Updated: (SOLR-572) Spell Checker as a Search Component
[ https://issues.apache.org/jira/browse/SOLR-572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-572: --- Summary: Spell Checker as a Search Component (was: Spell Checker as a Search Handler) Spell Checker as a Search Component --- Key: SOLR-572 URL: https://issues.apache.org/jira/browse/SOLR-572 Project: Solr Issue Type: New Feature Components: spellchecker Affects Versions: 1.3 Reporter: Shalin Shekhar Mangar Fix For: 1.3 Expose the Lucene contrib SpellChecker as a Search Component. Provide the following features: * Allow creating a spell index on a given field and make it possible to have multiple spell indices -- one for each field * Give suggestions on a per-field basis * Given a multi-word query, give only one consistent suggestion * Process the query with the same analyzer specified for the source field and process each token separately * Allow the user to specify minimum length for a token (optional) Consistency criteria for a multi-word query can consist of the following: * Preserve the correct words in the original query as it is * Never give duplicate words in a suggestion -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-553) Highlighter does not match phrase queries correctly
[ https://issues.apache.org/jira/browse/SOLR-553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bojan Smid updated SOLR-553: Attachment: Solr-553.patch Added unit test for this fix to the patch. Highlighter does not match phrase queries correctly --- Key: SOLR-553 URL: https://issues.apache.org/jira/browse/SOLR-553 Project: Solr Issue Type: New Feature Components: highlighter Affects Versions: 1.2 Environment: all Reporter: Brian Whitman Assignee: Otis Gospodnetic Attachments: highlighttest.xml, Solr-553.patch, Solr-553.patch http://www.nabble.com/highlighting-pt2%3A-returning-tokens-out-of-order-from-PhraseQuery-to16156718.html Say we search for the band I Love You But I've Chosen Darkness .../selectrows=100q=%22I%20Love%20You%20But%20I\'ve%20Chosen%20Darkness%22fq=type:htmlhl=truehl.fl=contenthl.fragsize=500hl.snippets=5hl.simple.pre=%3Cspan%3Ehl.simple.post=%3C/span%3E The highlight returns a snippet that does have the name altogether: Lights (Live) : spanI/span spanLove/span spanYou/span But spanI've/span spanChosen/span spanDarkness/span : But also returns unrelated snips from the same page: Black Francis Shop spanI/span Think spanI/span spanLove/span spanYou/span A correct highlighter should not return snippets that do not match the phrase exactly. LUCENE-794 (not yet committed, but seems to be ready) fixes up the problem from the Lucene end. Solr should get it too. Related: SOLR-575 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-379) KStem Token Filter
[ https://issues.apache.org/jira/browse/SOLR-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12597185#action_12597185 ] Otis Gospodnetic commented on SOLR-379: --- It would be great to have this available in Solr. Because of Kstem's incompatible library, I don't know how we can handle this. Incompatible license really just means we cannot distribute the KStem code (and cannot have it in the Lucene/Solr svn repository). Usually when incompatible licensing is a problem we say modify the build script to download the needed library on demand if it's not present locally. This is what some of the Lucene contrib components do, for example. However, looking at your ZIP file I see: -rw-r--r-- 2836 15-Oct-2007 17:16:46 src/java/org/apache/solr/analysis/KStemFilterFactory.java -rw-r--r-- 4 15-Oct-2007 16:28:08 src/java/org/apache/lucene/analysis/KStemmer.java -rw-r--r-- 4501 15-Oct-2007 17:08:38 src/java/org/apache/lucene/analysis/KStemFilter.java -rw-r--r-- 34259 15-Oct-2007 16:28:24 src/java/org/apache/lucene/analysis/KStemData8.java -rw-r--r-- 39918 15-Oct-2007 16:28:28 src/java/org/apache/lucene/analysis/KStemData7.java -rw-r--r-- 41412 15-Oct-2007 16:28:34 src/java/org/apache/lucene/analysis/KStemData6.java -rw-r--r-- 40457 15-Oct-2007 16:28:40 src/java/org/apache/lucene/analysis/KStemData5.java -rw-r--r-- 40823 15-Oct-2007 16:28:44 src/java/org/apache/lucene/analysis/KStemData4.java -rw-r--r-- 39808 15-Oct-2007 16:28:50 src/java/org/apache/lucene/analysis/KStemData3.java -rw-r--r-- 42696 15-Oct-2007 16:29:00 src/java/org/apache/lucene/analysis/KStemData2.java -rw-r--r-- 40020 15-Oct-2007 16:29:14 src/java/org/apache/lucene/analysis/KStemData1.java But this is really just a duplicate of what's in http://ciir.cs.umass.edu/downloads/files/KStem.jar, plus the Solr-specific KStemFilterFactory.java. So, could we simply download KStem.jar on demand? And is KStemFilterFactory.java really copyright CIIR? If we can change that to ASL then we can include it in the repo and with the modified build that downloads KStem.jar before compiling this class would compile. KStem Token Filter -- Key: SOLR-379 URL: https://issues.apache.org/jira/browse/SOLR-379 Project: Solr Issue Type: New Feature Components: search Reporter: Pieter Berkel Priority: Minor Attachments: KStemSolr.zip A Lucene / Solr implementation of the KStem stemmer. Full credit goes to Harry Wagner for adapting the Lucene version found here: http://ciir.cs.umass.edu/cgi-bin/downloads/downloads.cgi Background discussion to this stemmer (including licensing issues) can be found in this thread: http://www.nabble.com/Embedded-about-50--faster-for-indexing-tf4325720.html#a12376295 I've made some minor changes to KStemFilterFactory so that it compiles cleanly against trunk: 1) removed some unnecessary imports 2) changed the init() method parameters introduced by SOLR-215 3) moved KStemFilterFactory into package org.apache.solr.analysis Once compiled and included in your Solr war (or as a jar in your lib directory, the KStem filter can be used in your schema very easily: analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.StandardFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KStemFilterFactory cacheSize=2/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Accessing IndexReader during core initialization hangs init
Hi, While working on SOLR-572, I found that if I try to access the IndexReader using SolrCore.getSearcher().get().getReader() within the SolrCoreAware.inform method, the initialization process hangs. Basically, the SolrCore.getSearcher halts at the searcherLock.wait() call in the snippet below: // check to see if we can wait for someone else's searcher to be set if (onDeckSearchers0 !forceNew _searcher==null) { try { searcherLock.wait(); } catch (InterruptedException e) { log.info(SolrException.toStr(e)); } } Is this by design? Are SearchComponents not supposed to access the IndexReader in this way? I needed access to the IndexReader so that I can create the spell check index during core initialization. For now, I've moved the index creation to the first query coming into SpellCheckComponent (note to myself: review thread-safety in the init code). -- Regards, Shalin Shekhar Mangar.
[jira] Commented: (SOLR-556) Highlighting of multi-valued fields returns snippets which span multiple different values
[ https://issues.apache.org/jira/browse/SOLR-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12597226#action_12597226 ] Mike Klaas commented on SOLR-556: - Thanks for the report, Lars. I'll take a look at this shortly. Highlighting of multi-valued fields returns snippets which span multiple different values - Key: SOLR-556 URL: https://issues.apache.org/jira/browse/SOLR-556 Project: Solr Issue Type: Bug Components: highlighter Affects Versions: 1.3 Environment: Tomcat 5.5 Reporter: Lars Kotthoff Priority: Minor Attachments: solr-highlight-multivalued-example.xml, solr-highlight-multivalued.patch When highlighting multi-valued fields, the highlighter sometimes returns snippets which span multiple values, e.g. with values foo and bar and search term ba the highlighter will create the snippet fooemba/emr. Furthermore it sometimes returns smaller snippets than it should, e.g. with value foobar and search term oo it will create the snippet emoo/em regardless of hl.fragsize. I have been unable to determine the real cause for this, or indeed what actually goes on at all. To reproduce the problem, I've used the following steps: * create an index with multi-valued fields, one document should have at least 3 values for these fields (in my case strings of length between 5 and 15 Japanese characters -- as far as I can tell plain old ASCII should produce the same effect though) * search for part of a value in such a field with highlighting enabled, the additional parameters I use are hl.fragsize=70, hl.requireFieldMatch=true, hl.mergeContiguous=true (changing the parameters does not seem to have any effect on the result though) * highlighted snippets should show effects described above -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (SOLR-556) Highlighting of multi-valued fields returns snippets which span multiple different values
[ https://issues.apache.org/jira/browse/SOLR-556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Klaas reassigned SOLR-556: --- Assignee: Mike Klaas Highlighting of multi-valued fields returns snippets which span multiple different values - Key: SOLR-556 URL: https://issues.apache.org/jira/browse/SOLR-556 Project: Solr Issue Type: Bug Components: highlighter Affects Versions: 1.3 Environment: Tomcat 5.5 Reporter: Lars Kotthoff Assignee: Mike Klaas Priority: Minor Attachments: solr-highlight-multivalued-example.xml, solr-highlight-multivalued.patch When highlighting multi-valued fields, the highlighter sometimes returns snippets which span multiple values, e.g. with values foo and bar and search term ba the highlighter will create the snippet fooemba/emr. Furthermore it sometimes returns smaller snippets than it should, e.g. with value foobar and search term oo it will create the snippet emoo/em regardless of hl.fragsize. I have been unable to determine the real cause for this, or indeed what actually goes on at all. To reproduce the problem, I've used the following steps: * create an index with multi-valued fields, one document should have at least 3 values for these fields (in my case strings of length between 5 and 15 Japanese characters -- as far as I can tell plain old ASCII should produce the same effect though) * search for part of a value in such a field with highlighting enabled, the additional parameters I use are hl.fragsize=70, hl.requireFieldMatch=true, hl.mergeContiguous=true (changing the parameters does not seem to have any effect on the result though) * highlighted snippets should show effects described above -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-576) Make DocSetHitCollector public
Make DocSetHitCollector public -- Key: SOLR-576 URL: https://issues.apache.org/jira/browse/SOLR-576 Project: Solr Issue Type: Improvement Components: search Affects Versions: 1.3 Reporter: Jason Rutherglen Priority: Minor Make org.apache.solr.search.DocSetHitCollector public for use by other code -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: svn commit: r656826 - in /lucene/solr/trunk/src: java/org/apache/solr/update/DirectUpdateHandler2.java test/org/apache/solr/update/AutoCommitTest.java
On Thu, May 15, 2008 at 4:39 PM, [EMAIL PROTECTED] wrote: remove last vestiges of maxPendingDeletes from DUH2 Oops, thanks - I guess I missed that (I previously did a quick grep and didn't see anything). -Yonik
[jira] Assigned: (SOLR-319) changes SynonymFilterFactoryto Analyze synonyms file
[ https://issues.apache.org/jira/browse/SOLR-319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi reassigned SOLR-319: --- Assignee: Koji Sekiguchi changes SynonymFilterFactoryto Analyze synonyms file -- Key: SOLR-319 URL: https://issues.apache.org/jira/browse/SOLR-319 Project: Solr Issue Type: Improvement Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Minor Attachments: SOLR-319.patch, SOLR-319.patch, SOLR-319.patch WHAT: Currently, SynonymFilterFactory works very well with N-gram tokenizer (CJKTokenizer, for example). But we have to take care of the statement in synonyms.txt. For example, if I use CJKTokenizer (work as bi-gram for CJK chars) and want C1C2C3 maps to C4C5C6, I have to write the rule as follows: C1C2 C2C3 = C4C5 C5C6 But I want to write it C1C2C3=C4C5C6. This patch allows it. It is also helpful for sharing synonyms.txt. HOW: tokenFactory attribute is added to filter class=solr.SynonymFilterFactory/. If the attribute is specified, SynonymFilterFactory uses the TokenizerFactory to create Tokenizer. Then SynonymFilterFactory uses the Tokenizer to get tokens from the rules in synonyms.txt file. sample-1: CJKTokenizer fieldtype name=text_cjk class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.CJKTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=ngram_synonym_test_ja.txt ignoreCase=true expand=true tokenFactory=solr.CJKTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.CJKTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldtype sample-2: NGramTokenizer fieldtype name=text_ngram class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.NGramTokenizerFactory minGramSize=2 maxGramSize=2/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.NGramTokenizerFactory minGramSize=2 maxGramSize=2/ filter class=solr.SynonymFilterFactory synonyms=ngram_synonym_test_ngram.txt ignoreCase=true expand=true tokenFactory=solr.NGramTokenizerFactory minGramSize=2 maxGramSize=2/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldtype backward compatibility: Yes. If you omit tokenFactory attribute from filter class=solr.SynonymFilterFactory/ tag, it works as usual. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-572) Spell Checker as a Search Component
[ https://issues.apache.org/jira/browse/SOLR-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12597345#action_12597345 ] Noble Paul commented on SOLR-572: - * the spellcheck.dictionary=default must be optional in query. The user must be able to name a dictionary as 'default' and that can be used as the default if no value is passed. Spell Checker as a Search Component --- Key: SOLR-572 URL: https://issues.apache.org/jira/browse/SOLR-572 Project: Solr Issue Type: New Feature Components: spellchecker Affects Versions: 1.3 Reporter: Shalin Shekhar Mangar Fix For: 1.3 Attachments: SOLR-572.patch Expose the Lucene contrib SpellChecker as a Search Component. Provide the following features: * Allow creating a spell index on a given field and make it possible to have multiple spell indices -- one for each field * Give suggestions on a per-field basis * Given a multi-word query, give only one consistent suggestion * Process the query with the same analyzer specified for the source field and process each token separately * Allow the user to specify minimum length for a token (optional) Consistency criteria for a multi-word query can consist of the following: * Preserve the correct words in the original query as it is * Never give duplicate words in a suggestion -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-572) Spell Checker as a Search Component
[ https://issues.apache.org/jira/browse/SOLR-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12597351#action_12597351 ] Otis Gospodnetic commented on SOLR-572: --- I had a quick look and it all looks nice and clean. I like the config, though I think solr is too specific - the source field could be in a vanilla Lucene indexthat lives somewhere on disk, or example. Thus, I'd change solr to index. Oh, I see, you are reading field values from the index of the current core. I think that is fine, but wouldn't it also be good to be able to read field values from a vanilla Lucene index? (but you wouldn't know the field type and thus would not be able to get the Analyzer for the field) Also, and regardless of the above, instead of having indexDir and path, why not call them both location and maybe even let them include the file: schema for consistency, if it works with the code that uses those locations? Also on TODO: * Read dictionary from plain-text files. Spell Checker as a Search Component --- Key: SOLR-572 URL: https://issues.apache.org/jira/browse/SOLR-572 Project: Solr Issue Type: New Feature Components: spellchecker Affects Versions: 1.3 Reporter: Shalin Shekhar Mangar Fix For: 1.3 Attachments: SOLR-572.patch Expose the Lucene contrib SpellChecker as a Search Component. Provide the following features: * Allow creating a spell index on a given field and make it possible to have multiple spell indices -- one for each field * Give suggestions on a per-field basis * Given a multi-word query, give only one consistent suggestion * Process the query with the same analyzer specified for the source field and process each token separately * Allow the user to specify minimum length for a token (optional) Consistency criteria for a multi-word query can consist of the following: * Preserve the correct words in the original query as it is * Never give duplicate words in a suggestion -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-572) Spell Checker as a Search Component
[ https://issues.apache.org/jira/browse/SOLR-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12597354#action_12597354 ] Shalin Shekhar Mangar commented on SOLR-572: Otis, I agree that we should call index' instead of solr for the type and path can be renamed to location. But indexDir refers to the target for the spell check index whereas path currently refers to the source of the dictionary, so IMHO we should keep indexDir as it is (It can also be a relative path). For supporting arbitrary lucene indices, user must specify type=index, field=fieldName, location=path/to/lucene/index/directory which should be enough (TODO). In that case the analyzer can be fixed as something (say WhitespaceAnalyzer or StandardAnalyzer). I'm not sure I understand your comment on the schema. If this is for text files then I was thinking more about having a text file which would have one word per line and all those words would go into the same dictionary. Spell Checker as a Search Component --- Key: SOLR-572 URL: https://issues.apache.org/jira/browse/SOLR-572 Project: Solr Issue Type: New Feature Components: spellchecker Affects Versions: 1.3 Reporter: Shalin Shekhar Mangar Fix For: 1.3 Attachments: SOLR-572.patch Expose the Lucene contrib SpellChecker as a Search Component. Provide the following features: * Allow creating a spell index on a given field and make it possible to have multiple spell indices -- one for each field * Give suggestions on a per-field basis * Given a multi-word query, give only one consistent suggestion * Process the query with the same analyzer specified for the source field and process each token separately * Allow the user to specify minimum length for a token (optional) Consistency criteria for a multi-word query can consist of the following: * Preserve the correct words in the original query as it is * Never give duplicate words in a suggestion -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-572) Spell Checker as a Search Component
[ https://issues.apache.org/jira/browse/SOLR-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12597358#action_12597358 ] Otis Gospodnetic commented on SOLR-572: --- I see (indexDir comment). Might be better to make it more obvious then - sourceIndex for the Lucene index that serves as the source of data) vs. targetIndex (or spellcheckerIndex) for the resulting spellchecker index. For Lucene indices to be used as sources of data type=index, field=fieldName, location=path/to/lucene/index/directory makes sense. Ignore my comment about the schema, I'm just complicating things with that. Yes, one word per line for plain-text file data sources - that can easily be digested with PlainTextDictionary class (part of Lucene SC). Spell Checker as a Search Component --- Key: SOLR-572 URL: https://issues.apache.org/jira/browse/SOLR-572 Project: Solr Issue Type: New Feature Components: spellchecker Affects Versions: 1.3 Reporter: Shalin Shekhar Mangar Fix For: 1.3 Attachments: SOLR-572.patch Expose the Lucene contrib SpellChecker as a Search Component. Provide the following features: * Allow creating a spell index on a given field and make it possible to have multiple spell indices -- one for each field * Give suggestions on a per-field basis * Given a multi-word query, give only one consistent suggestion * Process the query with the same analyzer specified for the source field and process each token separately * Allow the user to specify minimum length for a token (optional) Consistency criteria for a multi-word query can consist of the following: * Preserve the correct words in the original query as it is * Never give duplicate words in a suggestion -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.