RE: default text type and stop words
I don't know if the problem is in Lucene, I didn't investigate further. Maybe it's considered a feature, not a bug for someone with different expectations. Given that Solr and Lucene have different release schedules. Even if the problem is in Lucene and it's addressed there, that doesn't guarentee it's solved with Solr. You would have to change from using a known stable vresion of Lucene to some nightly release that included a hypothetical patch or a patched custom version for this one little edge case. It's probably unlikely that either of those are going to happen. Or consider changing a line of XML... I only suggested considering it. There is also the concept of an anti-corruption layer in domain driven design. There are issues of time frames, release schedules, priorities and I'm not assuming this edge case is a high priority. I merely pointed out an issue in the defaults. I also didn't say not to deal with a bug that hypothetically could be in a tightly coupled dependency. Paul -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, November 02, 2007 11:02 PM To: solr-dev@lucene.apache.org Subject: Re: default text type and stop words In a message dated 11/2/07 6:54:25 PM, [EMAIL PROTECTED] writes: Even if the actual problem is at the Lucene level, perhaps it would be worth considering changes to the default to get around it. newbie here. is this common practice? find a bug in a tightly coupled dependency and not deal with it there? regard, billy ** See what's new at http://www.aol.com
default text type and stop words
I noticed very unexpected results when using stop words with and without conditions using the default text type. A normal query with a stop word returns no results as expected: For example with 'an' being a stopword movieName:an (results: 0 since it's a stop word) movieName:another (results 237) rating:PG-13 (results: 76095) but if I put them together with AND, for normal non stop words like 'another' the result is less than or equal to the smaller results being ANDed. So adding another AND clause with a stop word query should have 0 results. rating:PG-13 AND movieName:another (results 46) rating:PG-13 AND movieName:an (results 76095 should be 0) Commenting out the stop word filter from the text type for query will correct this behavior, although I'm not sure that's a real solution. So instead of anding the stop word clause it seems to ignore it. Even if the actual problem is at the Lucene level, perhaps it would be worth considering changes to the default to get around it. Workaround: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ !-- comment out to prevent strange behavior filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/-- filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Paul Sundling
RE: [Solr Wiki] Update of SolrPerformanceFactors by paulsundling
Sorry I replied to a subset of that question on the user list. I'll include my whole message, since it also relates to a past topic on this list (Time for a cleaner API): The embedded approach is at http://wiki.apache.org/solr/EmbeddedSolr For my testing I have a tunable setting for records to submit and did 10 per batch. Both approaches committed after every 1000 records, also tunable. A custom Lucene implementation I helped implement was even faster than embedded, using a ramdrive as a double buffer. However that did require a much larger memory footprint. The embedded class have little to no documentation and almost look like stub implementations, but they work well. While this project will succeed in a large part to how easy it is to integrate with non Java clients, I would actually like to see this project more java friendly, like a reference indexing implementation. There are a lot of tools that could be more widely useful like SimplePostTool. With a few API changes it could be used for the demo as well as a useful library. Instead I extended and then had to abandon that and resort to cut and paste reuse in the end. The functionality was 95% there, but just needed API tweaks to make it usable. It also seems unusual exposing fields directly instead of using accessors in the Java code. Accessors can be give a lot of flexibility that field access doesn't have. It would also be nice to able to get java objects back besides XML and JSON, like an Embedded equivalent for search. That way you could integrate more easily with Spring MVC, etc. There may also be some performance gains there. Paul Sundling -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Friday, August 24, 2007 1:07 PM To: solr-dev@lucene.apache.org Subject: Re: [Solr Wiki] Update of SolrPerformanceFactors by paulsundling On 8/24/07, Apache Wiki [EMAIL PROTECTED] wrote: + Using an [EmbeddedSolr] for indexing can be over 50% faster than one + using XML messages that are posted. Paul, were the documents posted one-per-message, or did you try multiple (like 50 to 100) per message? If one per message, the best way to increase performance is to have multiple threads adding docs. I'd be curious to know how a single CSV file would clock in at as well... -Yonik
RE: [jira] Updated: (SOLR-326) cleanup eclipse warnings
So would it be useful to keep that JIRA open after the patch is submitted to allow ongoing patch submissions? Paul Sundling -Original Message- From: Paul Sundling (JIRA) [mailto:[EMAIL PROTECTED] Sent: Thursday, August 02, 2007 4:24 PM To: solr-dev@lucene.apache.org Subject: [jira] Updated: (SOLR-326) cleanup eclipse warnings [ https://issues.apache.org/jira/browse/SOLR-326?page=com.atlassian.jira.p lugin.system.issuetabpanels:all-tabpanel ] Paul Sundling updated SOLR-326: --- Attachment: remove_unused_imports_patch.txt This should remove unused import eclipse warnings. cleanup eclipse warnings Key: SOLR-326 URL: https://issues.apache.org/jira/browse/SOLR-326 Project: Solr Issue Type: Improvement Reporter: Paul Sundling Priority: Minor Attachments: remove_unused_imports_patch.txt On default settings, Eclipse had 628 warnings. This patch removes 119 of those warnings related to unused imports. These are the safest warnings to fix and shouldn't require any testing other than confirming building still works. The general idea of removing warnings is both cleaner code, but also making it easier for interesting warnings to get hidden by uninteresting warnings. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
FW: EdgeNGramTokenizer errors in eclipse
Has anyone else noticed this? I didn't get any response on the user list. Paul Sundling -Original Message- From: Sundling, Paul Sent: Tuesday, July 24, 2007 4:55 PM To: [EMAIL PROTECTED] Subject: EdgeNGramTokenizer errors in eclipse I checked out the latest solr source code from subversion and put it in an eclipse project. I used all the jars for the project (had to add junit). I get errors in eclipse about two constants not being defined in one of the library jars: (based on imports org.apache.lucene.analysis.ngram.EdgeNGramTokenizer) EdgeNGramTokenizer.DEFAULT_MAX_GRAM_SIZE and EdgeNGramTokenizer.DEFAULT_MAX_GRAM_SIZE are not defined. So was a class changed that this Solr class depends on? The error happens in org.apache.solr.analysis.EdgeNGramTokenizerFactory: maxGramSize = (maxArg != null ? Integer.parseInt(maxArg) : EdgeNGramTokenizer.DEFAULT_MAX_GRAM_SIZE); String minArg = args.get(minGramSize); minGramSize = (minArg != null ? Integer.parseInt(minArg) : EdgeNGramTokenizer.DEFAULT_MIN_GRAM_SIZE); Am I doing something wrong? Paul Sundling