Re: default text type and stop words
Thanks Mike. I realized the issue was not defined. I appreciate the guidance about process...very much. Billy Mike Klaas wrote: On 2-Nov-07, at 11:02 PM, [EMAIL PROTECTED] wrote: In a message dated 11/2/07 6:54:25 PM, [EMAIL PROTECTED] writes: Even if the actual problem is at the Lucene level, perhaps it would be worth considering changes to the default to get around it. newbie here. is this common practice? find a bug in a tightly coupled dependency and not deal with it there? No--we fix bugs in Lucene if there is a genuine problem there. However, it isn't clear that this is a Lucene-caused problem. -Mike
Re: default text type and stop words
Another alternative that is to selectively use stopwords as in phrases or other places where they have meaning. In the past, stopword removal was mostly done to save disk space and some computation, but disk is cheap and computation, well, they can help you have better results if done right, so the computation cost may be worth it. If they truly were meaningless, why would they be in the language to begin with? :-) -Grant On Nov 6, 2007, at 1:36 AM, Walter Underwood wrote: I also said, Stopword removal is a reasonable default because it works fairly well for a general text corpus. Ultraseek keeps stopwords but most engines don't. I think it is fine as a default. I also think you have to understand stopwords at some point. wunder On 11/5/07 9:59 PM, Chris Hostetter [EMAIL PROTECTED] wrote: : This isn't a problem in Lucene or Solr. It is a result of the analyzers : you have chosen to use. If you choose to remove stopwords, you will not : be able to match stopwords. I believe paul's point was that this use of stopwords is in the text fieldtype in the example schema.xml ... which many people use as is. I'm personally of the mindset that it's fine like it is. While people who understand that an is a stop word might ask why does 'rating:PG AND name:an' match 40K movies, it should match 0? there is another (probably larger) group of people who won't know how the search is implemented, or that an is a stop word, and they will look at the same results and ask why am i getting 40K results? most of these don't have 'an' in the title? i should only be getting X results. That second group of people aren't going to be any happier if you give them 0 results instead -- at least this way people get some results to work with. -Hoss -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Boot Camp Training: ApacheCon Atlanta, Nov. 12, 2007. Sign up now! http://www.apachecon.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Re: default text type and stop words
On 2-Nov-07, at 11:02 PM, [EMAIL PROTECTED] wrote: In a message dated 11/2/07 6:54:25 PM, [EMAIL PROTECTED] writes: Even if the actual problem is at the Lucene level, perhaps it would be worth considering changes to the default to get around it. newbie here. is this common practice? find a bug in a tightly coupled dependency and not deal with it there? No--we fix bugs in Lucene if there is a genuine problem there. However, it isn't clear that this is a Lucene-caused problem. -Mike
RE: default text type and stop words
I don't know if the problem is in Lucene, I didn't investigate further. Maybe it's considered a feature, not a bug for someone with different expectations. Given that Solr and Lucene have different release schedules. Even if the problem is in Lucene and it's addressed there, that doesn't guarentee it's solved with Solr. You would have to change from using a known stable vresion of Lucene to some nightly release that included a hypothetical patch or a patched custom version for this one little edge case. It's probably unlikely that either of those are going to happen. Or consider changing a line of XML... I only suggested considering it. There is also the concept of an anti-corruption layer in domain driven design. There are issues of time frames, release schedules, priorities and I'm not assuming this edge case is a high priority. I merely pointed out an issue in the defaults. I also didn't say not to deal with a bug that hypothetically could be in a tightly coupled dependency. Paul -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, November 02, 2007 11:02 PM To: solr-dev@lucene.apache.org Subject: Re: default text type and stop words In a message dated 11/2/07 6:54:25 PM, [EMAIL PROTECTED] writes: Even if the actual problem is at the Lucene level, perhaps it would be worth considering changes to the default to get around it. newbie here. is this common practice? find a bug in a tightly coupled dependency and not deal with it there? regard, billy ** See what's new at http://www.aol.com
Re: default text type and stop words
On 11/5/07, Sundling, Paul [EMAIL PROTECTED] wrote: I don't know if the problem is in Lucene, I didn't investigate further. Yes, this is standard lucene behavior, working-as-designed. Stop words are removed from the query as if they never existed at all (this makes some sense because they were removed during indexing too). -Yonik
Re: default text type and stop words
: This isn't a problem in Lucene or Solr. It is a result of the analyzers : you have chosen to use. If you choose to remove stopwords, you will not : be able to match stopwords. I believe paul's point was that this use of stopwords is in the text fieldtype in the example schema.xml ... which many people use as is. I'm personally of the mindset that it's fine like it is. While people who understand that an is a stop word might ask why does 'rating:PG AND name:an' match 40K movies, it should match 0? there is another (probably larger) group of people who won't know how the search is implemented, or that an is a stop word, and they will look at the same results and ask why am i getting 40K results? most of these don't have 'an' in the title? i should only be getting X results. That second group of people aren't going to be any happier if you give them 0 results instead -- at least this way people get some results to work with. -Hoss
Re: default text type and stop words
I also said, Stopword removal is a reasonable default because it works fairly well for a general text corpus. Ultraseek keeps stopwords but most engines don't. I think it is fine as a default. I also think you have to understand stopwords at some point. wunder On 11/5/07 9:59 PM, Chris Hostetter [EMAIL PROTECTED] wrote: : This isn't a problem in Lucene or Solr. It is a result of the analyzers : you have chosen to use. If you choose to remove stopwords, you will not : be able to match stopwords. I believe paul's point was that this use of stopwords is in the text fieldtype in the example schema.xml ... which many people use as is. I'm personally of the mindset that it's fine like it is. While people who understand that an is a stop word might ask why does 'rating:PG AND name:an' match 40K movies, it should match 0? there is another (probably larger) group of people who won't know how the search is implemented, or that an is a stop word, and they will look at the same results and ask why am i getting 40K results? most of these don't have 'an' in the title? i should only be getting X results. That second group of people aren't going to be any happier if you give them 0 results instead -- at least this way people get some results to work with. -Hoss
Re: default text type and stop words
In a message dated 11/2/07 6:54:25 PM, [EMAIL PROTECTED] writes: Even if the actual problem is at the Lucene level, perhaps it would be worth considering changes to the default to get around it. newbie here. is this common practice? find a bug in a tightly coupled dependency and not deal with it there? regard, billy ** See what's new at http://www.aol.com