Re: default text type and stop words

2007-11-07 Thread Will Martin
Thanks Mike. I realized the issue was not defined. I appreciate the 
guidance about process...very much.


Billy

Mike Klaas wrote:

On 2-Nov-07, at 11:02 PM, [EMAIL PROTECTED] wrote:



In a message dated 11/2/07 6:54:25 PM, [EMAIL PROTECTED] 
writes:




Even if
the actual problem is at the Lucene level, perhaps it would be worth
considering changes to the default to get around it.



newbie here. is this common practice? find a bug in a tightly coupled
dependency and not deal with it there?


No--we fix bugs in Lucene if there is a genuine problem there.  
However, it isn't clear that this is a Lucene-caused problem.


-Mike


Re: default text type and stop words

2007-11-06 Thread Grant Ingersoll
Another alternative that is to selectively use stopwords as in phrases  
or other places where they have meaning.  In the past, stopword  
removal was mostly done to save disk space and some computation, but  
disk is cheap and computation, well, they can help you have better  
results if done right, so the computation cost may be worth it.  If  
they truly were meaningless, why would they be in the language to  
begin with? :-)


-Grant

On Nov 6, 2007, at 1:36 AM, Walter Underwood wrote:

I also said, Stopword removal is a reasonable default because it  
works

fairly well for a general text corpus. Ultraseek keeps stopwords but
most engines don't. I think it is fine as a default. I also think you
have to understand stopwords at some point.

wunder

On 11/5/07 9:59 PM, Chris Hostetter [EMAIL PROTECTED]  
wrote:




: This isn't a problem in Lucene or Solr. It is a result of the  
analyzers
: you have chosen to use. If you choose to remove stopwords, you  
will not

: be able to match stopwords.

I believe paul's point was that this use of stopwords is in the  
text

fieldtype in the example schema.xml ... which many people use as is.

I'm personally of the mindset that it's fine like it is.  While  
people who
understand that an is a stop word might ask why does 'rating:PG  
AND
name:an' match 40K movies, it should match 0? there is another  
(probably
larger) group of people who won't know how the search is  
implemented, or
that an is a stop word, and they will look at the same results  
and ask
why am i getting 40K results? most of these don't have 'an' in the  
title?

i should only be getting X results.

That second group of people aren't going to be any happier if you
give them 0 results instead -- at least this way people get some  
results

to work with.

-Hoss





--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Boot Camp Training:
ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!  http://www.apachecon.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




Re: default text type and stop words

2007-11-05 Thread Mike Klaas

On 2-Nov-07, at 11:02 PM, [EMAIL PROTECTED] wrote:



In a message dated 11/2/07 6:54:25 PM,  
[EMAIL PROTECTED] writes:




Even if
the actual problem is at the Lucene level, perhaps it would be worth
considering changes to the default to get around it.



newbie here. is this common practice? find a bug in a tightly coupled
dependency and not deal with it there?


No--we fix bugs in Lucene if there is a genuine problem there.   
However, it isn't clear that this is a Lucene-caused problem.


-Mike


RE: default text type and stop words

2007-11-05 Thread Sundling, Paul
I don't know if the problem is in Lucene, I didn't investigate further.
Maybe it's considered a feature, not a bug for someone with different
expectations.

Given that Solr and Lucene have different release schedules.  Even if
the problem is in Lucene and it's addressed there, that doesn't
guarentee it's solved with Solr.  You would have to change from using a
known stable vresion of Lucene to some nightly release that included a
hypothetical patch or a patched custom version for this one little edge
case.  It's probably unlikely that either of those are going to happen.
Or consider changing a line of XML...   

I only suggested considering it.  There is also the concept of an
anti-corruption layer in domain driven design.  There are issues of time
frames, release schedules, priorities and I'm not assuming this edge
case is a high priority.  I merely pointed out an issue in the defaults.


I also didn't say not to deal with a bug that hypothetically could be in
a tightly coupled dependency.

Paul

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Friday, November 02, 2007 11:02 PM
To: solr-dev@lucene.apache.org
Subject: Re: default text type and stop words



In a message dated 11/2/07 6:54:25 PM, [EMAIL PROTECTED]
writes:


 Even if
 the actual problem is at the Lucene level, perhaps it would be worth 
 considering changes to the default to get around it.
 

newbie here. is this common practice? find a bug in a tightly coupled 
dependency and not deal with it there?

regard,
billy


**
 See what's new at 
http://www.aol.com


Re: default text type and stop words

2007-11-05 Thread Yonik Seeley
On 11/5/07, Sundling, Paul [EMAIL PROTECTED] wrote:
 I don't know if the problem is in Lucene, I didn't investigate further.

Yes, this is standard lucene behavior, working-as-designed.
Stop words are removed from the query as if they never existed at all
(this makes some sense because they were removed during indexing too).

-Yonik


Re: default text type and stop words

2007-11-05 Thread Chris Hostetter

: This isn't a problem in Lucene or Solr. It is a result of the analyzers
: you have chosen to use. If you choose to remove stopwords, you will not
: be able to match stopwords.

I believe paul's point was that this use of stopwords is in the text 
fieldtype in the example schema.xml ... which many people use as is.

I'm personally of the mindset that it's fine like it is.  While people who 
understand that an is a stop word might ask why does 'rating:PG AND 
name:an' match 40K movies, it should match 0? there is another (probably 
larger) group of people who won't know how the search is implemented, or 
that an is a stop word, and they will look at the same results and ask 
why am i getting 40K results? most of these don't have 'an' in the title? 
i should only be getting X results.

That second group of people aren't going to be any happier if you 
give them 0 results instead -- at least this way people get some results 
to work with.


-Hoss



Re: default text type and stop words

2007-11-05 Thread Walter Underwood
I also said, Stopword removal is a reasonable default because it works
fairly well for a general text corpus. Ultraseek keeps stopwords but
most engines don't. I think it is fine as a default. I also think you
have to understand stopwords at some point.

wunder

On 11/5/07 9:59 PM, Chris Hostetter [EMAIL PROTECTED] wrote:

 
 : This isn't a problem in Lucene or Solr. It is a result of the analyzers
 : you have chosen to use. If you choose to remove stopwords, you will not
 : be able to match stopwords.
 
 I believe paul's point was that this use of stopwords is in the text
 fieldtype in the example schema.xml ... which many people use as is.
 
 I'm personally of the mindset that it's fine like it is.  While people who
 understand that an is a stop word might ask why does 'rating:PG AND
 name:an' match 40K movies, it should match 0? there is another (probably
 larger) group of people who won't know how the search is implemented, or
 that an is a stop word, and they will look at the same results and ask
 why am i getting 40K results? most of these don't have 'an' in the title?
 i should only be getting X results.
 
 That second group of people aren't going to be any happier if you
 give them 0 results instead -- at least this way people get some results
 to work with.
 
 -Hoss




Re: default text type and stop words

2007-11-02 Thread MartinWmMo

In a message dated 11/2/07 6:54:25 PM, [EMAIL PROTECTED] writes:


 Even if
 the actual problem is at the Lucene level, perhaps it would be worth
 considering changes to the default to get around it.
 

newbie here. is this common practice? find a bug in a tightly coupled 
dependency and not deal with it there?

regard,
billy


**
 See what's new at 
http://www.aol.com