Re: wildcards, stemming and searching

2005-02-10 Thread Erik Hatcher
How would you deal with a query like a*z though?
I suspect, however, that you only care about suffix queries and 
stemming those.  If thats the case, then you could subclass 
getWildcardQuery and do internal stemming (remove trailing wildcard, 
run it through the analyzer directly there and return a modified 
WildcardQuery instance.

With wildcard queries though, this is risky.  Prefixes won't 
necessarily stem to what the full word would stem to.

Erik
On Feb 9, 2005, at 6:26 PM, aaz wrote:
Hi,
We are not using QueryParser and have some custom Query construction.
We have an index that indexes various documents. Each document is 
Analyzed and indexed via

StandardTokenizer() -StandardFilter() - LowercaseFilter() - 
StopFilter() - PorterStemFilter()

We also want to support wildcard queries, hence on an inbound query we 
need to deal with * in the value side of the comparison. We also 
need to analyze the value side of the query against the same 
analyzer in which the index was built with. This leads to some 
problems and would like your solution opinion.

User queries.
somefield = united*
After the analyzer hits united*, we get back unit. Hence we cannot 
detect that the user requested a wildcard.

Lets say we come up with some solution to escape the * char before 
the Analyzer hits it. For example

somefield = united*  - unitedXXWILDCARDXX
After analysis this then becomes unitedxxwildcardxx, which we can 
then turn into a WildcardQuery united*

The problem here is that the term united will never exist in the 
indexing due to the stemming which did not stem properly due to our 
escape mechanism.

How can I solve this problem?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: wildcards, stemming and searching

2005-02-10 Thread aaz
How would you deal with a query like a*z though?
Yeah I know, a user submitting that is certainly possible. I have no idea. I 
am starting to think that NOT stemming on indexing might be the safest 
solution.

- Original Message - 
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Thursday, February 10, 2005 8:55 AM
Subject: Re: wildcards, stemming and searching


How would you deal with a query like a*z though?
I suspect, however, that you only care about suffix queries and stemming 
those.  If thats the case, then you could subclass getWildcardQuery and do 
internal stemming (remove trailing wildcard, run it through the analyzer 
directly there and return a modified WildcardQuery instance.

With wildcard queries though, this is risky.  Prefixes won't necessarily 
stem to what the full word would stem to.

Erik
On Feb 9, 2005, at 6:26 PM, aaz wrote:
Hi,
We are not using QueryParser and have some custom Query construction.
We have an index that indexes various documents. Each document is 
Analyzed and indexed via

StandardTokenizer() -StandardFilter() - LowercaseFilter() - 
StopFilter() - PorterStemFilter()

We also want to support wildcard queries, hence on an inbound query we 
need to deal with * in the value side of the comparison. We also need 
to analyze the value side of the query against the same analyzer in 
which the index was built with. This leads to some problems and would 
like your solution opinion.

User queries.
somefield = united*
After the analyzer hits united*, we get back unit. Hence we cannot 
detect that the user requested a wildcard.

Lets say we come up with some solution to escape the * char before 
the Analyzer hits it. For example

somefield = united*  - unitedXXWILDCARDXX
After analysis this then becomes unitedxxwildcardxx, which we can then 
turn into a WildcardQuery united*

The problem here is that the term united will never exist in the 
indexing due to the stemming which did not stem properly due to our 
escape mechanism.

How can I solve this problem?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


wildcards, stemming and searching

2005-02-09 Thread aaz
Hi,
We are not using QueryParser and have some custom Query construction.

We have an index that indexes various documents. Each document is Analyzed and 
indexed via

StandardTokenizer() -StandardFilter() - LowercaseFilter() - StopFilter() - 
PorterStemFilter()

We also want to support wildcard queries, hence on an inbound query we need to 
deal with * in the value side of the comparison. We also need to analyze 
the value side of the query against the same analyzer in which the index was 
built with. This leads to some problems and would like your solution opinion.

User queries.

somefield = united*

After the analyzer hits united*, we get back unit. Hence we cannot detect 
that the user requested a wildcard.

Lets say we come up with some solution to escape the * char before the 
Analyzer hits it. For example

somefield = united*  - unitedXXWILDCARDXX

After analysis this then becomes unitedxxwildcardxx, which we can then turn 
into a WildcardQuery united*

The problem here is that the term united will never exist in the indexing due 
to the stemming which did not stem properly due to our escape mechanism.

How can I solve this problem?



RE: Stemming

2005-01-24 Thread Kevin L. Cobb
Do stemming algorithms take into consideration abbreviations too? Some
examples:

mg = milligrams
US = United States
CD = compact disc
vcr = video casette recorder

And, the next logical question, if stemming does not take care of
abbreviations, are there any solutions that include abbreviations inside
or outside of Lucene?

Thanks,

Kevin


-Original Message-
From: Chris Lamprecht [mailto:[EMAIL PROTECTED] 
Sent: Friday, January 21, 2005 5:51 PM
To: Lucene Users List
Subject: Re: Stemming

Also if you can't wait, see page 2 of
http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html

or the LIA e-book ;)

On Fri, 21 Jan 2005 09:27:42 -0500, Kevin L. Cobb
[EMAIL PROTECTED] wrote:
 OK, OK ... I'll buy the book. I guess its about time since I am deeply
 and forever in love with Lucene. Might as well take the final plunge.
 
 
 -Original Message-
 From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
 Sent: Friday, January 21, 2005 9:12 AM
 To: Lucene Users List
 Subject: Re: Stemming
 
 Hi Kevin,
 
 Stemming is an optional operation and is done in the analysis step.
 Lucene comes with a Porter stemmer and a Filter that you can use in an
 Analyzer:
 
 ./src/java/org/apache/lucene/analysis/PorterStemFilter.java
 ./src/java/org/apache/lucene/analysis/PorterStemmer.java
 
 You can find more about it here:
 http://www.lucenebook.com/search?query=stemming
 You can also see mentions of SnowballAnalyzer in those search results,
 and you can find an adapter for SnowballAnalyzers in Lucene Sandbox.
 
 Otis
 
 --- Kevin L. Cobb [EMAIL PROTECTED] wrote:
 
  I want to understand how Lucene uses stemming but can't find any
  documentation on the Lucene site. I'll continue to google but hope
  that
  this list can help narrow my search. I have several questions on the
  subject currently but hesitate to list them here since finding a
good
  document on the subject may answer most of them.
 
 
 
  Thanks in advance for any pointers,
 
 
 
  Kevin
 
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Stemming

2005-01-24 Thread Erik Hatcher
On Jan 24, 2005, at 7:24 AM, Kevin L. Cobb wrote:
Do stemming algorithms take into consideration abbreviations too?
No, they don't.  Adding abbreviations, aliases, synonyms, etc is not 
stemming.

And, the next logical question, if stemming does not take care of
abbreviations, are there any solutions that include abbreviations 
inside
or outside of Lucene?
Nothing built into Lucene does this, but the infrastructure allows it 
to be added in the form of a custom analysis step.  There are two basic 
approaches, adding aliases at indexing time, or adding them at query 
time by expanding the query.  I created some example analyzers in 
Lucene in Action (grab the source code from the site linked below) that 
demonstrate how this can be done using WordNet (and mock) synonym 
lookup.  You could extrapolate this into looking up abbreviations and 
adding them into the token stream.

http://www.lucenebook.com/search?query=synonyms
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!

2005-01-21 Thread Andrzej Bialecki
Morus Walter wrote:
Owen Densmore writes:

1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) 
apparently produces non-word stems .. i.e. not really human readable.  
(Example: generate, generates, generated, generating  - generat) 
Although in typical queries this is not important because the result of 
the search is a document list, it *would* be important if we use the 
stems within a graphical navigation interface.
So the question is: Is there a way to have the stemmer produce 
english
base forms of the words being stemmed?

rule based stemmers such as porter/snowball cannot do that.
But there are (commercial) dictionary based tools that can. E.g. the
canoo lemmatizer.
You might also have a look at egothors stemmer, that are word list based.
Egothor stemmers are algorithmic, they only use word lists for training. 
Stems produced by them are usually closer to lemmas than in e.g. 
Porter's stemmer, but there is a significant amount of stems like in the 
example above.


--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!

2005-01-21 Thread mark harwood
1 - I'm a bit concerned that reasonable stemming
(Porter/Snowball) 
apparently produces non-word stems .. i.e. not
really human readable. 

It is possible to derive the human-readable form of a
stemmed term using either re-analysis of indexed
content or TermPositionVector. Either of these
techniques should give you the position data required
to discover the original form. 
The highlighter package is one example of where this
technique is used.

Cheers
Mark





___ 
ALL-NEW Yahoo! Messenger - all new features - even more fun! 
http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Stemming

2005-01-21 Thread Kevin L. Cobb
I want to understand how Lucene uses stemming but can't find any
documentation on the Lucene site. I'll continue to google but hope that
this list can help narrow my search. I have several questions on the
subject currently but hesitate to list them here since finding a good
document on the subject may answer most of them. 

 

Thanks in advance for any pointers,

 

Kevin

 

 



Re: Stemming

2005-01-21 Thread Otis Gospodnetic
Hi Kevin,

Stemming is an optional operation and is done in the analysis step. 
Lucene comes with a Porter stemmer and a Filter that you can use in an
Analyzer:

./src/java/org/apache/lucene/analysis/PorterStemFilter.java
./src/java/org/apache/lucene/analysis/PorterStemmer.java

You can find more about it here:
http://www.lucenebook.com/search?query=stemming
You can also see mentions of SnowballAnalyzer in those search results,
and you can find an adapter for SnowballAnalyzers in Lucene Sandbox.

Otis

--- Kevin L. Cobb [EMAIL PROTECTED] wrote:

 I want to understand how Lucene uses stemming but can't find any
 documentation on the Lucene site. I'll continue to google but hope
 that
 this list can help narrow my search. I have several questions on the
 subject currently but hesitate to list them here since finding a good
 document on the subject may answer most of them. 
 
  
 
 Thanks in advance for any pointers,
 
  
 
 Kevin
 
  
 
  
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Stemming

2005-01-21 Thread Kevin L. Cobb
OK, OK ... I'll buy the book. I guess its about time since I am deeply
and forever in love with Lucene. Might as well take the final plunge.



-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Friday, January 21, 2005 9:12 AM
To: Lucene Users List
Subject: Re: Stemming

Hi Kevin,

Stemming is an optional operation and is done in the analysis step. 
Lucene comes with a Porter stemmer and a Filter that you can use in an
Analyzer:

./src/java/org/apache/lucene/analysis/PorterStemFilter.java
./src/java/org/apache/lucene/analysis/PorterStemmer.java

You can find more about it here:
http://www.lucenebook.com/search?query=stemming
You can also see mentions of SnowballAnalyzer in those search results,
and you can find an adapter for SnowballAnalyzers in Lucene Sandbox.

Otis

--- Kevin L. Cobb [EMAIL PROTECTED] wrote:

 I want to understand how Lucene uses stemming but can't find any
 documentation on the Lucene site. I'll continue to google but hope
 that
 this list can help narrow my search. I have several questions on the
 subject currently but hesitate to list them here since finding a good
 document on the subject may answer most of them. 
 
  
 
 Thanks in advance for any pointers,
 
  
 
 Kevin
 
  
 
  
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Stemming

2005-01-21 Thread Chris Lamprecht
Also if you can't wait, see page 2 of
http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html

or the LIA e-book ;)

On Fri, 21 Jan 2005 09:27:42 -0500, Kevin L. Cobb
[EMAIL PROTECTED] wrote:
 OK, OK ... I'll buy the book. I guess its about time since I am deeply
 and forever in love with Lucene. Might as well take the final plunge.
 
 
 -Original Message-
 From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
 Sent: Friday, January 21, 2005 9:12 AM
 To: Lucene Users List
 Subject: Re: Stemming
 
 Hi Kevin,
 
 Stemming is an optional operation and is done in the analysis step.
 Lucene comes with a Porter stemmer and a Filter that you can use in an
 Analyzer:
 
 ./src/java/org/apache/lucene/analysis/PorterStemFilter.java
 ./src/java/org/apache/lucene/analysis/PorterStemmer.java
 
 You can find more about it here:
 http://www.lucenebook.com/search?query=stemming
 You can also see mentions of SnowballAnalyzer in those search results,
 and you can find an adapter for SnowballAnalyzers in Lucene Sandbox.
 
 Otis
 
 --- Kevin L. Cobb [EMAIL PROTECTED] wrote:
 
  I want to understand how Lucene uses stemming but can't find any
  documentation on the Lucene site. I'll continue to google but hope
  that
  this list can help narrow my search. I have several questions on the
  subject currently but hesitate to list them here since finding a good
  document on the subject may answer most of them.
 
 
 
  Thanks in advance for any pointers,
 
 
 
  Kevin
 
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Newbie: Human Readable Stemming, Lucene Architecture, etc!

2005-01-20 Thread Owen Densmore
Hi .. I'm new to the list so forgive a dumb question or two as I get 
started.

We're in the midst of converting a small collection (1200-1500 
currently) of scientific literature to be easily searchable/navigable.  
We'll likely provide both a text query interface as well as a graphical 
way to search and discover.

Our initial approach will be vector based, looking at Latent Semantic 
Indexing (LSI) as a potential tool, although if that's not needed, 
we'll stop at reasonably simple stemming with a weighted document term 
matrix (DTM).  (Bear in mind I couldn't even pronounce most of these 
concepts last week, so go easy if I'm incoherent!)

It looks to me that Lucene has a quite well factored architecture.  I 
should at the very least be able to use the analyzer and stemmer to 
create a good starting point in the project.  I'd also like to leave a 
nice architecture behind in case we or others end up experimenting 
with, or extending, the system.

So a couple of questions:
1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) 
apparently produces non-word stems .. i.e. not really human readable.  
(Example: generate, generates, generated, generating  - generat) 
Although in typical queries this is not important because the result of 
the search is a document list, it *would* be important if we use the 
stems within a graphical navigation interface.
So the question is: Is there a way to have the stemmer produce 
english
base forms of the words being stemmed?

2 - We're probably using Lucene in ways it was not designed for, such 
as DTM/LSI and graphical clustering and navigation.  Naturally we'll 
provide code for these parts that are not in Lucene.
But the question arises: is this kinda dumb?!  Has anyone stretched 
Lucene's
design center with positive results?  Are we barking up the wrong 
tree?

3 - A nit on hyphenation: Our collection is scientific so has many 
hyphenated words.  I'm wondering about your experiences with 
hyphenation.  In our collection, things like self-organization, 
power-law, space-time, small-world, agent-based, etc. occur often, for 
example.
So the question is: Do folks break up hyphenated words?  If not, do 
you stem the
parts and glue them back together?  Do you apply stoplists to the 
parts?

Thanks for any help and pointers you can fling along,
Owenhttp://backspaces.net/http://redfish.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!

2005-01-20 Thread jian chen
Hi,

One thing to point out. I think Lucene is not using LSI as the
underlying retrieval model. It uses vector space model and also
proximity based retrieval.

Personally, I don't know much about LSI and I don't think the fancy
stuff like LSI is workable in industry. I believe we are far away from
the era of artificial intelligence and using any elusive way to do
information retrieval.

Cheers,

Jian


On Thu, 20 Jan 2005 14:50:10 -0700, Owen Densmore [EMAIL PROTECTED] wrote:
 Hi .. I'm new to the list so forgive a dumb question or two as I get
 started.
 
 We're in the midst of converting a small collection (1200-1500
 currently) of scientific literature to be easily searchable/navigable.
 We'll likely provide both a text query interface as well as a graphical
 way to search and discover.
 
 Our initial approach will be vector based, looking at Latent Semantic
 Indexing (LSI) as a potential tool, although if that's not needed,
 we'll stop at reasonably simple stemming with a weighted document term
 matrix (DTM).  (Bear in mind I couldn't even pronounce most of these
 concepts last week, so go easy if I'm incoherent!)
 
 It looks to me that Lucene has a quite well factored architecture.  I
 should at the very least be able to use the analyzer and stemmer to
 create a good starting point in the project.  I'd also like to leave a
 nice architecture behind in case we or others end up experimenting
 with, or extending, the system.
 
 So a couple of questions:
 
 1 - I'm a bit concerned that reasonable stemming (Porter/Snowball)
 apparently produces non-word stems .. i.e. not really human readable.
 (Example: generate, generates, generated, generating  - generat)
 Although in typical queries this is not important because the result of
 the search is a document list, it *would* be important if we use the
 stems within a graphical navigation interface.
  So the question is: Is there a way to have the stemmer produce
 english
  base forms of the words being stemmed?
 
 2 - We're probably using Lucene in ways it was not designed for, such
 as DTM/LSI and graphical clustering and navigation.  Naturally we'll
 provide code for these parts that are not in Lucene.
  But the question arises: is this kinda dumb?!  Has anyone stretched
 Lucene's
  design center with positive results?  Are we barking up the wrong
 tree?
 
 3 - A nit on hyphenation: Our collection is scientific so has many
 hyphenated words.  I'm wondering about your experiences with
 hyphenation.  In our collection, things like self-organization,
 power-law, space-time, small-world, agent-based, etc. occur often, for
 example.
  So the question is: Do folks break up hyphenated words?  If not, do
 you stem the
  parts and glue them back together?  Do you apply stoplists to the
 parts?
 
 Thanks for any help and pointers you can fling along,
 
 Owenhttp://backspaces.net/http://redfish.com/
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Newbie: Human Readable Stemming, Lucene Architecture, etc!

2005-01-20 Thread Chuck Williams
Like any other field, A.I. is only elusive until you master it.  There
are plenty of companies using A.I. techniques in various IR applications
successfully. LSI in particular has been around a long time and is well
understood.

Chuck

   -Original Message-
   From: jian chen [mailto:[EMAIL PROTECTED]
   Sent: Thursday, January 20, 2005 2:10 PM
   To: Lucene Users List
   Subject: Re: Newbie: Human Readable Stemming, Lucene Architecture,
etc!
   
   Hi,
   
   One thing to point out. I think Lucene is not using LSI as the
   underlying retrieval model. It uses vector space model and also
   proximity based retrieval.
   
   Personally, I don't know much about LSI and I don't think the fancy
   stuff like LSI is workable in industry. I believe we are far away
from
   the era of artificial intelligence and using any elusive way to do
   information retrieval.
   
   Cheers,
   
   Jian
   
   
   On Thu, 20 Jan 2005 14:50:10 -0700, Owen Densmore
[EMAIL PROTECTED]
   wrote:
Hi .. I'm new to the list so forgive a dumb question or two as I
get
started.
   
We're in the midst of converting a small collection (1200-1500
currently) of scientific literature to be easily
searchable/navigable.
We'll likely provide both a text query interface as well as a
   graphical
way to search and discover.
   
Our initial approach will be vector based, looking at Latent
Semantic
Indexing (LSI) as a potential tool, although if that's not needed,
we'll stop at reasonably simple stemming with a weighted document
term
matrix (DTM).  (Bear in mind I couldn't even pronounce most of
these
concepts last week, so go easy if I'm incoherent!)
   
It looks to me that Lucene has a quite well factored architecture.
I
should at the very least be able to use the analyzer and stemmer
to
create a good starting point in the project.  I'd also like to
leave a
nice architecture behind in case we or others end up experimenting
with, or extending, the system.
   
So a couple of questions:
   
1 - I'm a bit concerned that reasonable stemming (Porter/Snowball)
apparently produces non-word stems .. i.e. not really human
readable.
(Example: generate, generates, generated, generating  - generat)
Although in typical queries this is not important because the
result
   of
the search is a document list, it *would* be important if we use
the
stems within a graphical navigation interface.
 So the question is: Is there a way to have the stemmer
produce
english
 base forms of the words being stemmed?
   
2 - We're probably using Lucene in ways it was not designed for,
such
as DTM/LSI and graphical clustering and navigation.  Naturally
we'll
provide code for these parts that are not in Lucene.
 But the question arises: is this kinda dumb?!  Has anyone
   stretched
Lucene's
 design center with positive results?  Are we barking up the
wrong
tree?
   
3 - A nit on hyphenation: Our collection is scientific so has many
hyphenated words.  I'm wondering about your experiences with
hyphenation.  In our collection, things like self-organization,
power-law, space-time, small-world, agent-based, etc. occur often,
for
example.
 So the question is: Do folks break up hyphenated words?  If
not,
   do
you stem the
 parts and glue them back together?  Do you apply stoplists to
the
parts?
   
Thanks for any help and pointers you can fling along,
   
Owenhttp://backspaces.net/http://redfish.com/
   
   
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!

2005-01-20 Thread Morus Walter
Owen Densmore writes:

 1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) 
 apparently produces non-word stems .. i.e. not really human readable.  
 (Example: generate, generates, generated, generating  - generat) 
 Although in typical queries this is not important because the result of 
 the search is a document list, it *would* be important if we use the 
 stems within a graphical navigation interface.
  So the question is: Is there a way to have the stemmer produce 
 english
  base forms of the words being stemmed?
 
rule based stemmers such as porter/snowball cannot do that.
But there are (commercial) dictionary based tools that can. E.g. the
canoo lemmatizer.
You might also have a look at egothors stemmer, that are word list based.

HTH
Morus



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Query based stemming

2005-01-07 Thread Peter Kim
Hi,

I'm new to Lucene, so I apologize if this issue has been discussed
before (I'm sure it has), but I had a hard time finding an answer using
google. (Maybe this would be a good candidate for the FAQ!) :)

Is it possible to enable stem queries on a per-query basis? It doesn't
seem to be possible since the stem tokenizing is done during the
indexing process. Are people basically stuck with having all their
queries stemmed or none at all?

Thanks!
Peter

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query based stemming

2005-01-07 Thread Jim Lynch
From what I've read, if you want to have a choice, the easiest way is 
to index the documents twice. Once with stemming on and once with it off 
placing the results in two different indexes.  Then at query time, 
select which index you want to use based on whether you want stemming on 
or off.

Jim.
Peter Kim wrote:
Hi,
I'm new to Lucene, so I apologize if this issue has been discussed
before (I'm sure it has), but I had a hard time finding an answer using
google. (Maybe this would be a good candidate for the FAQ!) :)
Is it possible to enable stem queries on a per-query basis? It doesn't
seem to be possible since the stem tokenizing is done during the
indexing process. Are people basically stuck with having all their
queries stemmed or none at all?
Thanks!
Peter
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Query based stemming

2005-01-07 Thread Chris Hostetter

: Is it possible to enable stem queries on a per-query basis? It doesn't
: seem to be possible since the stem tokenizing is done during the
: indexing process. Are people basically stuck with having all their
: queries stemmed or none at all?

:  From what I've read, if you want to have a choice, the easiest way is
: to index the documents twice. Once with stemming on and once with it off
: placing the results in two different indexes.  Then at query time,
: select which index you want to use based on whether you want stemming on
: or off.

As I understand it, the intented place to impliment Stemming is in an
Analyzer Filter (not to be confused with a search Filter).  Since you can
can specify an Analyzer when you call addDocument, you don't have to
acctually have two seperate indexes, you could just have all the docs in
one index - and use a search Filter to indicate which docs to look at.

Alternately: the Analyzer's tokenStream method is given the fieldName
being analyzed, so you could write an Analyzer with a set of rules
telling it to only apply your Stemming filter to certain fields, and
then instead of having twice as many documents, you can just index your
text in two seperate fields (which should be a little easier, then
seperate docs because you are only duplicating the fields where stemming
is relevant)  Then at search time you don't have to filter anything, just
search the field that's applicable to your current desire (stemmed or
unstemmed)

Lastely: Allthough it's tricky to get correct, there's no law saying you
have to use the same Analyzer when you query as when you index.  You could
index your documents using an Analyzer that does no stemming, and then at
search time (if you want stemming) use an Analyzer that does reverse
stemming to expand your query terms out to all the possible variants.


(NOTE: I've never acctaully tried this, but i think the theory is sound).


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



about Stemming

2004-11-13 Thread Miguel Angel
Hi, I have used the DEMOS of lucene and I want to know as it is
possible to be added  Stemming for my applications.

-- 
Miguel Angel Angeles R.
Asesoria en Conectividad y Servidores
Telf. 97451277

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: about Stemming

2004-11-13 Thread Bernhard Messer
Miguel Angel schrieb:
Hi, I have used the DEMOS of lucene and I want to know as it is
possible to be added  Stemming for my applications.
 

have a look to the lucene-sandbox. Under contributions there are 
stemmers for many different languages.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Stemming Oddness

2004-11-06 Thread Pete Lewis
Hi Yousef

You are not doing anything wrong - its just how the Porter stemmer works!

The problem with Porter is that it tries to do everything in a purely algorithmic way 
- which doesn't cater for irregular conjugations etc.

Don't worry too much though, as long as you do the same stemming on the query string 
as you did while indexing - you should be able to find what you are looking for but 
can have some issues with trailing wildcards.

If you want a better stemmer, look for something that has a dictionary as well as 
algorithmic rules - a quick one that is readily available is Kstem which while not 
perfect I think is quite a bit better than Porter.

You can get the source code (Kstem.jar) from the floowing website:

http://ciir.cs.umass.edu/downloads/

For more info on Kstem see the paper by its designer Bob Krovetz at:

http://ciir.cs.umass.edu/pubfiles/ir-35.pdf

Cheers

Pete


- Original Message - 
From: Yousef Ourabi [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Saturday, November 06, 2004 1:13 AM
Subject: Stemming Oddness


 Hey,
 Thanks for everyone's reply to my last post, I have
 some quesiton. I imported the PorterStemmer and when I
 did the following
 
 PorterStemmer ps = new PorterStemmer();
 string r1 = ps.stem(elephant);
 r1 is 'eleph'
 
 also buying stems to bui, is this normal? Am I doing
 something wrong.
 
 I am calling reset inbetween function calls.
 
 Thanks,
 Yousef
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 

Stemming options

2004-04-11 Thread Boris Goldowsky
Has anyone on the list implemented a dictionary-based English stemmer
with Lucene?  Perhaps based on the freely-available ispell dictionaries
or something like that?  The Porter and Snowball stemmers have not
worked that well for our application, but it is a bit daunting to start
from scratch in developing an alternate stemmer.

Alternatively, is there an algorithmic stemmer that anyone has used
which is a little less aggressive than the Porter algorithm?  We've been
having problems with searches for conversion returning converse and
conversational; and animal returning animate.  Yes, these are
morphologically related, but in our particular application it would be
better to stick with removing simple inflections.

Thanks for any pointers --

Boris



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Problem with tokenizing/stemming in GermanAnalyzer

2003-02-17 Thread Volker Luedeling
Hi,

my application uses a GermanAnalyzer for tokenizing a search string and
constructing Query classes:

Analyzer an = new
org.apache.lucene.analysis.de.GermanAnalyzer();
TokenStream ts = an.tokenStream(fieldName, new
StringReader(fieldText));

I have noticed a strange problem with capitalization. Search for
computer results in the token compu. Search for Computer, however,
results in comput. The search is supposed to be case-insensitive, so
this must be a bug, right?

Volker


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Problem with tokenizing/stemming in GermanAnalyzer

2003-02-17 Thread Christoph Kiehl
 For now you could check out the current lucene version from cvs and
 just comment out the following line:

  uppercase = Character.isUpperCase( term.charAt( 0 ) );

In GermanStemmer.java of course ;))

 Regards
 Christoph




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




stemming feature

2002-12-10 Thread M Srinivas Rao
Hi all

Does the lucene will do stemming of a word?  If yes can anyone
say how to do it in java using lucene api.

Thanks
rgds
srinivas

__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




stemming feature

2002-12-10 Thread M Srinivas Rao
Hi all

Can anyone tell, where can i get the process flow  diagrams kind
of thing for lucene. I want to know how lucene works.

Thanks
rgds
srinivas

__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: stemming feature

2002-12-10 Thread Otis Gospodnetic
Check out PorterStemFilter class and Analyzer class.  Then look at some
Analyzer implementations and see how to implement your own
PorterAnalyzer.

Otis

--- M Srinivas Rao [EMAIL PROTECTED] wrote:
 Hi all
 
 Does the lucene will do stemming of a word?  If yes can anyone
 say how to do it in java using lucene api.
 
 Thanks
 rgds
 srinivas
 
 __
 Do you Yahoo!?
 Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
 http://mailplus.yahoo.com
 
 --
 To unsubscribe, e-mail:  
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Stemming

2002-05-02 Thread Joel Bernstein

In our search application the user can turn stemming off and on.

With Lucene will I have to maintain two sets of indexes to create this functionality, 
one
stemming and one non-stemming index?

Or

Is there a way to query a stemming index so that it does not return stems?


Thanks,
Joel



Re: Stemming

2002-05-02 Thread Otis Gospodnetic

You could have a single index with both stemmed and non-stemmed terms,
using different field names for each and searching a different set of
fields depending on the type of search.
You'd also have to use 2 types of analyzers/filters, I think.
Roughly :)

Otis


--- Joel Bernstein [EMAIL PROTECTED] wrote:
 In our search application the user can turn stemming off and on.
 
 With Lucene will I have to maintain two sets of indexes to create
 this functionality, one
 stemming and one non-stemming index?
 
 Or
 
 Is there a way to query a stemming index so that it does not return
 stems?
 
 
 Thanks,
 Joel
 


__
Do You Yahoo!?
Yahoo! Health - your guide to health and wellness
http://health.yahoo.com

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]