subject:"Stemming"

Stemming problem question

2005-02-23 Thread Weir, Michael

I'm getting complaints that I assume are related to stemming, e.g.
"Stamping" (the department) being indexed as "stamp" and not found using
'stamp*' in a query.  Somewhere I read someone suggesting that text be
indexed as two fields, one with the stemmer and one without.

Rather than doing this, does it make sense to implement a
'MultiAnalyzer' class that can be associated with several Analyzers and
returns a 'MultiTokenStream' that reads tokens from each Analyzer in
turn, resetting the Reader between each?

If such a thing makes sense (and hasn't already been implemented) I
would be glad to share it.

Thanks,
Michael Weir 
  
   This message may contain privileged and/or confidential information.  If 
you have received this e-mail in error or are not the intended recipient, you 
may not use, copy, disseminate or distribute it; do not open any attachments, 
delete it immediately from your system and notify the sender promptly by e-mail 
that you have done so.  Thank you. 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: wildcards, stemming and searching

2005-02-10 Thread aaz

How would you deal with a query like "a*z" though?
Yeah I know, a user submitting that is certainly possible. I have no idea. I 
am starting to think that NOT stemming on indexing might be the safest 
solution.

- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Thursday, February 10, 2005 8:55 AM
Subject: Re: wildcards, stemming and searching

How would you deal with a query like "a*z" though?
I suspect, however, that you only care about suffix queries and stemming 
those.  If thats the case, then you could subclass getWildcardQuery and do 
internal stemming (remove trailing wildcard, run it through the analyzer 
directly there and return a modified WildcardQuery instance.

With wildcard queries though, this is risky.  Prefixes won't necessarily 
stem to what the full word would stem to.

Erik
On Feb 9, 2005, at 6:26 PM, aaz wrote:
Hi,
We are not using QueryParser and have some custom Query construction.
We have an index that indexes various documents. Each document is 
Analyzed and indexed via

StandardTokenizer() ->StandardFilter() -> LowercaseFilter() -> 
StopFilter() -> PorterStemFilter()

We also want to support wildcard queries, hence on an inbound query we 
need to deal with "*" in the value side of the comparison. We also need 
to "analyze" the value side of the query against the same analyzer in 
which the index was built with. This leads to some problems and would 
like your solution opinion.

User queries.
somefield = united*
After the analyzer hits "united*", we get back "unit". Hence we cannot 
detect that the user requested a wildcard.

Lets say we come up with some solution to "escape" the "*" char before 
the Analyzer hits it. For example

somefield = united*  -> unitedXXWILDCARDXX
After analysis this then becomes "unitedxxwildcardxx", which we can then 
turn into a WildcardQuery "united*"

The problem here is that the term "united" will never exist in the 
indexing due to the stemming which did not stem properly due to our 
escape mechanism.

How can I solve this problem?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: wildcards, stemming and searching

2005-02-10 Thread Erik Hatcher

How would you deal with a query like "a*z" though?
I suspect, however, that you only care about suffix queries and 
stemming those.  If thats the case, then you could subclass 
getWildcardQuery and do internal stemming (remove trailing wildcard, 
run it through the analyzer directly there and return a modified 
WildcardQuery instance.

With wildcard queries though, this is risky.  Prefixes won't 
necessarily stem to what the full word would stem to.

Erik
On Feb 9, 2005, at 6:26 PM, aaz wrote:
Hi,
We are not using QueryParser and have some custom Query construction.
We have an index that indexes various documents. Each document is 
Analyzed and indexed via

StandardTokenizer() ->StandardFilter() -> LowercaseFilter() -> 
StopFilter() -> PorterStemFilter()

We also want to support wildcard queries, hence on an inbound query we 
need to deal with "*" in the value side of the comparison. We also 
need to "analyze" the value side of the query against the same 
analyzer in which the index was built with. This leads to some 
problems and would like your solution opinion.

User queries.
somefield = united*
After the analyzer hits "united*", we get back "unit". Hence we cannot 
detect that the user requested a wildcard.

Lets say we come up with some solution to "escape" the "*" char before 
the Analyzer hits it. For example

somefield = united*  -> unitedXXWILDCARDXX
After analysis this then becomes "unitedxxwildcardxx", which we can 
then turn into a WildcardQuery "united*"

The problem here is that the term "united" will never exist in the 
indexing due to the stemming which did not stem properly due to our 
escape mechanism.

How can I solve this problem?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

wildcards, stemming and searching

2005-02-09 Thread aaz

Hi,
We are not using QueryParser and have some custom Query construction.

We have an index that indexes various documents. Each document is Analyzed and 
indexed via

StandardTokenizer() ->StandardFilter() -> LowercaseFilter() -> StopFilter() -> 
PorterStemFilter()

We also want to support wildcard queries, hence on an inbound query we need to 
deal with "*" in the value side of the comparison. We also need to "analyze" 
the value side of the query against the same analyzer in which the index was 
built with. This leads to some problems and would like your solution opinion.

User queries.

somefield = united*

After the analyzer hits "united*", we get back "unit". Hence we cannot detect 
that the user requested a wildcard.

Lets say we come up with some solution to "escape" the "*" char before the 
Analyzer hits it. For example

somefield = united*  -> unitedXXWILDCARDXX

After analysis this then becomes "unitedxxwildcardxx", which we can then turn 
into a WildcardQuery "united*"

The problem here is that the term "united" will never exist in the indexing due 
to the stemming which did not stem properly due to our escape mechanism.

How can I solve this problem?

Re: Stemming

2005-01-24 Thread Erik Hatcher

On Jan 24, 2005, at 7:24 AM, Kevin L. Cobb wrote:
Do stemming algorithms take into consideration abbreviations too?
No, they don't.  Adding abbreviations, aliases, synonyms, etc is not 
stemming.

And, the next logical question, if stemming does not take care of
abbreviations, are there any solutions that include abbreviations 
inside
or outside of Lucene?
Nothing built into Lucene does this, but the infrastructure allows it 
to be added in the form of a custom analysis step.  There are two basic 
approaches, adding aliases at indexing time, or adding them at query 
time by expanding the query.  I created some example analyzers in 
Lucene in Action (grab the source code from the site linked below) that 
demonstrate how this can be done using WordNet (and mock) synonym 
lookup.  You could extrapolate this into looking up abbreviations and 
adding them into the token stream.

http://www.lucenebook.com/search?query=synonyms
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Stemming

2005-01-24 Thread Kevin L. Cobb

Do stemming algorithms take into consideration abbreviations too? Some
examples:

mg = milligrams
US = United States
CD = compact disc
vcr = video casette recorder

And, the next logical question, if stemming does not take care of
abbreviations, are there any solutions that include abbreviations inside
or outside of Lucene?

Thanks,

Kevin


-Original Message-
From: Chris Lamprecht [mailto:[EMAIL PROTECTED] 
Sent: Friday, January 21, 2005 5:51 PM
To: Lucene Users List
Subject: Re: Stemming

Also if you can't wait, see page 2 of
http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html

or the LIA e-book ;)

On Fri, 21 Jan 2005 09:27:42 -0500, Kevin L. Cobb
<[EMAIL PROTECTED]> wrote:
> OK, OK ... I'll buy the book. I guess its about time since I am deeply
> and forever in love with Lucene. Might as well take the final plunge.
> 
> 
> -Original Message-
> From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
> Sent: Friday, January 21, 2005 9:12 AM
> To: Lucene Users List
> Subject: Re: Stemming
> 
> Hi Kevin,
> 
> Stemming is an optional operation and is done in the analysis step.
> Lucene comes with a Porter stemmer and a Filter that you can use in an
> Analyzer:
> 
> ./src/java/org/apache/lucene/analysis/PorterStemFilter.java
> ./src/java/org/apache/lucene/analysis/PorterStemmer.java
> 
> You can find more about it here:
> http://www.lucenebook.com/search?query=stemming
> You can also see mentions of SnowballAnalyzer in those search results,
> and you can find an adapter for SnowballAnalyzers in Lucene Sandbox.
> 
> Otis
> 
> --- "Kevin L. Cobb" <[EMAIL PROTECTED]> wrote:
> 
> > I want to understand how Lucene uses stemming but can't find any
> > documentation on the Lucene site. I'll continue to google but hope
> > that
> > this list can help narrow my search. I have several questions on the
> > subject currently but hesitate to list them here since finding a
good
> > document on the subject may answer most of them.
> >
> >
> >
> > Thanks in advance for any pointers,
> >
> >
> >
> > Kevin
> >
> >
> >
> >
> >
> >
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Stemming

2005-01-21 Thread Chris Lamprecht

Also if you can't wait, see page 2 of
http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html

or the LIA e-book ;)

On Fri, 21 Jan 2005 09:27:42 -0500, Kevin L. Cobb
<[EMAIL PROTECTED]> wrote:
> OK, OK ... I'll buy the book. I guess its about time since I am deeply
> and forever in love with Lucene. Might as well take the final plunge.
> 
> 
> -Original Message-
> From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
> Sent: Friday, January 21, 2005 9:12 AM
> To: Lucene Users List
> Subject: Re: Stemming
> 
> Hi Kevin,
> 
> Stemming is an optional operation and is done in the analysis step.
> Lucene comes with a Porter stemmer and a Filter that you can use in an
> Analyzer:
> 
> ./src/java/org/apache/lucene/analysis/PorterStemFilter.java
> ./src/java/org/apache/lucene/analysis/PorterStemmer.java
> 
> You can find more about it here:
> http://www.lucenebook.com/search?query=stemming
> You can also see mentions of SnowballAnalyzer in those search results,
> and you can find an adapter for SnowballAnalyzers in Lucene Sandbox.
> 
> Otis
> 
> --- "Kevin L. Cobb" <[EMAIL PROTECTED]> wrote:
> 
> > I want to understand how Lucene uses stemming but can't find any
> > documentation on the Lucene site. I'll continue to google but hope
> > that
> > this list can help narrow my search. I have several questions on the
> > subject currently but hesitate to list them here since finding a good
> > document on the subject may answer most of them.
> >
> >
> >
> > Thanks in advance for any pointers,
> >
> >
> >
> > Kevin
> >
> >
> >
> >
> >
> >
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Stemming

2005-01-21 Thread Kevin L. Cobb

OK, OK ... I'll buy the book. I guess its about time since I am deeply
and forever in love with Lucene. Might as well take the final plunge.

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Friday, January 21, 2005 9:12 AM
To: Lucene Users List
Subject: Re: Stemming

Hi Kevin,

Stemming is an optional operation and is done in the analysis step. 
Lucene comes with a Porter stemmer and a Filter that you can use in an
Analyzer:

./src/java/org/apache/lucene/analysis/PorterStemFilter.java
./src/java/org/apache/lucene/analysis/PorterStemmer.java

You can find more about it here:
http://www.lucenebook.com/search?query=stemming
You can also see mentions of SnowballAnalyzer in those search results,
and you can find an adapter for SnowballAnalyzers in Lucene Sandbox.

Otis

--- "Kevin L. Cobb" <[EMAIL PROTECTED]> wrote:

> I want to understand how Lucene uses stemming but can't find any
> documentation on the Lucene site. I'll continue to google but hope
> that
> this list can help narrow my search. I have several questions on the
> subject currently but hesitate to list them here since finding a good
> document on the subject may answer most of them. 
> 
>  
> 
> Thanks in advance for any pointers,
> 
>  
> 
> Kevin
> 
>  
> 
>  
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Stemming

2005-01-21 Thread Otis Gospodnetic

Hi Kevin,

Stemming is an optional operation and is done in the analysis step. 
Lucene comes with a Porter stemmer and a Filter that you can use in an
Analyzer:

./src/java/org/apache/lucene/analysis/PorterStemFilter.java
./src/java/org/apache/lucene/analysis/PorterStemmer.java

You can find more about it here:
http://www.lucenebook.com/search?query=stemming
You can also see mentions of SnowballAnalyzer in those search results,
and you can find an adapter for SnowballAnalyzers in Lucene Sandbox.

Otis

--- "Kevin L. Cobb" <[EMAIL PROTECTED]> wrote:

> I want to understand how Lucene uses stemming but can't find any
> documentation on the Lucene site. I'll continue to google but hope
> that
> this list can help narrow my search. I have several questions on the
> subject currently but hesitate to list them here since finding a good
> document on the subject may answer most of them. 
> 
>  
> 
> Thanks in advance for any pointers,
> 
>  
> 
> Kevin
> 
>  
> 
>  
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Stemming

2005-01-21 Thread Kevin L. Cobb

I want to understand how Lucene uses stemming but can't find any
documentation on the Lucene site. I'll continue to google but hope that
this list can help narrow my search. I have several questions on the
subject currently but hesitate to list them here since finding a good
document on the subject may answer most of them. 

 

Thanks in advance for any pointers,

 

Kevin

Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!

2005-01-21 Thread mark harwood

>>1 - I'm a bit concerned that reasonable stemming
(Porter/Snowball) 
>>apparently produces non-word stems .. i.e. not
really human readable. 

It is possible to derive the human-readable form of a
stemmed term using either re-analysis of indexed
content or TermPositionVector. Either of these
techniques should give you the position data required
to discover the original form. 
The highlighter package is one example of where this
technique is used.

Cheers
Mark





___ 
ALL-NEW Yahoo! Messenger - all new features - even more fun! 
http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!

2005-01-21 Thread Andrzej Bialecki

Morus Walter wrote:
Owen Densmore writes:

1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) 
apparently produces non-word stems .. i.e. not really human readable.  
(Example: generate, generates, generated, generating  -> generat) 
Although in typical queries this is not important because the result of 
the search is a document list, it *would* be important if we use the 
stems within a graphical navigation interface.
So the question is: Is there a way to have the stemmer produce 
english
base forms of the words being stemmed?

rule based stemmers such as porter/snowball cannot do that.
But there are (commercial) dictionary based tools that can. E.g. the
canoo lemmatizer.
You might also have a look at egothors stemmer, that are word list based.
Egothor stemmers are algorithmic, they only use word lists for training. 
Stems produced by them are usually closer to lemmas than in e.g. 
Porter's stemmer, but there is a significant amount of stems like in the 
example above.


--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!

2005-01-20 Thread Morus Walter

Owen Densmore writes:

> 1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) 
> apparently produces non-word stems .. i.e. not really human readable.  
> (Example: generate, generates, generated, generating  -> generat) 
> Although in typical queries this is not important because the result of 
> the search is a document list, it *would* be important if we use the 
> stems within a graphical navigation interface.
>  So the question is: Is there a way to have the stemmer produce 
> english
>  base forms of the words being stemmed?
> 
rule based stemmers such as porter/snowball cannot do that.
But there are (commercial) dictionary based tools that can. E.g. the
canoo lemmatizer.
You might also have a look at egothors stemmer, that are word list based.

HTH
Morus



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Newbie: Human Readable Stemming, Lucene Architecture, etc!

2005-01-20 Thread Chuck Williams

Like any other field, A.I. is only elusive until you master it.  There
are plenty of companies using A.I. techniques in various IR applications
successfully. LSI in particular has been around a long time and is well
understood.

Chuck

  > -Original Message-
  > From: jian chen [mailto:[EMAIL PROTECTED]
  > Sent: Thursday, January 20, 2005 2:10 PM
  > To: Lucene Users List
  > Subject: Re: Newbie: Human Readable Stemming, Lucene Architecture,
etc!
  > 
  > Hi,
  > 
  > One thing to point out. I think Lucene is not using LSI as the
  > underlying retrieval model. It uses vector space model and also
  > proximity based retrieval.
  > 
  > Personally, I don't know much about LSI and I don't think the fancy
  > stuff like LSI is workable in industry. I believe we are far away
from
  > the era of artificial intelligence and using any elusive way to do
  > information retrieval.
  > 
  > Cheers,
  > 
  > Jian
  > 
  > 
  > On Thu, 20 Jan 2005 14:50:10 -0700, Owen Densmore
<[EMAIL PROTECTED]>
  > wrote:
  > > Hi .. I'm new to the list so forgive a dumb question or two as I
get
  > > started.
  > >
  > > We're in the midst of converting a small collection (1200-1500
  > > currently) of scientific literature to be easily
searchable/navigable.
  > > We'll likely provide both a text query interface as well as a
  > graphical
  > > way to search and discover.
  > >
  > > Our initial approach will be vector based, looking at Latent
Semantic
  > > Indexing (LSI) as a potential tool, although if that's not needed,
  > > we'll stop at reasonably simple stemming with a weighted document
term
  > > matrix (DTM).  (Bear in mind I couldn't even pronounce most of
these
  > > concepts last week, so go easy if I'm incoherent!)
  > >
  > > It looks to me that Lucene has a quite well factored architecture.
I
  > > should at the very least be able to use the analyzer and stemmer
to
  > > create a good starting point in the project.  I'd also like to
leave a
  > > nice architecture behind in case we or others end up experimenting
  > > with, or extending, the system.
  > >
  > > So a couple of questions:
  > >
  > > 1 - I'm a bit concerned that reasonable stemming (Porter/Snowball)
  > > apparently produces non-word stems .. i.e. not really human
readable.
  > > (Example: generate, generates, generated, generating  -> generat)
  > > Although in typical queries this is not important because the
result
  > of
  > > the search is a document list, it *would* be important if we use
the
  > > stems within a graphical navigation interface.
  > >  So the question is: Is there a way to have the stemmer
produce
  > > english
  > >  base forms of the words being stemmed?
  > >
  > > 2 - We're probably using Lucene in ways it was not designed for,
such
  > > as DTM/LSI and graphical clustering and navigation.  Naturally
we'll
  > > provide code for these parts that are not in Lucene.
  > >  But the question arises: is this kinda dumb?!  Has anyone
  > stretched
  > > Lucene's
  > >  design center with positive results?  Are we barking up the
wrong
  > > tree?
  > >
  > > 3 - A nit on hyphenation: Our collection is scientific so has many
  > > hyphenated words.  I'm wondering about your experiences with
  > > hyphenation.  In our collection, things like self-organization,
  > > power-law, space-time, small-world, agent-based, etc. occur often,
for
  > > example.
  > >  So the question is: Do folks break up hyphenated words?  If
not,
  > do
  > > you stem the
  > >  parts and glue them back together?  Do you apply stoplists to
the
  > > parts?
  > >
  > > Thanks for any help and pointers you can fling along,
  > >
  > > Owenhttp://backspaces.net/http://redfish.com/
  > >
  > >
-
  > > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > > For additional commands, e-mail:
[EMAIL PROTECTED]
  > >
  > >
  > 
  >
-
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!

2005-01-20 Thread jian chen

Hi,

One thing to point out. I think Lucene is not using LSI as the
underlying retrieval model. It uses vector space model and also
proximity based retrieval.

Personally, I don't know much about LSI and I don't think the fancy
stuff like LSI is workable in industry. I believe we are far away from
the era of artificial intelligence and using any elusive way to do
information retrieval.

Cheers,

Jian


On Thu, 20 Jan 2005 14:50:10 -0700, Owen Densmore <[EMAIL PROTECTED]> wrote:
> Hi .. I'm new to the list so forgive a dumb question or two as I get
> started.
> 
> We're in the midst of converting a small collection (1200-1500
> currently) of scientific literature to be easily searchable/navigable.
> We'll likely provide both a text query interface as well as a graphical
> way to search and discover.
> 
> Our initial approach will be vector based, looking at Latent Semantic
> Indexing (LSI) as a potential tool, although if that's not needed,
> we'll stop at reasonably simple stemming with a weighted document term
> matrix (DTM).  (Bear in mind I couldn't even pronounce most of these
> concepts last week, so go easy if I'm incoherent!)
> 
> It looks to me that Lucene has a quite well factored architecture.  I
> should at the very least be able to use the analyzer and stemmer to
> create a good starting point in the project.  I'd also like to leave a
> nice architecture behind in case we or others end up experimenting
> with, or extending, the system.
> 
> So a couple of questions:
> 
> 1 - I'm a bit concerned that reasonable stemming (Porter/Snowball)
> apparently produces non-word stems .. i.e. not really human readable.
> (Example: generate, generates, generated, generating  -> generat)
> Although in typical queries this is not important because the result of
> the search is a document list, it *would* be important if we use the
> stems within a graphical navigation interface.
>  So the question is: Is there a way to have the stemmer produce
> english
>  base forms of the words being stemmed?
> 
> 2 - We're probably using Lucene in ways it was not designed for, such
> as DTM/LSI and graphical clustering and navigation.  Naturally we'll
> provide code for these parts that are not in Lucene.
>  But the question arises: is this kinda dumb?!  Has anyone stretched
> Lucene's
>  design center with positive results?  Are we barking up the wrong
> tree?
> 
> 3 - A nit on hyphenation: Our collection is scientific so has many
> hyphenated words.  I'm wondering about your experiences with
> hyphenation.  In our collection, things like self-organization,
> power-law, space-time, small-world, agent-based, etc. occur often, for
> example.
>  So the question is: Do folks break up hyphenated words?  If not, do
> you stem the
>  parts and glue them back together?  Do you apply stoplists to the
> parts?
> 
> Thanks for any help and pointers you can fling along,
> 
> Owenhttp://backspaces.net/http://redfish.com/
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Newbie: Human Readable Stemming, Lucene Architecture, etc!

2005-01-20 Thread Owen Densmore

Hi .. I'm new to the list so forgive a dumb question or two as I get 
started.

We're in the midst of converting a small collection (1200-1500 
currently) of scientific literature to be easily searchable/navigable.  
We'll likely provide both a text query interface as well as a graphical 
way to search and discover.

Our initial approach will be vector based, looking at Latent Semantic 
Indexing (LSI) as a potential tool, although if that's not needed, 
we'll stop at reasonably simple stemming with a weighted document term 
matrix (DTM).  (Bear in mind I couldn't even pronounce most of these 
concepts last week, so go easy if I'm incoherent!)

It looks to me that Lucene has a quite well factored architecture.  I 
should at the very least be able to use the analyzer and stemmer to 
create a good starting point in the project.  I'd also like to leave a 
nice architecture behind in case we or others end up experimenting 
with, or extending, the system.

So a couple of questions:
1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) 
apparently produces non-word stems .. i.e. not really human readable.  
(Example: generate, generates, generated, generating  -> generat) 
Although in typical queries this is not important because the result of 
the search is a document list, it *would* be important if we use the 
stems within a graphical navigation interface.
So the question is: Is there a way to have the stemmer produce 
english
base forms of the words being stemmed?

2 - We're probably using Lucene in ways it was not designed for, such 
as DTM/LSI and graphical clustering and navigation.  Naturally we'll 
provide code for these parts that are not in Lucene.
But the question arises: is this kinda dumb?!  Has anyone stretched 
Lucene's
design center with positive results?  Are we barking up the wrong 
tree?

3 - A nit on hyphenation: Our collection is scientific so has many 
hyphenated words.  I'm wondering about your experiences with 
hyphenation.  In our collection, things like self-organization, 
power-law, space-time, small-world, agent-based, etc. occur often, for 
example.
So the question is: Do folks break up hyphenated words?  If not, do 
you stem the
parts and glue them back together?  Do you apply stoplists to the 
parts?

Thanks for any help and pointers you can fling along,
Owenhttp://backspaces.net/http://redfish.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Query based stemming

2005-01-07 Thread David Spencer

Jim Lynch wrote:
 From what I've read, if you want to have a choice, the easiest way is 
to index the documents twice. Once with stemming on and once with it off 
placing the results in two different indexes.  Then at query time, 
select which index you want to use based on whether you want stemming on 
or off.
IMHO keeping the data in the same index is easiest.
PerFieldAnalyzerWrapper is part of the magic...approx uasge follows from 
my code below. Second magic is to call doc.add(...) multiple times, 
"redundantly".

Don't use code below exactly however - things like MySnowballAnalyzer 
should become SnowballAnalyzer in your code...

Analyzer fa;
Analyzer getAnalyzer()
{
	Analyzer snowball = new MySnowballStopAnalyzer();
	Analyzer def = new AlphaNumStopAnalyzer();  // prob StandardAnalyzer 
for most people..
	PerFieldAnalyzerWrapper fa = new PerFieldAnalyzerWrapper( def);
	fa.addAnalyzer( "scontents", snowball);  // "s" in "scontents" if for 
stemming
	fa.addAnalyzer( "stitle", snowball);		
	return fa;
}

...
later:
Document doc = new Document();
doc.add( Field.Text( "title", title));
doc.add( Field.Text( "stitle", new StringReader( title))); // don't need 
recall
String body = ...;
doc.add( Field.Text( "contents", new StringReader( body), true)); // 
term vector
doc.add( Field.Text( "scontents", new StringReader( body)));
writer.addDocument( doc);


Jim.
Peter Kim wrote:
Hi,
I'm new to Lucene, so I apologize if this issue has been discussed
before (I'm sure it has), but I had a hard time finding an answer using
google. (Maybe this would be a good candidate for the FAQ!) :)
Is it possible to enable stem queries on a per-query basis? It doesn't
seem to be possible since the stem tokenizing is done during the
indexing process. Are people basically stuck with having all their
queries stemmed or none at all?
Thanks!
Peter
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Query based stemming

2005-01-07 Thread Chris Hostetter


: >Is it possible to enable stem queries on a per-query basis? It doesn't
: >seem to be possible since the stem tokenizing is done during the
: >indexing process. Are people basically stuck with having all their
: >queries stemmed or none at all?

:  From what I've read, if you want to have a choice, the easiest way is
: to index the documents twice. Once with stemming on and once with it off
: placing the results in two different indexes.  Then at query time,
: select which index you want to use based on whether you want stemming on
: or off.

As I understand it, the intented place to impliment Stemming is in an
Analyzer Filter (not to be confused with a search Filter).  Since you can
can specify an Analyzer when you call addDocument, you don't have to
acctually have two seperate indexes, you could just have all the docs in
one index - and use a search Filter to indicate which docs to look at.

Alternately: the Analyzer's tokenStream method is given the fieldName
being analyzed, so you could write an Analyzer with a set of rules
telling it to only apply your Stemming filter to certain fields, and
then instead of having twice as many documents, you can just index your
text in two seperate fields (which should be a little easier, then
seperate docs because you are only duplicating the fields where stemming
is relevant)  Then at search time you don't have to filter anything, just
search the field that's applicable to your current desire (stemmed or
unstemmed)

Lastely: Allthough it's tricky to get correct, there's no law saying you
have to use the same Analyzer when you query as when you index.  You could
index your documents using an Analyzer that does no stemming, and then at
search time (if you want stemming) use an Analyzer that does "reverse
stemming" to expand your query terms out to all the possible variants.


(NOTE: I've never acctaully tried this, but i think the theory is sound).


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Query based stemming

2005-01-07 Thread Jim Lynch

From what I've read, if you want to have a choice, the easiest way is 
to index the documents twice. Once with stemming on and once with it off 
placing the results in two different indexes.  Then at query time, 
select which index you want to use based on whether you want stemming on 
or off.

Jim.
Peter Kim wrote:
Hi,
I'm new to Lucene, so I apologize if this issue has been discussed
before (I'm sure it has), but I had a hard time finding an answer using
google. (Maybe this would be a good candidate for the FAQ!) :)
Is it possible to enable stem queries on a per-query basis? It doesn't
seem to be possible since the stem tokenizing is done during the
indexing process. Are people basically stuck with having all their
queries stemmed or none at all?
Thanks!
Peter
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Query based stemming

2005-01-07 Thread Peter Kim

Hi,

I'm new to Lucene, so I apologize if this issue has been discussed
before (I'm sure it has), but I had a hard time finding an answer using
google. (Maybe this would be a good candidate for the FAQ!) :)

Is it possible to enable stem queries on a per-query basis? It doesn't
seem to be possible since the stem tokenizing is done during the
indexing process. Are people basically stuck with having all their
queries stemmed or none at all?

Thanks!
Peter

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: about Stemming

2004-11-13 Thread Bernhard Messer

Miguel Angel schrieb:
Hi, I have used the DEMOS of lucene and I want to know as it is
possible to be added  Stemming for my applications.
 

have a look to the lucene-sandbox. Under contributions there are 
stemmers for many different languages.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

about Stemming

2004-11-13 Thread Miguel Angel

Hi, I have used the DEMOS of lucene and I want to know as it is
possible to be added  Stemming for my applications.

-- 
Miguel Angel Angeles R.
Asesoria en Conectividad y Servidores
Telf. 97451277

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Stemming Oddness

2004-11-06 Thread Pete Lewis

Hi Yousef

You are not doing anything wrong - its just how the Porter stemmer works!

The problem with Porter is that it tries to do everything in a purely algorithmic way 
- which doesn't cater for irregular conjugations etc.

Don't worry too much though, as long as you do the same stemming on the query string 
as you did while indexing - you should be able to find what you are looking for but 
can have some issues with trailing wildcards.

If you want a better stemmer, look for something that has a dictionary as well as 
algorithmic rules - a quick one that is readily available is Kstem which while not 
perfect I think is quite a bit better than Porter.

You can get the source code (Kstem.jar) from the floowing website:

http://ciir.cs.umass.edu/downloads/

For more info on Kstem see the paper by its designer Bob Krovetz at:

http://ciir.cs.umass.edu/pubfiles/ir-35.pdf

Cheers

Pete


- Original Message - 
From: "Yousef Ourabi" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Saturday, November 06, 2004 1:13 AM
Subject: Stemming Oddness


> Hey,
> Thanks for everyone's reply to my last post, I have
> some quesiton. I imported the PorterStemmer and when I
> did the following
> 
> PorterStemmer ps = new PorterStemmer();
> string r1 = ps.stem("elephant");
> r1 is 'eleph'
> 
> also buying stems to bui, is this normal? Am I doing
> something wrong.
> 
> I am calling reset inbetween function calls.
> 
> Thanks,
> Yousef
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

Stemming Oddness

2004-11-05 Thread Yousef Ourabi

Hey,
Thanks for everyone's reply to my last post, I have
some quesiton. I imported the PorterStemmer and when I
did the following

PorterStemmer ps = new PorterStemmer();
string r1 = ps.stem("elephant");
r1 is 'eleph'

also buying stems to bui, is this normal? Am I doing
something wrong.

I am calling reset inbetween function calls.

Thanks,
Yousef

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Stemming options

2004-04-11 Thread Boris Goldowsky

Has anyone on the list implemented a dictionary-based English stemmer
with Lucene?  Perhaps based on the freely-available ispell dictionaries
or something like that?  The Porter and Snowball stemmers have not
worked that well for our application, but it is a bit daunting to start
from scratch in developing an alternate stemmer.

Alternatively, is there an algorithmic stemmer that anyone has used
which is a little less aggressive than the Porter algorithm?  We've been
having problems with searches for "conversion" returning "converse" and
"conversational"; and "animal" returning "animate".  Yes, these are
morphologically related, but in our particular application it would be
better to stick with removing simple inflections.

Thanks for any pointers --

Boris



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Problem with tokenizing/stemming in GermanAnalyzer

2003-02-17 Thread Christoph Kiehl

Hi Gerhard,

> I promise I will check the stemmer next days... hm... not before this
> weekend, i have a martial arts challenge at sunday. Mental i'm not
> prepared to _fix_ anything. :)

Cool, I just started reading about stemmers etc. I'm very interested in your
solution. And good luck for your challenge ;)

Greetings
Christoph




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Problem with tokenizing/stemming in GermanAnalyzer

2003-02-17 Thread Gerhard Schwarz

Christoph Kiehl wrote:

Hi Volker,


I have noticed a strange problem with capitalization. Search for
"computer" results in the token "compu". Search for "Computer",
however, results in "comput". The search is supposed to be
case-insensitive, so this must be a bug, right?


This problem was already mentioned on the developer list. The analyzer tries
to do some noun recognition. But it does a bad job ;)


The analyzer should not do any case-recognition. After I read through 
the mailing list from the last weeks/months (I was busy last weeks), I 
found out that a super simple unique-discrimination algorithm is what 
the most users need. The original algorithm has more possible ways to 
extend it.

For now you could check out the current lucene version from cvs and just
comment out the following line:

 uppercase = Character.isUpperCase( term.charAt( 0 ) );

Then just run ant to built the jar. This fixes the problem you described.


I promise I will check the stemmer next days... hm... not before this 
weekend, i have a martial arts challenge at sunday. Mental i'm not 
prepared to _fix_ anything. :)

There is another problem with the Umlaut-conversion that also should be 
checked.

Greets,
Gerhard


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Problem with tokenizing/stemming in GermanAnalyzer

2003-02-17 Thread Christoph Kiehl

> For now you could check out the current lucene version from cvs and
> just comment out the following line:
>
>  uppercase = Character.isUpperCase( term.charAt( 0 ) );

In GermanStemmer.java of course ;))

> Regards
> Christoph




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Problem with tokenizing/stemming in GermanAnalyzer

2003-02-17 Thread Christoph Kiehl

Hi Volker,

> I have noticed a strange problem with capitalization. Search for
> "computer" results in the token "compu". Search for "Computer",
> however, results in "comput". The search is supposed to be
> case-insensitive, so this must be a bug, right?

This problem was already mentioned on the developer list. The analyzer tries
to do some noun recognition. But it does a bad job ;)

For now you could check out the current lucene version from cvs and just
comment out the following line:

 uppercase = Character.isUpperCase( term.charAt( 0 ) );

Then just run ant to built the jar. This fixes the problem you described.

Regards
Christoph




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Problem with tokenizing/stemming in GermanAnalyzer

2003-02-17 Thread Volker Luedeling

Hi,

my application uses a GermanAnalyzer for tokenizing a search string and
constructing Query classes:

Analyzer an = new
org.apache.lucene.analysis.de.GermanAnalyzer();
TokenStream ts = an.tokenStream(fieldName, new
StringReader(fieldText));

I have noticed a strange problem with capitalization. Search for
"computer" results in the token "compu". Search for "Computer", however,
results in "comput". The search is supposed to be case-insensitive, so
this must be a bug, right?

Volker


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: stemming feature

2002-12-10 Thread Otis Gospodnetic

Check out PorterStemFilter class and Analyzer class.  Then look at some
Analyzer implementations and see how to implement your own
PorterAnalyzer.

Otis

--- M Srinivas Rao <[EMAIL PROTECTED]> wrote:
> Hi all
> 
> Does the lucene will do stemming of a word?  If yes can anyone
> say how to do it in java using lucene api.
> 
> Thanks
> rgds
> srinivas
> 
> __
> Do you Yahoo!?
> Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
> http://mailplus.yahoo.com
> 
> --
> To unsubscribe, e-mail:  
> <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
> <mailto:[EMAIL PROTECTED]>
> 

__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

stemming feature

2002-12-10 Thread M Srinivas Rao

Hi all

Can anyone tell, where can i get the process flow  diagrams kind
of thing for lucene. I want to know how lucene works.

Thanks
rgds
srinivas

__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   
For additional commands, e-mail:

stemming feature

2002-12-10 Thread M Srinivas Rao

Hi all

Does the lucene will do stemming of a word?  If yes can anyone
say how to do it in java using lucene api.

Thanks
rgds
srinivas

__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: Stemming

2002-05-02 Thread Otis Gospodnetic

You could have a single index with both stemmed and non-stemmed terms,
using different field names for each and searching a different set of
fields depending on the type of search.
You'd also have to use 2 types of analyzers/filters, I think.
Roughly :)

Otis

--- Joel Bernstein <[EMAIL PROTECTED]> wrote:
> In our search application the user can turn stemming off and on.
> 
> With Lucene will I have to maintain two sets of indexes to create
> this functionality, one
> stemming and one non-stemming index?
> 
> Or
> 
> Is there a way to query a stemming index so that it does not return
> stems?
> 
> 
> Thanks,
> Joel
> 

__
Do You Yahoo!?
Yahoo! Health - your guide to health and wellness
http://health.yahoo.com

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Stemming

2002-05-02 Thread Joel Bernstein


In our search application the user can turn stemming off and on.

With Lucene will I have to maintain two sets of indexes to create this functionality, 
one
stemming and one non-stemming index?

Or

Is there a way to query a stemming index so that it does not return stems?


Thanks,
Joel

Stemming problem question

Re: wildcards, stemming and searching

Re: wildcards, stemming and searching

wildcards, stemming and searching

Re: Stemming

RE: Stemming

Re: Stemming

RE: Stemming

Re: Stemming

Stemming

Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!

Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!

Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!

RE: Newbie: Human Readable Stemming, Lucene Architecture, etc!

Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!

Newbie: Human Readable Stemming, Lucene Architecture, etc!

Re: Query based stemming

Re: Query based stemming

Re: Query based stemming

Query based stemming

Re: about Stemming

about Stemming

Re: Stemming Oddness

Stemming Oddness

Stemming options

Re: Problem with tokenizing/stemming in GermanAnalyzer

Re: Problem with tokenizing/stemming in GermanAnalyzer

Re: Problem with tokenizing/stemming in GermanAnalyzer

Re: Problem with tokenizing/stemming in GermanAnalyzer

Problem with tokenizing/stemming in GermanAnalyzer

Re: stemming feature

stemming feature

stemming feature

Re: Stemming

Stemming

35 matches

Site Navigation

Mail list logo

Footer information