Re: LucidWorks Solr

2010-04-22 Thread Robert Muir
On Wed, Apr 21, 2010 at 1:38 PM, Shashi Kant  wrote:

> Why do these approaches have to be mutually exclusive?
> Do a dictionary lookup, if no satisfactory match found use an
> algorithmic stemmer. Would probably save a few CPU cycles by
> algorithmic stemming iff necessary.
>
>
by the way, if you want to do this, you can do it easily in Solr trunk. Just
put a StemmerOverrideFilterFactory in front of your stemmer, containing
tab-separated dictionary-word stem mappings. In the test-files directory is
an example of this (stemdict.txt):

# test that we can override the stemming algorithm with our own mappings
# these must be tab-separated
monkeysmonkey
ottersotter
# some crazy ones that a stemmer would never do
dogscat

You can use this factory, or the new KeywordMarkerFilterFactory, which is
similar but simply takes a text file like protwords.txt, for the stemmer to
ignore.
Both of these filters set a special attribute for this token in the
tokenstream that all stemmers respect, and they won't do any stemming on
this token

-- 
Robert Muir
rcm...@gmail.com


Re: LucidWorks Solr

2010-04-21 Thread MitchK

I like this discussion pretty much.

It is a really complex topic.

I want to add another example.
In english, you are saying "it is a red dress".
In german it would mean "es ist ein rotes Kleid" (words can be translated in
the same order).
However the basic form of "rotes" is "rot".

If your users are searching for "rotes Kleid" they may also expect matching
documents like "Kleid rot" or something like that.

To draw a conclusion on the discussion until now, I want to quote Mark:

Mark Miller-3 wrote:
> 
> are you going to 
> want documents that had run and water when you searched for running 
> water?

In most cases, everyone disagrees.

Let's have an abstract view at this topic: When should an application stem a
word, and when not?
In my opinion, it allways makes sense to stem adjectives - and in most
languages one can decide whether a word is an adjective or not, even if it
is not known by any dictionary. But what about verbs?
Is a stemmed verb less worth than an stemmed adjective? In cases of titles -
which are short - I think yes. In cases of longer types of texts like
articles and descriptions - I am not sure.

The same applies on lemmatization. Yes, reducing the word will work fine in
a lot of cases, where the word is known. However, does it really makes sense
all the time?

I want to emphasize that this discussion only makes sense, if we want do
talk about search-applications which are tolerant and made for highly
relevant search results without exact matching. 

Kind regards
- Mitch
-- 
View this message in context: 
http://n3.nabble.com/LucidWorks-Solr-tp727341p741090.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: LucidWorks Solr

2010-04-21 Thread Robert Muir
On Wed, Apr 21, 2010 at 3:29 PM, Mark Miller  wrote:

>
> Stemming/lematization will pretty much always improve recall at the cost of
> precision - that's nothing new. If you stem instead, are you going to want
> documents that had run and water when you searched for running water? I just
> don't see this point as an argument against lemmatization and in favor of
> stemming.
>
>
its not really supposed to be an argument in favor of stemming. I just don't
think lemmatization/dictionary resources are any better.

here's a test that seems to agree:
http://www.clef-campaign.org/2003/WN_web/19.pdf

(for languages with compound word forms, the lexical approach helps,
obviously, but for stuff like English, Italian, nope)

-- 
Robert Muir
rcm...@gmail.com


Re: LucidWorks Solr

2010-04-21 Thread Mark Miller

On 4/21/10 3:22 PM, Robert Muir wrote:

On Wed, Apr 21, 2010 at 2:26 PM, Mark Miller  wrote:
   

Its an orthogonal issue - running will have that problem no matter what. It
doesn't affect whether a user that types running may be just as interested
in a doc that matches all of their other terms but has ran instead of
running. Its also just a simple example.


 

Its not orthogonal, e.g. "running water"

   
Stemming/lematization will pretty much always improve recall at the cost 
of precision - that's nothing new. If you stem instead, are you going to 
want documents that had run and water when you searched for running 
water? I just don't see this point as an argument against lemmatization 
and in favor of stemming.


-  Mark


Re: LucidWorks Solr

2010-04-21 Thread Robert Muir
On Wed, Apr 21, 2010 at 2:26 PM, Mark Miller  wrote:
>
> Its an orthogonal issue - running will have that problem no matter what. It
> doesn't affect whether a user that types running may be just as interested
> in a doc that matches all of their other terms but has ran instead of
> running. Its also just a simple example.
>
>
Its not orthogonal, e.g. "running water"

-- 
Robert Muir
rcm...@gmail.com


Re: Stemming [was: LucidWorks Solr]

2010-04-21 Thread Darren Govoni
IMHO, a 'stemmer' (being a specific 'thing') is exactly that. An
algorithm for stemming. A database or lexicon is not referred to as a
'stemmer'. One can perform "stemming" using a lexicon if that's their
need. 

For me, its more than just stemming because some words have morphology
totally separate from "stems" that need to be known in searching and NLP
(e.g. verb conjugations). Maybe some call that stemming too, but I never
have personally.

On Wed, 2010-04-21 at 10:18 -0700, Chris Hostetter wrote:

> : Regarding stemmers, I ditched them altogether a long time ago in favor
> : of a dictionary of morphologies of all known words (for any given
> : language). A simple lookup of any word morphology thus produces the set,
> : including the correct stem.
> 
> Strictly speaking: you haven't "ditched" stemmers altogether -- you've 
> ditched *algorithmic* stemmers and moved to a *dictionary* based stemmer 
> -- but it's still a stemmer.
> 
> (i just don't want people reading this thread to be confused about 
> terminology)
> 
> 
> -Hoss
> 




Re: LucidWorks Solr

2010-04-21 Thread Mark Miller

On 4/21/10 2:20 PM, Robert Muir wrote:

On Wed, Apr 21, 2010 at 2:09 PM, Mark Miller  wrote:

   

Right - I agree they both have their strengths and weakness' - but you
usually don't get things like running->ran with stemming. Like most things,
its a tradeoff. There is always a hybrid approach as well.


 

I think running/ran has more problems, the word is so ambiguous that whether
or not your search engine stems it right isn't going to matter anyway
(running for office has nothing to do with running shoes, etc)

   
Its an orthogonal issue - running will have that problem no matter what. 
It doesn't affect whether a user that types running may be just as 
interested in a doc that matches all of their other terms but has ran 
instead of running. Its also just a simple example.


- Mark


Re: LucidWorks Solr

2010-04-21 Thread Robert Muir
On Wed, Apr 21, 2010 at 2:09 PM, Mark Miller  wrote:

>
> Right - I agree they both have their strengths and weakness' - but you
> usually don't get things like running->ran with stemming. Like most things,
> its a tradeoff. There is always a hybrid approach as well.
>
>
I think running/ran has more problems, the word is so ambiguous that whether
or not your search engine stems it right isn't going to matter anyway
(running for office has nothing to do with running shoes, etc)

-- 
Robert Muir
rcm...@gmail.com


Re: LucidWorks Solr

2010-04-21 Thread Mark Miller

On 4/21/10 2:02 PM, Robert Muir wrote:

On Wed, Apr 21, 2010 at 1:49 PM, Mark Miller  wrote:

   

  I believe that's covered by morphology?


 

The problem is typically a morphological analyzer emits multiple solutions,
which include POS.

So morphology can tell you that "building" has two solutions: the gerund
form which you might stem to "build", or the noun form which you would stem
to "building".
But, you need more stuff (POS tagging, etc) to decide which to pick to
arrive at a lemma... and if your users are entering very short queries you
can see how this could be inaccurate, since there isn't much context.

So what snowball does (simply stemming build, building, buildings all to
"build") might seem silly at first, but you can see how it avoids this
entire mess.

   
Right - I agree they both have their strengths and weakness' - but you 
usually don't get things like running->ran with stemming. Like most 
things, its a tradeoff. There is always a hybrid approach as well.


- Mark


Re: LucidWorks Solr

2010-04-21 Thread Robert Muir
On Wed, Apr 21, 2010 at 1:49 PM, Mark Miller  wrote:

>
>  I believe that's covered by morphology?
>
>
The problem is typically a morphological analyzer emits multiple solutions,
which include POS.

So morphology can tell you that "building" has two solutions: the gerund
form which you might stem to "build", or the noun form which you would stem
to "building".
But, you need more stuff (POS tagging, etc) to decide which to pick to
arrive at a lemma... and if your users are entering very short queries you
can see how this could be inaccurate, since there isn't much context.

So what snowball does (simply stemming build, building, buildings all to
"build") might seem silly at first, but you can see how it avoids this
entire mess.

-- 
Robert Muir
rcm...@gmail.com


Re: LucidWorks Solr

2010-04-21 Thread Mark Miller

On 4/21/10 1:43 PM, Walter Underwood wrote:

On Apr 21, 2010, at 10:30 AM, Mark Miller wrote:

   

But they don't usually call 'non algorithmic' stemming 'stemming'. Stemming 
usually means using a simple heuristic process. When you use vocabulary and 
morphology, its usually called lemmatization rather than stemming.

 

"stemmer" is jargon that does not have a precise definition.
   
Usually, as the wikipedia article Robert linked to states, stemming is 
done without knowledge of the context of the word. With stemming you are 
not necessarily finding lemmas - just stems. Stems can be anything as 
long as the same word always stems to the same thing - lemmas are more 
than that. I don't think the definition is super precise, but I also 
wouldn't call it jargon.

For example, the LinguistX morphological analyzers are called "stemmers" and 
they provide options that are dictionary-based inflectional, dictionary-based 
derivational, and algorithmic. You can also combine those, so you can get accurate 
dictionary-based stems, then use an algorithmic stemmer on words not in the dictionary.
   


That just sounds like a mix of stemming and lemmatization.

- Mark



Re: LucidWorks Solr

2010-04-21 Thread Mark Miller

On 4/21/10 1:43 PM, Robert Muir wrote:

On Wed, Apr 21, 2010 at 1:30 PM, Mark Miller  wrote:

   

  But they don't usually call 'non algorithmic' stemming 'stemming'.
Stemming usually means using a simple heuristic process. When you use
vocabulary and morphology, its usually called lemmatization rather than
stemming.


 

Lemmatization usually requires part-of-speech, too.

I was gonna use my build, building, buildings example but I see wikipedia
already has a nice explained example (meeting) here:
http://en.wikipedia.org/wiki/Lemmatisation


   

I believe that's covered by morphology?

- Mark


Re: LucidWorks Solr

2010-04-21 Thread Robert Muir
On Wed, Apr 21, 2010 at 1:30 PM, Mark Miller  wrote:

>
>  But they don't usually call 'non algorithmic' stemming 'stemming'.
> Stemming usually means using a simple heuristic process. When you use
> vocabulary and morphology, its usually called lemmatization rather than
> stemming.
>
>
Lemmatization usually requires part-of-speech, too.

I was gonna use my build, building, buildings example but I see wikipedia
already has a nice explained example (meeting) here:
http://en.wikipedia.org/wiki/Lemmatisation


-- 
Robert Muir
rcm...@gmail.com


Re: LucidWorks Solr

2010-04-21 Thread Walter Underwood
On Apr 21, 2010, at 10:30 AM, Mark Miller wrote:

> But they don't usually call 'non algorithmic' stemming 'stemming'. Stemming 
> usually means using a simple heuristic process. When you use vocabulary and 
> morphology, its usually called lemmatization rather than stemming.
> 

"stemmer" is jargon that does not have a precise definition.

For example, the LinguistX morphological analyzers are called "stemmers" and 
they provide options that are dictionary-based inflectional, dictionary-based 
derivational, and algorithmic. You can also combine those, so you can get 
accurate dictionary-based stems, then use an algorithmic stemmer on words not 
in the dictionary.

Stemmers may convert the surface word to a dictionary form (inflectional), to a 
root dictionary form (derivational), or to a non-word key (the Porter 
algorithm). Arabic and Hebrew stemmers often choose an intermediate form with 
some vowel marks rather than the all-consonant "semetic root".

Language is complicated.

Maintaining a high-quality dictionary is expensive, so you probably won't find 
many free ones.

wunder
--
Walter Underwood
Lead Engineer, Mark Logic









Re: LucidWorks Solr

2010-04-21 Thread Shashi Kant
Why do these approaches have to be mutually exclusive?
Do a dictionary lookup, if no satisfactory match found use an
algorithmic stemmer. Would probably save a few CPU cycles by
algorithmic stemming iff necessary.


On Wed, Apr 21, 2010 at 1:31 PM, Robert Muir  wrote:
> sy to look at the "faults" of some algorithmic stemmer, in truth its
> only purpose is to cause related forms of the word to conflate to the same
> form, and hopefully avoiding unrelated terms from conflating to this form.
>
> A dictionary-based stemmer is out-of-date the day you put it into
> production: languages aren't static. For example, you can't expect a
> dictionary-based stemmer to properly deal with forms like "googling" or
> "tweets" that have recently slipped into English vocabulary, but an
> algorithmic stemmer will likely deal with these just fine.


Re: LucidWorks Solr

2010-04-21 Thread Robert Muir
On Wed, Apr 21, 2010 at 1:18 PM, Chris Hostetter
wrote:
>
> Strictly speaking: you haven't "ditched" stemmers altogether -- you've
> ditched *algorithmic* stemmers and moved to a *dictionary* based stemmer
> -- but it's still a stemmer.
>
> (i just don't want people reading this thread to be confused about
> terminology)
>
>
I agree, and dictionary-based stemming has its own set of problems. While
its easy to look at the "faults" of some algorithmic stemmer, in truth its
only purpose is to cause related forms of the word to conflate to the same
form, and hopefully avoiding unrelated terms from conflating to this form.

A dictionary-based stemmer is out-of-date the day you put it into
production: languages aren't static. For example, you can't expect a
dictionary-based stemmer to properly deal with forms like "googling" or
"tweets" that have recently slipped into English vocabulary, but an
algorithmic stemmer will likely deal with these just fine.

-- 
Robert Muir
rcm...@gmail.com


Re: LucidWorks Solr

2010-04-21 Thread Mark Miller

On 4/21/10 1:18 PM, Chris Hostetter wrote:

: Regarding stemmers, I ditched them altogether a long time ago in favor
: of a dictionary of morphologies of all known words (for any given
: language). A simple lookup of any word morphology thus produces the set,
: including the correct stem.

Strictly speaking: you haven't "ditched" stemmers altogether -- you've
ditched *algorithmic* stemmers and moved to a *dictionary* based stemmer
-- but it's still a stemmer.

(i just don't want people reading this thread to be confused about
terminology)


-Hoss

   
But they don't usually call 'non algorithmic' stemming 'stemming'. 
Stemming usually means using a simple heuristic process. When you use 
vocabulary and morphology, its usually called lemmatization rather than 
stemming.


- Mark


Re: LucidWorks Solr

2010-04-21 Thread Chris Hostetter

: Regarding stemmers, I ditched them altogether a long time ago in favor
: of a dictionary of morphologies of all known words (for any given
: language). A simple lookup of any word morphology thus produces the set,
: including the correct stem.

Strictly speaking: you haven't "ditched" stemmers altogether -- you've 
ditched *algorithmic* stemmers and moved to a *dictionary* based stemmer 
-- but it's still a stemmer.

(i just don't want people reading this thread to be confused about 
terminology)


-Hoss



Re: LucidWorks Solr

2010-04-19 Thread Andy

> Andy,
> 
> This will help with smooth injection of your multilingual
> documents into Solr (multilingual either in the sense of 1
> doc containing fields in multiple languages or 1 index
> containing documents in different languages):
> 
>   http://sematext.com/products/multilingual-indexer/index.html


Otis,

Thanks for the info.

Is multilingual indexer an open source project or a commercial product? That 
web page doesn't mention anything about either open source or a price, so it's 
hard to tell.







Re: LucidWorks Solr

2010-04-19 Thread Otis Gospodnetic
Andy,

This will help with smooth injection of your multilingual documents into Solr 
(multilingual either in the sense of 1 doc containing fields in multiple 
languages or 1 index containing documents in different languages):

  http://sematext.com/products/multilingual-indexer/index.html

Re your other question about open-source morpho dictionaries - I don't know of 
any.  Last time I looked for dictionaries I learned that they cost money.  That 
said, the market for datasets is starting to grow, so you may be able to find 
more and cheaper dictionaries now.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Andy 
> To: solr-user@lucene.apache.org
> Sent: Mon, April 19, 2010 8:45:40 AM
> Subject: Re: LucidWorks Solr
> 
> Thanks for the explanation Mitch.

You're right. There can't be universal 
> stemmers.

What about multi-language stemmers? I'm mostly interested in 
> English, Spanish, German, French, Italian. Are there any stemmers that would 
> handle those languages?

If not, what's the recommended way to deal with 
> documents in multiple languages?

--- On Mon, 4/19/10, MitchK <
> ymailto="mailto:mitc...@web.de"; 
> href="mailto:mitc...@web.de";>mitc...@web.de> wrote:

> From: 
> MitchK <
> href="mailto:mitc...@web.de";>mitc...@web.de>
> Subject: Re: 
> LucidWorks Solr
> To: 
> href="mailto:solr-user@lucene.apache.org";>solr-user@lucene.apache.org
> 
> Date: Monday, April 19, 2010, 4:36 AM
> 
> Andy, I think it is 
> important to know what a stemmer really
> is.
> 
> It reduces 
> words to their infinitves. Those infinitives do
> not refer to the
> 
> real infinitive everytime, but however: for the system, it
> is an 
> infinitive,
> since all its derivates could be reduced to the same 
> form.
> Thats a stemmer.
> 
> According to this, there can't 
> exist a stemmer for every
> language, because
> every language has 
> got its own rules of how to reduce a
> word to its
> 
> infinitive.
> 
> If you apply a stemmer for english language on a 
> german
> document, the
> results might be unexpected. However, 
> sometimes it still
> works good enough. 
> 
> Keep in mind 
> that this is an algorithm. It is not important
> whether the
> 
> created infinitive is the real infinitive. It is only
> important that 
> most of
> the derivate forms can be reduced to the same basic 
> form.
> Please ask, if
> something is not clear.
> 
> 
> KStem:
> The wiki[1] says that KStem is less aggressive as the
> 
> standard stemmer.
> I guess that this means that there are more rules for 
> how
> to reduce a word
> to its infinitive and according to this the 
> results might
> be better.
> 
> 
> [1] 
> href="http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem"; 
> target=_blank 
> >http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem
> 
> 
> Kind regards
> - Mitch
> -- 
> View this message in 
> context: 
> target=_blank 
> >http://n3.nabble.com/LucidWorks-Solr-tp727341p729110.html
> Sent from 
> the Solr - User mailing list archive at
> Nabble.com.
> 
> 


Re: LucidWorks Solr

2010-04-19 Thread Erick Erickson
no big deal, just wanted to mention.

On Mon, Apr 19, 2010 at 1:24 PM,  wrote:

> > This is a little bit of hijacking going on here, but
> You are right. Accept my regrets.
>
>
> > It's algorithmic. That is, there isn't a list of variants that
> > stem to the same infinitive, and your statement
> > "always the same infintive for any derivate of the word"
> > isn't quite what happens.
> >
> > Stemmers will always produce the same infinitive for any given
> > word, just the opposite of what you said. But it is NOT guaranteed
> > that a stemmer will always produce the same infinitive for all
> > derivatives. Rather it just does a pretty darn good job with some
> > anomalies because the rules don't cover all the edge cases.
> >
> > Their *goal* is to do it perfectly, but we all know about unachievable
> > goals...
> >
> > HTH
> > Erick
> >
> > On Mon, Apr 19, 2010 at 12:28 PM, MitchK  wrote:
> >
> >>
> >> I am curious:
> >> The idea behind a stemmer is not that he produces the correct infinitive
> >> for
> >> a given word. The idea is that he produces always the same infintive for
> >> any
> >> derivate of the word.
> >>
> >> What would be, if there is an unknown word? For example something like
> >> slang? How does your solution works here? Does it scale?
> >>
> >> Thank you for sharing experiences. :)
> >>
> >> - Mitch
> >> --
> >> View this message in context:
> >> http://n3.nabble.com/LucidWorks-Solr-tp727341p730059.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >
>
>


Re: LucidWorks Solr

2010-04-19 Thread darren
> This is a little bit of hijacking going on here, but
You are right. Accept my regrets.


> It's algorithmic. That is, there isn't a list of variants that
> stem to the same infinitive, and your statement
> "always the same infintive for any derivate of the word"
> isn't quite what happens.
>
> Stemmers will always produce the same infinitive for any given
> word, just the opposite of what you said. But it is NOT guaranteed
> that a stemmer will always produce the same infinitive for all
> derivatives. Rather it just does a pretty darn good job with some
> anomalies because the rules don't cover all the edge cases.
>
> Their *goal* is to do it perfectly, but we all know about unachievable
> goals...
>
> HTH
> Erick
>
> On Mon, Apr 19, 2010 at 12:28 PM, MitchK  wrote:
>
>>
>> I am curious:
>> The idea behind a stemmer is not that he produces the correct infinitive
>> for
>> a given word. The idea is that he produces always the same infintive for
>> any
>> derivate of the word.
>>
>> What would be, if there is an unknown word? For example something like
>> slang? How does your solution works here? Does it scale?
>>
>> Thank you for sharing experiences. :)
>>
>> - Mitch
>> --
>> View this message in context:
>> http://n3.nabble.com/LucidWorks-Solr-tp727341p730059.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>



Re: LucidWorks Solr

2010-04-19 Thread darren
My use requires a mroe correct processing of language than what you define
as a stemmer. My experience with stemmers is that even with some words
without a stem, it makes a new word from it. I consider those false
positives.

My approach is based on the need to recognize that walk, walked, walking
all refer to the same lemma "walk" as is correct in grammar (not some
stemmer algorithm choice).

It scales fine. In fact, I use lucene with Instantiated in-memory index to
perform the lookups, but one could easily use MySQL or something else.

Darren

>
> I am curious:
> The idea behind a stemmer is not that he produces the correct infinitive
> for
> a given word. The idea is that he produces always the same infintive for
> any
> derivate of the word.
>
> What would be, if there is an unknown word? For example something like
> slang? How does your solution works here? Does it scale?
>
> Thank you for sharing experiences. :)
>
> - Mitch
> --
> View this message in context:
> http://n3.nabble.com/LucidWorks-Solr-tp727341p730059.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



Re: LucidWorks Solr

2010-04-19 Thread MitchK

Yes, you are right, thank you Erick.
I've lost this point and thought only of common cases, not of special ones. 

However, one can combine the mentioned solutions and different stem-filters
in different fields, so that one can be quite (not absolutely) sure, that in
most of all cases the application works as expected. 

- Mitch
-- 
View this message in context: 
http://n3.nabble.com/LucidWorks-Solr-tp727341p730160.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: LucidWorks Solr

2010-04-19 Thread Erick Erickson
This is a little bit of hijacking going on here, but

It's algorithmic. That is, there isn't a list of variants that
stem to the same infinitive, and your statement
"always the same infintive for any derivate of the word"
isn't quite what happens.

Stemmers will always produce the same infinitive for any given
word, just the opposite of what you said. But it is NOT guaranteed
that a stemmer will always produce the same infinitive for all
derivatives. Rather it just does a pretty darn good job with some
anomalies because the rules don't cover all the edge cases.

Their *goal* is to do it perfectly, but we all know about unachievable
goals...

HTH
Erick

On Mon, Apr 19, 2010 at 12:28 PM, MitchK  wrote:

>
> I am curious:
> The idea behind a stemmer is not that he produces the correct infinitive
> for
> a given word. The idea is that he produces always the same infintive for
> any
> derivate of the word.
>
> What would be, if there is an unknown word? For example something like
> slang? How does your solution works here? Does it scale?
>
> Thank you for sharing experiences. :)
>
> - Mitch
> --
> View this message in context:
> http://n3.nabble.com/LucidWorks-Solr-tp727341p730059.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: LucidWorks Solr

2010-04-19 Thread MitchK

I am curious:
The idea behind a stemmer is not that he produces the correct infinitive for
a given word. The idea is that he produces always the same infintive for any
derivate of the word. 

What would be, if there is an unknown word? For example something like
slang? How does your solution works here? Does it scale? 

Thank you for sharing experiences. :)

- Mitch
-- 
View this message in context: 
http://n3.nabble.com/LucidWorks-Solr-tp727341p730059.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: LucidWorks Solr

2010-04-19 Thread darren
There have been some open source ones. I don't have the links handy at
this moment[1]. But I parsed through the electronic dictionary and
generated a database of each word and its morphologies. I got tired of
lame stemmers that were wrong half the time. Computers are fast enough to
do lookups on 150,000 words noawadays, there's no need for fuzzy
algorithms here, IMO.

Good luck!

[1] google will turn up some I think.

> Thanks for the tip.
>
> Are there any publicly available dictionary of morphologies that I could
> use? Or did you build your own one?
>
>
> --- On Mon, 4/19/10, Darren Govoni  wrote:
>
>> From: Darren Govoni 
>> Subject: Re: LucidWorks Solr
>> To: solr-user@lucene.apache.org
>> Date: Monday, April 19, 2010, 7:39 AM
>> Regarding stemmers, I ditched them
>> altogether a long time ago in favor
>> of a dictionary of morphologies of all known words (for any
>> given
>> language). A simple lookup of any word morphology thus
>> produces the set,
>> including the correct stem.
>>
>> Works great. 100% of the time.
>>
>> Just a tip from me.
>>
>>
>> On Mon, 2010-04-19 at 00:36 -0800, MitchK wrote:
>>
>> > Andy, I think it is important to know what a stemmer
>> really is.
>> >
>> > It reduces words to their infinitves. Those
>> infinitives do not refer to the
>> > real infinitive everytime, but however: for the
>> system, it is an infinitive,
>> > since all its derivates could be reduced to the same
>> form.
>> > Thats a stemmer.
>> >
>> > According to this, there can't exist a stemmer for
>> every language, because
>> > every language has got its own rules of how to reduce
>> a word to its
>> > infinitive.
>> >
>> > If you apply a stemmer for english language on a
>> german document, the
>> > results might be unexpected. However, sometimes it
>> still works good enough.
>> >
>> > Keep in mind that this is an algorithm. It is not
>> important whether the
>> > created infinitive is the real infinitive. It is only
>> important that most of
>> > the derivate forms can be reduced to the same basic
>> form. Please ask, if
>> > something is not clear.
>> >
>> > KStem:
>> > The wiki[1] says that KStem is less aggressive as the
>> standard stemmer.
>> > I guess that this means that there are more rules for
>> how to reduce a word
>> > to its infinitive and according to this the results
>> might be better.
>> >
>> >
>> > [1] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem
>> >
>> > Kind regards
>> > - Mitch
>>
>>
>>
>
>
>
>



Re: LucidWorks Solr

2010-04-19 Thread Andy
Thanks for the tip.

Are there any publicly available dictionary of morphologies that I could use? 
Or did you build your own one?


--- On Mon, 4/19/10, Darren Govoni  wrote:

> From: Darren Govoni 
> Subject: Re: LucidWorks Solr
> To: solr-user@lucene.apache.org
> Date: Monday, April 19, 2010, 7:39 AM
> Regarding stemmers, I ditched them
> altogether a long time ago in favor
> of a dictionary of morphologies of all known words (for any
> given
> language). A simple lookup of any word morphology thus
> produces the set,
> including the correct stem.
> 
> Works great. 100% of the time.
> 
> Just a tip from me.
> 
> 
> On Mon, 2010-04-19 at 00:36 -0800, MitchK wrote:
> 
> > Andy, I think it is important to know what a stemmer
> really is.
> > 
> > It reduces words to their infinitves. Those
> infinitives do not refer to the
> > real infinitive everytime, but however: for the
> system, it is an infinitive,
> > since all its derivates could be reduced to the same
> form.
> > Thats a stemmer.
> > 
> > According to this, there can't exist a stemmer for
> every language, because
> > every language has got its own rules of how to reduce
> a word to its
> > infinitive.
> > 
> > If you apply a stemmer for english language on a
> german document, the
> > results might be unexpected. However, sometimes it
> still works good enough. 
> > 
> > Keep in mind that this is an algorithm. It is not
> important whether the
> > created infinitive is the real infinitive. It is only
> important that most of
> > the derivate forms can be reduced to the same basic
> form. Please ask, if
> > something is not clear.
> > 
> > KStem:
> > The wiki[1] says that KStem is less aggressive as the
> standard stemmer.
> > I guess that this means that there are more rules for
> how to reduce a word
> > to its infinitive and according to this the results
> might be better.
> > 
> > 
> > [1] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem
> > 
> > Kind regards
> > - Mitch
> 
> 
> 


  


Re: LucidWorks Solr

2010-04-19 Thread Andy
Thanks for the explanation Mitch.

You're right. There can't be universal stemmers.

What about multi-language stemmers? I'm mostly interested in English, Spanish, 
German, French, Italian. Are there any stemmers that would handle those 
languages?

If not, what's the recommended way to deal with documents in multiple languages?

--- On Mon, 4/19/10, MitchK  wrote:

> From: MitchK 
> Subject: Re: LucidWorks Solr
> To: solr-user@lucene.apache.org
> Date: Monday, April 19, 2010, 4:36 AM
> 
> Andy, I think it is important to know what a stemmer really
> is.
> 
> It reduces words to their infinitves. Those infinitives do
> not refer to the
> real infinitive everytime, but however: for the system, it
> is an infinitive,
> since all its derivates could be reduced to the same form.
> Thats a stemmer.
> 
> According to this, there can't exist a stemmer for every
> language, because
> every language has got its own rules of how to reduce a
> word to its
> infinitive.
> 
> If you apply a stemmer for english language on a german
> document, the
> results might be unexpected. However, sometimes it still
> works good enough. 
> 
> Keep in mind that this is an algorithm. It is not important
> whether the
> created infinitive is the real infinitive. It is only
> important that most of
> the derivate forms can be reduced to the same basic form.
> Please ask, if
> something is not clear.
> 
> KStem:
> The wiki[1] says that KStem is less aggressive as the
> standard stemmer.
> I guess that this means that there are more rules for how
> to reduce a word
> to its infinitive and according to this the results might
> be better.
> 
> 
> [1] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem
> 
> Kind regards
> - Mitch
> -- 
> View this message in context: 
> http://n3.nabble.com/LucidWorks-Solr-tp727341p729110.html
> Sent from the Solr - User mailing list archive at
> Nabble.com.
> 


  


Re: LucidWorks Solr

2010-04-19 Thread Darren Govoni
Regarding stemmers, I ditched them altogether a long time ago in favor
of a dictionary of morphologies of all known words (for any given
language). A simple lookup of any word morphology thus produces the set,
including the correct stem.

Works great. 100% of the time.

Just a tip from me.


On Mon, 2010-04-19 at 00:36 -0800, MitchK wrote:

> Andy, I think it is important to know what a stemmer really is.
> 
> It reduces words to their infinitves. Those infinitives do not refer to the
> real infinitive everytime, but however: for the system, it is an infinitive,
> since all its derivates could be reduced to the same form.
> Thats a stemmer.
> 
> According to this, there can't exist a stemmer for every language, because
> every language has got its own rules of how to reduce a word to its
> infinitive.
> 
> If you apply a stemmer for english language on a german document, the
> results might be unexpected. However, sometimes it still works good enough. 
> 
> Keep in mind that this is an algorithm. It is not important whether the
> created infinitive is the real infinitive. It is only important that most of
> the derivate forms can be reduced to the same basic form. Please ask, if
> something is not clear.
> 
> KStem:
> The wiki[1] says that KStem is less aggressive as the standard stemmer.
> I guess that this means that there are more rules for how to reduce a word
> to its infinitive and according to this the results might be better.
> 
> 
> [1] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem
> 
> Kind regards
> - Mitch




Re: LucidWorks Solr

2010-04-19 Thread MitchK

Andy, I think it is important to know what a stemmer really is.

It reduces words to their infinitves. Those infinitives do not refer to the
real infinitive everytime, but however: for the system, it is an infinitive,
since all its derivates could be reduced to the same form.
Thats a stemmer.

According to this, there can't exist a stemmer for every language, because
every language has got its own rules of how to reduce a word to its
infinitive.

If you apply a stemmer for english language on a german document, the
results might be unexpected. However, sometimes it still works good enough. 

Keep in mind that this is an algorithm. It is not important whether the
created infinitive is the real infinitive. It is only important that most of
the derivate forms can be reduced to the same basic form. Please ask, if
something is not clear.

KStem:
The wiki[1] says that KStem is less aggressive as the standard stemmer.
I guess that this means that there are more rules for how to reduce a word
to its infinitive and according to this the results might be better.


[1] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem

Kind regards
- Mitch
-- 
View this message in context: 
http://n3.nabble.com/LucidWorks-Solr-tp727341p729110.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: LucidWorks Solr

2010-04-18 Thread Andy


--- On Sun, 4/18/10, Grant Ingersoll  wrote:
 
> 
> Sure, but I'm biased. ;-)  Hopefully, you will find it
> useful, but choose the one that best fits your needs (and
> let me know if you need help assessing that.)
> 

Thanks for the explanation Grant.

WHat is the advantage of KStem over the standard Solr stemmer?

On your website it was mentioned that KStem only works for English. What would 
happen if some of my documents are in other languages? What about the standard 
Solr stemmer -- does it also work on English only?

Is there a stemmer that's sort of "universal" & work on multiple languages?





Re: LucidWorks Solr

2010-04-18 Thread Grant Ingersoll

On Apr 18, 2010, at 3:53 AM, Andy wrote:

> Just wanted to know if anyone has used LucidWorks Solr. 
> 
> - How do you compare it to the standard Apache Solr?

We take a release of Solr.  We wrap it w/ an installer, tomcat/jetty, our 
reference guide, Luke, etc.  We also add in an optimized version of KStem.  
Finally, we apply certain patches that came after whatever the release was that 
didn't make it into the release (we usually delay our release by a few weeks).  
Many of these things we package simply cannot be in an ASF release b/c of ASF 
policies, others are there for convenience so that people don't have to go all 
over the web to get them.

> 
> - the non-blocking IO of LucidWorks Solr -- is that for networking IO or disk 
> IO? what are its effects?

I think this is a legacy from the 1.3 CD on our website.  I believe what this 
is referring to is in Solr 1.4, as it was a patch that was applied to trunk 
after 1.3 was released.   I'll let our web team know to update that.

> 
> - LucidWorks website also talked about "significantly improved faceting 
> performance" -- what improvements are they? How much improvements?

Same as the previous issue.   I'll let our web team know to update that.

> 
> Would you recommend using it?
> 

Sure, but I'm biased. ;-)  Hopefully, you will find it useful, but choose the 
one that best fits your needs (and let me know if you need help assessing that.)

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search



Re: LucidWorks Solr

2010-04-18 Thread Paolo Castagna

Thanks for asking, I am interested as well in reading the response to
your questions.

Paolo

Andy wrote:
Just wanted to know if anyone has used LucidWorks Solr. 


- How do you compare it to the standard Apache Solr?

- the non-blocking IO of LucidWorks Solr -- is that for networking IO or disk 
IO? what are its effects?

- LucidWorks website also talked about "significantly improved faceting 
performance" -- what improvements are they? How much improvements?

Would you recommend using it?

Thanks.


  


LucidWorks Solr

2010-04-18 Thread Andy
Just wanted to know if anyone has used LucidWorks Solr. 

- How do you compare it to the standard Apache Solr?

- the non-blocking IO of LucidWorks Solr -- is that for networking IO or disk 
IO? what are its effects?

- LucidWorks website also talked about "significantly improved faceting 
performance" -- what improvements are they? How much improvements?

Would you recommend using it?

Thanks.


  


Re: LucidWorks Solr

2010-03-16 Thread Kevin Osborn
For my purposes, the Porter analyzer was overly aggressive with stemming. So, 
we then moved to KStem. It looks like this is no longer being maintained and 
Lucid claimed much better performance with theirs, so I gave that a try and it 
seems to be working fine. I didn't do any benchmarks though.

And I just took the war in LucidWorks\dist. I think in the install 
instructions, there was also a script to apply to the included source code as 
well. I did that as well since I look at the source regularly.

I didn't look at LudidGlaze or any of the other Lucid features.

-Kevin





From: blargy 
To: solr-user@lucene.apache.org
Sent: Tue, March 16, 2010 12:31:09 PM
Subject: Re: LucidWorks Solr


Kevin,

When you say you just included the war you mean the /packs/solr.war correct?
I see that the KStemmer is nicely packed in there but I don't see LucidGaze
anywhere. Have you had any experience using this? 

So I'm guessing you would suggest using the LucidWorks solr.war over the
apache-solr-war just because of the various bug-fixes/tests. 

As a side question. Is there a reason you choose the LucidKStemmer over any
other stemmers (KStemmer, Porter, etc)? I'm unsure of which stemmer would
work best. Thanks again!


Kevin Osborn-2 wrote:
> 
> I used it mostly for KStemmer, but I also liked the fact that it included
> about a dozen or so stable patches since Solr 1.4 was released. We just
> use the included WAR in our project however. We don't use the installer or
> anything like that.
> 
> 
> 
> 
> 
> 
> From: blargy 
> To: solr-user@lucene.apache.org
> Sent: Tue, March 16, 2010 11:52:17 AM
> Subject: LucidWorks Solr
> 
> 
> Has anyone used this?:
> http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr
> 
> Other than the KStemmer and installer what are the other "enhancements"
> that
> this download offers? Is it worth using over the default Solr
> installation?
> 
> Thanks
> 
> -- 
> View this message in context:
> http://old.nabble.com/LucidWorks-Solr-tp27922870p27922870.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
>  
> 

-- 
View this message in context: 
http://old.nabble.com/LucidWorks-Solr-tp27922870p27923359.html
Sent from the Solr - User mailing list archive at Nabble.com.


  

Re: LucidWorks Solr

2010-03-16 Thread blargy

Kevin,

When you say you just included the war you mean the /packs/solr.war correct?
I see that the KStemmer is nicely packed in there but I don't see LucidGaze
anywhere. Have you had any experience using this? 

So I'm guessing you would suggest using the LucidWorks solr.war over the
apache-solr-war just because of the various bug-fixes/tests. 

As a side question. Is there a reason you choose the LucidKStemmer over any
other stemmers (KStemmer, Porter, etc)? I'm unsure of which stemmer would
work best. Thanks again!


Kevin Osborn-2 wrote:
> 
> I used it mostly for KStemmer, but I also liked the fact that it included
> about a dozen or so stable patches since Solr 1.4 was released. We just
> use the included WAR in our project however. We don't use the installer or
> anything like that.
> 
> 
> 
> 
> 
> 
> From: blargy 
> To: solr-user@lucene.apache.org
> Sent: Tue, March 16, 2010 11:52:17 AM
> Subject: LucidWorks Solr
> 
> 
> Has anyone used this?:
> http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr
> 
> Other than the KStemmer and installer what are the other "enhancements"
> that
> this download offers? Is it worth using over the default Solr
> installation?
> 
> Thanks
> 
> -- 
> View this message in context:
> http://old.nabble.com/LucidWorks-Solr-tp27922870p27922870.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
>   
> 

-- 
View this message in context: 
http://old.nabble.com/LucidWorks-Solr-tp27922870p27923359.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: LucidWorks Solr

2010-03-16 Thread AJ Chen
I'm trying it out right now. I hope it will work well out-of-box for
indexing/searching a set of documents with frequent update.
-aj

On Tue, Mar 16, 2010 at 11:52 AM, blargy  wrote:

>
> Has anyone used this?:
> http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr
>
> Other than the KStemmer and installer what are the other "enhancements"
> that
> this download offers? Is it worth using over the default Solr installation?
>
> Thanks
>
> --
> View this message in context:
> http://old.nabble.com/LucidWorks-Solr-tp27922870p27922870.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA
650-283-4091
*Building social media monitoring pipeline, and connecting social customers
to CRM*


Re: LucidWorks Solr

2010-03-16 Thread Kevin Osborn
I used it mostly for KStemmer, but I also liked the fact that it included about 
a dozen or so stable patches since Solr 1.4 was released. We just use the 
included WAR in our project however. We don't use the installer or anything 
like that.






From: blargy 
To: solr-user@lucene.apache.org
Sent: Tue, March 16, 2010 11:52:17 AM
Subject: LucidWorks Solr


Has anyone used this?:
http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr

Other than the KStemmer and installer what are the other "enhancements" that
this download offers? Is it worth using over the default Solr installation?

Thanks

-- 
View this message in context: 
http://old.nabble.com/LucidWorks-Solr-tp27922870p27922870.html
Sent from the Solr - User mailing list archive at Nabble.com.


  

What is the process to build Lucidworks Solr?

2010-01-07 Thread Micah Koga
I am using LucidWorks Solr v1.4 and I would like to compile in a search
component, however it does not seem like a very straightforward process. The
ant script in the solr directory is that of the stock solr installation
which does not compile out of the box.

Has anyone been able to successfully compile Lucidworks Solr?