Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!

jian chen Thu, 20 Jan 2005 14:09:46 -0800

Hi,

One thing to point out. I think Lucene is not using LSI as the
underlying retrieval model. It uses vector space model and also
proximity based retrieval.


Personally, I don't know much about LSI and I don't think the fancy
stuff like LSI is workable in industry. I believe we are far away from
the era of artificial intelligence and using any elusive way to do
information retrieval.

Cheers,

Jian


On Thu, 20 Jan 2005 14:50:10 -0700, Owen Densmore <[EMAIL PROTECTED]> wrote:
> Hi .. I'm new to the list so forgive a dumb question or two as I get
> started.
> 
> We're in the midst of converting a small collection (1200-1500
> currently) of scientific literature to be easily searchable/navigable.
> We'll likely provide both a text query interface as well as a graphical
> way to search and discover.
> 
> Our initial approach will be vector based, looking at Latent Semantic
> Indexing (LSI) as a potential tool, although if that's not needed,
> we'll stop at reasonably simple stemming with a weighted document term
> matrix (DTM).  (Bear in mind I couldn't even pronounce most of these
> concepts last week, so go easy if I'm incoherent!)
> 
> It looks to me that Lucene has a quite well factored architecture.  I
> should at the very least be able to use the analyzer and stemmer to
> create a good starting point in the project.  I'd also like to leave a
> nice architecture behind in case we or others end up experimenting
> with, or extending, the system.
> 
> So a couple of questions:
> 
> 1 - I'm a bit concerned that reasonable stemming (Porter/Snowball)
> apparently produces non-word stems .. i.e. not really human readable.
> (Example: generate, generates, generated, generating  -> generat)
> Although in typical queries this is not important because the result of
> the search is a document list, it *would* be important if we use the
> stems within a graphical navigation interface.
>      So the question is: Is there a way to have the stemmer produce
> english
>      base forms of the words being stemmed?
> 
> 2 - We're probably using Lucene in ways it was not designed for, such
> as DTM/LSI and graphical clustering and navigation.  Naturally we'll
> provide code for these parts that are not in Lucene.
>      But the question arises: is this kinda dumb?!  Has anyone stretched
> Lucene's
>      design center with positive results?  Are we barking up the wrong
> tree?
> 
> 3 - A nit on hyphenation: Our collection is scientific so has many
> hyphenated words.  I'm wondering about your experiences with
> hyphenation.  In our collection, things like self-organization,
> power-law, space-time, small-world, agent-based, etc. occur often, for
> example.
>      So the question is: Do folks break up hyphenated words?  If not, do
> you stem the
>      parts and glue them back together?  Do you apply stoplists to the
> parts?
> 
> Thanks for any help and pointers you can fling along,
> 
> Owen    http://backspaces.net/    http://redfish.com/
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!

Reply via email to