Hi, One thing to point out. I think Lucene is not using LSI as the underlying retrieval model. It uses vector space model and also proximity based retrieval.
Personally, I don't know much about LSI and I don't think the fancy stuff like LSI is workable in industry. I believe we are far away from the era of artificial intelligence and using any elusive way to do information retrieval. Cheers, Jian On Thu, 20 Jan 2005 14:50:10 -0700, Owen Densmore <[EMAIL PROTECTED]> wrote: > Hi .. I'm new to the list so forgive a dumb question or two as I get > started. > > We're in the midst of converting a small collection (1200-1500 > currently) of scientific literature to be easily searchable/navigable. > We'll likely provide both a text query interface as well as a graphical > way to search and discover. > > Our initial approach will be vector based, looking at Latent Semantic > Indexing (LSI) as a potential tool, although if that's not needed, > we'll stop at reasonably simple stemming with a weighted document term > matrix (DTM). (Bear in mind I couldn't even pronounce most of these > concepts last week, so go easy if I'm incoherent!) > > It looks to me that Lucene has a quite well factored architecture. I > should at the very least be able to use the analyzer and stemmer to > create a good starting point in the project. I'd also like to leave a > nice architecture behind in case we or others end up experimenting > with, or extending, the system. > > So a couple of questions: > > 1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) > apparently produces non-word stems .. i.e. not really human readable. > (Example: generate, generates, generated, generating -> generat) > Although in typical queries this is not important because the result of > the search is a document list, it *would* be important if we use the > stems within a graphical navigation interface. > So the question is: Is there a way to have the stemmer produce > english > base forms of the words being stemmed? > > 2 - We're probably using Lucene in ways it was not designed for, such > as DTM/LSI and graphical clustering and navigation. Naturally we'll > provide code for these parts that are not in Lucene. > But the question arises: is this kinda dumb?! Has anyone stretched > Lucene's > design center with positive results? Are we barking up the wrong > tree? > > 3 - A nit on hyphenation: Our collection is scientific so has many > hyphenated words. I'm wondering about your experiences with > hyphenation. In our collection, things like self-organization, > power-law, space-time, small-world, agent-based, etc. occur often, for > example. > So the question is: Do folks break up hyphenated words? If not, do > you stem the > parts and glue them back together? Do you apply stoplists to the > parts? > > Thanks for any help and pointers you can fling along, > > Owen http://backspaces.net/ http://redfish.com/ > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]