Re: What is the best file system for Lucene?

2004-11-30 Thread Pete Lewis
Hi Sanyi Could you try XP on your desktop - that would take some variables out. The problem is that you are comparing OS, as well as filesystems, as well as different hardware configs. Also, unless you take your hyperthreading off, with just one index you are searching with just one half of the

Re: Lucene scalability, performance

2004-11-15 Thread Pete Lewis
Hi Venkat If you want to go against just html pages (maybe with Dublin core tags) then Swich-E isn't too bad, but it wont be as portable as Lucene plus it doesn't seem to be as nearly as active on the development side as Lucene (so you'll get less support in the event of problems). Swish seems ea

Re: Stemming Oddness

2004-11-06 Thread Pete Lewis
Hi Yousef You are not doing anything wrong - its just how the Porter stemmer works! The problem with Porter is that it tries to do everything in a purely algorithmic way - which doesn't cater for irregular conjugations etc. Don't worry too much though, as long as you do the same stemming on the

Re: PorterStemmer / Levenshtein Distance

2004-11-05 Thread Pete Lewis
Hi Yousef If you want to use it for something else then go direct for the Snowball stemmers, for details go to the site: http://snowball.tartarus.org/ Cheers Pete - Original Message - From: "Yousef Ourabi" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Friday, N

Re: PorterStemfilter

2004-09-14 Thread Pete Lewis
Hi David I like KStem more than Porter / Snowball - but still has limitations although performs better as it has a dictionary to augment the rules. Note that KStem will also treat "print" and "printer" as two distinct terms, probably treating it as verb and noun respectively.

Re: PorterStemfilter

2004-09-14 Thread Pete Lewis
;printer" and submit then the results will be "print" and "printer" - hence showing the the Porter stemmed versions are the same as the originals. Therefore they are both distinct terms in their own right and searches on one will not hit the other. Cheers Pete Lewis

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Pete Lewis
ction, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers

Re: Using MySpell iso the Snowball Analyzer

2004-09-09 Thread Pete Lewis
Hi Aad Use the stemmed result as what you index, but then also remember to stem the query terms as well - you need to do the same on the way out as on the way in. We don't use MySpell but we do use our own stemmer in this way, as there are many examples where Snowball falls down like: caught ->

Re: Searching different types of words

2003-11-25 Thread Pete Lewis
Hi I'd recommend Kstem over Porter, it performs much better on English let alone when you get to other languages. You can get the source code for Kstem.jar at teh following website: http://ciir.cs.umass.edu/downloads/ Pete - Original Message - From: "Otis Gospodnetic" <[EMAIL PROTECTED

Re: Index entire filesystem

2003-11-05 Thread Pete Lewis
Stefan Groschupf" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, November 05, 2003 11:01 AM Subject: Re: Index entire filesystem > There is some ongoing work for nutch.org. > May be we can bundle all work together?! > Nutc

Re: Index entire filesystem

2003-11-05 Thread Pete Lewis
Hi Stefan Using OpenOffice will enable you to parse 182 file formats, but its not a pure java solution and you still need an alternate solution for pdfs. I'd be interested in knowing whether anyone is working on a pure java solution that would give us a single method for handling ms office documm

Re: Lucene demo ideas?

2003-09-17 Thread Pete Lewis
Might want two demos, one for Unix environments and one for Windows. Most users will want a fast start that they can copy and adapt. So quick targets would be: filesystems - html / text / pdf / office documents for windows. xml - fairly simple example maybe against news items. database - again s

Reference for Lucene as a search tool built into a CD

2003-09-09 Thread Pete Lewis
Does anyone know of Lucene being packaged onto a CD to provide a search facility for the data on that CD? If so, would it be possible to refence? Thanks Pete

Multi-lingual synonym and homonym lists

2003-06-08 Thread Pete Lewis
Hi all Does anyone know of any sysnonym and homonym lists for the different European languages? Sorry for the cross-posting but I'd like to use them for query expanssion in different languages. Pete

Re: RE : Parsers

2003-05-30 Thread Pete Lewis
Hi guys Thanks, Jawin looks really nice :) Pete - Original Message - From: "Andrzej Bialecki" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Thursday, May 29, 2003 9:45 AM Subject: Re: RE : Parsers > Victor Hadianto wrote: > >>I'm using successfully a combinatio

Re: RE : Parsers

2003-05-29 Thread Pete Lewis
Hi Victor Thanks. In the past I have used the Inso OutsideIn filters and found them very good; however I'd like to come up with a pure Java solution, so if there is a Java equivalent to the Inso filters I be grateful for any details. Failing that, I thought that I'd go for individual parsers ini

Re: Parsers

2003-05-28 Thread Pete Lewis
one... Adriano Labate -Message d'origine- De : Pete Lewis [mailto:[EMAIL PROTECTED] Envoyé : mercredi, 28 mai 2003 12:48 À : Lucene Users List Objet : Parsers Hi all, I have a rather nice html parser that I got from SourceForge. Does anyone know of any good parsers for pdf and Micr

Parsers

2003-05-28 Thread Pete Lewis
Hi all, I have a rather nice html parser that I got from SourceForge. Does anyone know of any good parsers for pdf and Microsoft Office Suite (.doc, .ppt, .xls, etc), any help would be much appreciated. Pete Lewis