custom Tokenizer

2002-12-12 Thread Raj
Hello All, My program indexing a string--- London/Bristol/LondonEast/Scotland using standarad anlyser. when i seach with a word londonit doesnt comeup in the hits. If i search for london it is coming. Where would be the problem? should it requires a custom tokenizer

phrase search woes......

2002-12-12 Thread host unknown
Hi all, I'm having a problem searching for phrases (example: bucky badger). I can search for the terms individually (using and or or searches (booleanquery)), but can't seem to do a phrasequery (within the same boolean query)see code: BooleanQuery myquery = new BooleanQuery(); for (int

Re: Accentuated characters

2002-12-12 Thread stephane vaucher
Hi Eric, Thanks for the link. I've looked at it and it has some interesting parts like the stop words and the analyser which I might partially include (partially since I work with both english and french texts). Cheers, Stephane Eric Isakson wrote: Don't know if any of the code in this

Re: Accentuated characters

2002-12-12 Thread stephane vaucher
Actually, I'm just looking to remove accentuated chars from java chars (so Unicode), only for the search (original doc should stay the same as I display), I'll just implement a TokenFilter to do this. It should be relatively simple. Just wanted to know if it had already been done (perhaps in a

RE: Accentuated characters

2002-12-12 Thread Alex Murzaku
Something flexible and elegant would also be a simple fst. Here is one built for lucene: http://sourceforge.net/projects/normalizer/ -Original Message- From: stephane vaucher [mailto:[EMAIL PROTECTED]] Sent: Thursday, December 12, 2002 12:23 PM To: Lucene Users List Subject: Re:

HTML saga continues...

2002-12-12 Thread Leo Galambos
So, I have tried this with Lucene: 1) original JavaCC LL(k) HTML parser 2) SWING's HTML parser In case of (1) I could process about 300K of HTML documents. In case of (2) more than 400K. But I cannot process complete collection (5M) and finish my hard stress tests of Lucene. Is there anyone

Re: HTML saga continues...

2002-12-12 Thread Erik Hatcher
Look in the Lucene sandbox in CVS. I contributed an Ant task that indexed HTML documents. It uses JTidy under the covers to parse HTML into title and body content, and it could be extended to pull other information such meta keywords. Erik Leo Galambos wrote: So, I have tried this with

Re: Accentuated characters

2002-12-12 Thread stephane vaucher
Fair enough, but a protected would only allow subclasses from accessing it. Personally, I would rather not have to use a subclass to implement my feature. I think the logic behind this is that its an intrinsic property of a Term, thus it should be immutable, as any modifications to this object

Re: HTML saga continues...

2002-12-12 Thread Erik Hatcher
On a related note, I've also released a project that I developed for my book and for presentations that I have been giving on Ant, XDoclet, and JUnit. This project is a documentation search engine with a web (Struts) interface. It uses Lucene and the Ant task I mentioned already to index a

Re: HTML saga continues...

2002-12-12 Thread Otis Gospodnetic
Yeah, Neko is not the most straight forward, but it works. Sorry, the code is somewhere.can;t look for it now. But you could also look at LARM under Lucene Sanbox, it's got a nice HTML parser, too. Otis --- Leo Galambos [EMAIL PROTECTED] wrote: So, I have tried this with Lucene: 1)