custom Tokenizer

2002-12-12 Thread Raj
Hello All,
   My program indexing a string---
London/Bristol/LondonEast/Scotland   using standarad anlyser.
when i seach with a word   londonit doesnt comeup in the hits.  If i
search for "london"   it is coming.
Where would be the problem?
should it requires a custom tokenizer  which should tokenize the  /  s in
the input string while indexing. Or is that facility is available in
StandardTokenizer already.
Thanks..



--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




phrase search woes......

2002-12-12 Thread host unknown
Hi all,

I'm having a problem searching for phrases (example:  "bucky badger").  I 
can search for the terms individually (using and or or searches 
(booleanquery)), but can't seem to do a phrasequery (within the same boolean 
query)see code:

BooleanQuery myquery = new BooleanQuery();

for (int i = 0; i < keyword_vector.size(); i ++) {
	keyword = (String)keyword_vector.elementAt(i);

	//if the user has entered a wildcard at the end of the keyword
	if (keyword.endsWith("*")){
		myquery.add(new PrefixQuery(new Term("rest", keyword.substring(0, 
keyword.length() -1))), require_all, false);

	//user is looking for an exact phrase.enclosed w/ quotes
	} else if (keyword.startsWith("\"") && keyword.endsWith("\"")){
		//remove the quotes
		keyword = keyword.substring(1, keyword.length() - 1);
		PhraseQuery pq = new PhraseQuery();
		pq.add(new Term("rest", keyword));
		pq.setSlop(new Integer((String)properties.get("PhraseSlop")).intValue());
		myquery.add(pq, require_all, false);

   //just looking for a simple term
	} else {
		myquery.add(new TermQuery(new Term("rest", keyword)), require_all, false);
	}
}


If I'm storing the "rest" field using Field.text(String, Reader) could this 
be the problem since it's not storing the index verbatim?  If not, where 
should I be looking



Thanks for your previous helpgetting VERY close to moving to production,
Dominic
madison.com

_
STOP MORE SPAM with the new MSN 8 and get 2 months FREE* 
http://join.msn.com/?page=features/junkmail


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 



Re: Accentuated characters

2002-12-12 Thread stephane vaucher
Hi Eric,

Thanks for the link. I've looked at it and it has some interesting parts 
like the stop words and the analyser which I might partially include 
(partially since I work with both english and french texts).

Cheers,
Stephane

Eric Isakson wrote:

Don't know if any of the code in this French analyzer that was contributed by Patrick Talbot may apply, any reason you don't just use it? see http://nagoya.apache.org/eyebrowse/ReadMsg?[EMAIL PROTECTED]&msgNo=870

Eric
--
Eric D. IsaksonSAS Institute Inc.
Application Developer  SAS Campus Drive
XML Technologies   Cary, NC 27513
(919) 531-3639 http://www.sas.com


-Original Message-
From: stephane vaucher [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, December 10, 2002 2:58 PM
To: [EMAIL PROTECTED]
Subject: Accentuated characters


Hello everyone,

I wish to implement a TokenFilter that will remove accentuated 
characters so for example 'é' will become 'e'. As I would rather not 
reinvent the wheel, I've tried to find something on the web and on the 
mailing lists. I saw a mention of a contrib that could do this (see 
http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg02146.html), 
but I don't see anything applicable.

Has anyone done this yet, if so I would much appreciate some pointers 
(or code), otherwise, I'll be happy to contribute whatever I produce 
(but it might be very simple since I'll only need to deal with french).

Cheers,
Stephane


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 





--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Accentuated characters

2002-12-12 Thread stephane vaucher
Actually, I'm just looking to remove accentuated chars from java chars 
(so Unicode), only for the search (original doc should stay the same as 
I display), I'll just implement a TokenFilter to do this. It should be 
relatively simple. Just wanted to know if it had already been done 
(perhaps in a generic way).

Stephane

Joshua O'Madadhain wrote:

On Tue, 10 Dec 2002, stephane vaucher wrote:


I wish to implement a TokenFilter that will remove accentuated
characters so for example 'é' will become 'e'. As I would rather not
reinvent the wheel, I've tried to find something on the web and on the
mailing lists. I saw a mention of a contrib that could do this (see
http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg02146.html),
but I don't see anything applicable.



It may depend on what kind of encoding you're working with.  (E.g., HTML
documents represent such characters in a different way than that of
Postscript documents.)  Probably the easiest way to handle this, if you
want to avoid such questions, would be to convert all your input documents
(and query text) to Java (Unicode) strings, and then do a
search-and-replace with the appropriate character-pair arguments.  (After
this is done, you would then do whatever Lucene processing (indexing,
query parsing, etc.) was appropriate.  I am not aware of any code that
does this, but it should be straightforward.

Good luck--

Joshua O'Madadhain

[EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
 Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.




--
To unsubscribe, e-mail:   
For additional commands, e-mail: 






--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Accentuated characters

2002-12-12 Thread stephane vaucher
There is no problem with package scopes:

This is how I remove trailing 's' chars:

   String word = token.termText();
  
   if(word.endsWith("s")){
   word = word.substring(0, word.length() - 1);
   }
  
   if ( !word.equals( token.termText() ) ) {
   return new Token( word, token.startOffset(),
 token.endOffset(), token.type() );
   }

I'll take a look at how the Collator works to see if I can make a 
generic (maybe locale specific) string normaliser so I could specify the 
level of differences.

Stephane

Eric Isakson wrote:

If you really want to make your own TokenFilter, have a look at org.apache.lucene.analysis.LowerCaseFilter.next()

it does:
 public final Token next() throws java.io.IOException {
   Token t = input.next();

   if (t == null)
 return null;

   t.termText = t.termText.toLowerCase();

   return t;
 }

The termText member of the Token class is package scoped, so you will have to implement your filter in the org.apache.lucene.analysis package. No worries about encoding as the termText is already a java (unicode) string. You will just have to provide the mechanism to get the accented characters converted to there non-accented equivalents. java.text.Collator has some magic that does this for string comparisons but I couldn't find any public methods that give you access to convert a string to its non-accented equivalent.

Eric
--
Eric D. IsaksonSAS Institute Inc.
Application Developer  SAS Campus Drive
XML Technologies   Cary, NC 27513
(919) 531-3639 http://www.sas.com



-Original Message-
From: stephane vaucher [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, December 10, 2002 2:58 PM
To: [EMAIL PROTECTED]
Subject: Accentuated characters


Hello everyone,

I wish to implement a TokenFilter that will remove accentuated 
characters so for example 'é' will become 'e'. As I would rather not 
reinvent the wheel, I've tried to find something on the web and on the 
mailing lists. I saw a mention of a contrib that could do this (see 
http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg02146.html), 
but I don't see anything applicable.

Has anyone done this yet, if so I would much appreciate some pointers 
(or code), otherwise, I'll be happy to contribute whatever I produce 
(but it might be very simple since I'll only need to deal with french).

Cheers,
Stephane


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 





--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Accentuated characters

2002-12-12 Thread stephane vaucher
Thanks for the reference. I basically work with french, english, or 
bilingual texts. I'll take a quick look at the lib, but it might be an 
overkill.

Cheers,
Stephane

Alex Murzaku wrote:

IBM's ICU4J has a normalizer which should do what you need. It's a big
library, but if you deal with multilingual text often, it might make your
life easier.





--
To unsubscribe, e-mail:   
For additional commands, e-mail: 





--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Accentuated characters

2002-12-12 Thread Alex Murzaku
Something flexible and elegant would also be a simple fst.
Here is one built for lucene:
 http://sourceforge.net/projects/normalizer/

-Original Message-
From: stephane vaucher [mailto:[EMAIL PROTECTED]] 
Sent: Thursday, December 12, 2002 12:23 PM
To: Lucene Users List
Subject: Re: Accentuated characters


Thanks for the reference. I basically work with french, english, or 
bilingual texts. I'll take a quick look at the lib, but it might be an 
overkill.

Cheers,
Stephane

Alex Murzaku wrote:

>IBM's ICU4J has a normalizer which should do what you need. It's a big 
>library, but if you deal with multilingual text often, it might make 
>your life easier.
>
>
>
>---
>-
>
>--
>To unsubscribe, e-mail:

>For additional commands, e-mail: 
>
>



--
To unsubscribe, e-mail:

For additional commands, e-mail:




--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Accentuated characters

2002-12-12 Thread Eric Isakson
That method works too. Putting your token filter in the org.apache.lucene.analysis 
package and replacing: 

if ( !word.equals( token.termText() ) ) {
return new Token( word, token.startOffset(),
  token.endOffset(), token.type() );
}

with

token.termText = word;
return token;

will make your code operate more efficiently as you won't be creating a bunch of new 
Token objects that will have to be garbage collected. Would be nice if the termText 
was "protected" rather than package scoped.

Eric

-Original Message-
From: stephane vaucher [mailto:[EMAIL PROTECTED]]
Sent: Thursday, December 12, 2002 12:12 PM
To: Lucene Users List
Subject: Re: Accentuated characters


There is no problem with package scopes:

This is how I remove trailing 's' chars:

String word = token.termText();
   
if(word.endsWith("s")){
word = word.substring(0, word.length() - 1);
}
   
if ( !word.equals( token.termText() ) ) {
return new Token( word, token.startOffset(),
  token.endOffset(), token.type() );
}

I'll take a look at how the Collator works to see if I can make a 
generic (maybe locale specific) string normaliser so I could specify the 
level of differences.

Stephane

Eric Isakson wrote:

>If you really want to make your own TokenFilter, have a look at 
>org.apache.lucene.analysis.LowerCaseFilter.next()
>
>it does:
>  public final Token next() throws java.io.IOException {
>Token t = input.next();
>
>if (t == null)
>  return null;
>
>t.termText = t.termText.toLowerCase();
>
>return t;
>  }
>
>The termText member of the Token class is package scoped, so you will have to 
>implement your filter in the org.apache.lucene.analysis package. No worries about 
>encoding as the termText is already a java (unicode) string. You will just have to 
>provide the mechanism to get the accented characters converted to there non-accented 
>equivalents. java.text.Collator has some magic that does this for string comparisons 
>but I couldn't find any public methods that give you access to convert a string to 
>its non-accented equivalent.
>
>Eric
>--
>Eric D. IsaksonSAS Institute Inc.
>Application Developer  SAS Campus Drive
>XML Technologies   Cary, NC 27513
>(919) 531-3639 http://www.sas.com
>
>
>
>-Original Message-
>From: stephane vaucher [mailto:[EMAIL PROTECTED]]
>Sent: Tuesday, December 10, 2002 2:58 PM
>To: [EMAIL PROTECTED]
>Subject: Accentuated characters
>
>
>Hello everyone,
>
>I wish to implement a TokenFilter that will remove accentuated 
>characters so for example 'é' will become 'e'. As I would rather not 
>reinvent the wheel, I've tried to find something on the web and on the 
>mailing lists. I saw a mention of a contrib that could do this (see 
>http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg02146.html), 
>but I don't see anything applicable.
>
>Has anyone done this yet, if so I would much appreciate some pointers 
>(or code), otherwise, I'll be happy to contribute whatever I produce 
>(but it might be very simple since I'll only need to deal with french).
>
>Cheers,
>Stephane
>
>
>--
>To unsubscribe, e-mail:   
>For additional commands, e-mail: 
>
>
>--
>To unsubscribe, e-mail:   
>For additional commands, e-mail: 
>
>



--
To unsubscribe, e-mail:   
For additional commands, e-mail: 


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




HTML saga continues...

2002-12-12 Thread Leo Galambos
So, I have tried this with Lucene:
1) original JavaCC LL(k) HTML parser
2) SWING's HTML parser

In case of (1) I could process about 300K of HTML documents. In case of 
(2) more than 400K.

But I cannot process complete collection (5M) and finish my hard stress
tests of Lucene.

Is there anyone who has HTML parser that really works with Lucene? :) If
you think that you have one, please let me know. I wanted to try Neko, but 
it looks complicated and I do not want to affect the results by ``robust'' 
parser.

THX

-g-


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: HTML saga continues...

2002-12-12 Thread Erik Hatcher
Look in the Lucene sandbox in CVS.  I contributed an Ant task that 
indexed HTML documents.  It uses JTidy under the covers to parse HTML 
into title and body content, and it could be extended to pull other 
information such  keywords.

	Erik


Leo Galambos wrote:
So, I have tried this with Lucene:
1) original JavaCC LL(k) HTML parser
2) SWING's HTML parser

In case of (1) I could process about 300K of HTML documents. In case of 
(2) more than 400K.

But I cannot process complete collection (5M) and finish my hard stress
tests of Lucene.

Is there anyone who has HTML parser that really works with Lucene? :) If
you think that you have one, please let me know. I wanted to try Neko, but 
it looks complicated and I do not want to affect the results by ``robust'' 
parser.

THX

-g-


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 





--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Accentuated characters

2002-12-12 Thread stephane vaucher
Fair enough, but a "protected" would only allow subclasses from 
accessing it. Personally, I would rather not have to use a subclass to 
implement my feature. I think the logic behind this is that its an 
intrinsic property of a Term, thus it should be immutable, as any 
modifications to this object might have important side-effects.

Stephane


Eric Isakson wrote:

That method works too. Putting your token filter in the org.apache.lucene.analysis package and replacing: 

   if ( !word.equals( token.termText() ) ) {
   return new Token( word, token.startOffset(),
 token.endOffset(), token.type() );
   }

with

   token.termText = word;
		return token;

will make your code operate more efficiently as you won't be creating a bunch of new Token objects that will have to be garbage collected. Would be nice if the termText was "protected" rather than package scoped.

Eric

-Original Message-
From: stephane vaucher [mailto:[EMAIL PROTECTED]]
Sent: Thursday, December 12, 2002 12:12 PM
To: Lucene Users List
Subject: Re: Accentuated characters


There is no problem with package scopes:

This is how I remove trailing 's' chars:

   String word = token.termText();
  
   if(word.endsWith("s")){
   word = word.substring(0, word.length() - 1);
   }
  
   if ( !word.equals( token.termText() ) ) {
   return new Token( word, token.startOffset(),
 token.endOffset(), token.type() );
   }

I'll take a look at how the Collator works to see if I can make a 
generic (maybe locale specific) string normaliser so I could specify the 
level of differences.

Stephane

Eric Isakson wrote:

If you really want to make your own TokenFilter, have a look at org.apache.lucene.analysis.LowerCaseFilter.next()

it does:
public final Token next() throws java.io.IOException {
  Token t = input.next();

  if (t == null)
return null;

  t.termText = t.termText.toLowerCase();

  return t;
}

The termText member of the Token class is package scoped, so you will have to implement your filter in the org.apache.lucene.analysis package. No worries about encoding as the termText is already a java (unicode) string. You will just have to provide the mechanism to get the accented characters converted to there non-accented equivalents. java.text.Collator has some magic that does this for string comparisons but I couldn't find any public methods that give you access to convert a string to its non-accented equivalent.

Eric
--
Eric D. IsaksonSAS Institute Inc.
Application Developer  SAS Campus Drive
XML Technologies   Cary, NC 27513
(919) 531-3639 http://www.sas.com



-Original Message-
From: stephane vaucher [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, December 10, 2002 2:58 PM
To: [EMAIL PROTECTED]
Subject: Accentuated characters


Hello everyone,

I wish to implement a TokenFilter that will remove accentuated 
characters so for example 'é' will become 'e'. As I would rather not 
reinvent the wheel, I've tried to find something on the web and on the 
mailing lists. I saw a mention of a contrib that could do this (see 
http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg02146.html), 
but I don't see anything applicable.

Has anyone done this yet, if so I would much appreciate some pointers 
(or code), otherwise, I'll be happy to contribute whatever I produce 
(but it might be very simple since I'll only need to deal with french).

Cheers,
Stephane


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 





--
To unsubscribe, e-mail:   
For additional commands, e-mail: 


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 






--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: HTML saga continues...

2002-12-12 Thread Erik Hatcher
On a related note, I've also released a project that I developed for my 
book and for presentations that I have been giving on Ant, XDoclet, and 
JUnit.  This project is a documentation search engine with a web 
(Struts) interface.  It uses Lucene and the Ant task I mentioned already 
to index a directory full of HTML and text files.  The sample data 
provided is Ant's documentation.

Its available as version 0.3 (currently, but always grab the latest 
thats there) at http://www.ehatchersolutions.com/downloads/

I have not documented it well yet, but that is my plan over the next 
couple of weeks.

To get it running you need:

- Ant 1.5.1 (1.5 is not sufficient)
- JUnit 3.8 or up (3.8.1 is the latest)
- j2ee.jar - I don't provide this in the download for size (and legal?) 
reasons.

Build it this way:

	ant -Dj2ee.jar=/path/to/my/j2ee.jar

Or if you run it without the -D switch it will tell you where to place 
j2ee.jar by default.  If you have J2EE_HOME set it will pick that up 
automatically and use it appropriately.

Deploy the WAR in a web container, or the EAR in JBoss.  Navigate to:

	http://localhost:8080/ant-sample/

and search for your favorite Ant tasks or Ant related information.

Let me know if you experience any issues with it, or have comments.

	Erik

Erik Hatcher wrote:
Look in the Lucene sandbox in CVS.  I contributed an Ant task that 
indexed HTML documents.  It uses JTidy under the covers to parse HTML 
into title and body content, and it could be extended to pull other 
information such  keywords.

Erik


Leo Galambos wrote:

So, I have tried this with Lucene:
1) original JavaCC LL(k) HTML parser
2) SWING's HTML parser

In case of (1) I could process about 300K of HTML documents. In case 
of (2) more than 400K.

But I cannot process complete collection (5M) and finish my hard stress
tests of Lucene.

Is there anyone who has HTML parser that really works with Lucene? :) If
you think that you have one, please let me know. I wanted to try Neko, 
but it looks complicated and I do not want to affect the results by 
``robust'' parser.

THX

-g-


--
To unsubscribe, e-mail:   

For additional commands, e-mail: 






--
To unsubscribe, e-mail:   

For additional commands, e-mail: 






--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: HTML saga continues...

2002-12-12 Thread Otis Gospodnetic
Yeah, Neko is not the most straight forward, but it works.
Sorry, the code is somewhere.can;t look for it now.
But you could also look at LARM under Lucene Sanbox, it's got a nice
HTML parser, too.

Otis

--- Leo Galambos <[EMAIL PROTECTED]> wrote:
> So, I have tried this with Lucene:
> 1) original JavaCC LL(k) HTML parser
> 2) SWING's HTML parser
> 
> In case of (1) I could process about 300K of HTML documents. In case
> of 
> (2) more than 400K.
> 
> But I cannot process complete collection (5M) and finish my hard
> stress
> tests of Lucene.
> 
> Is there anyone who has HTML parser that really works with Lucene? :)
> If
> you think that you have one, please let me know. I wanted to try
> Neko, but 
> it looks complicated and I do not want to affect the results by
> ``robust'' 
> parser.
> 
> THX
> 
> -g-
> 
> 
> --
> To unsubscribe, e-mail:  
> 
> For additional commands, e-mail:
> 
> 


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   
For additional commands, e-mail: