Hi,

The StandardAnalyzer fulfills all requirements in our application, apart
from one.

Input text that goes for indexing is -

 

My name is Sudhanya.Chatterjee is my second name. Cost of book is 50.6

 

StandardTokenizer successfully tokenizes on punctuations.

But if a dot is not followed by a space it considers it as one token.

In above case "Sudhanya" "Chatterjee" should be considered as two tokens but
50.6 as one token.

So if a dot is preceded or followed by a number it should be kept intact.

 

This is one extra rule I want apart from the exsisting ones of a
StandardAnalyzer.

 

How to add the above requirement into the existing code of StandardAnalyzer?

Taken into consideration that the rest of the rules are still required.

 

Thanks,

Sudhanya

 

 

 

 


DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.

Reply via email to