Re: How to tokenize with comma in standard tokenizer

Mark Miller Mon, 17 Sep 2007 08:05:48 -0700

Take the comma out of: | <#P: ("_"|"-"|"/"|"."|",") > in the .jj file(around line 92). Keep in mind that this will affect being able to findtokens that where previously indexed with the comma there (obviously). Ibelieve the javacc target in the build file will rebuild...you need toget javacc and put a prop file next to the build file calledbuild.properties that contains: javacc.home=c:/javacc (or wherever youput javacc).

Also, you could consider trying to pre-process the strings (replace thecomma with a space or something).


- Mark

Bhavin Pandya wrote:

Hi,

Standard tokenizer works pretty well for me... but i found one problem with my 
usage...

I want to tokenize..."TheRing6,Proposal6,GuyandGirl6" as a three saparate 
tokens.. while standard analyzer considering it as a one word because it has one digit in 
token.

Expected three tokens:
1. thering6
2. proposal6
3. guyandgirl6

i want to change this behaviour of standard tokenizer for this purpose.... But 
i dont know where to change....
Do i need to comment some rule in StandardTokenizer.jj file ???  I am confused 
with this file....

Any pointer...

- Bhavin


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to tokenize with comma in standard tokenizer

Reply via email to