So this will recognize anything in quotes as a single token and '_' and '-' will not break up words. There may be some repercussions for the NUM token but nothing I'd worry about. maybe you want to use Unicode for '-' and '_' as well...I wouldn't worry about it myself.

- Mark


TOKEN : {                      // token patterns

 // basic word: a sequence of digits & letters
 <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >

| <QUOTED:     "\"" (~["\""])+ "\"">

 // internal apostrophes: O'Reilly, you're, O'Reilly's
 // use a post-filter to remove possesives
| <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >

 // acronyms: U.S.A., I.B.M., etc.
 // use a post-filter to remove dots
| <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >

 // company names like AT&T and [EMAIL PROTECTED]
| <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >

 // email addresses
| <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM> (("."|"-") <ALPHANUM>)+ >

 // hostname
| <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >

 // floating point, serial, model numbers, ip addresses, etc.
 // every other segment must have at least one digit
| <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
      | <HAS_DIGIT> <P> <ALPHANUM>
      | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
      | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
      | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
      | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
       )
 >
| <#P: ("_"|"-"|"/"|"."|",") >
| <#HAS_DIGIT:                      // at least one digit
   (<LETTER>|<DIGIT>)*
   <DIGIT>
   (<LETTER>|<DIGIT>)*
 >

| < #ALPHA: (<LETTER>)+>
| < #LETTER:                      // unicode letters
     [
      "\u0041"-"\u005a",
      "\u0061"-"\u007a",
      "\u00c0"-"\u00d6",
      "\u00d8"-"\u00f6",
      "\u00f8"-"\u00ff",
      "\u0100"-"\u1fff",
      "-", "_"
     ]
 >

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to