If you need this for external use (meaning, not within CLucene), I would recommend using ICU's or Boost's implementations.
Token classifications used within CLucene are used to differentiate between words, numbers, EOF, e-mails and more. Meaning, the Tokenizer would not just tokenize a string based on white-spaces, but it would try to keep acronyms and e-mails, for instance, intact. It also supports CJK tokens. Java Lucene has a more decent implementation (which I think supports more cases) which we haven't ported yet. Using another tokenizer is possible, but you'll have to derive from Tokenizer and have your own Analyzer to call it. Not a difficult thing to do tho. > I only care about tokenization of a sequence of characters into words. If so, I recommend using other libraries which are meant for specifically for that. HTH. Itamar. -----Original Message----- From: Paul J. Lucas [mailto:[email protected]] Sent: Wednesday, February 10, 2010 2:32 AM To: [email protected] Subject: Re: [CLucene-dev] CLucene tokenizer vs ICU tokenizer On Feb 9, 2010, at 2:36 PM, Itamar Syn-Hershko wrote: > I'm not sure what you mean. I mean the ability to know, for a given piece of text, where the token boundaries are (e.g., words). > CLucene StandardTokenizer is meant for internal use only, and provides > the calling Analyzer with a stream of identified tokens (it classifies > the tokens, not just tokenizes them). Classifies them how? Also, one can plug in one's own tokenizer, yes? > The ICU tokenizer is a general purpose tokenizer (like Boost's > implementation is), with loads of extra functionality the CLucene one > doesn't have or need. I only care about tokenization of a sequence of characters into words. - Paul ---------------------------------------------------------------------------- -- SOLARIS 10 is the OS for Data Centers - provides features such as DTrace, Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW http://p.sf.net/sfu/solaris-dev2dev _______________________________________________ CLucene-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/clucene-developers ------------------------------------------------------------------------------ SOLARIS 10 is the OS for Data Centers - provides features such as DTrace, Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW http://p.sf.net/sfu/solaris-dev2dev _______________________________________________ CLucene-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/clucene-developers
