If you need this for external use (meaning, not within CLucene), I would
recommend using ICU's or Boost's implementations.

Token classifications used within CLucene are used to differentiate between
words, numbers, EOF, e-mails and more. Meaning, the Tokenizer would not just
tokenize a string based on white-spaces, but it would try to keep acronyms
and e-mails, for instance, intact. It also supports CJK tokens. Java Lucene
has a more decent implementation (which I think supports more cases) which
we haven't ported yet.

Using another tokenizer is possible, but you'll have to derive from
Tokenizer and have your own Analyzer to call it. Not a difficult thing to do
tho.

> I only care about tokenization of a sequence of characters into words.

If so, I recommend using other libraries which are meant for specifically
for that.

HTH.

Itamar.

-----Original Message-----
From: Paul J. Lucas [mailto:[email protected]] 
Sent: Wednesday, February 10, 2010 2:32 AM
To: [email protected]
Subject: Re: [CLucene-dev] CLucene tokenizer vs ICU tokenizer

On Feb 9, 2010, at 2:36 PM, Itamar Syn-Hershko wrote:

> I'm not sure what you mean.

I mean the ability to know, for a given piece of text, where the token
boundaries are (e.g., words).

> CLucene StandardTokenizer is meant for internal use only, and provides 
> the calling Analyzer with a stream of identified tokens (it classifies 
> the tokens, not just tokenizes them).

Classifies them how?  Also, one can plug in one's own tokenizer, yes?

> The ICU tokenizer is a general purpose tokenizer (like Boost's 
> implementation is), with loads of extra functionality the CLucene one 
> doesn't have or need.

I only care about tokenization of a sequence of characters into words.

- Paul
----------------------------------------------------------------------------
--
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
CLucene-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/clucene-developers



------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
CLucene-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/clucene-developers

Reply via email to