Hi William,

I saw your change to the alpha num optimization in the
tokenizer.

I am aware of the fact that it is not perfect currently, especially
for non-english languages. In my opinion we should use unicode
to determine what is a letter and what is a numerical.

Since it is a performance optimization I think we should
undo the change you made and rather look into the unicode approach.

What do you think?

We might want more options anyway, e.g. a tokenization dictionary for
some frequent cases. In such a dictionary the tokenizer could lookup how
a certain input char sequence should be tokenized.

Jörn

Reply via email to