Hi,
Let me explain a brief background or intention of the change.
Basically, character normalization is not a responsibility of a
tokenizer and should not be performed when you "tokenize" texts.
Instead, there are charFilters and tokenFilters that perform
full-width and half-width normalization.
Hi,
> The only thing that seems to differ is that the characters are full-width
> vs half-width, so I was wondering if this is intended behavior or a bug/too
> restrictive
This is intended behavior. The first column in the user dictionary
must be equal to the concatenated string of the second
Hi Mike,
Thanks for the response! I'm actually not super familiar with
UserDictionaries, but looking at the code, it seems like a single line in
the user provided user dictionary corresponds to a single entry? In that
case, here is the line (or entry) that does have both widths that I believe
is
HI Marc, I wonder if there is a workaround for this issue: eg, could
we have entries for both widths? I wonder if there is some interaction
with an analysis chain that is doing half-width -> full-width
conversion (or vice versa)? I think the UserDictionary has to operate
on pre-analyzed tokens ...
Hi,
I had a question about the Japanese user dictionary. We have a user
dictionary that used to work but after attempting to upgrade Lucene, it
fails with the following error:
Caused by: java.lang.RuntimeException: Illegal user dictionary entry レコーダー
- the concatenated segmentation (レコーダー) does