Bill,
Do you have a WordDelimiterFilterFactory in the analysis chain (with
"*preserveOriginal"
*attribute likely set to *0*)?
That would split the token on the period downstream in the analysis chain
even if StandardTokenizer doesn't.

-Rahul

On Thu, May 4, 2023 at 6:22 AM Mikhail Khludnev <[email protected]> wrote:

> Raised https://github.com/apache/lucene/issues/12264.
> Let's look at what devs say.
>
> On Wed, May 3, 2023 at 6:13 PM Bill Tantzen <[email protected]>
> wrote:
>
> > Shawn,
> > No, email addresses are not preserved -- from the docs:
> >
> >
> >    -
> >
> >    The "@" character is among the set of token-splitting punctuation, so
> >    email addresses are not preserved as single tokens.
> >
> >
> > but the non-split on "test.com" vs the split on "test7.com" is
> unexpected!
> > ~~Bill
> >
> >
> > On Wed, May 3, 2023 at 10:04 AM Shawn Heisey <[email protected]>
> wrote:
> >
> > > On 5/2/23 15:30, Bill Tantzen wrote:
> > > > This works as I expected:
> > > > ab00c.tif -- tokenizes as it should with a value of ab00c.tif
> > > >
> > > > This doesn't work as I expected
> > > > ab003.tif -- tokenizes with a result of ab003 and tif
> > >
> > > I got the same behavior with ICUTokenizer, which uses ICU4J for Unicode
> > > handling.  I am pretty sure ICU4J is IBM's implementation of Unicode.
> I
> > > think StandardTokenizer is using a different implementation.
> > >
> > > I'm on Solr 9.3.0-SNAPSHOT ... the ICU analysis components it uses
> > > reference icu4j version 70.1, which is dated Oct 28, 2021 on maven
> > central.
> > >
> > > Two different Unicode implementations are doing exactly the same thing.
> > > Is it a bug, or expected behavior?  It does mean filenames are
> sometimes
> > > not being handled in the way you expect.
> > >
> > > I ran another check ... I had thought that StandardTokenizer preserved
> > > email addresses as a single token ... but I am seeing that
> [email protected]
> > > is split into two terms.  It splits [email protected] into three terms.
> > >
> > > Thanks,
> > > Shawn
> > >
> >
> >
> > --
> > Human wheels spin round and round
> > While the clock keeps the pace... -- John Mellencamp
> > ________________________________________________________________
> > Bill Tantzen    University of Minnesota Libraries
> > 612-626-9949 (U of M)    612-325-1777 (cell)
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> https://t.me/MUST_SEARCH
> A caveat: Cyrillic!
>

Reply via email to