Re: standard tokenizer seemingly splitting on dot

Gus Heck Tue, 02 May 2023 12:56:40 -0700

I concur that the docs clearly state your expected behavior should be true:
Standard Tokenizer


This tokenizer splits the text field into tokens, treating whitespace and
punctuation as delimiters. Delimiter characters are discarded, with the
following exceptions:

   -

   Periods (dots) that are not followed by whitespace are kept as part of
   the token, including Internet domain names.
   -

   The "@" character is among the set of token-splitting punctuation, so
   email addresses are not preserved as single tokens.


However when I used the analysis screen vs the _default configset in a
local solr (happens to be running 10-snapshot because of something I am
working on, but shouldn't matter) I get the documented behavior.
ST
XYZ.tif
SF
XYZ.tif
LCF
 xyz.tif
Can you check that your text_general type or your *_txt dynamic type hasn't
been re-defined?


On Tue, May 2, 2023 at 3:17 PM Bill Tantzen <[email protected]>
wrote:

> Thanks Dave!
> Using a string field instead would work fine for my purposes I think...
> I'm just trying to understand why it doesn't work with a field of type
> text_general which uses the standard tokenizer in both the index and the
> query analyzer.  The docs state:
>
> This tokenizer splits the text field into tokens, treating whitespace and
> punctuation as delimiters.
> Delimiter characters are discarded, with the following exceptions:
> Periods (dots) that are not followed by whitespace are kept as part of the
> token, including Internet domain names.
>
> That's what is confusing me...  Meanwhile, I'm going to take your
> suggestion and convert the field to a string!
> ~~Bill
>
> On Tue, May 2, 2023 at 1:40 PM Dave <[email protected]> wrote:
>
> > You’re not doing anything wrong, a dot is not a character so it splits
> the
> > field in the index and the query. If you used a string instead it
> > theoretically would maintain the non characters but also lead to more
> > strict search constraints. If you tried this you need to re index a
> couple
> > documents to
> > Make sure you are getting what you want.
> >
> > -Dave
> >
> > > On May 2, 2023, at 2:22 PM, Bill Tantzen <[email protected]>
> > wrote:
> > >
> > > I'm using the solrconfig.xml from the distribution,
> > > ./server/solr/configsets/_default/conf/solrconfig.xml
> > >
> > > But this problem extends to the index as well; using the initial
> example,
> > > if I search for <str name="parsedquery">metadata_txt:ab00001</str>
> > (instead
> > > of ab00001.tif), my result set includes ab00001.tif, ab00001.jpg,
> > > ab00001.png, etc so the tokens in the index are split on dot as well,
> not
> > > just the query.
> > >
> > > I'm doing something wrong, or I'm misunderstanding something!!
> > > ~~Bill
> > >
> > >> On Tue, May 2, 2023 at 1:02 PM Mikhail Khludnev <[email protected]>
> > wrote:
> > >>
> > >> Analyzer is configured in schema.xml. But literally, splitting on dot
> is
> > >> what I expect from StandardTokenizer.
> > >>
> > >> On Tue, May 2, 2023 at 8:48 PM Bill Tantzen <[email protected]
> >
> > >> wrote:
> > >>
> > >>> Mikhail,
> > >>> Thanks for the quick reply.  Here is the parser info:
> > >>>
> > >>> <str name="QParser">LuceneQParser</str>
> > >>>
> > >>> ~~Bill
> > >>>
> > >>> On Tue, May 2, 2023 at 12:43 PM Mikhail Khludnev <[email protected]>
> > >> wrote:
> > >>>
> > >>>> Hello Bill,
> > >>>> Which analyzer is configured for metadata_txt?  Perhaps you need to
> > >> tune
> > >>> it
> > >>>> accordingly.
> > >>>>
> > >>>> On Tue, May 2, 2023 at 7:40 PM Bill Tantzen
> <[email protected]
> > >
> > >>>> wrote:
> > >>>>
> > >>>>> In my solr 9.2 schema, I am leveraging the dynamicField
> > >>>>>
> > >>>>> <dynamicField name="*_txt" type="text_general" indexed="true"
> > >>>>> stored="true"/>
> > >>>>>
> > >>>>> which tokenizes with solr.StandardTokenizerFactory for index and
> > >> query.
> > >>>>>
> > >>>>> However, when I query with, for example,
> > >>>>> <str name="q">metadata_txt:XYZ.tif</str>
> > >>>>>
> > >>>>> I see many more hits than I expect.  When I add debug=true to the
> > >>> query,
> > >>>> I
> > >>>>> see:
> > >>>>> <str name="rawquerystring">metadata_txt:XYZ.tif</str>
> > >>>>> <str name="querystring">metadata_txt:XYZ.tif</str>
> > >>>>> <str name="parsedquery">metadata_txt:XYZ metadata_txt:tif</str>
> > >>>>>
> > >>>>> But I expect that dots not followed by whitespace will be kept as
> > >> part
> > >>> of
> > >>>>> the token, that is, the parsed query should remain
> > >>> "metadata_txt:XYZ.tif"
> > >>>>> but solr appears to be splitting into two tokens.
> > >>>>>
> > >>>>> Can somebody point out what I am misunderstanding?
> > >>>>> Thanks,
> > >>>>> ~~Bill
> > >>>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Sincerely yours
> > >>>> Mikhail Khludnev
> > >>>> https://t.me/MUST_SEARCH
> > >>>> A caveat: Cyrillic!
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Human wheels spin round and round
> > >>> While the clock keeps the pace... -- John Mellencamp
> > >>> ________________________________________________________________
> > >>> Bill Tantzen    University of Minnesota Libraries
> > >>> 612-626-9949 (U of M)    612-325-1777 (cell)
> > >>>
> > >>
> > >>
> > >> --
> > >> Sincerely yours
> > >> Mikhail Khludnev
> > >> https://t.me/MUST_SEARCH
> > >> A caveat: Cyrillic!
> > >>
> > >
> > >
> > > --
> > > Human wheels spin round and round
> > > While the clock keeps the pace... -- John Mellencamp
> > > ________________________________________________________________
> > > Bill Tantzen    University of Minnesota Libraries
> > > 612-626-9949 (U of M)    612-325-1777 (cell)
> >
>
>
> --
> Human wheels spin round and round
> While the clock keeps the pace... -- John Mellencamp
> ________________________________________________________________
> Bill Tantzen    University of Minnesota Libraries
> 612-626-9949 (U of M)    612-325-1777 (cell)
>


-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: standard tokenizer seemingly splitting on dot

Reply via email to