I concur that the docs clearly state your expected behavior should be true: Standard Tokenizer
This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions: - Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names. - The "@" character is among the set of token-splitting punctuation, so email addresses are not preserved as single tokens. However when I used the analysis screen vs the _default configset in a local solr (happens to be running 10-snapshot because of something I am working on, but shouldn't matter) I get the documented behavior. ST XYZ.tif SF XYZ.tif LCF xyz.tif Can you check that your text_general type or your *_txt dynamic type hasn't been re-defined? On Tue, May 2, 2023 at 3:17 PM Bill Tantzen <[email protected]> wrote: > Thanks Dave! > Using a string field instead would work fine for my purposes I think... > I'm just trying to understand why it doesn't work with a field of type > text_general which uses the standard tokenizer in both the index and the > query analyzer. The docs state: > > This tokenizer splits the text field into tokens, treating whitespace and > punctuation as delimiters. > Delimiter characters are discarded, with the following exceptions: > Periods (dots) that are not followed by whitespace are kept as part of the > token, including Internet domain names. > > That's what is confusing me... Meanwhile, I'm going to take your > suggestion and convert the field to a string! > ~~Bill > > On Tue, May 2, 2023 at 1:40 PM Dave <[email protected]> wrote: > > > You’re not doing anything wrong, a dot is not a character so it splits > the > > field in the index and the query. If you used a string instead it > > theoretically would maintain the non characters but also lead to more > > strict search constraints. If you tried this you need to re index a > couple > > documents to > > Make sure you are getting what you want. > > > > -Dave > > > > > On May 2, 2023, at 2:22 PM, Bill Tantzen <[email protected]> > > wrote: > > > > > > I'm using the solrconfig.xml from the distribution, > > > ./server/solr/configsets/_default/conf/solrconfig.xml > > > > > > But this problem extends to the index as well; using the initial > example, > > > if I search for <str name="parsedquery">metadata_txt:ab00001</str> > > (instead > > > of ab00001.tif), my result set includes ab00001.tif, ab00001.jpg, > > > ab00001.png, etc so the tokens in the index are split on dot as well, > not > > > just the query. > > > > > > I'm doing something wrong, or I'm misunderstanding something!! > > > ~~Bill > > > > > >> On Tue, May 2, 2023 at 1:02 PM Mikhail Khludnev <[email protected]> > > wrote: > > >> > > >> Analyzer is configured in schema.xml. But literally, splitting on dot > is > > >> what I expect from StandardTokenizer. > > >> > > >> On Tue, May 2, 2023 at 8:48 PM Bill Tantzen <[email protected] > > > > >> wrote: > > >> > > >>> Mikhail, > > >>> Thanks for the quick reply. Here is the parser info: > > >>> > > >>> <str name="QParser">LuceneQParser</str> > > >>> > > >>> ~~Bill > > >>> > > >>> On Tue, May 2, 2023 at 12:43 PM Mikhail Khludnev <[email protected]> > > >> wrote: > > >>> > > >>>> Hello Bill, > > >>>> Which analyzer is configured for metadata_txt? Perhaps you need to > > >> tune > > >>> it > > >>>> accordingly. > > >>>> > > >>>> On Tue, May 2, 2023 at 7:40 PM Bill Tantzen > <[email protected] > > > > > >>>> wrote: > > >>>> > > >>>>> In my solr 9.2 schema, I am leveraging the dynamicField > > >>>>> > > >>>>> <dynamicField name="*_txt" type="text_general" indexed="true" > > >>>>> stored="true"/> > > >>>>> > > >>>>> which tokenizes with solr.StandardTokenizerFactory for index and > > >> query. > > >>>>> > > >>>>> However, when I query with, for example, > > >>>>> <str name="q">metadata_txt:XYZ.tif</str> > > >>>>> > > >>>>> I see many more hits than I expect. When I add debug=true to the > > >>> query, > > >>>> I > > >>>>> see: > > >>>>> <str name="rawquerystring">metadata_txt:XYZ.tif</str> > > >>>>> <str name="querystring">metadata_txt:XYZ.tif</str> > > >>>>> <str name="parsedquery">metadata_txt:XYZ metadata_txt:tif</str> > > >>>>> > > >>>>> But I expect that dots not followed by whitespace will be kept as > > >> part > > >>> of > > >>>>> the token, that is, the parsed query should remain > > >>> "metadata_txt:XYZ.tif" > > >>>>> but solr appears to be splitting into two tokens. > > >>>>> > > >>>>> Can somebody point out what I am misunderstanding? > > >>>>> Thanks, > > >>>>> ~~Bill > > >>>>> > > >>>> > > >>>> > > >>>> -- > > >>>> Sincerely yours > > >>>> Mikhail Khludnev > > >>>> https://t.me/MUST_SEARCH > > >>>> A caveat: Cyrillic! > > >>>> > > >>> > > >>> > > >>> -- > > >>> Human wheels spin round and round > > >>> While the clock keeps the pace... -- John Mellencamp > > >>> ________________________________________________________________ > > >>> Bill Tantzen University of Minnesota Libraries > > >>> 612-626-9949 (U of M) 612-325-1777 (cell) > > >>> > > >> > > >> > > >> -- > > >> Sincerely yours > > >> Mikhail Khludnev > > >> https://t.me/MUST_SEARCH > > >> A caveat: Cyrillic! > > >> > > > > > > > > > -- > > > Human wheels spin round and round > > > While the clock keeps the pace... -- John Mellencamp > > > ________________________________________________________________ > > > Bill Tantzen University of Minnesota Libraries > > > 612-626-9949 (U of M) 612-325-1777 (cell) > > > > > -- > Human wheels spin round and round > While the clock keeps the pace... -- John Mellencamp > ________________________________________________________________ > Bill Tantzen University of Minnesota Libraries > 612-626-9949 (U of M) 612-325-1777 (cell) > -- http://www.needhamsoftware.com (work) http://www.the111shift.com (play)
