Yes, that is what I am seeing. Looking in the code myself, I see no reason for this behavior. That is why I assumed I was doing something very wrong.
Below I have included an example. I set the max length to 300. I insert a record with a single token of 500 characters. I expect the token to be removed and not included in the index. When I query using the large token, the record is returned. I can see the same result using the analysis page in the solr console. He is a test example: <field name="portal_package" type="text_std" indexed="true" stored="true" multiValued="true"/> <fieldType name="text_std" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LengthFilterFactory" min="1" max="300" /> </analyzer> </fieldType> A test record: { "documentKind": "test", "uri": "test300", "id": "test300", "portal_package": "12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890" } Query result: { "responseHeader": { "status": 0, "QTime": 55, "params": { "indent": "true", "q": "portal_package:12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890", "_": "1431704135745", "wt": "json" } }, "response": { "numFound": 1, "start": 0, "docs": [ { "documentKind": "test", "uri": "test300", "id": "test300", "portal_package": [ "12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890" ], "_version_": 1501249997589446700, "timestamp": "2015-05-15T15:26:05.205Z", "language": "en" } ] } } ----- Original Message ----- From: "Shawn Heisey" <apa...@elyograg.org> To: solr-user@lucene.apache.org Sent: Friday, May 15, 2015 11:13:14 AM Subject: Re: Problem with solr.LengthFilterFactory On 5/15/2015 8:49 AM, Charles Sanders wrote: > I'm seeing a problem with the LengthFilter. It appears to work fine until I > increase the max value above 254. At the point it stops removing the very > large token from the stream. As a result I get the error: > java.lang.IllegalArgumentException: Document contains at least one immense > term...... UTF8 encoding is longer than the max length 32766 > > I'm certain I'm doing this wrong. Can someone please show me the light. :) > > <fieldType name="text_std" class="solr.TextField" positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.LengthFilterFactory" min="1" max="254" /> > </analyzer> > </fieldType> So with max="254", you don't get the error? Looking at the code for LengthFilter, I can't see any way for it to behave differently with a max of 254 vs. a max of 255 or higher. All of the interfaces and classes involved use "int" for length, which means it should work perfectly with numbers above 254. Thanks, Shawn