Yes, that is what I am seeing. Looking in the code myself, I see no reason for 
this behavior. That is why I assumed I was doing something very wrong. 

Below I have included an example. I set the max length to 300. I insert a 
record with a single token of 500 characters. I expect the token to be removed 
and not included in the index. When I query using the large token, the record 
is returned. I can see the same result using the analysis page in the solr 
console. 

He is a test example: 

<field name="portal_package" type="text_std" indexed="true" stored="true" 
multiValued="true"/> 

<fieldType name="text_std" class="solr.TextField" positionIncrementGap="100"> 
<analyzer type="index"> 
<tokenizer class="solr.WhitespaceTokenizerFactory"/> 
<filter class="solr.LengthFilterFactory" min="1" max="300" /> 
</analyzer> 
</fieldType> 


A test record: 

{ 
"documentKind": "test", 
"uri": "test300", 
"id": "test300", 
"portal_package": 
"12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890"
 
} 


Query result: 

{ 
"responseHeader": { 
"status": 0, 
"QTime": 55, 
"params": { 
"indent": "true", 
"q": 
"portal_package:12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890",
 
"_": "1431704135745", 
"wt": "json" 
} 
}, 
"response": { 
"numFound": 1, 
"start": 0, 
"docs": [ 
{ 
"documentKind": "test", 
"uri": "test300", 
"id": "test300", 
"portal_package": [ 
"12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890"
 
], 
"_version_": 1501249997589446700, 
"timestamp": "2015-05-15T15:26:05.205Z", 
"language": "en" 
} 
] 
} 
} 





----- Original Message -----

From: "Shawn Heisey" <apa...@elyograg.org> 
To: solr-user@lucene.apache.org 
Sent: Friday, May 15, 2015 11:13:14 AM 
Subject: Re: Problem with solr.LengthFilterFactory 

On 5/15/2015 8:49 AM, Charles Sanders wrote: 
> I'm seeing a problem with the LengthFilter. It appears to work fine until I 
> increase the max value above 254. At the point it stops removing the very 
> large token from the stream. As a result I get the error: 
> java.lang.IllegalArgumentException: Document contains at least one immense 
> term...... UTF8 encoding is longer than the max length 32766 
> 
> I'm certain I'm doing this wrong. Can someone please show me the light. :) 
> 
> <fieldType name="text_std" class="solr.TextField" positionIncrementGap="100"> 
> <analyzer type="index"> 
> <tokenizer class="solr.WhitespaceTokenizerFactory"/> 
> <filter class="solr.LengthFilterFactory" min="1" max="254" /> 
> </analyzer> 
> </fieldType> 

So with max="254", you don't get the error? Looking at the code for 
LengthFilter, I can't see any way for it to behave differently with a 
max of 254 vs. a max of 255 or higher. All of the interfaces and 
classes involved use "int" for length, which means it should work 
perfectly with numbers above 254. 

Thanks, 
Shawn 


Reply via email to