Just for our future reference:

1. Which schema field type analyzer combination was producing the exception?
2. Can you provide the full stack trace for the exception?

It shouldn't be possible to get the exception when using the white space
tokenizer, due to the noted issue, so which tokenizer was it?


-- Jack Krupansky

On Mon, May 18, 2015 at 12:21 PM, Charles Sanders <csand...@redhat.com>
wrote:

> No, the field has always been text. And from the error, its obviously
> passing a very large token to the index, regardless of the tokenizer and
> filter.
>
> So I guess I will have to tokenize and filter the text before I send it to
> solr, since solr is not able to properly handle a very large token. More
> custom code.
>
> Thanks everyone for your help.
> Charles
>
>
> ----- Original Message -----
>
> From: "Jack Krupansky" <jack.krupan...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Monday, May 18, 2015 12:00:37 PM
> Subject: Re: Problem with solr.LengthFilterFactory
>
> Sorry for not spotting that earlier. Lucene itself does have such a limit.
> No way around it - an individual term is limited to 32K-2 bytes. Lucene is
> designed for searching of terms, not large blob storage.
>
> Maybe you defined that field as a string originally and later updated your
> schema to a text field, so that Lucene still knows it as an unanalyzed
> string field? You need to delete the index and start over if you want to
> change the field types like that.
>
> -- Jack Krupansky
>
> On Mon, May 18, 2015 at 8:33 AM, Charles Sanders <csand...@redhat.com>
> wrote:
>
> > Jack,
> > Thanks for the information. If I understand this correctly, the White
> > space tokenizer will break a single token of size 300 into two tokens,
> one
> > of size 256 and the other of size 44. If this is true, then for the
> single
> > test document I have used, in the index in the portal_package field, I
> > should see two tokens rather than one large single token.
> >
> > If my understanding is correct, then why in my production system, where
> we
> > occasionally get a single very large token, do I see this error?
> > Caused by: java.lang.IllegalArgumentException: Document contains at least
> > one immense term in field="portal_package" (whose UTF8 encoding is longer
> > than the max length 32766)
> >
> > The existence of this error would lead me to conclude that a very large
> > single token is making its way through the white space tokenizer and
> > filters to the index where it is rejected.
> >
> > I'm afraid my understanding is not complete. Can you fill in the gaps?
> >
> > Thanks,
> > Charles
> >
> >
> > ----- Original Message -----
> >
> > From: "Jack Krupansky" <jack.krupan...@gmail.com>
> > To: solr-user@lucene.apache.org
> > Sent: Friday, May 15, 2015 4:31:22 PM
> > Subject: Re: Problem with solr.LengthFilterFactory
> >
> > Sorry that my brain has turned to mush... the issue you are hitting is
> due
> > to a known, undocumented limit in the whitespace tokenizer:
> >
> > https://issues.apache.org/jira/browse/LUCENE-5785
> > "White space tokenizer has undocumented limit of 256 characters per
> token"
> >
> > If you look at the parsed query you will see that two query terms were
> > generated. This is because the whitespace tokenizer will simply split
> long
> > tokens every 256 characters. So, your filter will never see a long term.
> >
> > There is a note on the Jira (evidently by me!) that you can use the
> pattern
> > tokenizer as a workaround. But... if your term is a string anyway, you
> > could just use the keyword tokenizer.
> >
> >
> > -- Jack Krupansky
> >
> > On Fri, May 15, 2015 at 4:06 PM, Charles Sanders <csand...@redhat.com>
> > wrote:
> >
> > > Shawn,
> > > Thanks a bunch for working with me on this.
> > >
> > > I have deleted all records from my index. Stopped solr. Made the schema
> > > changes as requested. Started solr. Then insert the one test record.
> Then
> > > search. Still see the same results. No portal_package is not the unique
> > > key, its uri. Which is a string field.
> > >
> > > <field name="portal_package" type="text_std" indexed="true"
> stored="true"
> > > multiValued="true"/>
> > >
> > > <fieldType name="text_std" class="solr.TextField"
> > > positionIncrementGap="100">
> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > > <filter class="solr.LengthFilterFactory" min="1" max="300" />
> > > </fieldType>
> > >
> > > {
> > > "documentKind": "test",
> > > "uri": "test300",
> > > "id": "test300",
> > >
> >
> "portal_package":"12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890"
> > > }
> > >
> > >
> > > {
> > > "responseHeader": {
> > > "status": 0,
> > > "QTime": 47,
> > > "params": {
> > > "spellcheck": "true",
> > > "enableElevation": "false",
> > > "df": "allText",
> > > "echoParams": "all",
> > > "spellcheck.maxCollations": "5",
> > > "spellcheck.dictionary": "andreasAutoComplete",
> > > "spellcheck.count": "5",
> > > "spellcheck.collate": "true",
> > > "spellcheck.onlyMorePopular": "true",
> > > "rows": "10",
> > > "indent": "true",
> > > "q":
> > >
> >
> "portal_package:12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890",
> > > "_": "1431719989047",
> > > "debug": "query",
> > > "wt": "json"
> > > }
> > > },
> > > "response": {
> > > "numFound": 1,
> > > "start": 0,
> > > "docs": [
> > > {
> > > "documentKind": "test",
> > > "uri": "test300",
> > > "id": "test300",
> > > "portal_package": [
> > >
> > >
> >
> "12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890"
> > > ],
> > > "_version_": 1501267024421060600,
> > > "timestamp": "2015-05-15T19:56:43.247Z",
> > > "language": "en"
> > > }
> > > ]
> > > },
> > > "debug": {
> > > "rawquerystring":
> > >
> >
> "portal_package:12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890",
> > > "querystring":
> > >
> >
> "portal_package:12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890",
> > > "parsedquery":
> > >
> >
> "portal_package:1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456
> > >
> >
> portal_package:7890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890",
> > > "parsedquery_toString":
> > >
> >
> "portal_package:1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456
> > >
> >
> portal_package:7890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890",
> > > "QParser": "LuceneQParser"
> > > }
> > > }
> > >
> > >
> > >
> > >
> > >
> > > ----- Original Message -----
> > >
> > > From: "Shawn Heisey" <apa...@elyograg.org>
> > > To: solr-user@lucene.apache.org
> > > Sent: Friday, May 15, 2015 3:29:19 PM
> > > Subject: Re: Problem with solr.LengthFilterFactory
> > >
> > > On 5/15/2015 1:23 PM, Shawn Heisey wrote:
> > > > Then I looked back at your fieldType definition and noticed that you
> > > > are only defining an index analyzer. Remove the 'type="index"' part
> of
> > > > the analyzer config so it happens at both index and query time,
> > > > reindex, then try again.
> > >
> > > The reindex may be very important here. I would actually completely
> > > delete your data directory and restart Solr before reindexing, to be
> > > sure you don't have old recordsfrom any previous reindexes.
> > >
> > > http://wiki.apache.org/solr/HowToReindex
> > >
> > > I think this next part is unlikely, but I'm going to ask it anyway: Is
> > > the portal_package field your schema uniqueKey? If it is, that might be
> > > an additional source of problems. Using a solr.Textfield for a
> > > uniqueKey field causes Solr to behave in unexpected ways.
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
> > >
> >
> >
>
>

Reply via email to