Re: Solr Exceptions -- "immense terms"

Alexandre Rafalovitch Mon, 15 Sep 2014 08:42:11 -0700

May not need a script for that:
http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/update/processor/TruncateFieldUpdateProcessorFactory.html


Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 15 September 2014 11:05, Jack Krupansky <j...@basetechnology.com> wrote:
> You can use an update request processor to filter the input for large
> values. You could write a script with the stateless script processor which
> ignores or trims large input values.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Christopher Gross
> Sent: Monday, September 15, 2014 7:58 AM
> To: solr-user
> Subject: Re: Solr Exceptions -- "immense terms"
>
>
> Yeah -- for this part I'm just trying to store it to show it later.
>
> There was a change in Lucene 4.8.x.  Before then, the exception was just
> being eaten...now they throw it up and don't index that document.
>
> Can't push the whole schema up -- but I do copy the content field into a
> "text" field (text_en_splitting) that gets used for a full text search
> (along w/ some other fields).  But then I would think I'd see the error for
> that field instead of "content."  I may try that to figure out where the
> problem is, but I do want to have the content available for doing the
> search...
>
> It's big.
>
> I'm probably going to have to tweak the schema some (probably wise anyway),
> but I'm not sure what do to about this large text.  I'm loading the content
> in via some Java code so I could trim it down, but I'd rather not exclude
> content from the page just because it's large.  I was hoping that someone
> would have a better field type to use, or an idea of what to do to
> configure it.
>
> Thanks Michael.
>
>
> -- Chris
>
> On Mon, Sep 15, 2014 at 10:38 AM, Michael Della Bitta <
> michael.della.bi...@appinions.com> wrote:
>
>> I just came back to this because I figured out you're trying to just store
>> this text. Now I'm baffled. How big is it? :)
>>
>> Not sure why an analyzer is running if you're just storing the content.
>> Maybe you should post your whole schema.xml... there could be a copyfield
>> that's dumping the text into a different field that has the keyword
>> tokenizer?
>>
>> Michael Della Bitta
>>
>> Applications Developer
>>
>> o: +1 646 532 3062
>>
>> appinions inc.
>>
>> “The Science of Influence Marketing”
>>
>> 18 East 41st Street
>>
>> New York, NY 10017
>>
>> t: @appinions <https://twitter.com/Appinions> | g+:
>> plus.google.com/appinions
>> <
>>
>> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
>> >
>> w: appinions.com <http://www.appinions.com/>
>>
>> On Mon, Sep 15, 2014 at 10:37 AM, Michael Della Bitta <
>> michael.della.bi...@appinions.com> wrote:
>>
>> > If you're using a String fieldtype, you're not indexing it so much as
>> > dumping the whole content blob in there as a single term for exact
>> > matching.
>> >
>> > You probably want to look at one of the text field types for textural
>> > content.
>> >
>> > That doesn't explain the difference in behavior between Solr versions,
>> but
>> > my hunch is that you'll be happier in general with the behavior of a
>> field
>> > type that does tokenizing and stemming for plain text search anyway.
>> >
>> > Michael Della Bitta
>> >
>> > Applications Developer
>> >
>> > o: +1 646 532 3062
>> >
>> > appinions inc.
>> >
>> > “The Science of Influence Marketing”
>> >
>> > 18 East 41st Street
>> >
>> > New York, NY 10017
>> >
>> > t: @appinions <https://twitter.com/Appinions> | g+:
>> > plus.google.com/appinions
>> > <
>>
>> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
>> >
>> > w: appinions.com <http://www.appinions.com/>
>> >
>> > On Mon, Sep 15, 2014 at 10:06 AM, Christopher Gross <cogr...@gmail.com>
>> > wrote:
>> >
>> >> Solr 4.9.0
>> >> Java 1.7.0_49
>> >>
>> >> I'm indexing an internal Wiki site.  I was running on an older version
>> of
>> >> Solr (4.1) and wasn't having any trouble indexing the content, but now
>> I'm
>> >> getting errors:
>> >>
>> >> SCHEMA:
>> >> <field name="content" type="string" indexed="false" stored="true"
>> >> required="true"/>
>> >>
>> >> LOGS:
>> >> Caused by: java.lang.IllegalArgumentException: Document contains at
>> least
>> >> one immense term in field="content" (whose UTF8 encoding is longer than
>> >> the
>> >> max length 32766), all of which were skipped.  Please correct the
>> analyzer
>> >> to not produce such terms.  The prefix of the first immense term is:
>> '[60,
>> >> 33, 45, 45, 32, 98, 111, 100, 121, 67, 111, 110, 116, 101, 110, 116, >>
>> >> 32,
>> >> 45, 45, 62, 10, 9, 9, 9, 60, 100, 105, 118, 32, 115]...', original
>> >> message:
>> >> bytes can be at most 32766 in length; got 183250
>> >> ....
>> >> Caused by:
>> >> org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException:
>> bytes
>> >> can be at most 32766 in length; got 183250
>> >>
>> >> I was indexing it, but I switched that off (as you can see above) but
>> >> >> it
>> >> still is having problems.  Is there a different type I should use, or a
>> >> different analyzer?  I imagine that there is a way to index very large
>> >> documents in Solr.  Any recommendations would be helpful.  Thanks!
>> >>
>> >> -- Chris
>> >>
>> >
>> >
>>
>

Re: Solr Exceptions -- "immense terms"

Reply via email to