I knew it was in there somewhere! But... that truncates the full field value, as opposed to an individual term for a text field. It depends on whether the immediate issue was for a text field or for a string field. The underlying issue may be that it rarely makes sense to "index" a full wiki page as a string field.

-- Jack Krupansky

-----Original Message----- From: Alexandre Rafalovitch
Sent: Monday, September 15, 2014 8:39 AM
To: solr-user
Subject: Re: Solr Exceptions -- "immense terms"

May not need a script for that:
http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/update/processor/TruncateFieldUpdateProcessorFactory.html

Regards,
  Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 15 September 2014 11:05, Jack Krupansky <j...@basetechnology.com> wrote:
You can use an update request processor to filter the input for large
values. You could write a script with the stateless script processor which
ignores or trims large input values.

-- Jack Krupansky

-----Original Message----- From: Christopher Gross
Sent: Monday, September 15, 2014 7:58 AM
To: solr-user
Subject: Re: Solr Exceptions -- "immense terms"


Yeah -- for this part I'm just trying to store it to show it later.

There was a change in Lucene 4.8.x.  Before then, the exception was just
being eaten...now they throw it up and don't index that document.

Can't push the whole schema up -- but I do copy the content field into a
"text" field (text_en_splitting) that gets used for a full text search
(along w/ some other fields). But then I would think I'd see the error for
that field instead of "content."  I may try that to figure out where the
problem is, but I do want to have the content available for doing the
search...

It's big.

I'm probably going to have to tweak the schema some (probably wise anyway), but I'm not sure what do to about this large text. I'm loading the content
in via some Java code so I could trim it down, but I'd rather not exclude
content from the page just because it's large.  I was hoping that someone
would have a better field type to use, or an idea of what to do to
configure it.

Thanks Michael.


-- Chris

On Mon, Sep 15, 2014 at 10:38 AM, Michael Della Bitta <
michael.della.bi...@appinions.com> wrote:

I just came back to this because I figured out you're trying to just store
this text. Now I'm baffled. How big is it? :)

Not sure why an analyzer is running if you're just storing the content.
Maybe you should post your whole schema.xml... there could be a copyfield
that's dumping the text into a different field that has the keyword
tokenizer?

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<

https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
>
w: appinions.com <http://www.appinions.com/>

On Mon, Sep 15, 2014 at 10:37 AM, Michael Della Bitta <
michael.della.bi...@appinions.com> wrote:

> If you're using a String fieldtype, you're not indexing it so much as
> dumping the whole content blob in there as a single term for exact
> matching.
>
> You probably want to look at one of the text field types for textural
> content.
>
> That doesn't explain the difference in behavior between Solr versions,
but
> my hunch is that you'll be happier in general with the behavior of a
field
> type that does tokenizing and stemming for plain text search anyway.
>
> Michael Della Bitta
>
> Applications Developer
>
> o: +1 646 532 3062
>
> appinions inc.
>
> “The Science of Influence Marketing”
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions <https://twitter.com/Appinions> | g+:
> plus.google.com/appinions
> <

https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
>
> w: appinions.com <http://www.appinions.com/>
>
> On Mon, Sep 15, 2014 at 10:06 AM, Christopher Gross <cogr...@gmail.com>
> wrote:
>
>> Solr 4.9.0
>> Java 1.7.0_49
>>
>> I'm indexing an internal Wiki site.  I was running on an older version
of
>> Solr (4.1) and wasn't having any trouble indexing the content, but now
I'm
>> getting errors:
>>
>> SCHEMA:
>> <field name="content" type="string" indexed="false" stored="true"
>> required="true"/>
>>
>> LOGS:
>> Caused by: java.lang.IllegalArgumentException: Document contains at
least
>> one immense term in field="content" (whose UTF8 encoding is longer >> than
>> the
>> max length 32766), all of which were skipped.  Please correct the
analyzer
>> to not produce such terms.  The prefix of the first immense term is:
'[60,
>> 33, 45, 45, 32, 98, 111, 100, 121, 67, 111, 110, 116, 101, 110, 116, >> >>
>> 32,
>> 45, 45, 62, 10, 9, 9, 9, 60, 100, 105, 118, 32, 115]...', original
>> message:
>> bytes can be at most 32766 in length; got 183250
>> ....
>> Caused by:
>> org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException:
bytes
>> can be at most 32766 in length; got 183250
>>
>> I was indexing it, but I switched that off (as you can see above) but
>> >> it
>> still is having problems. Is there a different type I should use, or >> a
>> different analyzer?  I imagine that there is a way to index very large
>> documents in Solr.  Any recommendations would be helpful.  Thanks!
>>
>> -- Chris
>>
>
>



Reply via email to