[jira] [Commented] (SOLR-8495) Schemaless mode cannot index large text fields

Hoss Man (JIRA) Tue, 20 Sep 2016 09:44:40 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-8495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15507076#comment-15507076
 ]


Hoss Man commented on SOLR-8495:
--------------------------------

the root issue here is the same as SOLR-9526: assuming (untokenized) StrField.

I think my suggestion in that issue makes the most sense -- but it doesn't 
address the surface error noted in this issue: an exception when "string" 
values are too big.

So perhaps for that we should just just add TruncateFieldUpdateProcessorFactory 
to the data_drive configs with some reasonable upper limit?

{code}
 <processor class="solr.TruncateFieldUpdateProcessorFactory">
   <str name="typeClass">solr.StrField</str>
   <int name="maxLength">10000</int>
 </processor>
{code}

bq. Autodetect space-separated text above a (customizable? maybe 256 bytes or 
so by default?) threshold as tokenized text rather than as StrField.

I'm leary of this an approach like this, because it would be extremely trappy 
depending on the order docs were indexed: similar to the float/int problems we 
have now, but probably more so, and with more confusion because it wouldn't 
neccessarily be obvious at first glance when/why StrField was choosen vs 
TextField (or even that a diff choice was made if the user didn't go look, 
since unlike the int/float issue the _output_ of the stored field would be the 
same "String" 

(and you'd only ever get an error if the first doc was a "short" string, and 
some other doc was above the 32K lucene limit ... if all the docs were under 
the 32K limit, but above the str/text threshold, you'd never get an error -- 
regardless of the order the docs were indexed in.  but one doc ordering would 
give you searchable text fields, and another doc order would give you StrFields 
that didn't match any search you tried.



> Schemaless mode cannot index large text fields
> ----------------------------------------------
>
>                 Key: SOLR-8495
>                 URL: https://issues.apache.org/jira/browse/SOLR-8495
>             Project: Solr
>          Issue Type: Bug
>          Components: Data-driven Schema, Schema and Analysis
>    Affects Versions: 4.10.4, 5.3.1, 5.4
>            Reporter: Shalin Shekhar Mangar
>              Labels: difficulty-easy, impact-medium
>             Fix For: 5.5, 6.0
>
>
> The schemaless mode by default indexes all string fields into an indexed 
> StrField which is limited to 32KB text. Anything larger than that leads to an 
> exception during analysis.
> {code}
> Caused by: java.lang.IllegalArgumentException: Document contains at least one 
> immense term in field="text" (whose UTF8 encoding is longer than the max 
> length 32766)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-8495) Schemaless mode cannot index large text fields

Reply via email to