[jira] [Commented] (SOLR-5698) exceptionally long terms are silently ignored during indexing

Hoss Man (JIRA) Wed, 05 Feb 2014 11:16:43 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-5698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13892442#comment-13892442
 ]


Hoss Man commented on SOLR-5698:
--------------------------------

Things i'm confident of...

* the limit is IndexWriter.MAX_TERM_LENGTH
* it is not configurable
* a message is written to the infoStream by DocFieldProcessor when this is 
exceeded...{code}
    if (docState.maxTermPrefix != null && docState.infoStream.isEnabled("IW")) {
      docState.infoStream.message("IW", "WARNING: document contains at least 
one immense term (whose UTF8 encoding is longer than the max length " + 
DocumentsWriterPerThread.MAX_TERM_LENGTH_UTF8 + "), all of which were skipped.  
Please correct the analyzer to not produce such terms.  The prefix of the first 
immense term is: '" + docState.maxTermPrefix + "...'");
      docState.maxTermPrefix = null;
    }
  }
{code}

Things i _think_ i understand, but am not certain of...
* by the time DocumentsWriterPerThread sees this problem, and logs this to the 
infoStream, it's already too late to throw an exception up the call stack 
(because it's happening in another thread)

Rough idea only half considered...
* update the tokenstream producers in Solr to explicitly check the terms they 
are about to return and throw an exception if they exceed this length (mention 
using LengthFilter in this error message)
* this wouldn't help if people use their own concrete Analyzer class -- but it 
would solve the problem with things like StrField, or anytime analysis 
factories are used
* we could conceivable wrap any user configured concrete Analyzer class to do 
this check -- but i'm not sure we should, since it will add cycles and the 
Analyzer might already be well behaved.

thoughts?

> exceptionally long terms are silently ignored during indexing
> -------------------------------------------------------------
>
>                 Key: SOLR-5698
>                 URL: https://issues.apache.org/jira/browse/SOLR-5698
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Hoss Man
>
> As reported on the user list, when a term is greater then 2^15 bytes it is 
> silently ignored at indexing time -- no error is given at all.
> we should investigate:
> * if there is a way to get the lower level lucene code to propogate up an 
> error we can return to the user instead of silently ignoring these terms
> * if there is no way to generate a low level error:
> ** is there at least way to make this limit configurable so it's more obvious 
> to users that this limit exists?
> ** should we make things like StrField do explicit size checking on the terms 
> they produce and explicitly throw their own error?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5698) exceptionally long terms are silently ignored during indexing

Reply via email to