[ https://issues.apache.org/jira/browse/SOLR-5698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13892442#comment-13892442 ]
Hoss Man commented on SOLR-5698: -------------------------------- Things i'm confident of... * the limit is IndexWriter.MAX_TERM_LENGTH * it is not configurable * a message is written to the infoStream by DocFieldProcessor when this is exceeded...{code} if (docState.maxTermPrefix != null && docState.infoStream.isEnabled("IW")) { docState.infoStream.message("IW", "WARNING: document contains at least one immense term (whose UTF8 encoding is longer than the max length " + DocumentsWriterPerThread.MAX_TERM_LENGTH_UTF8 + "), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '" + docState.maxTermPrefix + "...'"); docState.maxTermPrefix = null; } } {code} Things i _think_ i understand, but am not certain of... * by the time DocumentsWriterPerThread sees this problem, and logs this to the infoStream, it's already too late to throw an exception up the call stack (because it's happening in another thread) Rough idea only half considered... * update the tokenstream producers in Solr to explicitly check the terms they are about to return and throw an exception if they exceed this length (mention using LengthFilter in this error message) * this wouldn't help if people use their own concrete Analyzer class -- but it would solve the problem with things like StrField, or anytime analysis factories are used * we could conceivable wrap any user configured concrete Analyzer class to do this check -- but i'm not sure we should, since it will add cycles and the Analyzer might already be well behaved. thoughts? > exceptionally long terms are silently ignored during indexing > ------------------------------------------------------------- > > Key: SOLR-5698 > URL: https://issues.apache.org/jira/browse/SOLR-5698 > Project: Solr > Issue Type: Bug > Reporter: Hoss Man > > As reported on the user list, when a term is greater then 2^15 bytes it is > silently ignored at indexing time -- no error is given at all. > we should investigate: > * if there is a way to get the lower level lucene code to propogate up an > error we can return to the user instead of silently ignoring these terms > * if there is no way to generate a low level error: > ** is there at least way to make this limit configurable so it's more obvious > to users that this limit exists? > ** should we make things like StrField do explicit size checking on the terms > they produce and explicitly throw their own error? -- This message was sent by Atlassian JIRA (v6.1.5#6160) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org