Re: [jira] [Commented] (LUCENE-7538) Uploading large text file to a field causes "this IndexWriter is closed" error

David Smiley Fri, 04 Nov 2016 09:42:48 -0700

Erick,

I'm working with multiple users/clients that have documents at this size
(though I don't know "Eric B").  So it happens, and sometimes the bigger
the document, the more genuinely relevant it tends to be (I've seen this).
So it's a challenge.  Sometimes big docs can be decomposed and then you
have to fend with collapsing/grouping/joining etc. but sometimes those
mechanisms won't quite work for your use-case, particularly if you're
*already* using them; although they can sometimes be layered (i.e. multiple
collapses).  But then there's perf issues, with, say, high cardinality
fields CollapseQParser.  Just an example.  So IMO we (Lucene/Solr hackers)
can't really dismiss the want/desire to index huge docs out of hand,
although I agree it's good to try and push-back on the requirement because
perhaps it can be avoided in some cases.


Cheers,
~ David

On Fri, Nov 4, 2016 at 11:34 AM Erick Erickson <erickerick...@gmail.com>
wrote:

I agree that more graceful handling is desirable, but I also have
to ask what _use_ indexing a single 800M document is.

If it's textual data it'll generate a hit very frequently, and will
likely be way down in the relevance list. If it's numerics, it'll
also likely be hit a lot. Will the users even see it? If a user did
find it, would they be able to display it? They'd have to wait
for the 800M doc to be transmitted just for starters whether
you get the real doc from Solr or some other system-of-record.

Where I'm going here is wondering whether it makes sense
in your application to have your own cutoff and not even send
it to be indexed. I've seen situations where it _is_ required
mind you, it's just a question worth asking.

Best,
Erick

On Fri, Nov 4, 2016 at 7:50 AM, Michael McCandless (JIRA)
<j...@apache.org> wrote:
>
>     [
https://issues.apache.org/jira/browse/LUCENE-7538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15636551#comment-15636551
]
>
> Michael McCandless commented on LUCENE-7538:
> --------------------------------------------
>
> Unfortunately, yes, {{IndexWriter}} will still close itself if you try to
index a too-massive text field.
>
> All my patch does is catch the int overflow, but the resulting exception
isn't that much better :)
>
> {noformat}
>    [junit4]   2> NOTE: reproduce with: ant test
-Dtestcase=TestIndexWriter -Dtests.method=testMassiveField
-Dtests.seed=251DAB4FB5DAD334 -Dtests.locale=uk-UA
-Dtests.timezone=Atlantic/Jan_Mayen -Dtests.asserts=true
-Dtests.file.encoding=UTF-8
>    [junit4] ERROR   37.4s | TestIndexWriter.testMassiveField <<<
>    [junit4]    > Throwable #1:
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
>    [junit4]    >        at
__randomizedtesting.SeedInfo.seed([251DAB4FB5DAD334:E1314D20F8A334C3]:0)
>    [junit4]    >        at
org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:740)
>    [junit4]    >        at
org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:754)
>    [junit4]    >        at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1558)
>    [junit4]    >        at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1307)
>    [junit4]    >        at
org.apache.lucene.index.TestIndexWriter.testMassiveField(TestIndexWriter.java:2791)
>    [junit4]    >        at java.lang.Thread.run(Thread.java:745)
>    [junit4]    > Caused by: java.lang.ArithmeticException: integer
overflow
>    [junit4]    >        at java.lang.Math.multiplyExact(Math.java:867)
>    [junit4]    >        at
org.apache.lucene.util.UnicodeUtil.maxUTF8Length(UnicodeUtil.java:618)
>    [junit4]    >        at
org.apache.lucene.codecs.compressing.GrowableByteArrayDataOutput.writeString(GrowableByteArrayDataOutput.java:67)
>    [junit4]    >        at
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.writeField(CompressingStoredFieldsWriter.java:292)
>    [junit4]    >        at
org.apache.lucene.codecs.asserting.AssertingStoredFieldsFormat$AssertingStoredFieldsWriter.writeField(AssertingStoredFieldsFormat.java:143)
>    [junit4]    >        at
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:434)
>    [junit4]    >        at
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:373)
>    [junit4]    >        at
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:231)
>    [junit4]    >        at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:478)
>    [junit4]    >        at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1562)
>    [junit4]    >        at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1307)
>    [junit4]    >        at
org.apache.lucene.index.TestIndexWriter.testMassiveField(TestIndexWriter.java:2784)
>    [junit4]    >        ... 36 more
> {noformat}
>
> I'll think about how to catch it more cleanly up front w/o closing the
{{IndexWriter}}...
>
>> Uploading large text file to a field causes "this IndexWriter is closed"
error
>>
------------------------------------------------------------------------------
>>
>>                 Key: LUCENE-7538
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-7538
>>             Project: Lucene - Core
>>          Issue Type: Bug
>>    Affects Versions: 5.5.1
>>            Reporter: Steve Chen
>>         Attachments: LUCENE-7538.patch
>>
>>
>> We have seen "this IndexWriter is closed" error after we tried to upload
a large text file to a single Solr text field. The field definition in the
schema.xml is:
>> {noformat}
>> <field name="fileContent" type="text_general" indexed="true"
stored="true" termVectors="true" termPositions="true" termOffsets="true"/>
>> {noformat}
>> After that, the IndexWriter remained closed and couldn't be recovered
until we reloaded the Solr core.  The text file had size of 800MB,
containing only numbers and English characters.
>> Stack trace is shown below:
>> {noformat}
>> 2016-11-02 23:00:17,913 [http-nio-19082-exec-3] ERROR
org.apache.solr.handler.RequestHandlerBase -
org.apache.solr.common.SolrException: Exception writing document id
1487_0_1 to the index; possible analysis error.
>>         at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:180)
>>         at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:68)
>>         at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>>         at
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:934)
>>         at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1089)
>>         at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:712)
>>         at
org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
>>         at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:126)
>>         at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:131)
>>         at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:237)
>>         at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:69)
>>         at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:155)
>>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:2082)
>>         at
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:651)
>>         at
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:458)
>>         at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:229)
>>         at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:184)
>>         at
veeva.ecm.common.interfaces.web.SolrDispatchOverride.doFilter(SolrDispatchOverride.java:43)
>>         at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:239)
>>         at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>>         at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:212)
>>         at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:106)
>>         at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:141)
>>         at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:79)
>>         at
org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:616)
>>         at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:88)
>>         at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:521)
>>         at
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1096)
>>         at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:674)
>>         at org.apache.tomcat.util.net
.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1500)
>>         at org.apache.tomcat.util.net
.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1456)
>>         at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>         at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>         at
org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
>>         at java.lang.Thread.run(Thread.java:745)
>> Caused by: org.apache.lucene.store.AlreadyClosedException: this
IndexWriter is closed
>>         at
org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:720)
>>         at
org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:734)
>>         at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1473)
>>         at
org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:282)
>>         at
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:214)
>>         at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:169)
>>         ... 34 more
>> Caused by: java.lang.ArrayIndexOutOfBoundsException: 56
>>         at
org.apache.lucene.util.UnicodeUtil.UTF16toUTF8(UnicodeUtil.java:201)
>>         at
org.apache.lucene.util.UnicodeUtil.UTF16toUTF8(UnicodeUtil.java:183)
>>         at
org.apache.lucene.codecs.compressing.GrowableByteArrayDataOutput.writeString(GrowableByteArrayDataOutput.java:72)
>>         at
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.writeField(CompressingStoredFieldsWriter.java:292)
>>         at
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:382)
>>         at
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:321)
>>         at
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
>>         at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
>>         at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1477)
>>         ... 37 more
>> {noformat}
>> We debugged and traced down the issue.  It was an integer overflow
problem that was not properly handled.  The
GrowableByteArrayDataOutput::writeString(String string) method is shown
below:
>> {noformat}
>> @Override
>>   public void writeString(String string) throws IOException {
>>     int maxLen = string.length() * UnicodeUtil.MAX_UTF8_BYTES_PER_CHAR;
>>     if (maxLen <= MIN_UTF8_SIZE_TO_ENABLE_DOUBLE_PASS_ENCODING)  {
>>       // string is small enough that we don't need to save memory by
falling back to double-pass approach
>>       // this is just an optimized writeString() that re-uses
scratchBytes.
>>       scratchBytes = ArrayUtil.grow(scratchBytes, maxLen);
>>       int len = UnicodeUtil.UTF16toUTF8(string, 0, string.length(),
scratchBytes);
>>       writeVInt(len);
>>       writeBytes(scratchBytes, len);
>>     } else  {
>>       // use a double pass approach to avoid allocating a large
intermediate buffer for string encoding
>>       int numBytes = UnicodeUtil.calcUTF16toUTF8Length(string, 0,
string.length());
>>       writeVInt(numBytes);
>>       bytes = ArrayUtil.grow(bytes, length + numBytes);
>>       length = UnicodeUtil.UTF16toUTF8(string, 0, string.length(),
bytes, length);
>>     }
>>   }
>> {noformat}
>> The 800MB text file stored in the string parameter of the method had a
length of 800 million, the maxLen became negative integer as the result of
the length times 3. The negative integer was then passed into
ArrayUtil.grow(scratchBytes, maxLen):
>> {noformat}
>> public static byte[] grow(byte[] array, int minSize) {
>>     assert minSize >= 0: "size must be positive (got " + minSize + "):
likely integer overflow?";
>>     if (array.length < minSize) {
>>       byte[] newArray = new byte[oversize(minSize, 1)];
>>       System.arraycopy(array, 0, newArray, 0, array.length);
>>       return newArray;
>>     } else
>>       return array;
>>   }
>> {noformat}
>> Assertion was disabled in production so the execution won't stop. The
original array was returned from the method call without increasing the
size, which caused an ArrayIndexOutOfBoundsException to be thrown.  The
ArrayIndexOutOfBoundsException was wrapped into AbortingException, and
later on caused the IndexWriter to be closed in IndexWriter class.
>> The code should fail faster with a more-specific error for the integer
overflow problem, and shouldn't cause the IndexWriter to be closed.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: [jira] [Commented] (LUCENE-7538) Uploading large text file to a field causes "this IndexWriter is closed" error

Reply via email to