Jay Hacker created SOLR-4090:
--------------------------------

             Summary: Indexing mangles documents under load
                 Key: SOLR-4090
                 URL: https://issues.apache.org/jira/browse/SOLR-4090
             Project: Solr
          Issue Type: Bug
    Affects Versions: 4.0
         Environment: Red Hat 5.8 x86_64, Oracle JDK 1.7.0_09
            Reporter: Jay Hacker


Sometimes when indexing documents with multiple concurrent processes, I get 
intermittent data corruption errors.  The data submitted for indexing is valid, 
but Solr will complain that it is malformed and return an HTTP 500 or 400 
error.  The errors are similar whether submitting as XML, JSON, or CSV.  The 
problem has not occurred using a single process (i.e., one process POSTing to 
Solr), is rare with four processes, and is pretty reproducible with 16 (or 64, 
or 128).  I've seen this problem since at least Solr 1.4; we generally just use 
four threads and hope for the best.  It seems a bit more common with Solr 4, so 
I decided to track it down.  

I have dummy documents for the example ("collection1") schema that, when posted 
using many simultaneous processes, often generate parsing errors.  I use 
something like this to pound Solr with repeated POSTs of the same document:

{code}
yes docs.xml | xargs -P64 -I{} curl -s --data-binary @{} --header 
'Content-type: text/xml' 'http://localhost:8983/solr/collection1/update'
{code}

Which fairly reliably gives this error:

{noformat}
SEVERE: org.apache.solr.common.SolrException: Unexpected '<'  in attribute value
 at [row,col {unknown-source}]: [2765,74]
        at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:159)
        at 
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
        at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
        at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)
        at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
        at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
        at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
        at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
        at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
        at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
        at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
        at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
        at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
        at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
        at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
        at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
        at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
        at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
        at org.eclipse.jetty.server.Server.handle(Server.java:351)
        at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
        at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
        at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)
        at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)
        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:642)
        at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)
        at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
        at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
        at java.lang.Thread.run(Thread.java:722)
Caused by: com.ctc.wstx.exc.WstxParsingException: Unexpected '<'  in attribute 
value
 at [row,col {unknown-source}]: [2765,74]
        at 
com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:630)
        at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:461)
        at 
com.ctc.wstx.sr.BasicStreamReader.parseNormalizedAttrValue(BasicStreamReader.java:1951)
        at 
com.ctc.wstx.sr.BasicStreamReader.handleNsAttrs(BasicStreamReader.java:3037)
        at 
com.ctc.wstx.sr.BasicStreamReader.handleStartElem(BasicStreamReader.java:2936)
        at 
com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2848)
        at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019)
        at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:370)
        at 
org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:229)
        at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:157)
        ... 31 more
{noformat}


When posting the same data as JSON:

{code}
yes docs.json | xargs -P64 -I{} curl -s --data-binary @{} --header 
'Content-type: application/json' 'http://localhost:8983/solr/collection1/update'
{code}

The errors look like this:

{noformat}
SEVERE: org.apache.solr.common.SolrException: ERROR: [doc=3] multiple values 
encountered for non multiValued field f0008_t: 
[388888888888888888888888888888888888888888888888888, 
9888888888888888888888888888888888888888888888888888]
        at 
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:243)
        at 
org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:73)
        at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:208)
        at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
        at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
        at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:432)
        at 
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:557)
        at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:325)
        at 
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
        at 
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:386)
        at 
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:111)
        at 
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:95)
        at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:59)
        at 
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
        at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
        at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)
        at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
        at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
        at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
        at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
        at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
        at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
        at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
        at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
        at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
        at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
        at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
        at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
        at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
        at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
        at org.eclipse.jetty.server.Server.handle(Server.java:351)
        at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
        at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
        at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)
        at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)
        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:642)
        at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)
        at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
        at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
        at java.lang.Thread.run(Thread.java:722)
{noformat}

I would like to emphasize that the stack traces are misleading: there is no 
problem with the data. Often the exact same document will index just fine a 
thousand times before throwing an error.  Somewhere between being submitted and 
parsed, the data is being occasionally corrupted.  The errors are intermittent, 
but when the do happen, they often complain about exactly the same byte in the 
input.

Here's what I know:

* Small POSTs, or single large documents, don't seem to trigger the problem; it 
seems to take multiple documents totaling ~150K or more.
* More threads trigger the error more often.
* Restricting the JVM heap to say -Xmx384M seems to trigger the problem more 
often.
* Setting {{commit=true}} with each post seems to alleviate the problem.
* I can't seem to reproduce it with 3.6.1, or 4.1-2012-11-16_01-01-43 nightly 
-- maybe related to [SOLR-3621|https://issues.apache.org/jira/browse/SOLR-3621]?
* To test if it was Jetty, I wrote a small servlet that just decodes JSON and 
pounded it under both Jetty 8.1.2 and 8.1.7  without problems -- however, 
[SOLR-4031|https://issues.apache.org/jira/browse/SOLR-4031] sounds awfully 
similar, and...
* If I unzip the 4.0 tarball, and replace jetty (example/{lib/,start.jar}) with 
the jetty 8.1.7 from 4.1 nightly, it seems to fix the problem.

I'm posting this in case anyone runs into similar problems, and meanwhile I 
will keep testing 4.1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to