[jira] [Commented] (LUCENE-5400) Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization
[ https://issues.apache.org/jira/browse/LUCENE-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136911#comment-14136911 ] Michael McCandless commented on LUCENE-5400: Thanks for backporting [~sar...@syr.edu]! Hmm, now I'm hitting this test failure on 4.9.x: {noformat} ant test -Dtestcase=TestStandardAnalyzer -Dtests.method=testRandomHugeStringsGraphAfter -Dtests.seed=65FB3AF41D805AF9 -Dtests.locale=mk_MK -Dtests.timezone=Etc/GMT+5 -Dtests.file.encoding=UTF-8 [junit4] FAILURE 0.41s | TestStandardAnalyzer.testRandomHugeStringsGraphAfter [junit4] Throwable #1: java.lang.AssertionError [junit4]at __randomizedtesting.SeedInfo.seed([65FB3AF41D805AF9:CA1B98C5DDF4A2CB]:0) [junit4]at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:751) [junit4]at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:614) [junit4]at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:513) [junit4]at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:437) [junit4]at org.apache.lucene.analysis.core.TestStandardAnalyzer.testRandomHugeStringsGraphAfter(TestStandardAnalyzer.java:402) [junit4]at java.lang.Thread.run(Thread.java:745) [junit4] 2 NOTE: test params are: codec=Lucene46, sim=RandomSimilarityProvider(queryNorm=false,coord=no): {}, locale=mk_MK, timezone=Etc/GMT+5 [junit4] 2 NOTE: Linux 3.13.0-32-generic amd64/Oracle Corporation 1.7.0_55 (64-bit)/cpus=8,threads=1,free=378278472,total=503316480 [junit4] 2 NOTE: All tests run in this JVM: [TestStandardAnalyzer] {noformat} I dug just a bit... looks like we are passing len=0 to MockReaderWrapper.read(char[], int, int), which it can't handle (it calls {{realLen = TestUtil.nextInt(random, 1, len);}}) ... I'm not sure why we don't hit this on 4.x/trunk... Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization - Key: LUCENE-5400 URL: https://issues.apache.org/jira/browse/LUCENE-5400 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.5 Reporter: Chris Geeringh Assignee: Steve Rowe Fix For: 4.10, 5.0, 4.9.1 This is a pretty nasty bug, and causes the cluster to stop accepting updates. I'm not sure how to consistently reproduce it but I have done so numerous times. Switching to a whitespace tokenizer improved indexing speed, and I never got the issue again. I'm running a 4.6 Snapshot - I had issues with deadlocks with numerous versions of Solr, and have finally narrowed down the problem to this code, which affects many/all(?) versions of Solr. When the thread hits this issue it uses 100% CPU, restarting the node which has the error allows indexing to continue until hit again. Here is thread dump: http-bio-8080-exec-45 (201) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl.getNextToken(UAX29URLEmailTokenizerImpl.java:4343) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer.incrementToken(UAX29URLEmailTokenizer.java:147) org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82) org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248) org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253) org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:453) org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1517) org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:217) org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:583) org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:719) org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:449) org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:89)
[jira] [Commented] (LUCENE-5400) Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization
[ https://issues.apache.org/jira/browse/LUCENE-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136942#comment-14136942 ] Michael McCandless commented on LUCENE-5400: I think this passing len=0 was fixed in 4.x/trunk by one of the JFlex upgrades? When I diff StandardTokenizerImpl.java from 4.9.x to 4.x I see this difference: {noformat} 1025,1027c523,532 /* finally: fill the buffer with new input */ int numRead = zzReader.read(zzBuffer, zzEndRead, zzBuffer.length-zzEndRead); --- /* fill the buffer with new input */ int requested = zzBuffer.length - zzEndRead - zzFinalHighSurrogate; int totalRead = 0; while (totalRead requested) { int numRead = zzReader.read(zzBuffer, zzEndRead + totalRead, requested - totalRead); if (numRead == -1) { break; } totalRead += numRead; } {noformat} I could fix this by having MockReaderWrapper.read immediately return 0 if len is 0, but this seems scary i.e. is there a real bug in StandardTokenizerImpl... Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization - Key: LUCENE-5400 URL: https://issues.apache.org/jira/browse/LUCENE-5400 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.5 Reporter: Chris Geeringh Assignee: Steve Rowe Fix For: 4.10, 5.0, 4.9.1 This is a pretty nasty bug, and causes the cluster to stop accepting updates. I'm not sure how to consistently reproduce it but I have done so numerous times. Switching to a whitespace tokenizer improved indexing speed, and I never got the issue again. I'm running a 4.6 Snapshot - I had issues with deadlocks with numerous versions of Solr, and have finally narrowed down the problem to this code, which affects many/all(?) versions of Solr. When the thread hits this issue it uses 100% CPU, restarting the node which has the error allows indexing to continue until hit again. Here is thread dump: http-bio-8080-exec-45 (201) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl.getNextToken(UAX29URLEmailTokenizerImpl.java:4343) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer.incrementToken(UAX29URLEmailTokenizer.java:147) org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82) org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248) org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253) org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:453) org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1517) org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:217) org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:583) org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:719) org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:449) org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:89) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:151) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:131) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:116) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186) org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:158) org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:99) org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58) org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
[jira] [Commented] (LUCENE-5400) Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization
[ https://issues.apache.org/jira/browse/LUCENE-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137218#comment-14137218 ] Steve Rowe commented on LUCENE-5400: Thanks for finding the bug, [~mikemccand]. This problem doesn't exist on trunk or branch_4x because JFlex 1.6's {{zzRefill()}} doesn't call {{Reader.read()}} with len=0. It's only a problem on {{lucene_solr_4_9}} because when I adjusted the generated scanner munging in analysis-common's {{run-jflex-and-disable-buffer-expansion}} macro to work with JFlex 1.5-generated code for the 4.9.1 backport, I didn't also modify the code to not call {{Reader.read()}} with len=0. I've changed the munging code locally and {{TestStandardAnalyzer.testRandomHugeStringsGraphAfter()}} now passes with the above-mentioned seed.Here's what {{StandardTokenizerImpl.zzRefill()}} has now: {code:java} /* finally: fill the buffer with new input */ int numRead = 0, requested = zzBuffer.length - zzEndRead; if (requested 0) numRead = zzReader.read(zzBuffer, zzEndRead, requested); {code} I'm currently beasting {{TestStandardAnalyzer}} and {{TestUAX29URLEmailTokenizer}} (no failures yet after 100 and 50 runs, respectively). Committing the fix shortly. Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization - Key: LUCENE-5400 URL: https://issues.apache.org/jira/browse/LUCENE-5400 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.5 Reporter: Chris Geeringh Assignee: Steve Rowe Fix For: 4.9.1, 4.10, 5.0 This is a pretty nasty bug, and causes the cluster to stop accepting updates. I'm not sure how to consistently reproduce it but I have done so numerous times. Switching to a whitespace tokenizer improved indexing speed, and I never got the issue again. I'm running a 4.6 Snapshot - I had issues with deadlocks with numerous versions of Solr, and have finally narrowed down the problem to this code, which affects many/all(?) versions of Solr. When the thread hits this issue it uses 100% CPU, restarting the node which has the error allows indexing to continue until hit again. Here is thread dump: http-bio-8080-exec-45 (201) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl.getNextToken(UAX29URLEmailTokenizerImpl.java:4343) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer.incrementToken(UAX29URLEmailTokenizer.java:147) org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82) org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248) org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253) org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:453) org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1517) org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:217) org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:583) org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:719) org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:449) org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:89) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:151) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:131) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:116) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186) org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:158) org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:99)
[jira] [Commented] (LUCENE-5400) Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization
[ https://issues.apache.org/jira/browse/LUCENE-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137226#comment-14137226 ] ASF subversion and git services commented on LUCENE-5400: - Commit 1625586 from [~sar...@syr.edu] in branch 'dev/branches/lucene_solr_4_9' [ https://svn.apache.org/r1625586 ] LUCENE-5897, LUCENE-5400: change JFlex-generated source munging so that zzRefill() doesn't call Reader.read(buffer,start,len) with len=0 Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization - Key: LUCENE-5400 URL: https://issues.apache.org/jira/browse/LUCENE-5400 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.5 Reporter: Chris Geeringh Assignee: Steve Rowe Fix For: 4.9.1, 4.10, 5.0 This is a pretty nasty bug, and causes the cluster to stop accepting updates. I'm not sure how to consistently reproduce it but I have done so numerous times. Switching to a whitespace tokenizer improved indexing speed, and I never got the issue again. I'm running a 4.6 Snapshot - I had issues with deadlocks with numerous versions of Solr, and have finally narrowed down the problem to this code, which affects many/all(?) versions of Solr. When the thread hits this issue it uses 100% CPU, restarting the node which has the error allows indexing to continue until hit again. Here is thread dump: http-bio-8080-exec-45 (201) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl.getNextToken(UAX29URLEmailTokenizerImpl.java:4343) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer.incrementToken(UAX29URLEmailTokenizer.java:147) org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82) org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248) org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253) org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:453) org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1517) org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:217) org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:583) org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:719) org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:449) org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:89) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:151) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:131) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:116) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186) org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:158) org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:99) org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58) org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703) org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406) org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195) org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
[jira] [Commented] (LUCENE-5400) Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization
[ https://issues.apache.org/jira/browse/LUCENE-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137341#comment-14137341 ] Uwe Schindler commented on LUCENE-5400: --- This fix is fine, because it spares one method call. But in any case the MockReader impl is wrong. You can always call Reader.read() with len=0, this is not disallowed. And all other readers support this. SO MockReader may just need an condition like {{if (len==0) rreturn 0;}} Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization - Key: LUCENE-5400 URL: https://issues.apache.org/jira/browse/LUCENE-5400 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.5 Reporter: Chris Geeringh Assignee: Steve Rowe Fix For: 4.9.1, 4.10, 5.0 This is a pretty nasty bug, and causes the cluster to stop accepting updates. I'm not sure how to consistently reproduce it but I have done so numerous times. Switching to a whitespace tokenizer improved indexing speed, and I never got the issue again. I'm running a 4.6 Snapshot - I had issues with deadlocks with numerous versions of Solr, and have finally narrowed down the problem to this code, which affects many/all(?) versions of Solr. When the thread hits this issue it uses 100% CPU, restarting the node which has the error allows indexing to continue until hit again. Here is thread dump: http-bio-8080-exec-45 (201) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl.getNextToken(UAX29URLEmailTokenizerImpl.java:4343) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer.incrementToken(UAX29URLEmailTokenizer.java:147) org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82) org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248) org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253) org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:453) org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1517) org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:217) org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:583) org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:719) org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:449) org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:89) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:151) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:131) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:116) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186) org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:158) org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:99) org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58) org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703) org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406) org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195) org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
[jira] [Commented] (LUCENE-5400) Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization
[ https://issues.apache.org/jira/browse/LUCENE-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137350#comment-14137350 ] Michael McCandless commented on LUCENE-5400: But then again I sort of want to know when a Lucene tokenizer is passing len=0 ... that's ... a strange thing to be doing. Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization - Key: LUCENE-5400 URL: https://issues.apache.org/jira/browse/LUCENE-5400 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.5 Reporter: Chris Geeringh Assignee: Steve Rowe Fix For: 4.9.1, 4.10, 5.0 This is a pretty nasty bug, and causes the cluster to stop accepting updates. I'm not sure how to consistently reproduce it but I have done so numerous times. Switching to a whitespace tokenizer improved indexing speed, and I never got the issue again. I'm running a 4.6 Snapshot - I had issues with deadlocks with numerous versions of Solr, and have finally narrowed down the problem to this code, which affects many/all(?) versions of Solr. When the thread hits this issue it uses 100% CPU, restarting the node which has the error allows indexing to continue until hit again. Here is thread dump: http-bio-8080-exec-45 (201) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl.getNextToken(UAX29URLEmailTokenizerImpl.java:4343) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer.incrementToken(UAX29URLEmailTokenizer.java:147) org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82) org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248) org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253) org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:453) org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1517) org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:217) org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:583) org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:719) org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:449) org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:89) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:151) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:131) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:116) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186) org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:158) org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:99) org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58) org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703) org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406) org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195) org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
[jira] [Commented] (LUCENE-5400) Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization
[ https://issues.apache.org/jira/browse/LUCENE-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137345#comment-14137345 ] Michael McCandless commented on LUCENE-5400: +1 to fix MockReaderWrapper. Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization - Key: LUCENE-5400 URL: https://issues.apache.org/jira/browse/LUCENE-5400 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.5 Reporter: Chris Geeringh Assignee: Steve Rowe Fix For: 4.9.1, 4.10, 5.0 This is a pretty nasty bug, and causes the cluster to stop accepting updates. I'm not sure how to consistently reproduce it but I have done so numerous times. Switching to a whitespace tokenizer improved indexing speed, and I never got the issue again. I'm running a 4.6 Snapshot - I had issues with deadlocks with numerous versions of Solr, and have finally narrowed down the problem to this code, which affects many/all(?) versions of Solr. When the thread hits this issue it uses 100% CPU, restarting the node which has the error allows indexing to continue until hit again. Here is thread dump: http-bio-8080-exec-45 (201) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl.getNextToken(UAX29URLEmailTokenizerImpl.java:4343) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer.incrementToken(UAX29URLEmailTokenizer.java:147) org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82) org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248) org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253) org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:453) org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1517) org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:217) org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:583) org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:719) org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:449) org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:89) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:151) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:131) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:116) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186) org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:158) org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:99) org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58) org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703) org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406) org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195) org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222)
[jira] [Commented] (LUCENE-5400) Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization
[ https://issues.apache.org/jira/browse/LUCENE-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137489#comment-14137489 ] Uwe Schindler commented on LUCENE-5400: --- Yeah, so we have to decide: - If MockReaderWrapper is standards conformant - of If we want to detect bugs For the latter we should keep it as it is. Maybe make it explicit and print a good message in the assert. Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization - Key: LUCENE-5400 URL: https://issues.apache.org/jira/browse/LUCENE-5400 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.5 Reporter: Chris Geeringh Assignee: Steve Rowe Fix For: 4.9.1, 4.10, 5.0 This is a pretty nasty bug, and causes the cluster to stop accepting updates. I'm not sure how to consistently reproduce it but I have done so numerous times. Switching to a whitespace tokenizer improved indexing speed, and I never got the issue again. I'm running a 4.6 Snapshot - I had issues with deadlocks with numerous versions of Solr, and have finally narrowed down the problem to this code, which affects many/all(?) versions of Solr. When the thread hits this issue it uses 100% CPU, restarting the node which has the error allows indexing to continue until hit again. Here is thread dump: http-bio-8080-exec-45 (201) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl.getNextToken(UAX29URLEmailTokenizerImpl.java:4343) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer.incrementToken(UAX29URLEmailTokenizer.java:147) org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82) org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248) org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253) org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:453) org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1517) org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:217) org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:583) org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:719) org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:449) org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:89) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:151) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:131) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:116) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186) org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:158) org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:99) org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58) org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703) org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406) org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195) org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
[jira] [Commented] (LUCENE-5400) Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization
[ https://issues.apache.org/jira/browse/LUCENE-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137547#comment-14137547 ] Michael McCandless commented on LUCENE-5400: +1 to make MRW anal and throw an exc on len==0 explaining that it's actually OK but WTF is your tokenizer doing... Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization - Key: LUCENE-5400 URL: https://issues.apache.org/jira/browse/LUCENE-5400 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.5 Reporter: Chris Geeringh Assignee: Steve Rowe Fix For: 4.9.1, 4.10, 5.0 This is a pretty nasty bug, and causes the cluster to stop accepting updates. I'm not sure how to consistently reproduce it but I have done so numerous times. Switching to a whitespace tokenizer improved indexing speed, and I never got the issue again. I'm running a 4.6 Snapshot - I had issues with deadlocks with numerous versions of Solr, and have finally narrowed down the problem to this code, which affects many/all(?) versions of Solr. When the thread hits this issue it uses 100% CPU, restarting the node which has the error allows indexing to continue until hit again. Here is thread dump: http-bio-8080-exec-45 (201) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl.getNextToken(UAX29URLEmailTokenizerImpl.java:4343) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer.incrementToken(UAX29URLEmailTokenizer.java:147) org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82) org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248) org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253) org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:453) org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1517) org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:217) org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:583) org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:719) org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:449) org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:89) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:151) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:131) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:116) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186) org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:158) org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:99) org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58) org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703) org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406) org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195) org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
[jira] [Commented] (LUCENE-5400) Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization
[ https://issues.apache.org/jira/browse/LUCENE-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136735#comment-14136735 ] ASF subversion and git services commented on LUCENE-5400: - Commit 1625458 from [~sar...@syr.edu] in branch 'dev/branches/lucene_solr_4_9' [ https://svn.apache.org/r1625458 ] LUCENE-5897, LUCENE-5400: JFlex-based tokenizers StandardTokenizer and UAX29URLEmailTokenizer tokenize extremely slowly over long sequences of text partially matching certain grammar rules. The scanner default buffer size was reduced, and scanner buffer growth was disabled, resulting in much, much faster tokenization for these text sequences. (merged branch_4x r1619773) Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization - Key: LUCENE-5400 URL: https://issues.apache.org/jira/browse/LUCENE-5400 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.5 Reporter: Chris Geeringh Assignee: Steve Rowe Fix For: 4.10, 5.0 This is a pretty nasty bug, and causes the cluster to stop accepting updates. I'm not sure how to consistently reproduce it but I have done so numerous times. Switching to a whitespace tokenizer improved indexing speed, and I never got the issue again. I'm running a 4.6 Snapshot - I had issues with deadlocks with numerous versions of Solr, and have finally narrowed down the problem to this code, which affects many/all(?) versions of Solr. When the thread hits this issue it uses 100% CPU, restarting the node which has the error allows indexing to continue until hit again. Here is thread dump: http-bio-8080-exec-45 (201) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl.getNextToken(UAX29URLEmailTokenizerImpl.java:4343) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer.incrementToken(UAX29URLEmailTokenizer.java:147) org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82) org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248) org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253) org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:453) org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1517) org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:217) org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:583) org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:719) org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:449) org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:89) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:151) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:131) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:116) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186) org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:158) org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:99) org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58) org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
[jira] [Commented] (LUCENE-5400) Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization
[ https://issues.apache.org/jira/browse/LUCENE-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106685#comment-14106685 ] ASF subversion and git services commented on LUCENE-5400: - Commit 1619730 from [~sar...@syr.edu] in branch 'dev/trunk' [ https://svn.apache.org/r1619730 ] LUCENE-5897, LUCENE-5400: JFlex-based tokenizers StandardTokenizer and UAX29URLEmailTokenizer tokenize extremely slowly over long sequences of text partially matching certain grammar rules. The scanner default buffer size was reduced, and scanner buffer growth was disabled, resulting in much, much faster tokenization for these text sequences. Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization - Key: LUCENE-5400 URL: https://issues.apache.org/jira/browse/LUCENE-5400 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.5 Reporter: Chris Geeringh Assignee: Steve Rowe This is a pretty nasty bug, and causes the cluster to stop accepting updates. I'm not sure how to consistently reproduce it but I have done so numerous times. Switching to a whitespace tokenizer improved indexing speed, and I never got the issue again. I'm running a 4.6 Snapshot - I had issues with deadlocks with numerous versions of Solr, and have finally narrowed down the problem to this code, which affects many/all(?) versions of Solr. When the thread hits this issue it uses 100% CPU, restarting the node which has the error allows indexing to continue until hit again. Here is thread dump: http-bio-8080-exec-45 (201) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl.getNextToken(UAX29URLEmailTokenizerImpl.java:4343) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer.incrementToken(UAX29URLEmailTokenizer.java:147) org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82) org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248) org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253) org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:453) org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1517) org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:217) org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:583) org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:719) org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:449) org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:89) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:151) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:131) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:116) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186) org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:158) org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:99) org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58) org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703) org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
[jira] [Commented] (LUCENE-5400) Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization
[ https://issues.apache.org/jira/browse/LUCENE-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106759#comment-14106759 ] ASF subversion and git services commented on LUCENE-5400: - Commit 1619773 from [~sar...@syr.edu] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1619773 ] LUCENE-5897, LUCENE-5400: JFlex-based tokenizers StandardTokenizer and UAX29URLEmailTokenizer tokenize extremely slowly over long sequences of text partially matching certain grammar rules. The scanner default buffer size was reduced, and scanner buffer growth was disabled, resulting in much, much faster tokenization for these text sequences. (merged trunk r1619730) Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization - Key: LUCENE-5400 URL: https://issues.apache.org/jira/browse/LUCENE-5400 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.5 Reporter: Chris Geeringh Assignee: Steve Rowe This is a pretty nasty bug, and causes the cluster to stop accepting updates. I'm not sure how to consistently reproduce it but I have done so numerous times. Switching to a whitespace tokenizer improved indexing speed, and I never got the issue again. I'm running a 4.6 Snapshot - I had issues with deadlocks with numerous versions of Solr, and have finally narrowed down the problem to this code, which affects many/all(?) versions of Solr. When the thread hits this issue it uses 100% CPU, restarting the node which has the error allows indexing to continue until hit again. Here is thread dump: http-bio-8080-exec-45 (201) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl.getNextToken(UAX29URLEmailTokenizerImpl.java:4343) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer.incrementToken(UAX29URLEmailTokenizer.java:147) org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82) org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248) org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253) org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:453) org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1517) org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:217) org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:583) org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:719) org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:449) org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:89) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:151) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:131) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:116) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186) org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:158) org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:99) org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58) org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703) org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
[jira] [Commented] (LUCENE-5400) Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization
[ https://issues.apache.org/jira/browse/LUCENE-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106967#comment-14106967 ] ASF subversion and git services commented on LUCENE-5400: - Commit 1619836 from [~rjernst] in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1619836 ] LUCENE-5897, LUCENE-5400: JFlex-based tokenizers StandardTokenizer and UAX29URLEmailTokenizer tokenize extremely slowly over long sequences of text partially matching certain grammar rules. The scanner default buffer size was reduced, and scanner buffer growth was disabled, resulting in much, much faster tokenization for these text sequences. (merged branch_4x r1619773) Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization - Key: LUCENE-5400 URL: https://issues.apache.org/jira/browse/LUCENE-5400 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.5 Reporter: Chris Geeringh Assignee: Steve Rowe This is a pretty nasty bug, and causes the cluster to stop accepting updates. I'm not sure how to consistently reproduce it but I have done so numerous times. Switching to a whitespace tokenizer improved indexing speed, and I never got the issue again. I'm running a 4.6 Snapshot - I had issues with deadlocks with numerous versions of Solr, and have finally narrowed down the problem to this code, which affects many/all(?) versions of Solr. When the thread hits this issue it uses 100% CPU, restarting the node which has the error allows indexing to continue until hit again. Here is thread dump: http-bio-8080-exec-45 (201) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl.getNextToken(UAX29URLEmailTokenizerImpl.java:4343) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer.incrementToken(UAX29URLEmailTokenizer.java:147) org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82) org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248) org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253) org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:453) org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1517) org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:217) org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:583) org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:719) org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:449) org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:89) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:151) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:131) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:116) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186) org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:158) org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:99) org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58) org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703) org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
[jira] [Commented] (LUCENE-5400) Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization
[ https://issues.apache.org/jira/browse/LUCENE-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106977#comment-14106977 ] ASF subversion and git services commented on LUCENE-5400: - Commit 1619840 from [~rjernst] in branch 'dev/trunk' [ https://svn.apache.org/r1619840 ] LUCENE-5672,LUCENE-5897,LUCENE-5400: move changes entry to 4.10.0 Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization - Key: LUCENE-5400 URL: https://issues.apache.org/jira/browse/LUCENE-5400 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.5 Reporter: Chris Geeringh Assignee: Steve Rowe This is a pretty nasty bug, and causes the cluster to stop accepting updates. I'm not sure how to consistently reproduce it but I have done so numerous times. Switching to a whitespace tokenizer improved indexing speed, and I never got the issue again. I'm running a 4.6 Snapshot - I had issues with deadlocks with numerous versions of Solr, and have finally narrowed down the problem to this code, which affects many/all(?) versions of Solr. When the thread hits this issue it uses 100% CPU, restarting the node which has the error allows indexing to continue until hit again. Here is thread dump: http-bio-8080-exec-45 (201) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl.getNextToken(UAX29URLEmailTokenizerImpl.java:4343) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer.incrementToken(UAX29URLEmailTokenizer.java:147) org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82) org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248) org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253) org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:453) org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1517) org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:217) org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:583) org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:719) org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:449) org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:89) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:151) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:131) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:116) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186) org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:158) org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:99) org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58) org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703) org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406) org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195) org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
[jira] [Commented] (LUCENE-5400) Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization
[ https://issues.apache.org/jira/browse/LUCENE-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106980#comment-14106980 ] ASF subversion and git services commented on LUCENE-5400: - Commit 1619841 from [~rjernst] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1619841 ] LUCENE-5672,LUCENE-5897,LUCENE-5400: move changes entry to 4.10.0 Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization - Key: LUCENE-5400 URL: https://issues.apache.org/jira/browse/LUCENE-5400 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.5 Reporter: Chris Geeringh Assignee: Steve Rowe This is a pretty nasty bug, and causes the cluster to stop accepting updates. I'm not sure how to consistently reproduce it but I have done so numerous times. Switching to a whitespace tokenizer improved indexing speed, and I never got the issue again. I'm running a 4.6 Snapshot - I had issues with deadlocks with numerous versions of Solr, and have finally narrowed down the problem to this code, which affects many/all(?) versions of Solr. When the thread hits this issue it uses 100% CPU, restarting the node which has the error allows indexing to continue until hit again. Here is thread dump: http-bio-8080-exec-45 (201) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl.getNextToken(UAX29URLEmailTokenizerImpl.java:4343) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer.incrementToken(UAX29URLEmailTokenizer.java:147) org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82) org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248) org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253) org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:453) org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1517) org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:217) org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:583) org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:719) org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:449) org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:89) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:151) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:131) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:116) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186) org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:158) org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:99) org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58) org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703) org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406) org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195) org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
[jira] [Commented] (LUCENE-5400) Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization
[ https://issues.apache.org/jira/browse/LUCENE-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106985#comment-14106985 ] ASF subversion and git services commented on LUCENE-5400: - Commit 1619842 from [~rjernst] in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1619842 ] LUCENE-5672,LUCENE-5897,LUCENE-5400: move changes entry to 4.10.0 Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization - Key: LUCENE-5400 URL: https://issues.apache.org/jira/browse/LUCENE-5400 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.5 Reporter: Chris Geeringh Assignee: Steve Rowe This is a pretty nasty bug, and causes the cluster to stop accepting updates. I'm not sure how to consistently reproduce it but I have done so numerous times. Switching to a whitespace tokenizer improved indexing speed, and I never got the issue again. I'm running a 4.6 Snapshot - I had issues with deadlocks with numerous versions of Solr, and have finally narrowed down the problem to this code, which affects many/all(?) versions of Solr. When the thread hits this issue it uses 100% CPU, restarting the node which has the error allows indexing to continue until hit again. Here is thread dump: http-bio-8080-exec-45 (201) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl.getNextToken(UAX29URLEmailTokenizerImpl.java:4343) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer.incrementToken(UAX29URLEmailTokenizer.java:147) org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82) org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248) org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253) org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:453) org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1517) org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:217) org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:583) org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:719) org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:449) org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:89) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:151) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:131) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:116) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186) org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:158) org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:99) org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58) org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703) org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406) org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195) org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
[jira] [Commented] (LUCENE-5400) Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization
[ https://issues.apache.org/jira/browse/LUCENE-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13897182#comment-13897182 ] Edu Garcia commented on LUCENE-5400: Hi. We've hit this bug in Atlassian Confluence (https://jira.atlassian.com/browse/CONF-32566) and it's causing a bit of customer pain. Is [~steve_rowe]'s solution a viable one, or is someone working on a better one? Thank you! Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization - Key: LUCENE-5400 URL: https://issues.apache.org/jira/browse/LUCENE-5400 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.5 Reporter: Chris Geeringh Assignee: Steve Rowe This is a pretty nasty bug, and causes the cluster to stop accepting updates. I'm not sure how to consistently reproduce it but I have done so numerous times. Switching to a whitespace tokenizer improved indexing speed, and I never got the issue again. I'm running a 4.6 Snapshot - I had issues with deadlocks with numerous versions of Solr, and have finally narrowed down the problem to this code, which affects many/all(?) versions of Solr. When the thread hits this issue it uses 100% CPU, restarting the node which has the error allows indexing to continue until hit again. Here is thread dump: http-bio-8080-exec-45 (201) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl.getNextToken(UAX29URLEmailTokenizerImpl.java:4343) org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer.incrementToken(UAX29URLEmailTokenizer.java:147) org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82) org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248) org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253) org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:453) org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1517) org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:217) org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:583) org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:719) org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:449) org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:89) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:151) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:131) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:116) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186) org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:158) org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:99) org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58) org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703) org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406) org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195) org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)