Large import making solr unresponsive
This is an issue we've only been running into lately so I'm not sure what to make of it. We have 2 cores on a solr machine right now, one of them is about 10k documents, the other is about 1.5mil. None of the documents are very large, only about 30 short attributes. We also have about 10 requests/sec hitting the smaller core and less on the larger one. Whenever we try to do a full import on the smaller one everything is fine, the response times stay the same during the whole 30 seconds it takes to run the indexer. The cpu also stays fairly low. When we run a full import on the larger one the response times on all cores tank from about 10ms to over 8 seconds. We have a 4 core machine (VM) and I've noticed 1 core stays pegged the entire time which is understandable since the DIH as I understand it is single threaded. Also, from what I can tell there is no disk, network, or memory pressure (8gb) either and the other procs do virtually nothing. Also the responses from solr still come back with a <10ms qtime. My best guess at this point is tomcat is having issues when the single proc gets pegged but I'm at a loss on how to further diagnose this to a tomcat issue or something weird that solr is doing. Has anyone run into this before or have ideas about what might be happening?
RE: problem adding new fields in DIH
Thanks for the explanation and bug report Robert! -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Monday, July 09, 2012 3:18 PM To: solr-user@lucene.apache.org Subject: Re: problem adding new fields in DIH Thanks again for reporting this Brent. I opened a JIRA issue: https://issues.apache.org/jira/browse/SOLR-3610 On Mon, Jul 9, 2012 at 3:36 PM, Brent Mills wrote: > We're having an issue when we add or change a field in the db-data-config.xml > and schema.xml files in solr. Basically whenever I add something new to > index I add it to the database, then the data config, then add the field to > the schema to index, reload the core, and do a full import. This has worked > fine until we upgraded to an iteration of 4.0 (we are currently on 4.0 > alpha). Now sometimes when we go through this process solr throws errors > about the field not being found. The only way to fix this is to restart > tomcat and everything immediately starts working fine again. > > The interesting thing is that this is only a problem if the database is > returning a value for that field and only in the documents that have a value. > The field shows up in the schema browser in solr, it just has no data in it. > If I completely remove it from the database but leave it in the schema and > dataconfig files there is no issue. Also of note, this is happening on 2 > different machines. > > Here's the trace > > SEVERE: Exception while solr commit. > java.lang.IllegalArgumentException: no such field test > at > org.apache.solr.core.DefaultCodecFactory$1.getPostingsFormatForField(DefaultCodecFactory.java:49) > at > org.apache.lucene.codecs.lucene40.Lucene40Codec$1.getPostingsFormatForField(Lucene40Codec.java:52) > at > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.addField(PerFieldPostingsFormat.java:94) > at > org.apache.lucene.index.FreqProxTermsWriterPerField.flush(FreqProxTermsWriterPerField.java:335) > at > org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:85) > at org.apache.lucene.index.TermsHash.flush(TermsHash.java:117) > at org.apache.lucene.index.DocInverter.flush(DocInverter.java:53) > at > org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:82) > at > org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:480) > at > org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:422) > at > org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:554) > at > org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2547) > at > org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2683) > at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2663) > at > org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:414) > at > org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:82) > at > org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64) > at > org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:919) > at > org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:154) > at > org.apache.solr.handler.dataimport.SolrWriter.commit(SolrWriter.java:107) > at > org.apache.solr.handler.dataimport.DocBuilder.finish(DocBuilder.java:304) > at > org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:256) > at > org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333) > at > org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:399) > at > org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:380) > -- lucidimagination.com
problem adding new fields in DIH
We're having an issue when we add or change a field in the db-data-config.xml and schema.xml files in solr. Basically whenever I add something new to index I add it to the database, then the data config, then add the field to the schema to index, reload the core, and do a full import. This has worked fine until we upgraded to an iteration of 4.0 (we are currently on 4.0 alpha). Now sometimes when we go through this process solr throws errors about the field not being found. The only way to fix this is to restart tomcat and everything immediately starts working fine again. The interesting thing is that this is only a problem if the database is returning a value for that field and only in the documents that have a value. The field shows up in the schema browser in solr, it just has no data in it. If I completely remove it from the database but leave it in the schema and dataconfig files there is no issue. Also of note, this is happening on 2 different machines. Here's the trace SEVERE: Exception while solr commit. java.lang.IllegalArgumentException: no such field test at org.apache.solr.core.DefaultCodecFactory$1.getPostingsFormatForField(DefaultCodecFactory.java:49) at org.apache.lucene.codecs.lucene40.Lucene40Codec$1.getPostingsFormatForField(Lucene40Codec.java:52) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.addField(PerFieldPostingsFormat.java:94) at org.apache.lucene.index.FreqProxTermsWriterPerField.flush(FreqProxTermsWriterPerField.java:335) at org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:85) at org.apache.lucene.index.TermsHash.flush(TermsHash.java:117) at org.apache.lucene.index.DocInverter.flush(DocInverter.java:53) at org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:82) at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:480) at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:422) at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:554) at org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2547) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2683) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2663) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:414) at org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:82) at org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64) at org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:919) at org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:154) at org.apache.solr.handler.dataimport.SolrWriter.commit(SolrWriter.java:107) at org.apache.solr.handler.dataimport.DocBuilder.finish(DocBuilder.java:304) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:256) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:399) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:380)
RE: Nested CachedSqlEntityProcessor running for each entity row with Solr 3.6?
Hi James, I just pulled down the newest nightly build of 4.0 and it solves an issue I had been having with solr ignoring the caching of the child entities. It was basically opening a new connection for each iteration even though everything was specified correctly. This was present in my previous build of 4.0 so it looks like you fixed it with one of those patches. Thanks for all your work on the DIH, the caching improvements are a big help with some of the things we will be rolling out in production soon. -Brent -Original Message- From: Dyer, James [mailto:james.d...@ingrambook.com] Sent: Monday, May 07, 2012 1:47 PM To: solr-user@lucene.apache.org Cc: Brent Mills; dye.kel...@gmail.com; keithn...@dswinc.com Subject: RE: Nested CachedSqlEntityProcessor running for each entity row with Solr 3.6? Dear Kellen, Brent & Keith, There now are fixes available for 2 cache-related bugs that unfortunately made their way into the 3.6.0 release. These were addressed on these 2 JIRA issues, which have been committed to the 3.6 branch (as of today): - https://issues.apache.org/jira/browse/SOLR-3430 - https://issues.apache.org/jira/browse/SOLR-3360 These problem were also affecting Trunk/4.x, with both fixes being committed to Trunk under SOLR-3430. Should Solr 3.6.1 be released, these fixes will become generally available at that time. They also will be part of the 4.0 release, which the Development Community hopes will be later this year. In the mean time, I am hoping each of you can test these fixes with your installation. The best way to do this is to get a fresh SVN checkout of the 3.6.1 branch (http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_6/), switch to the "solr" directory, then run "ant dist". I believe you need Ant 1.8 to build. If you are unable to build yourself, I put an *unofficial* shapshot of the DIH jar here: http://people.apache.org/~jdyer/unofficial/apache-solr-dataimporthandler-3.6.1-SNAPSHOT-r1335176.jar Please let me know if this solves your problems with DIH Caching, giving you the functionality you had with 3.5 and prior. Your feedback is greatly appreciatd. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: not interesting [mailto:dye.kel...@gmail.com] Sent: Monday, May 07, 2012 9:43 AM To: solr-user@lucene.apache.org Subject: Nested CachedSqlEntityProcessor running for each entity row with Solr 3.6? I just upgraded from Solr 3.4 to Solr 3.6; I'm using the same data-import.xml for both versions. The import functioned properly with 3.4. I'm using a nested entity to fetch authors associated with each document, and I'm using CachedSqlEntityProcessor to avoid hitting the DB an unreasonable number of times. However, when indexing, Solr indexes very slowly and appears to be fetching all authors in the DB for each document. The index should be ~500 megs; I aborted the indexing when it reached ~6gigs. If I comment out the nested author entity below, Solr will index normally. Am I missing something obvious or is this a bug? Also posted at SO if you prefer to answer there: http://stackoverflow.com/questions/10482484/nested-cachedsqlentityprocessor-running-for-each-entity-row-with-solr-3-6 Kellen
Upgrading to 3.6 broke cachedsqlentityprocessor
I've read some things in jira on the new functionality that was put into caching in the DIH but I wouldn't think it should break the old behavior. It doesn't look as though any errors are being thrown, it's just ignoring the caching part and opening a ton of connections. Also I cannot find any documentation on the new functionality that was added so I'm not sure what syntax is valid and what's not. Here is my entity that worked in 3.1 but no longer works in 3.6:
RE: Sharing dih "dictionaries"
You're totally correct. There's actually a link on the DIH page now which wasn't there when I had read it a long time ago. I'm really looking forward to 4.0, it's got a ton of great new features. Thanks for the links!! -Original Message- From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] Sent: Monday, December 05, 2011 10:45 PM To: solr-user@lucene.apache.org Subject: Re: Sharing dih "dictionaries" It looks like https://issues.apache.org/jira/browse/SOLR-2382 or even https://issues.apache.org/jira/browse/SOLR-2613. I guess by using SOLR-2382 you can specify your own SortedMapBackedCache subclass which is able to share your Dictionary. Regards On Tue, Dec 6, 2011 at 12:26 AM, Brent Mills wrote: > I'm not really sure how to title this but here's what I'm trying to do. > > I have a query that creates a rather large dictionary of codes that > are shared across multiple fields of a base entity. I'm using the > cachedsqlentityprocessor but I was curious if there was a way to join > this multiple times to the base entity so I can avoid having to reload > it for each column join. > > Ex: > > > > > > > > > > Kind of a simplified example but in this case the dictionary query has > to be run 3 times to join 3 different columns. It would be nice if I > could load the data set once as an entity and specify how to join it > in code without requiring a separate sql query. Any ideas? > -- Sincerely yours Mikhail Khludnev Developer Grid Dynamics tel. 1-415-738-8644 Skype: mkhludnev <http://www.griddynamics.com>
Sharing dih "dictionaries"
I'm not really sure how to title this but here's what I'm trying to do. I have a query that creates a rather large dictionary of codes that are shared across multiple fields of a base entity. I'm using the cachedsqlentityprocessor but I was curious if there was a way to join this multiple times to the base entity so I can avoid having to reload it for each column join. Ex: Kind of a simplified example but in this case the dictionary query has to be run 3 times to join 3 different columns. It would be nice if I could load the data set once as an entity and specify how to join it in code without requiring a separate sql query. Any ideas?