One of three cores is missing userData and lastModified fields from /admin/cores
Hey All, On a Solr server running 4.10.2 with three cores, two return the expected info from /solr/admin/cores?wt=json but the third is missing userData and lastModified. The first (artists) and third (tracks) cores from the linked screenshot are the ones I care about. Unfortunately, the third (tracks) is the one missing lastModified. As far as I can see, that comes from: https://github.com/apache/lucene-solr/blob/lucene_solr_4_10_2/solr/core/src/java/org/apache/solr/handler/admin/LukeRequestHandler.java#L568 I can't trace back to see what would possible cause getUserData() to return an empty Object, but that appears to be what is happening? For these severs, indexes that are pre-optimized are shipped over to the server and the server is re-started... nothing is actually ever committed on these live servers. This should behave exactly the same for artists and tracks, even though tracks is the one always missing lastUpdated. Here's the output in img format, I'll paste the full JSON[1] below: http://monosnap.com/image/XMyAfk5z3AvHgY39m0qAKAGlc3RACI.png I'd like to be able to provide access to clients to grab lastUpdated time for both indices so that they can see how old/stale the data they are getting results back from is... ...alternately, is there any other way to expose easily how old (last modified time?) the index for a core is? Thanks, Aaron 1: Full JSON ---snip--- { responseHeader: { status: 0, QTime: 10 }, defaultCoreName: collection1, initFailures: { }, status: { artists: { name: artists, isDefaultCore: false, instanceDir: /opt/solr/search/solr/artists/, dataDir: /opt/solr/search/solr/artists/, config: solrconfig.xml, schema: schema.xml, startTime: 2015-03-24T14:12:23.667Z, uptime: 7335696, index: { numDocs: 3360380, maxDoc: 3360380, deletedDocs: 0, indexHeapUsageBytes: 63366952, version: 421, segmentCount: 1, current: true, hasDeletions: false, directory: org.apache.lucene.store.MMapDirectory:MMapDirectory@/opt/solr/search/solr/artists/index lockFactory=NativeFSLockFactory@/opt/solr/search/solr/artists/index, userData: { commitTimeMSec: 1427133705908 }, lastModified: 2015-03-23T18:01:45.908Z, sizeInBytes: 25341305528, size: 23.6 GB } }, banana-int: { name: banana-int, isDefaultCore: false, instanceDir: /opt/solr/search/solr/banana-int/, dataDir: /opt/solr/search/solr/banana-int/data/, config: solrconfig.xml, schema: schema.xml, startTime: 2015-03-24T14:12:22.895Z, uptime: 7336472, index: { numDocs: 3, maxDoc: 3, deletedDocs: 0, indexHeapUsageBytes: 17448, version: 135, segmentCount: 3, current: true, hasDeletions: false, directory: org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/opt/solr/search/solr/banana-int/data/index lockFactory=NativeFSLockFactory@/opt/solr/search/solr/banana-int/data/index; maxCacheMB=48.0 maxMergeSizeMB=4.0), userData: { commitTimeMSec: 1412796723183 }, lastModified: 2014-10-08T19:32:03.183Z, sizeInBytes: 16196, size: 15.82 KB } }, tracks: { name: tracks, isDefaultCore: false, instanceDir: /opt/solr/search/solr/tracks/, dataDir: /opt/solr/search/solr/tracks/, config: solrconfig.xml, schema: schema.xml, startTime: 2015-03-24T14:12:23.656Z, uptime: 7335713, index: { numDocs: 53268126, maxDoc: 53268126, deletedDocs: 0, indexHeapUsageBytes: 517650552, version: 100, segmentCount: 1, current: true, hasDeletions: false, directory: org.apache.lucene.store.MMapDirectory:MMapDirectory@/opt/solr/search/solr/tracks/index lockFactory=NativeFSLockFactory@/opt/solr/search/solr/tracks/index, userData: { }, sizeInBytes: 122892905007, size: 114.45 GB } } } } ---snip---
Re: Understanding fieldNorm differences between 3.6.1 and 4.9 solrs
Wow - so apparently I have terrible recall and should re-read this thread I started on the same topic when upgrading from 1.4 to 3.6 and hit a very similar fieldNorm issue almost two years ago! =) http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201207.mbox/%3CCALyTvnpwZMj4zxPbK0abVpnyRJny=qauijdqmj7e3zgnv7u...@mail.gmail.com%3E In the mean time, I'm still happy to hear any new thoughts / suggestions on making similarity contiguous across upgrades. Thanks again, Aaron On Tue, Jul 1, 2014 at 11:14 PM, Aaron Daubman daub...@gmail.com wrote: In trying to determine some subtle scoring differences (causing occasionally significant ordering differences) among search results, I wrote a parser to normalize debug.explain.structured JSON output. It appears that every score that is different comes down to a difference in fieldNorm, where the 3.6.1 solr is using 0.109375 as the fieldNorm, and the 4.9 solr is using 0.125 as the fieldNorm. [1] What would be causing the different versions to use different field norms (and rather infrequently, as the majority of scores are identical as desired)? Thanks, Aaron [1] Here's a snippet of the diff (of the output from my debug.explain.structured normalizer) for one such difference (apologies for the width): 06808040cd523a296abaf26025148c85: { 06808040cd523a296abaf26025148c85: { * _value: 0.839616605, | _value: 0.854748135, * description: product of:, description: product of:, details: [ details: [ { { * _value: 2.623802, | _value: 2.67108801, * description: sum of:, description: sum of:, details: [ details: [ { { * _value: 0.0644619693, | _value: 0.0736708307, * description: weight(t_style:alternative description: weight(t_style:alternative details: [ details: [ { { _value: 0.0629802298, _value: 0.0629802298, description: queryWeight, description: queryWeight, details: [ details: [ { { _value: 4.18500798, _value: 4.18500798, description: idf(137871) description: idf(137871) } } ] ] }, }, { { * _value: 1.02352709,| _value: 1.1697453, * description: fieldWeight, description: fieldWeight, details: [ details: [ { { _value: 2.23606799, _value: 2.23606799, description: tf(freq=5) description: tf(freq=5) }, }, { { _value: 4.18500798, _value: 4.18500798, description: idf(137871) description: idf(137871) }, }, { { * _value: 0.109375, | _value: 0.125, * * description: fieldNorm description: fieldNorm* } } ] ] } } ] ] }, },
Understanding fieldNorm differences between 3.6.1 and 4.9 solrs
In trying to determine some subtle scoring differences (causing occasionally significant ordering differences) among search results, I wrote a parser to normalize debug.explain.structured JSON output. It appears that every score that is different comes down to a difference in fieldNorm, where the 3.6.1 solr is using 0.109375 as the fieldNorm, and the 4.9 solr is using 0.125 as the fieldNorm. [1] What would be causing the different versions to use different field norms (and rather infrequently, as the majority of scores are identical as desired)? Thanks, Aaron [1] Here's a snippet of the diff (of the output from my debug.explain.structured normalizer) for one such difference (apologies for the width): 06808040cd523a296abaf26025148c85: { 06808040cd523a296abaf26025148c85: { * _value: 0.839616605, | _value: 0.854748135, * description: product of:, description: product of:, details: [ details: [ { { * _value: 2.623802, | _value: 2.67108801, * description: sum of:, description: sum of:, details: [ details: [ { { * _value: 0.0644619693, | _value: 0.0736708307, * description: weight(t_style:alternative description: weight(t_style:alternative details: [ details: [ { { _value: 0.0629802298, _value: 0.0629802298, description: queryWeight, description: queryWeight, details: [ details: [ { { _value: 4.18500798, _value: 4.18500798, description: idf(137871) description: idf(137871) } } ] ] }, }, { { * _value: 1.02352709,| _value: 1.1697453, * description: fieldWeight, description: fieldWeight, details: [ details: [ { { _value: 2.23606799, _value: 2.23606799, description: tf(freq=5) description: tf(freq=5) }, }, { { _value: 4.18500798, _value: 4.18500798, description: idf(137871) description: idf(137871) }, }, { { * _value: 0.109375, | _value: 0.125, * * description: fieldNorm description: fieldNorm* } } ] ] } } ] ] }, },
Re: Range Queries performing differently on SortableIntField vs TrieField of type integer
Hi Upayavira, One small question - did you re-index in-between? The index structure will be different for each. Yes, the Solr 1.4.1 (working) instance was built using the original schema and that solr version. The Solr 3.6.1 (not working) instance was re-built using the new schema and Solr 3.6.1... Thanks, Aaron
Re: Range Queries performing differently on SortableIntField vs TrieField of type integer
I forgot a possibly important piece... Given the different Solr versions, the schema version (and it's related different defaults) is also a change: Solr 1.4.1 Has: schema name=ourSchema version=1.1 Solr 3.6.1 Has: schema name=ourSchema version=1.5 Solr 1.4.1 Relevant Schema Parts - Working as desired: - fieldType name=sint class=solr.SortableIntField sortMissingLast=true omitNorms=true/ ... field name=i_yearStartSort type=sint indexed=true stored=false required=false multiValued=true/ field name=i_yearStopSort type=sint indexed=true stored=false required=false multiValued=true/ Solr 3.6.1 Relevant Schema Parts - Not working as expected: - fieldType name=tint class=solr.TrieField type=integer precisionStep=4 sortMissingLast=true positionIncrementGap=0 omitNorms=true/ ... field name=i_yearStartSort type=tint indexed=true stored=false required=false multiValued=false/ field name=i_yearStopSort type=tint indexed=true stored=false required=false multiValued=false/
Re: Cannot run Solr4 from Intellij Idea
Interestingly, I have run in to this same (or very similar) issue when attempting to run embedded solr. All of the solr.* classes that were recently moved to lucene would not work with the solr.* shorthand - I had to replace them with the full classpath. As you found, these shorthands in the same schema worked fine from within solr proper (webapp). Is there a workaround for this? (It would be great to have a unified schema between embedded and webapp solr instances) Thanks, Aaron On Tue, Dec 4, 2012 at 7:37 AM, Artyom ice...@mail.ru wrote: After 2 days I have figured out how to open Solr 4 in IntelliJ IDEA 11.1.4 on Tomcat 7. IntelliJ IDEA finds webapp/web/WEB-INF/web.xml and offers to make a facet from it and adds this facet to the parent module, from which an artifact can be created. The problem is that Solr cannot run properly. I get this message: SEVERE: Unable to create core: mycore org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer: Error loading class 'solr.StandardTokenizerFactory' at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:369) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:113) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:846) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:534) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:356) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:308) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:107) at org.apache.catalina.core.ApplicationFilterConfig.initFilter(ApplicationFilterConfig.java:277) at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:258) at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:382) at org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilterConfig.java:103) at org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4650) at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5306) at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150) at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:901) at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877) at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:618) at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:650) at org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:1582) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] analyzer/tokenizer: Error loading class 'solr.StandardTokenizerFactory' at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177) at org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:344) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) ... 25 more Caused by: org.apache.solr.common.SolrException: Error loading class 'solr.StandardTokenizerFactory' at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:436) at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:457) at org.apache.solr.util.plugin.AbstractPluginLoader.create(AbstractPluginLoader.java:89) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) ... 29 more Caused by: java.lang.ClassNotFoundException: solr.StandardTokenizerFactory at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at
Preventing accepting queries while custom QueryComponent starts up?
Greetings, I have several custom QueryComponents that have high one-time startup costs (hashing things in the index, caching things from a RDBMS, etc...) Is there a way to prevent solr from accepting connections before all QueryComponents are ready? Especially, since many of our instance are load-balanced (and added-in/removed automatically based on admin/ping responses) preventing ping from answering prior to all custom QueryComponents being ready would be ideal... Thanks, Aaron
Re: Preventing accepting queries while custom QueryComponent starts up?
Amit, I am using warming /firstSearcher queries to ensure this happens before any external queries are received, however, unless I am misinterpreting the logs, solr starts responding to admin/ping requests before firstSearcher completes, and, the LB then puts the solr instance back in the pool, and it starts accepting connections... On Thu, Nov 8, 2012 at 4:24 PM, Amit Nithian anith...@gmail.com wrote: I think Solr does this by default and are you executing warming queries in the firstSearcher so that these actions are done before Solr is ready to accept real queries? On Thu, Nov 8, 2012 at 11:54 AM, Aaron Daubman daub...@gmail.com wrote: Greetings, I have several custom QueryComponents that have high one-time startup costs (hashing things in the index, caching things from a RDBMS, etc...) Is there a way to prevent solr from accepting connections before all QueryComponents are ready? Especially, since many of our instance are load-balanced (and added-in/removed automatically based on admin/ping responses) preventing ping from answering prior to all custom QueryComponents being ready would be ideal... Thanks, Aaron
Re: Preventing accepting queries while custom QueryComponent starts up?
(plus when I deploy, my deploy script runs some actual simple test queries to ensure they return before enabling the ping handler to return 200s) to avoid this problem. What are you doing to programmatically disable/enable the ping handler? This sounds like exactly what I should be doing as well...
Improving performance for use-case where large (200) number of phrase queries are used?
Greetings, We have a solr instance in use that gets some perhaps atypical queries and suffers from poor (2 second) QTimes. Documents (~2,350,000) in this instance are mainly comprised of various descriptive fields, such as multi-word (phrase) tags - an average document contains 200-400 phrases like this across several different multi-valued field types. A custom QueryComponent has been built that functions somewhat like a very specific MoreLikeThis. A seed document is specified via the incoming query, its terms are retrieved, boosted both by query parameters as well as fields within the document that specify term weighting, sorted by this custom boosting, and then a second query is crafted by taking the top 200 (sorted by the custom boosting) resulting field values paired with their fields and searching for documents matching these 200 values. For many searches, 25-50% of the documents match the query of 200 terms (so 600,000 to 1,200,000). After doing some profiling, it seems that a majority of the QTime comes from dealing with phrases and resulting term positions, since a majority of the search terms are actually multi-word tokenized phrases. (processing is dominated by ExactPhraseScorer on down, particularly: SegmentTermPositions, readVInt) I have thought of a few ways to improve performance for this use case, and am looking for feedback as to which seems best, as well as any insight into other ways to approach this problem that I haven't considered (or things to look into to help better understand the slow QTimes more fully): 1) Shard the index - since there is no key to really specify which shard queries would go to, this would only be of benefit if scoring is done in parallel. Is there documentation I have so far missed that describes distributed searching for this case? (I haven't found anything that really describes the differences in scoring for distributed vs. non-distributed indices, aside from the warnings that IDF doesn't work - which I don't think we really care about). 2) Implement Common Grams as described here: http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2 It's not clear how many individual words in the phrases being used are, in fact, common, but given that 25-50% of the documents in the index match many queries, it seems this may be of value 3) Try and make mm (minimum terms should match) work for the custom query. I haven't been able to figure out how exactly this parameter works, but, my thinking is along the lines of if only 2 of those 200 terms match a document, it doesn't need to get scored. What I don't currently understand is at what point failing the mm requirement short-circuits - e.g. does the doc still get scored? If it does short-circuit prior to scoring, this may help somewhat, although it's not clear this would still prevent the many many gets against term positions that is still killing QTime 4) Set a dynamic number (rather than the currently fixed 200) of terms based on the custom boosting/weighting value - e.g. only use terms whose calculated value is above some threshold. I'm not keen on this since some documents may be dominated by many weak terms and not have any great ones, it it might break for those (finding the sweet spot cutoff would not be straightforward). 5) *This is my current favorite*: stop tokenizing/analyzing these terms and just use KeywordTokenizer. Most of these phrases are pre-vetted, and it may be possible to clean/process any others before creating the docs. My main worry here is that, currently, if I understand correctly, a document with the phrase brazilian pop would still be returned as a match to a seed document containing only the phrase brazilian (not the other way around, but that is not necessary), however, with KeywordTokenizer, this would no longer be the case. If I switched from the current dubious tokenize/stem/etc... and just used Keyword, would this allow queries like this used to be a long phrase query to match documents that have this used to be a long phrase query as one of the multivalued values in the field without having to pull term positions? (and thus significantly speed up performance). Thanks, Aaron
Re: Improving performance for use-case where large (200) number of phrase queries are used?
Thanks for the ideas - some followup questions in-line below: * use shingles e.g. to turn two-word phrases into single terms (how long is your average phrase?). Would this be different than what I was calling common grams? (other than shingling every two words, rather than just common ones?) * in addition to the above, maybe for phrases with 2 terms, consider just a boolean conjunction of the shingled phrases instead of a real phrase query: e.g. more like this - (more_like AND like_this). This would have some false positives. This would definitely help, but, IIRC, we moved to phrase queries due to too many false positives, it would be an interesting experiment to see how many false positives were left when shingling and then just doing conjunctive queries. * use a more aggressive stopwords list for your MorePhrasesLikeThis. * reduce this number 200, and instead work harder to prune out which phrases are the most descriptive from the seed document, e.g. based on some heuristics like their frequency or location within that seed document, so your query isnt so massive. This is something I've been asking for (perform some sort of PCA / feature selection on the actual terms used) but is of questionable value and hard to do right so hasn't happened yet (it's not clear that there will be terms that are very common that are not also very descriptive, so the extent to which this would help is unknown). Thanks again for the ideas! Aaron
Re: Improving performance for use-case where large (200) number of phrase queries are used?
Hi Peter, Thanks for the recommendation - I believe we are thinking along the same lines, but wanted to check to make sure. Are you suggesting something different than my #5 (below) or are we essentially suggesting the same thing? On Wed, Oct 24, 2012 at 1:20 PM, Peter Keegan peterlkee...@gmail.com wrote: Could you index your 'phrase tags' as single tokens? Then your phrase queries become simple TermQuerys. 5) *This is my current favorite*: stop tokenizing/analyzing these terms and just use KeywordTokenizer. Most of these phrases are pre-vetted, and it may be possible to clean/process any others before creating the docs. My main worry here is that, currently, if I understand correctly, a document with the phrase brazilian pop would still be returned as a match to a seed document containing only the phrase brazilian (not the other way around, but that is not necessary), however, with KeywordTokenizer, this would no longer be the case. If I switched from the current dubious tokenize/stem/etc... and just used Keyword, would this allow queries like this used to be a long phrase query to match documents that have this used to be a long phrase query as one of the multivalued values in the field without having to pull term positions? (and thus significantly speed up performance). Thanks again, Aaron
Why does SolrIndexSearcher.java enforce mutual exclusion of filter and filterList?
Greetings, I'm wondering if somebody would please explain why SolrIndexSearcher.java enforces mutual exclusion of filter and filterList (e.g. see: https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L2039 ) For a custom application we have been using this functionality successfully and I have been maintaining patches against base releases up from 1.4 through 3.6.1 and am now finally looking at 4.0. Since I am yet-again revisiting this custom patch, I am wondering why this functionality is prevented out of the box - for two reasons really: 1) It would be great if I didn't have to maintain a custom internal branch of solr for this tiny little change 2) I am worried that the purposeful prevention of this functionality implies there is a downside to doing this. Is there a downside to utilizing both a DocSet based filter and query based filterList? If not, once I migrate this patch to 4.0 what would be the best way to get this functionality incorporated into the base? For additional info, you may find the now-2-year-old issue with patches addressing this up through 3.6.1 here: https://issues.apache.org/jira/browse/SOLR-2052 Any insight appreciated as always, Aaron
ScorerDocQueue.java's downHeap showing up as frequent hotspot in profiling - ideas why?
Greetings, In a recent batch of solr 3.6.1 slow response time queries the profiler highlighted downHeap (line 212) in SoorerDocQueue.java as averaging more than 60ms across the 16 calls I was looking at and showing it spiking up over 100ms - which, after looking at the code (two int comparisons?!?) I am at a loss to explain: Here's the source: https://github.com/apache/lucene-solr/blob/6b8783bfa59351878c59e47deaa7739d95150a22/lucene/core/src/java/org/apache/lucene/util/ScorerDocQueue.java#L212 Here's the invocation trace of one of the many similar: ---snip--- Thread.run:722 (0ms self time, 416 ms total time) QueuedThreadPool$3.run:526 (0ms self time, 416 ms total time) QueuedThreadPool.runJob:595 (0ms self time, 416 ms total time) ExecutorCallback$ExecutorCallbackInvoker.run:130 (0ms self time, 416 ms total time) ExecutorCallback$ExecutorCallbackInvoker.call:124 (0ms self time, 416 ms total time) AbstractConnection$1.onCompleted:63 (0ms self time, 416 ms total time) AbstractConnection$1.onCompleted:71 (0ms self time, 416 ms total time) HttpConnection.onFillable:253 (0ms self time, 416 ms total time) HttpChannel.run:246 (0ms self time, 416 ms total time) Server.handle:403 (0ms self time, 416 ms total time) HandlerWrapper.handle:97 (0ms self time, 416 ms total time) IPAccessHandler.handle:204 (0ms self time, 416 ms total time) HandlerCollection.handle:110 (0ms self time, 416 ms total time) ContextHandlerCollection.handle:258 (0ms self time, 416 ms total time) ScopedHandler.handle:136 (0ms self time, 416 ms total time) ContextHandler.doScope:973 (0ms self time, 416 ms total time) SessionHandler.doScope:174 (0ms self time, 416 ms total time) ServletHandler.doScope:358 (0ms self time, 416 ms total time) ContextHandler.doHandle:1044 (0ms self time, 416 ms total time) SessionHandler.doHandle:213 (0ms self time, 416 ms total time) SecurityHandler.handle:540 (0ms self time, 416 ms total time) ScopedHandler.handle:138 (0ms self time, 416 ms total time) ServletHandler.doHandle:429 (0ms self time, 416 ms total time) ServletHandler$CachedChain.doFilter:1274 (0ms self time, 416 ms total time) SolrDispatchFilter.doFilter:260 (0ms self time, 416 ms total time) SolrDispatchFilter.execute:365 (0ms self time, 416 ms total time) SolrCore.execute:1376 (0ms self time, 416 ms total time) RequestHandlerBase.handleRequest:129 (0ms self time, 416 ms total time) SearchHandler.handleRequestBody:186 (0ms self time, 416 ms total time) QueryComponent.process:394 (0ms self time, 416 ms total time) SolrIndexSearcher.search:375 (0ms self time, 416 ms total time) SolrIndexSearcher.getDocListC:1176 (0ms self time, 416 ms total time) SolrIndexSearcher.getDocListNC:1296 (0ms self time, 416 ms total time) IndexSearcher.search:364 (0ms self time, 416 ms total time) IndexSearcher.search:581 (0ms self time, 416 ms total time) FilteredQuery$2.score:169 (0ms self time, 416 ms total time) BooleanScorer2.advance:320 (0ms self time, 416 ms total time) ReqExclScorer.advance:112 (0ms self time, 416 ms total time) DisjunctionSumScorer.advance:229 (52ms self time, 416 ms total time) DisjunctionSumScorer.advanceAfterCurrent:171 (0ms self time, 308 ms total time) ScorerDocQueue.topNextAndAdjustElsePop:120 (0ms self time, 308 ms total time) ScorerDocQueue.checkAdjustElsePop:135 (0ms self time, 111 ms total time) ScorerDocQueue.downHeap:212 (111ms self time, 111 ms total time) ---snip--- Any ideas on what is causing this seemingly inordinate amount of time in downHeap? Is this symptomatic of anything in particular? Thanks, as always! Aaron
Re: PriorityQueue:initialize consistently showing up as hot spot while profiling
Hi Mikhail, On Fri, Oct 5, 2012 at 7:15 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: okay. huge rows value is no.1 way to kill Lucene. It's not possible, absolutely. You need to rethink logic of your component. Check Solr's FieldCollapsing code, IIRC it makes second search to achieve similar goal. Also check PostFilter and DelegatingCollector classes, their approach can also be handy for your task. This sounds like it could be a much saner way to handle the task, however, I'm not sure what I should be looking at for the 'FieldCollapsing code' you mention - can you point me to a class? Also, is there anything more you can say about PostFilter and DelegatingCollector classes - I reviewed them but it was not obvious to me what they were doing that would allow me to reduce the large rows param we use to ensure all relevant docs are included in the grouping and limiting occurs at the group level, rather than pre-grouping... Thanks again, Aaron
Re: PriorityQueue:initialize consistently showing up as hot spot while profiling
On Fri, Oct 5, 2012 at 4:33 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: what's the value of rows param http://wiki.apache.org/solr/CommonQueryParameters#rows ? Very interesting question - so, for historic reasons lost to me, we pass in a huge (1000?) number for rows and this hits our custom component, which has its own internal maximum for real rows returned. (This is a custom grouping component, so I am guessing the large number of rows had to do with trying not to limit what got grouped?). Is the value of rows what is used for that heap allocation? Thanks, Aaron On Fri, Oct 5, 2012 at 6:56 AM, Aaron Daubman daub...@gmail.com wrote: Greetings, I've been seeing this call chain come up fairly frequently when debugging longer-QTime queries under Solr 3.6.1 but have not been able to understand from the code what is really going on - the call graph and code follow below. Would somebody please explain to me: 1) Why this would show up frequently as a hotspot 2) If it is expected to do so 3) If there is anything I should look in to that may help performance where this frequently shows up as the long pole in the QTime tent 4) What the code is doing and why heap is being allocated as an apparently giant object (which also is apparently not unheard of due to MAX_VALUE wrapping check) ---call-graph--- Filter - SolrDispatchFilter:doFilter (method time = 12 ms, total time = 487 ms) Filter - SolrDispatchFilter:execute:365 (method time = 0 ms, total time = 109 ms) org.apache.solr.core.SolrCore:execute:1376 (method time = 0 ms, total time = 109 ms) org.apache.solr.handler.RequestHandlerBase:handleRequest:129 (method time = 0 ms, total time = 109 ms) org.apache.solr.handler.component.SearchHandler:handleRequestBody:186 (method time = 0 ms, total time = 109 ms) com.echonest.solr.component.EchoArtistGroupingComponent:process:188 (method time = 0 ms, total time = 109 ms) org.apache.solr.search.SolrIndexSearcher:search:375 (method time = 0 ms, total time = 96 ms) org.apache.solr.search.SolrIndexSearcher:getDocListC:1176 (method time = 0 ms, total time = 96 ms) org.apache.solr.search.SolrIndexSearcher:getDocListNC:1209 (method time = 0 ms, total time = 96 ms) org.apache.solr.search.SolrIndexSearcher:getProcessedFilter:796 (method time = 0 ms, total time = 26 ms) org.apache.solr.search.BitDocSet:andNot:185 (method time = 0 ms, total time = 13 ms) org.apache.lucene.util.OpenBitSet:clone:732 (method time = 13 ms, total time = 13 ms) org.apache.solr.search.BitDocSet:intersection:31 (method time = 0 ms, total time = 13 ms) org.apache.solr.search.DocSetBase:intersection:90 (method time = 0 ms, total time = 13 ms) org.apache.lucene.util.OpenBitSet:and:808 (method time = 13 ms, total time = 13 ms) org.apache.lucene.search.TopFieldCollector:create:916 (method time = 0 ms, total time = 46 ms) org.apache.lucene.search.FieldValueHitQueue:create:175 (method time = 0 ms, total time = 46 ms) org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue:init:111 (method time = 0 ms, total time = 46 ms) org.apache.lucene.search.SortField:getComparator:409 (method time = 0 ms, total time = 13 ms) org.apache.lucene.search.FieldComparator$FloatComparator:init:400 (method time = 13 ms, total time = 13 ms) org.apache.lucene.util.PriorityQueue:initialize:108 (method time = 33 ms, total time = 33 ms) ---snip--- org.apache.lucene.util.PriorityQueue:initialize - hotspot is line 108: heap = (T[]) new Object[heapSize]; // T is unbounded type, so this unchecked cast works always ---PriorityQueue.java--- /** Subclass constructors must call this. */ @SuppressWarnings(unchecked) protected final void initialize(int maxSize) { size = 0; int heapSize; if (0 == maxSize) // We allocate 1 extra to avoid if statement in top() heapSize = 2; else { if (maxSize == Integer.MAX_VALUE) { // Don't wrap heapSize to -1, in this case, which // causes a confusing NegativeArraySizeException. // Note that very likely this will simply then hit // an OOME, but at least that's more indicative to // caller that this values is too big. We don't +1 // in this case, but it's very unlikely in practice // one will actually insert this many objects into // the PQ: heapSize = Integer.MAX_VALUE; } else { // NOTE: we add +1 because all access to heap is // 1-based not 0-based. heap[0] is unused. heapSize = maxSize + 1; } } heap = (T[]) new Object[heapSize]; // T is unbounded type, so this unchecked cast works always this.maxSize = maxSize; // If sentinel objects are supported, populate the queue with them T sentinel
PriorityQueue:initialize consistently showing up as hot spot while profiling
Greetings, I've been seeing this call chain come up fairly frequently when debugging longer-QTime queries under Solr 3.6.1 but have not been able to understand from the code what is really going on - the call graph and code follow below. Would somebody please explain to me: 1) Why this would show up frequently as a hotspot 2) If it is expected to do so 3) If there is anything I should look in to that may help performance where this frequently shows up as the long pole in the QTime tent 4) What the code is doing and why heap is being allocated as an apparently giant object (which also is apparently not unheard of due to MAX_VALUE wrapping check) ---call-graph--- Filter - SolrDispatchFilter:doFilter (method time = 12 ms, total time = 487 ms) Filter - SolrDispatchFilter:execute:365 (method time = 0 ms, total time = 109 ms) org.apache.solr.core.SolrCore:execute:1376 (method time = 0 ms, total time = 109 ms) org.apache.solr.handler.RequestHandlerBase:handleRequest:129 (method time = 0 ms, total time = 109 ms) org.apache.solr.handler.component.SearchHandler:handleRequestBody:186 (method time = 0 ms, total time = 109 ms) com.echonest.solr.component.EchoArtistGroupingComponent:process:188 (method time = 0 ms, total time = 109 ms) org.apache.solr.search.SolrIndexSearcher:search:375 (method time = 0 ms, total time = 96 ms) org.apache.solr.search.SolrIndexSearcher:getDocListC:1176 (method time = 0 ms, total time = 96 ms) org.apache.solr.search.SolrIndexSearcher:getDocListNC:1209 (method time = 0 ms, total time = 96 ms) org.apache.solr.search.SolrIndexSearcher:getProcessedFilter:796 (method time = 0 ms, total time = 26 ms) org.apache.solr.search.BitDocSet:andNot:185 (method time = 0 ms, total time = 13 ms) org.apache.lucene.util.OpenBitSet:clone:732 (method time = 13 ms, total time = 13 ms) org.apache.solr.search.BitDocSet:intersection:31 (method time = 0 ms, total time = 13 ms) org.apache.solr.search.DocSetBase:intersection:90 (method time = 0 ms, total time = 13 ms) org.apache.lucene.util.OpenBitSet:and:808 (method time = 13 ms, total time = 13 ms) org.apache.lucene.search.TopFieldCollector:create:916 (method time = 0 ms, total time = 46 ms) org.apache.lucene.search.FieldValueHitQueue:create:175 (method time = 0 ms, total time = 46 ms) org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue:init:111 (method time = 0 ms, total time = 46 ms) org.apache.lucene.search.SortField:getComparator:409 (method time = 0 ms, total time = 13 ms) org.apache.lucene.search.FieldComparator$FloatComparator:init:400 (method time = 13 ms, total time = 13 ms) org.apache.lucene.util.PriorityQueue:initialize:108 (method time = 33 ms, total time = 33 ms) ---snip--- org.apache.lucene.util.PriorityQueue:initialize - hotspot is line 108: heap = (T[]) new Object[heapSize]; // T is unbounded type, so this unchecked cast works always ---PriorityQueue.java--- /** Subclass constructors must call this. */ @SuppressWarnings(unchecked) protected final void initialize(int maxSize) { size = 0; int heapSize; if (0 == maxSize) // We allocate 1 extra to avoid if statement in top() heapSize = 2; else { if (maxSize == Integer.MAX_VALUE) { // Don't wrap heapSize to -1, in this case, which // causes a confusing NegativeArraySizeException. // Note that very likely this will simply then hit // an OOME, but at least that's more indicative to // caller that this values is too big. We don't +1 // in this case, but it's very unlikely in practice // one will actually insert this many objects into // the PQ: heapSize = Integer.MAX_VALUE; } else { // NOTE: we add +1 because all access to heap is // 1-based not 0-based. heap[0] is unused. heapSize = maxSize + 1; } } heap = (T[]) new Object[heapSize]; // T is unbounded type, so this unchecked cast works always this.maxSize = maxSize; // If sentinel objects are supported, populate the queue with them T sentinel = getSentinelObject(); if (sentinel != null) { heap[1] = sentinel; for (int i = 2; i heap.length; i++) { heap[i] = getSentinelObject(); } size = maxSize; } } ---snip--- Thanks, as always! Aaron
Re: Understanding fieldCache SUBREADER insanity
Hi Yonik, I've been attempting to fix the SUBREADER insanity in our custom component, and have made perhaps some progress (or is this worse?) - I've gone from SUBREADER to VALUEMISMATCH insanity: ---snip--- entries_count : 12 entry#0 : 'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='f_normalizedTotalHotttnesss',class org.apache.lucene.search.FieldCacheImpl$DocsWithFieldCache,null=org.apache.lucene.util.FixedBitSet#1387502754 entry#1 : 'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='i_track_count',class org.apache.lucene.search.FieldCacheImpl$DocsWithFieldCache,null=org.apache.lucene.util.Bits$MatchAllBits#233863705 entry#2 : 'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='s_artistID',class org.apache.lucene.search.FieldCache$StringIndex,null=org.apache.lucene.search.FieldCache$StringIndex#652215925 entry#3 : 'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='s_artistID',class java.lang.String,null=[Ljava.lang.String;#1036517187 entry#4 : 'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='thingID',class java.lang.String,null=[Ljava.lang.String;#357017445 entry#5 : 'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='f_normalizedTotalHotttnesss',float,org.apache.lucene.search.FieldCache.NUMERIC_UTILS_FLOAT_PARSER=[F#322888397 entry#6 : 'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='f_normalizedTotalHotttnesss',float,org.apache.lucene.search.FieldCache.DEFAULT_FLOAT_PARSER=org.apache.lucene.search.FieldCache$CreationPlaceholder#1229311421 entry#7 : 'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='f_normalizedTotalHotttnesss',float,null=[F#322888397 entry#8 : 'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='i_collapse',int,org.apache.lucene.search.FieldCache.DEFAULT_INT_PARSER=org.apache.lucene.search.FieldCache$CreationPlaceholder#92920526 entry#9 : 'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='i_collapse',int,null=[I#494669113 entry#10 : 'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='i_collapse',int,org.apache.lucene.search.FieldCache.NUMERIC_UTILS_INT_PARSER=[I#494669113 entry#11 : 'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='i_track_count',int,org.apache.lucene.search.FieldCache.NUMERIC_UTILS_INT_PARSER=[I#994584654 insanity_count : 1 insanity#0 : VALUEMISMATCH: Multiple distinct value objects for MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)+s_artistID 'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='s_artistID',class org.apache.lucene.search.FieldCache$StringIndex,null=org.apache.lucene.search.FieldCache$StringIndex#652215925 'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='s_artistID',class java.lang.String,null=[Ljava.lang.String;#1036517187 ---snip--- Any suggestions on what the cause of this VALUEMISMATCH is, if it is the normal case, or suggestions on how to fix it. For anybody else with SUBREADER insanity issues, this is the change I made to get this far (get the first leafReader, since we are using a merged/optimized index): ---snip--- SolrIndexReader reader = searcher.getReader().getLeafReaders()[0]; collapseIDs = FieldCache.DEFAULT.getInts(reader, COLLAPSE_KEY_NAME); hotnessValues = FieldCache.DEFAULT.getFloats(reader, HOTNESS_KEY_NAME); artistIDs = FieldCache.DEFAULT.getStrings(reader, ARTIST_KEY_NAME); ---snip--- Thanks, Aaron On Wed, Sep 19, 2012 at 4:54 PM, Yonik Seeley yo...@lucidworks.com wrote: already-optimized, single-segment index That part is interesting... if true, then the type of insanity you saw should be impossible, and either the insanity detection or something else is broken. -Yonik http://lucidworks.com
Solr Caching - how to tune, how much to increase, and any tips on using Solr with JDK7 and G1 GC?
Greetings, I've recently moved to running some of our Solr (3.6.1) instances using JDK 7u7 with the G1 GC (playing with max pauses in the 20 to 100ms range). By and large, it has been working well (or, perhaps I should say that without requiring much tuning it works much better in general than my haphazard attempts to tune CMS). I have two instances in particular, one with a heap size of 14G and one with a heap size of 60G. I'm attempting to squeeze out additional performance by increasing Solr's cache sizes (I am still seeing the hit ratio go up as I increase max size size and decrease the number of evictions), and am guessing this is the cause of some recent situations where the 14G instance especially eventually (12-24 hrs later under 100s of queries per minute) makes it to 80%-90% of the heap and then spirals into major GC with long-pause territory. I am wondering: 1) if anybody has experience tuning the G1 GC, especially for use with Solr (what are decent max-pause times to use?) 2) how to better tune Solr's cache sizes - e.g. how to even tell the actual amount of memory used by each cache (not # entries as the stats sow, but # bits) 3) if there are any guidelines on when increasing a cache's size (even if it does continue to increase the hit ratio) runs into the law of diminishing returns or even starts to hurt - e.g. if the document cache has a current maxSize of 65536 and has seen 4409275 evictions, and currently has a hit ratio of 0.74, should the max be increased further? If so, how much ram needs to be added to the heap, and how much larger should its max size be made? I should mention that these solr instances are read-only (so cache is probably more valuable than in other scenarios - we only invalidate the searcher every 20-24hrs or so) and are also backed with indexes (6G and 70G for the 14G and 60G heap sizes) on IODrives, so I'm not as concerned about leaving RAM for linux to cache the index files (I'd much rather actually cache the post-transformed values). Thanks as always, Aaron
How to more gracefully handle field format exceptions?
Greetings, Is there a way to configure more graceful handling of field formatting exceptions when indexing documents? Currently, there is a field being generated in some documents that I am indexing that is supposed to be a float but some times slips through as an empty string. (I know, fix the docs, but sometimes bad values slip through, and it would be nice to handle them in a more forgiving manner). Here's an example of the exception - when this happens, the entire doc is thrown out due to the one malformed field: ---snip--- ERROR org.apache.solr.core.SolrCore - org.apache.solr.common.SolrException: ERROR: [doc=docidstr] Error adding field 'f_floatfield'='' ... Caused by: java.lang.NumberFormatException: empty String 00:56:46,288 [SI] WARN com.company.IndexerThread - BAD DOC: a82a2f6a6a42ad3c98a05ddb3f2c382c 01:02:12,713 [SI] ERROR org.apache.solr.core.SolrCore - org.apache.solr.common.SolrException: ERROR: [doc=6ff90020f9ec0f6dd623e9879c3e024d] Error adding field 'f_afloatfield'='' at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:333) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:157) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:142) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:121) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:106) at com.company.IndexerThread.run(IndexerThread.java:55) at java.lang.Thread.run(Thread.java:722) Caused by: java.lang.NumberFormatException: empty String at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1011) at java.lang.Float.parseFloat(Float.java:452) at org.apache.solr.schema.TrieField.createField(TrieField.java:410) at org.apache.solr.schema.SchemaField.createField(SchemaField.java:103) at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:203) at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:286) ... 12 more 01:02:12,713 [SI] WARN com.company.IndexerThread - BAD DOC: 6ff90020f9ec0f6dd623e9879c3e024d ---snip--- In my thinking (and for this situation), it would be much better to just ignore the malformed field and keep the doc - is there any way to configure this or enable this behavior instead? Thanks, Aaron
Re: How to more gracefully handle field format exceptions?
Hi Otis, I was just looking at how to implement that, but was hoping for a cleaner method - it seems like I will have to actually parse the error as text to find the field that caused it, then remove/mangle that field and attempt re-adding the document - which seems less than ideal. I would think there would be a flag or an easy way to override the add method that would just drop (or set to default value) any field that didn't meet expectations. Thanks for the suggestion, Aaron On Mon, Sep 24, 2012 at 9:24 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi Aaron, You could catch the error on the client, fix/clean/remove, and retry, no? Otis -- Search Analytics - http://sematext.com/search-analytics/index.html Performance Monitoring - http://sematext.com/spm/index.html On Mon, Sep 24, 2012 at 9:21 PM, Aaron Daubman daub...@gmail.com wrote: Greetings, Is there a way to configure more graceful handling of field formatting exceptions when indexing documents? Currently, there is a field being generated in some documents that I am indexing that is supposed to be a float but some times slips through as an empty string. (I know, fix the docs, but sometimes bad values slip through, and it would be nice to handle them in a more forgiving manner). Here's an example of the exception - when this happens, the entire doc is thrown out due to the one malformed field: ---snip--- ERROR org.apache.solr.core.SolrCore - org.apache.solr.common.SolrException: ERROR: [doc=docidstr] Error adding field 'f_floatfield'='' ... Caused by: java.lang.NumberFormatException: empty String 00:56:46,288 [SI] WARN com.company.IndexerThread - BAD DOC: a82a2f6a6a42ad3c98a05ddb3f2c382c 01:02:12,713 [SI] ERROR org.apache.solr.core.SolrCore - org.apache.solr.common.SolrException: ERROR: [doc=6ff90020f9ec0f6dd623e9879c3e024d] Error adding field 'f_afloatfield'='' at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:333) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:157) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:142) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:121) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:106) at com.company.IndexerThread.run(IndexerThread.java:55) at java.lang.Thread.run(Thread.java:722) Caused by: java.lang.NumberFormatException: empty String at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1011) at java.lang.Float.parseFloat(Float.java:452) at org.apache.solr.schema.TrieField.createField(TrieField.java:410) at org.apache.solr.schema.SchemaField.createField(SchemaField.java:103) at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:203) at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:286) ... 12 more 01:02:12,713 [SI] WARN com.company.IndexerThread - BAD DOC: 6ff90020f9ec0f6dd623e9879c3e024d ---snip--- In my thinking (and for this situation), it would be much better to just ignore the malformed field and keep the doc - is there any way to configure this or enable this behavior instead? Thanks, Aaron
Re: Understanding fieldCache SUBREADER insanity
Yonik, et al. I believe I found the section of code pushing me into 'insanity' status: ---snip--- int[] collapseIDs = null; float[] hotnessValues = null; String[] artistIDs = null; try { collapseIDs = FieldCache.DEFAULT.getInts(searcher.getIndexReader(), COLLAPSE_KEY_NAME); hotnessValues = FieldCache.DEFAULT.getFloats(searcher.getIndexReader(), HOTNESS_KEY_NAME); artistIDs = FieldCache.DEFAULT.getStrings(searcher.getIndexReader(), ARTIST_KEY_NAME); } ... ---snip--- Since it seems like this code is using the 'old-style' pre-Lucene 2.9 top-level indexReaders, is there any example code you can point me to that could show how to convert to using the leaf level segmentReaders? If the limited information I've been able to find is correct, this could explain some of the significant memory usage I am seeing... Thanks again, Aaron On Wed, Sep 19, 2012 at 4:54 PM, Yonik Seeley yo...@lucidworks.com wrote: already-optimized, single-segment index That part is interesting... if true, then the type of insanity you saw should be impossible, and either the insanity detection or something else is broken. -Yonik http://lucidworks.com
Understanding fieldCache SUBREADER insanity
Hi all, In reviewing a solr instance with somewhat variable performance, I noticed that its fieldCache stats show an insanity_count of 1 with the insanity type SUBREADER: ---snip--- insanity_count : 1 insanity#0 : SUBREADER: Found caches for descendants of ReadOnlyDirectoryReader(segments_k _6h9(3.3):C17198463)+tf_normalizedTotalHotttnesss 'ReadOnlyDirectoryReader(segments_k _6h9(3.3):C17198463)'='tf_normalizedTotalHotttnesss',float,org.apache.lucene.search.FieldCache.NUMERIC_UTILS_FLOAT_PARSER=[F#1965982057 'ReadOnlyDirectoryReader(segments_k _6h9(3.3):C17198463)'='tf_normalizedTotalHotttnesss',float,null=[F#1965982057 'MMapIndexInput(path=/io01/p/solr/playlist/a/playlist/index/_6h9.frq)'='tf_normalizedTotalHotttnesss',float,org.apache.lucene.search.FieldCache.NUMERIC_UTILS_FLOAT_PARSER=[F#1308116426 ---snip--- How can I decipher what this means and what, if anything, I should do to fix/improve the insanity? Thanks, Aaron
Re: Understanding fieldCache SUBREADER insanity
Hi Tomás, This probably means that you are using the same field for faceting and for sorting (tf_normalizedTotalHotttnesss), sorting uses the segment level cache and faceting uses by default the global field cache. This can be a problem because the field is duplicated in cache, and then it uses twice the memory. One way to solve this would be to change the faceting method on that field to 'fcs', which uses segment level cache (but may be a little bit slower). Thanks for explaining what the sparse wiki and javadoc mean - I had read them but had no idea what the implications were ;-) We are not doing any explicit faceting, and this index is also supposed to be a read-only, already-optimized, single-segment index - both of these seem to indicate to (very unknowledgeable about this) me that this could be more of a problem - e.g. what am I doing to cause this since I don't think I need to be using segment-level anything (should be a single segment if I understand optimization and RO indicies) and I am not leveraging faceting? Any pointers on where else to look for what might be causing this (one issue I am currently troubleshooting is too-many-pauses caused by too-frequent GC, so preventing this double-allocation could help)? Thanks again, Aaron
Solr request/response lifecycle and logging full response time
Greetings, I'm looking to add some additional logging to a solr 3.6.0 setup to allow us to determine actual time spent by Solr responding to a request. We have a custom QueryComponent that sometimes returns 1+ MB of data and while QTime is always on the order of ~100ms, the response time at the client can be longer than a second (as measured with JMeter running on the same server using localhost). The end goal is to be able to: 1) determine if this large variance in response time is due to Solr, and if so where (to help determine if/how it can be optimized) 2) determine if the large variance is due to how jetty handles connections, buffering, etc... (and if so, if/how we can optimize there) ...or some combination of the two. As it stands now, where the second or so between when the actual query finishes as indicated by QTime, when solr gathers all the data to be returned as requested by fl, and when the client actually receives the data (even when the client is on the localhost) is completely opaque. My main question: - Is there any documentation (a diagram / flowchart would be oh so wonderful) on the lifecycle of a Solr request? So far I've attempted to modify and rebuild solr, adding logging to SolrCore's execute() method (this pretty much mirrors QTime), as well as add timing calculations and logging to various different overriden methods in the QueryComponent custom extension, all to no avail so far. What I'm getting at is how to: - start a stopwatch when solr receives the request from the client - stop the stopwatch and log the elapsed time right before solr hands the response body off to Jetty to be delivered back to the client. Thanks, as always! Aaron
Re: Solr request/response lifecycle and logging full response time
I'd still love to see a query lifecycle flowchart, but, in case it helps any future users or in case this is still incorrect, here's how I'm tackling this: 1) Override default json responseWriter with my own in solrconfig.xml: queryResponseWriter name=json class=com.mydomain.solr.component.JSONResponseWriterWithTiming/ 2) Define JSONResponseWriterWithTiming as just extending JSONResponseWriter and adding in a log statement: public class JSONResponseWriterWithTiming extends JSONResponseWriter { private static final Logger logger = LoggerFactory.getLogger(JSONResponseWriterWithTiming.class); @Override public void write(Writer writer, SolrQueryRequest req, SolrQueryResponse rsp) throws IOException { super.write(writer, req, rsp); if (logger.isInfoEnabled()) { final long st = req.getStartTime(); logger.info(String.format(Total solr time for query with QTime: %d is: %d, (int) (rsp.getEndTime() - st), (int) (System.currentTimeMillis() - st))); } } } Please advise if: - Flowcharts for any solr/lucene-related lifecycles exist - There is a better way of doing this Thanks, Aaron On Thu, Sep 6, 2012 at 9:16 PM, Aaron Daubman daub...@gmail.com wrote: Greetings, I'm looking to add some additional logging to a solr 3.6.0 setup to allow us to determine actual time spent by Solr responding to a request. We have a custom QueryComponent that sometimes returns 1+ MB of data and while QTime is always on the order of ~100ms, the response time at the client can be longer than a second (as measured with JMeter running on the same server using localhost). The end goal is to be able to: 1) determine if this large variance in response time is due to Solr, and if so where (to help determine if/how it can be optimized) 2) determine if the large variance is due to how jetty handles connections, buffering, etc... (and if so, if/how we can optimize there) ...or some combination of the two. As it stands now, where the second or so between when the actual query finishes as indicated by QTime, when solr gathers all the data to be returned as requested by fl, and when the client actually receives the data (even when the client is on the localhost) is completely opaque. My main question: - Is there any documentation (a diagram / flowchart would be oh so wonderful) on the lifecycle of a Solr request? So far I've attempted to modify and rebuild solr, adding logging to SolrCore's execute() method (this pretty much mirrors QTime), as well as add timing calculations and logging to various different overriden methods in the QueryComponent custom extension, all to no avail so far. What I'm getting at is how to: - start a stopwatch when solr receives the request from the client - stop the stopwatch and log the elapsed time right before solr hands the response body off to Jetty to be delivered back to the client. Thanks, as always! Aaron
Re: Frustrating differences in fieldNorm between two different versions of solr indexing the same document
Robert, I have a solr 1.4.1 instance and a solr 3.6.0 instance, both configured as identically as possible (given deprecations) and indexing the same document. Why did you do this? If you want the exact same scoring, use the exact same analysis. This means specifying luceneMatchVersion = 2.9, and the exact same analysis components (even if deprecated). I have taken the field values for the example below and run them through /admin/analysis.jsp on each solr instance. Even for the problematic docs/fields, the results are almost identical. For the example below, the t_tag values for the problematic doc: 1.4.1: 162 values 3.6.0: 164 values This is why: you changed your analysis. Apologies if I didn't clearly state my goal/concern: I am not looking for the exact same scoring - I am looking to explain scoring differences. Deprecated components will eventually go away, time moves on, etc... etc... I would like to be able to run current code, and should be able to - the part that is sticking is being able to *explain* the difference in results. As you can see from my email, after running the different analysis on the input, the output does not demonstrate (in any way that I can see) why the fieldNorm values would be so different. Even with the different analysis, the results are almost identical - which *should* result in an almost identical fieldNorm??? Again, the desire is not to be the same, it is to understand the difference. Thanks, Aaron
Re: Frustrating differences in fieldNorm between two different versions of solr indexing the same document
Robert, So this is lossy: basically you can think of there being only 256 possible values. So when you increased the number of terms only slightly by changing your analysis, this happened to bump you over the edge rounding you up to the next value. more information: http://lucene.apache.org/core/3_6_0/scoring.html http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html Thanks - this was extremely helpful! I had read both sources before but didn't grasp the magnitude of lossy-ness until your pointer and mention of edge-case. Just to help out anybody else who might run in to this, I hacked together a little harness to demonstrate: --- fieldLength: 160, computeNorm: 0.07905694, floatToByte315: 109, byte315ToFloat: 0.078125 fieldLength: 161, computeNorm: 0.07881104, floatToByte315: 109, byte315ToFloat: 0.078125 fieldLength: 162, computeNorm: 0.07856742, floatToByte315: 109, byte315ToFloat: 0.078125 fieldLength: 163, computeNorm: 0.07832605, floatToByte315: 109, byte315ToFloat: 0.078125 fieldLength: 164, computeNorm: 0.07808688, floatToByte315: 108, byte315ToFloat: 0.0625 fieldLength: 165, computeNorm: 0.077849895, floatToByte315: 108, byte315ToFloat: 0.0625 fieldLength: 166, computeNorm: 0.07761505, floatToByte315: 108, byte315ToFloat: 0.0625 --- So my takeaway is that these scores that vary significantly are caused by: 1) a field with lengths right on this boundary between the two analyzer chains 2) the fact that we might be searching for matches from 50+ values to a field with 150+ values, and so the overall score is repeatedly impacted by the otherwise typically insignificant change in fieldNorm value Thanks again, Aaron
Frustrating differences in fieldNorm between two different versions of solr indexing the same document
Greetings, I've been digging in to this for two days now and have come up short - hopefully there is some simple answer I am just not seeing: I have a solr 1.4.1 instance and a solr 3.6.0 instance, both configured as identically as possible (given deprecations) and indexing the same document. For most queries the results are very close (scoring within three significant differences, almost identical positions in results). However, for certain documents, the scores are very different (causing these docs to be ranked +/- 25 positions different or more in the results) In looking at debugQuery output, it seems like this is due to fieldNorm values being lower for the 3.6.0 instance than the 1.4.1. (note that for most docs, the fieldNorms are identical) I have taken the field values for the example below and run them through /admin/analysis.jsp on each solr instance. Even for the problematic docs/fields, the results are almost identical. For the example below, the t_tag values for the problematic doc: 1.4.1: 162 values 3.6.0: 164 values note that 1/sqrt(162) = 0.07857 ~= fieldNorm for 1.4.1, however, (1/0.0625)^2 = 256, which is no where near 164 Here is a particular example from 1.4.1: 1.6263733 = (MATCH) fieldWeight(t_tag:soul in 2066419), product of: 3.8729835 = tf(termFreq(t_tag:soul)=15) 5.3750753 = idf(docFreq=27619, maxDocs=2194294) 0.078125 = fieldNorm(field=t_tag, doc=2066419) And the same from 3.6.0: 1.3042576 = (MATCH) fieldWeight(t_tag:soul in 1977957), product of: 3.8729835 = tf(termFreq(t_tag:soul)=15) 5.388126 = idf(docFreq=27740, maxDocs=2232857) 0.0625 = fieldNorm(field=t_tag, doc=1977957) Here is the 1.4.1 config for the t_tag field and text type: fieldtype name=text class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.ISOLatin1AccentFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ /analyzer /fieldtype dynamicField name=t_* type=text indexed=true stored=true required=false multiValued=true termVectors=true/ And 3.6.0 schema config for the t_tag field and text type: fieldtype name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer /fieldtype field name=t_tag type=text indexed=true stored=true required=false multiValued=true/ I at first got distracted by this change between versions: LUCENE-2286: Enabled DefaultSimilarity.setDiscountOverlaps by default. This means that terms with a position increment gap of zero do not affect the norms calculation by default. However, this doesn't appear to be causing the issue as, according to analysis.jsp there is no overlap for t_tag... Can you point me to where these fieldNorm differences are coming from and why they'd only be happing for a select few documents for which the content doesn't stand out? Thank you, Aaron
Debugging jetty IllegalStateException errors?
Greetings, I'm wondering if anybody has experienced (and found root cause) for errors like this. We're running Solr 3.6.0 with latest stable Jetty 7 (7.6.4.v20120524). I know this is likely due to a client (or the server) terminating the connection unexpectedly, but we see these fairly frequently and can't determine what the impact is or why they are happening (who is closing early, why?) Any tips/tricks on troubleshooting or what to do to possibly minimize or help prevent these from happening (we are using a fairly old python client to programmatically access this solr instance). ---snip--- 17:25:13,250 [qtp581536050-12] WARN jetty.server.Response null - Committed before 500 null org.eclipse.jetty.io.EofException at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:952) at org.eclipse.jetty.http.AbstractGenerator.flush(AbstractGenerator.java:438) at org.eclipse.jetty.server.HttpOutput.flush(HttpOutput.java:94) at org.eclipse.jetty.server.AbstractHttpConnection$Output.flush(AbstractHttpConnection.java:1016) at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:278) at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:122) at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:212) at org.apache.solr.common.util.FastWriter.flush(FastWriter.java:115) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:353) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:273) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1332) at org.eclipse.jetty.servlets.UserAgentFilter.doFilter(UserAgentFilter.java:77) at org.eclipse.jetty.servlets.GzipFilter.doFilter(GzipFilter.java:247) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1332) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:477) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:225) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1031) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:406) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:186) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:965) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111) at org.eclipse.jetty.server.Server.handle(Server.java:348) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:452) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:894) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:948) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:851) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:77) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:620) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:46) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:603) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:538) at java.lang.Thread.run(Thread.java:662) Caused by: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:137) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:359) at java.nio.channels.SocketChannel.write(SocketChannel.java:360) at org.eclipse.jetty.io.nio.ChannelEndPoint.gatheringFlush(ChannelEndPoint.java:371) at org.eclipse.jetty.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:330) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:330) at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:876) ... 37 more 17:25:13,250 [qtp581536050-12] WARN jetty.servlet.ServletHandler null - /solr/artists/select java.lang.IllegalStateException: Committed at org.eclipse.jetty.server.Response.resetBuffer(Response.java:1087) at
Re: Correct way to deal with source data that may include a multivalued field that needs to be used for sorting?
While I look into doing some refactoring, as well as creating some new UpdateRequestProcessors (and/or backporting), would you please point me to some reading material on why you say the following: In this day and age, a custom update handler is almost never the right answer to a problem -- nor is a custom request handler that does updates (theose two things are actaully different) ... my advice is always to start by trying to impliment what you need as an UpdateRequestProcessor, and if that doesn't work out then refactor your code to be a Request Handler instead. e.g. benefits of UpdateRequestProcessor over custom update handler? Thanks again for the great pointers, Aaron
Re: What would cause: SEVERE: java.lang.ClassCastException: com.company.MyCustomTokenizerFactory cannot be cast to org.apache.solr.analysis.TokenizerFactory
Jack, Thanks - this was indeed the issue. I still don't understand exactly why (the same local-nexus-hosted Solr jars were the ones being duplicated on the classpath: included in my custom -with-dependencies jars as well as in the solr war, which was build/distributed/and hosted from the same nexus repo used to host my jars) but shading solr from my -with-dependencies jars fixed the issue. (if anybody could point me to reading on why this happened - e.g. the classes on the classpath would be duplicated but identical, in my naive understanding of the classloader this should have still just worked - it would be appreciated) Thanks again, Aaron On Sat, Jun 9, 2012 at 2:40 PM, Jack Krupansky j...@basetechnology.comwrote: Make sure there are no stray jars/classes in your jar, especially any that might contain BaseTokenizerFactory or TokenizerFactory. I notice that your jar name says -with-dependencies, raising a little suspicion. The exception is as if your class was referring to a BaseTokenizerFactory, which implements TokenizerFactory, coming from your jar (or a contained jar) rather than getting resolved to Solr 3.6's own BaseTokenizerFactory and TokenizerFactory. -- Jack Krupansky -Original Message- From: Aaron Daubman Sent: Saturday, June 09, 2012 12:03 AM To: solr-user@lucene.apache.org Subject: What would cause: SEVERE: java.lang.ClassCastException: com.company.**MyCustomTokenizerFactory cannot be cast to org.apache.solr.analysis.**TokenizerFactory Greetings, I am in the process of updating custom code and schema from Solr 1.4 to 3.6.0 and have run into the following issue with our two custom Tokenizer and Token Filter components. I've been banging my head against this one for far too long, especially since it must be something obvious I'm missing. I have custom Tokenizer and Token Filter components along with corresponding factories. The code for all looks very similar to the Tokenizer and TokenFilter (and Factory) code that is standard with 3.6.0 (and I have also read through http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**shttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters I have ensured my custom code is on the classpath, it is in ENSolrComponents-1.0-SNAPSHOT-**jar-with-dependencies.jar: ---output snip--- Jun 8, 2012 10:41:00 PM org.apache.solr.core.**CoreContainer load INFO: loading shared library: /opt/test_artists_solr/jetty-**solr/lib/en Jun 8, 2012 10:41:00 PM org.apache.solr.core.**SolrResourceLoader replaceClassLoader INFO: Adding 'file:/opt/test_artists_solr/**jetty-solr/lib/en/** ENSolrComponents-1.0-SNAPSHOT-**jar-with-dependencies.jar' to classloader Jun 8, 2012 10:41:00 PM org.apache.solr.core.**SolrResourceLoader replaceClassLoader INFO: Adding 'file:/opt/test_artists_solr/**jetty-solr/lib/en/ENUtil-1.0-** SNAPSHOT-jar-with-**dependencies.jar' to classloader Jun 8, 2012 10:41:00 PM org.apache.solr.core.**CoreContainer create --snip--- After successfully parsing the schema and creating many fields, etc.. the following is logged: ---snip--- Jun 8, 2012 10:41:00 PM org.apache.solr.util.plugin.**AbstractPluginLoader load INFO: created : com.company.**MyCustomTokenizerFactory Jun 8, 2012 10:41:00 PM org.apache.solr.common.**SolrException log SEVERE: java.lang.ClassCastException: com.company.** MyCustomTokenizerFactory cannot be cast to org.apache.solr.analysis.**TokenizerFactory at org.apache.solr.schema.**IndexSchema$5.init(**IndexSchema.java:966) at org.apache.solr.util.plugin.**AbstractPluginLoader.load(** AbstractPluginLoader.java:148) at org.apache.solr.schema.**IndexSchema.readAnalyzer(** IndexSchema.java:986) at org.apache.solr.schema.**IndexSchema.access$100(**IndexSchema.java:60) at org.apache.solr.schema.**IndexSchema$1.create(**IndexSchema.java:453) at org.apache.solr.schema.**IndexSchema$1.create(**IndexSchema.java:433) at org.apache.solr.util.plugin.**AbstractPluginLoader.load(** AbstractPluginLoader.java:140) at org.apache.solr.schema.**IndexSchema.readSchema(**IndexSchema.java:490) at org.apache.solr.schema.**IndexSchema.init(**IndexSchema.java:123) at org.apache.solr.core.**CoreContainer.create(**CoreContainer.java:481) at org.apache.solr.core.**CoreContainer.load(**CoreContainer.java:335) at org.apache.solr.core.**CoreContainer.load(**CoreContainer.java:219) at org.apache.solr.core.**CoreContainer$Initializer.** initialize(CoreContainer.java:**161) at org.apache.solr.servlet.**SolrDispatchFilter.init(** SolrDispatchFilter.java:96) at org.eclipse.jetty.servlet.**FilterHolder.doStart(** FilterHolder.java:102) at org.eclipse.jetty.util.**component.AbstractLifeCycle.** start(AbstractLifeCycle.java:**59) at org.eclipse.jetty.servlet.**ServletHandler.initialize(** ServletHandler.java:748) at org.eclipse.jetty.servlet.**ServletContextHandler.**startContext(** ServletContextHandler.java:**249) at org.eclipse.jetty.webapp
Re: Correct way to deal with source data that may include a multivalued field that needs to be used for sorting?
Hoss, The new FieldValueSubsetUpdateProcessorFactory classes look phenomenal. I haven't looked yet, but what are the chances these will be back-ported to 3.6 (or how hard would it be to backport them?)... I'll have to check out the source in more detail. If stuck on 3.6, what would be the best way to deal with this situation? It's currently looking like it will have to be a custom update handler, but I'd hate to have to go down this route if there are more future-proof options. Thanks again, Aaron On Tue, Jun 5, 2012 at 6:53 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : The real issue here is that the docs are created externally, and the : producer won't (yet) guarantee that fields that should appear once will : actually appear once. Because of this, I don't want to declare the field as : multiValued=false as I don't want to cause indexing errors. It would be : great for me (and apparently many others after searching) if there were an : option as simple as forceSingleValued=true - where some deterministic : behavior such as use first field encountered, ignore all others, would : occur. This will be trivial in Solr 4.0, using one of the new FieldValueSubsetUpdateProcessorFactory classes that are now available -- just pick your rule... https://builds.apache.org/view/G-L/view/Lucene/job/Solr-trunk/javadoc/org/apache/solr/update/processor/FieldValueSubsetUpdateProcessorFactory.html Direct Known Subclasses: FirstFieldValueUpdateProcessorFactory, LastFieldValueUpdateProcessorFactory, MaxFieldValueUpdateProcessorFactory, MinFieldValueUpdateProcessorFactory -Hoss
What would cause: SEVERE: java.lang.ClassCastException: com.company.MyCustomTokenizerFactory cannot be cast to org.apache.solr.analysis.TokenizerFactory
Greetings, I am in the process of updating custom code and schema from Solr 1.4 to 3.6.0 and have run into the following issue with our two custom Tokenizer and Token Filter components. I've been banging my head against this one for far too long, especially since it must be something obvious I'm missing. I have custom Tokenizer and Token Filter components along with corresponding factories. The code for all looks very similar to the Tokenizer and TokenFilter (and Factory) code that is standard with 3.6.0 (and I have also read through http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters I have ensured my custom code is on the classpath, it is in ENSolrComponents-1.0-SNAPSHOT-jar-with-dependencies.jar: ---output snip--- Jun 8, 2012 10:41:00 PM org.apache.solr.core.CoreContainer load INFO: loading shared library: /opt/test_artists_solr/jetty-solr/lib/en Jun 8, 2012 10:41:00 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/opt/test_artists_solr/jetty-solr/lib/en/ENSolrComponents-1.0-SNAPSHOT-jar-with-dependencies.jar' to classloader Jun 8, 2012 10:41:00 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/opt/test_artists_solr/jetty-solr/lib/en/ENUtil-1.0-SNAPSHOT-jar-with-dependencies.jar' to classloader Jun 8, 2012 10:41:00 PM org.apache.solr.core.CoreContainer create --snip--- After successfully parsing the schema and creating many fields, etc.. the following is logged: ---snip--- Jun 8, 2012 10:41:00 PM org.apache.solr.util.plugin.AbstractPluginLoader load INFO: created : com.company.MyCustomTokenizerFactory Jun 8, 2012 10:41:00 PM org.apache.solr.common.SolrException log SEVERE: java.lang.ClassCastException: com.company.MyCustomTokenizerFactory cannot be cast to org.apache.solr.analysis.TokenizerFactory at org.apache.solr.schema.IndexSchema$5.init(IndexSchema.java:966) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:148) at org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:986) at org.apache.solr.schema.IndexSchema.access$100(IndexSchema.java:60) at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:453) at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:433) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:490) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:123) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:481) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:335) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:219) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:161) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:96) at org.eclipse.jetty.servlet.FilterHolder.doStart(FilterHolder.java:102) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) at org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:748) at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:249) at org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1222) at org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:676) at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:455) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) at org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:36) at org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:183) at org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:491) at org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:138) at org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:142) at org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:53) at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:604) at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:535) at org.eclipse.jetty.util.Scanner.scan(Scanner.java:398) at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:332) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) at org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:118) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) at org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(DeploymentManager.java:552) at org.eclipse.jetty.deploy.DeploymentManager.doStart(DeploymentManager.java:227) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) at org.eclipse.jetty.util.component.AggregateLifeCycle.doStart(AggregateLifeCycle.java:63) at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:53) at
Re: What would cause: SEVERE: java.lang.ClassCastException: com.company.MyCustomTokenizerFactory cannot be cast to org.apache.solr.analysis.TokenizerFactory
Just in case it is helpful, here are the relevant pieces of my schema.xml: ---snip-- fieldtype name=customfield class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=com.company.MyCustomTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true/ !--filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/-- /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true/ !--filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/-- /analyzer /fieldtype ---snip--- and ---snip--- fieldtype name=customterms class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=com.company.MyCustomFilterFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt expand=false/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=\- replacement= replace=all/ filter class=solr.PatternReplaceFilterFactory pattern=amp;amp; replacement=amp; replace=all/ filter class=solr.PatternReplaceFilterFactory pattern=\s+ replacement= replace=all/ filter class=solr.TrimFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer /fieldtype ---snip--- On Sat, Jun 9, 2012 at 12:03 AM, Aaron Daubman daub...@gmail.com wrote: Greetings, I am in the process of updating custom code and schema from Solr 1.4 to 3.6.0 and have run into the following issue with our two custom Tokenizer and Token Filter components. I've been banging my head against this one for far too long, especially since it must be something obvious I'm missing. I have custom Tokenizer and Token Filter components along with corresponding factories. The code for all looks very similar to the Tokenizer and TokenFilter (and Factory) code that is standard with 3.6.0 (and I have also read through http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters I have ensured my custom code is on the classpath, it is in ENSolrComponents-1.0-SNAPSHOT-jar-with-dependencies.jar: ---output snip--- Jun 8, 2012 10:41:00 PM org.apache.solr.core.CoreContainer load INFO: loading shared library: /opt/test_artists_solr/jetty-solr/lib/en Jun 8, 2012 10:41:00 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/opt/test_artists_solr/jetty-solr/lib/en/ENSolrComponents-1.0-SNAPSHOT-jar-with-dependencies.jar' to classloader Jun 8, 2012 10:41:00 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/opt/test_artists_solr/jetty-solr/lib/en/ENUtil-1.0-SNAPSHOT-jar-with-dependencies.jar' to classloader Jun 8, 2012 10:41:00 PM org.apache.solr.core.CoreContainer create --snip--- After successfully parsing the schema and creating many fields, etc.. the following is logged: ---snip--- Jun 8, 2012 10:41:00 PM org.apache.solr.util.plugin.AbstractPluginLoader load INFO: created : com.company.MyCustomTokenizerFactory Jun 8, 2012 10:41:00 PM org.apache.solr.common.SolrException log SEVERE: java.lang.ClassCastException: com.company.MyCustomTokenizerFactory cannot be cast to org.apache.solr.analysis.TokenizerFactory at org.apache.solr.schema.IndexSchema$5.init(IndexSchema.java:966) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:148) at org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:986) at org.apache.solr.schema.IndexSchema.access$100(IndexSchema.java:60) at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:453) at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:433) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:490) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:123) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:481) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:335) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:219
Re: Correct way to deal with source data that may include a multivalued field that needs to be used for sorting?
Thanks for the responses, By saying dirty data you imply that only one of the values is good or clean and that the others can be safely discarded/ignored, as opposed to true multi-valued data where each value is there for good reason and needs to be preserved. In any case, how do you know/decide which value should be used for sorting - and did you just get lucky that Solr happened to use the right one? I haven't gone back and checked the old version's docs where this was working, however, I suspect that either the field never ended up appearing in docs more than once, or if it did, it had the same value repeated... The real issue here is that the docs are created externally, and the producer won't (yet) guarantee that fields that should appear once will actually appear once. Because of this, I don't want to declare the field as multiValued=false as I don't want to cause indexing errors. It would be great for me (and apparently many others after searching) if there were an option as simple as forceSingleValued=true - where some deterministic behavior such as use first field encountered, ignore all others, would occur. The preferred technique would be the preprocess and clean the data before it is handed to Solr or SolrJ, even if the source must remain dirty. Baring that a preprocessor or a custom update processor certainly. I could write preprocessors (this is really what will eventually happen when the producer cleans their data), custom processors, etc... however, for something this simple it would be great not to be producing more code that would have to be maintained. Please clarify exactly how the data is being fed into Solr. I am using generic code to read from a key/value store and compose documents. This is another reason fixing the data at this point would not be desirable, the currently generic code would need to be made specific to look for these particular fields and then coerce them to single values... Thanks again, Aaron
Correct way to deal with source data that may include a multivalued field that needs to be used for sorting?
Greetings, I have dirty source data where some documents being indexed, although unlikely, may contain multivalued fields that are also required for sorting. In previous versions of Solr, sorting on this field worked fine (possibly because few or no multivalued fields were ever encountered?), however, as of 3.6.0, thanks to https://issues.apache.org/jira/browse/SOLR-2339 attempting to sort on this field now throws an error: [2012-06-04 17:20:01,691] ERROR org.apache.solr.common.SolrException org.apache.solr.common.SolrException: can not sort on multivalued field: f_normalizedValue The relevant bits of the schema.xml are: fieldType name=sfloat class=solr.TrieFloatField precisionStep=0 positionIncrementGap=0 sortMissingLast=true/ dynamicField name=f_* type=sfloat indexed=true stored=true required=false multiValued=true/ Assuming that the source documents being indexed cannot be changed (which, at least for now, they cannot), what would be the next best way to allow for both the possibility of multiple f_normalizedValue fields appearing in indexed documents, as wel as being able to sort by f_normalizedValue? Thank you, Aaron
Re: Tips on creating a custom QueryCache?
Hoss, : 1) Any recommendations on which best to sub-class? I'm guessing, for this : scenario with rare batch puts and no evictions, I'd be looking for get : performance. This will also be on a box with many CPUs - so I wonder if the : older LRUCache would be preferable? i suspect you are correct ... the entire point of the other caches is dealingwith faster replacement, so you really don't care. You might even find it worth while to write your own NoReplacementCache from scratch backed by a HashMap (instead of the LinkedHashMap used in LRUCache) I really like this idea (roll-your-own cache using a simple HashMap). However, as much searching as I've done, I've come up short on anything that describes concurrency in Solr. The short question is, for such a cache, do I need to worry about concurrent access (I'm guessing that the firstSearcher QuerySenderListener process would be single-threaded/non-concurrent, and thus writes would never be an issue - is this correct?) - e.g. for my case, would I back the NoReplacementCache with a HashMap or ? The bigger question is: what are the parallel task execution paths in Solr and under what conditions are they possible? Thanks again, Aaron
Example setup of using Solr 3.6.0 with Jetty 7 (7.6.3)?
Greetings, Has anybody gotten Solr 3.6.0 to work well with Jetty 7.6.3, and if so, would you mind sharing your config files / directory structure / other useful details? Thanks, Aaron
Generating maven artifacts for 3.6.0 build - correct -Dversion to use?
Greetings, Following the directions here: http://svn.apache.org/repos/asf/lucene/dev/trunk/dev-tools/maven/README.maven for building Lucene/Solr with Maven, what is the correct -Dversion to pass in to get-maven-poms. This seems set up for building -SNAPSHOT, however, I would like to use maven to build the 3.6.0 tag. If I set version to 3.6.0, however, this causes issue with lucene, which seems to really only want version 3.6 (no 0) and even causes the version check test to fail. What is the correct version to pass in to get-maven-poms for a 3.6.0 release build via maven? Thanks, Aaron
Re: Tips on creating a custom QueryCache?
Thanks for the reply, Do you have any pointers to relevant Docs or Examples that show how this should be chained together? Thanks again, Aaron On Thu, May 24, 2012 at 3:03 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Perhaps this could be a custom SearchComponent that's run before the usual QueryComponent? This component would be responsible for loading queries, executing them, caching results, and for returning those results when these queries are encountered later on. Otis From: Aaron Daubman daub...@gmail.com Subject: Tips on creating a custom QueryCache? Greetings, I'm looking for pointers on where to start when creating a custom QueryCache. Our usage patterns are possibly a bit unique, so let me explain the desired use case: Our Solr index is read-only except for dedicated periods where it is updated and re-optimized. On startup, I would like to create a specific QueryCache that would cache the top ~20,000 (arbitrary but large) queries. This cache should never evict entries, and, after the warming process to populate, should never be added to either. The warming process would be to run through the (externally determined) list of anticipated top X (say 20,000) queries and cache these results. This cache would then be used for the duration of the solr run-time (until the period, perhaps daily, where the index is updated and re-optimized, at which point the cache would be re-created) Where should I begin looking to implement such a cache? The reason for this somewhat different approach to caching is that we may get any number of odd queries throughout the day for which performance isn't important, and we don't want any of these being added to the cache or evicting other entries from the cache. We need to ensure high performance for this pre-determined list of queries only (while still handling other arbitrary queries, if not as quickly) Thanks, Aaron
Re: Tips on creating a custom QueryCache?
Hoss, brilliant as always - many thanks! =) Subclassing the SolrCache class sounds like a good way to accomplish this. Some questions: 1) Any recommendations on which best to sub-class? I'm guessing, for this scenario with rare batch puts and no evictions, I'd be looking for get performance. This will also be on a box with many CPUs - so I wonder if the older LRUCache would be preferable? 2) Would I need to worry about auto warming at all? I'm still a little foggy on lifecycle of firstSearcher versus newSearcher (is firstSearcher really only ever called the first time the solr instanced is started?). In any case, since the only time a commit would occur is when batch updates, re-indexing and re-optimizing occurs (once a day off-peak perhaps) I *think* I would always want to perform the same static warming rather than attempting to auto-warm from the old cache - does this make sense? Thanks again! Aaron On Thu, May 24, 2012 at 7:38 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: Interesting problem, w/o making any changes to Solr, you could probably get this behavior be: a) sizing your cache large neough. b) using a firstSearcher that generates your N queries on startup c) configure autowarming of 100% d) ensure every query you send uses cache=false The tricky part being d. But if you don't mind writing a little java, i think this should actually be fairly trivial to do w/o needing d at all... 1) subclass the existing SolrCache class of your choice. 2) in your subclass, make put be a No-Op if getState()==LIVE, else super.put(...) ...so during any warming phase (either static from firstSearcher/newSearcher, or because of autowarming) the cache will accept new objects, but once warming is done it will ignore requests to add new items (so it will never evict anything) Then all you need is a firstSearcher event listener that feeds you your N queries (model it after QuerySenderListener but have it read from whatever source you want instead of the solrconfig.xml) : The reason for this somewhat different approach to caching is that we may : get any number of odd queries throughout the day for which performance : isn't important, and we don't want any of these being added to the cache or : evicting other entries from the cache. We need to ensure high performance : for this pre-determined list of queries only (while still handling other : arbitrary queries, if not as quickly) FWIW: my defacto way of dealing with this in the past was to siloize my slave machines by usecase. For example, in one index: i had 1 master, which replicated to 2*N slaves, as well as a repeater. The 2*N slaves were behind 2 diff load balancers (N even numbered machines and N odd numbered machines), and the two sets of slaves had diff static cache warming configs - even numbered machines warmed queries common to browsing categories, odd numbered machines warmed top-searches. If the front end was doing an arbitrary search, it was routed to the load blancer for the odd-numbered slaves. if the front end was doing a category browse, the query was routed to the even-numbered slaves. Meanwhile: the repeater was replicating out to a bunch of smaller one-off boxes with cache configs by use case, ie: the data-wharehouse and analytics team had their own slave they could run their own complex queries against. the tools team had a dedicated slave that various internal tools would query via ajax to get metadata, etc... -Hoss
Tips on creating a custom QueryCache?
Greetings, I'm looking for pointers on where to start when creating a custom QueryCache. Our usage patterns are possibly a bit unique, so let me explain the desired use case: Our Solr index is read-only except for dedicated periods where it is updated and re-optimized. On startup, I would like to create a specific QueryCache that would cache the top ~20,000 (arbitrary but large) queries. This cache should never evict entries, and, after the warming process to populate, should never be added to either. The warming process would be to run through the (externally determined) list of anticipated top X (say 20,000) queries and cache these results. This cache would then be used for the duration of the solr run-time (until the period, perhaps daily, where the index is updated and re-optimized, at which point the cache would be re-created) Where should I begin looking to implement such a cache? The reason for this somewhat different approach to caching is that we may get any number of odd queries throughout the day for which performance isn't important, and we don't want any of these being added to the cache or evicting other entries from the cache. We need to ensure high performance for this pre-determined list of queries only (while still handling other arbitrary queries, if not as quickly) Thanks, Aaron