Re: Solr trunk for production
Otis Gospodnetic wrote: Are people using Solr trunk in serious production environments? I suspect the answer is yes, just want to see if there are any gotchas/warnings. Yes, since it seemed the best way to get edismax with this patch[1]; and to get the more update-friendly MergePolicy[2]. Main gotcha I noticed so far is trying to figure out appropriate times to sync with trunk's newer patches; and whether or not we need to rebuild our kinda big ( 1TB) indexes when we do. [1] the patch I needed: https://issues.apache.org/jira/browse/SOLR-2058 [2] nicer MergePolicy https://issues.apache.org/jira/browse/LUCENE-2602
Would it be nuts to store a bunch of large attachments (images, videos) in stored but-not-indexed fields
I have some documents with a bunch of attachments (images, thumbnails for them, audio clips, word docs, etc); and am currently dealing with them by just putting a path on a filesystem to them in solr; and then jumping through hoops of keeping them in sync with solr. Would it be nuts to stick the image data itself in solr? More specifically - if I have a bunch of large stored fields, would it significantly impact search performance in the cases when those fields aren't fetched. Searches are very common in this system, and it's very rare that someone actually opens up one of these attachments so I'm not really worried about the time it takes to fetch them when someone does actually want one.
Re: Solr sorting problem
Savvas-Andreas Moysidis wrote: In my understanding sorting on a field for which analysis has yielded multiple terms just doesn't make sense.. If you have document#1 with a field A which has the terms Epsilon, Alpha, and document#2 with field A which has the terms Beta, Delta and request an ascending sort on field A what order should you get and why? In the couple use cases I've been asked for it, either... (a) returning each document only the first time it appeared document 1 [for alpha] followed by document 2[beta] (b) or returning them with duplicates doc1 [alpha], doc2[beta], doc2[beta] doc1[epsilon] ... would have been an OK user experience. The use case show me documents relevant to things close to a location seems like a pretty broad use-case that any geospatial-aware search engine would like to handle; and I imagine in many cases a single document might refer to multiple addresses/locations. In another case, I was asked if the application could sort the incidents by the age of rape victims. And while most incidents involved a single victim, some had 2 or more.The idea wasn't to impose some total ordering but rather make it quick to find documents involving younger people. I realize I can work around that one by adding a min-age column. For the spatial one, where different users might pick different center points I can't think of any good workaround beyond Jonathan's idea of facets -- perhaps overlaying some map grid on the data and using facets for that. On 27 October 2010 17:56, Jonathan Rochkind rochk...@jhu.edu wrote: I would suggest that trying to sort on a multi-token/multi-value value in the first place ought to always raise an exception. Are there any reasons why you'd EVER want to do this, with the way it currently works? Letting people do this and only _sometimes_ raise an exception, but never do anything that's actually reasonable, just adds confusion for newbies. Alternately, perhaps sorting on a multi-valued or tokenized field ought to sort only on the FIRST token found in the first value of , but not sure how feasible that is to code. Ron, for your particular use case -- lucene sorting just can't really do that, I'm not sure there's a WAY to code sorting that works on multi-valued fields. A given lucene/solr search results set only includes each document ONCE. So where would that document appear in your sort on a multi-valued field? A different solution is required. I too sometimes have similar use cases, and my best ideas about how to solve them involve using faceting --- you can facet on a multi-valued field, and you can sort facets--but you can only sort facets by index order, a strict byte-by-byte sort. Which doesn't always work for me either. I haven't quite figured out the solution to this sort of problem. Ron Mayer wrote: Lance Norskog wrote: You may not sort on a tokenized field. You may not sort on a multiValued field. You can only have one term in a field. If there are more search terms than documents, A) sorting doesn't mean anything and B) Lucene will throw an exception. Is that considered a feature, or an annoyance/bug? One of the things I'm using Solr for is to store a whole bunch of documents about crime events that contain information roughly like this: the gang member ran the red light at 100 main st, and continued driving to 500 main street where he hit a car. He fled his car and ran to 789 second avenue where he hijacked another car and drove to his house at 654 someother st If I do a search for the name of that gang member's gang, I'd really really like to be able to sort my documents by distance from a location -- for example to quickly find any documents referring to gang activity in a neighborhood. And I'd really like to see this document near the top of my search results whether the user chose 100 main, 500 main, 790 second, or 650 someother street as his center point for sorting his search. If I wanted that so badly I'd be willing to try coding it so you _could_ sort on multiValued fields, would people want that feature? If so - would someone know off the top of their head where I should get started looking in the code? Or is it considered a feature that solr currently disallows it?
If I want to move a core from one physical machine to another....
If I want to move a core from one physical machine to another, is it as simple as just scp -r core5 otherserver:/path/on/other/server/ and then adding core name=core5name instanceDir=core5 / on that other server's solr.xml file and restarting the server there? PS: Should have I been able to figure the answer to that out by RTFM somewhere?
Re: a bug of solr distributed search
Andrzej Bialecki wrote: On 2010-10-25 11:22, Toke Eskildsen wrote: On Thu, 2010-07-22 at 04:21 +0200, Li Li wrote: But itshows a problem of distrubted search without common idf. A doc will get different score in different shard. Bingo. I really don't understand why this fundamental problem with sharding isn't mentioned more often. Every time the advice use sharding is given, it should be followed with a but be aware that it will make relevance ranking unreliable. The reason is twofold, I think: And a third potential reason - it's arguably a feature instead of a bug for some applications. Depending on how I organize my shards, give me the most relevant document from each shard for this search seems like it could be useful. * there is an exact solution to this problem, namely to make two distributed calls instead of one (first call to collect per-shard IDFs for given query terms, second call to submit a query rewritten with the global IDF-s). This solution is implemented in SOLR-1632, with some caching to reduce the cost for common queries. However, this means that now for every query you need to make two calls instead of one, which potentially doubles the time to return results (for simple common queries - for rare complex queries the time will be still dominated by the query runtime on shard servers). * another reason is that in many many cases the difference between using exact global IDF and per-shard IDFs is not that significant. If shards are more or less homogenous (e.g. you assign documents to shards by hash(docId)) then term distributions will be also similar. So then the question is whether you can accept an N% variance in scores across shards, or whether you want to bear the cost of an additional distributed RPC for every query... To summarize, I would qualify your statement with: ...if the composition of your shards is drastically different. Otherwise the cost of using global IDF is not worth it, IMHO.
Re: Solr sorting problem
Erick Erickson wrote: In general, the behavior when sorting is not predictable when sorting on a tokenized field, which text is. What would it mean to sort on a field with erick Moazzam as tokens in a single document? Should it be in the es or the ms? Might it be possible or reasonable to have it show up under both e and m? Or if not, just at the first one it finds? I've recently been asked a similar question where we wanted to sort documents by a victim's age. I have a victim_age field, but since there can be multiple victims in an incident it wasn't a unique field. As a workaround, I added a victim_age_min field; but it would have been easier if I didn't need to do that. That said, you probably want to watch out for case Best Erick On Fri, Oct 22, 2010 at 10:02 AM, Moazzam Khan moazz...@gmail.com wrote: For anyone who faced the same problem, changing the field to string from text worked! -Moazzam On Fri, Oct 22, 2010 at 8:50 AM, Moazzam Khan moazz...@gmail.com wrote: The field type of the first name and last name is text. Could that be why it's not sorting properly? I just changed it to string and started a full-import. Hopefully that will work. Thanks, Moazzam On Thu, Oct 21, 2010 at 7:42 PM, Jayendra Patil jayendra.patil@gmail.com wrote: need additional information . Sorting is easy in Solr just by passing the sort parameter However, when it comes to text sorting it depends on how you analyse and tokenize your fields Sorting does not work on fields with multiple tokens. http://wiki.apache.org/solr/FAQ#Why_Isn.27t_Sorting_Working_on_my_Text_Fields.3F On Thu, Oct 21, 2010 at 7:24 PM, Moazzam Khan moazz...@gmail.com wrote: Hey guys, I have a list of people indexed in Solr. I am trying to sort by their first names but I keep getting results that are not alphabetically sorted (I see the names starting with W before the names starting with A). I have a feeling that the results are first being sorted by relevancy then sorted by first name. Is there a way I can get the results to be sorted alphabetically? Thanks, Moazzam
Re: Prioritizing adjectives in solr search
Erick Erickson wrote: You can do some interesting things with payloads. You could index a particular value as the payload that identified the kind of word it was, where kind is something you define. Then at query time, you could boost depending on what part kind of word you identified it as in both the query and at indexing time. But I can't even imagine how one would go about supporting this in a general search engine. This kind of thing seems far too domain specific. Well, the pf2 and pf3 parameters in edismax come pretty close. For example, for the search query red baseball cap black leather jacket, a pf2 with no phrase slop, combined with a pf2 with a phrase slop of 3 will do a pretty good job at finding red caps and black jackets and baseball caps and leather jackets before it'll find red baseball jackets and leather caps. All it depended on is the convention that in english someone'll probably put adjectives before nouns in both the query and the document's text. The one annoyance is that I think the phrase slop doesn't care much about the order of words.. On Sun, Oct 10, 2010 at 8:50 PM, Ron Mayer r...@0ape.com wrote: Walter Underwood wrote: I think this is a bad idea. The tf.idf algorithm will already put a higher weight on hammers than on blue, because hammers will be more rare than blue. Plus, you are making huge assumptions about the queries. In a search for Canon camera, Canon is an adjective, but it is the important part of the query. Have you looked at your query logs and which queries are successful and which are not? Don't make radical changes like this unless you can justify them from the logs. The one radical change I'd like in the area of adjectives in noun clauses is if more weight were put when the adjectives apply to the appropriate noun. For example, a search for: 'red baseball cap black leather jacket' should find a doc with the guy wore a red cap, blue jeans, and a leather jacket before one that says the guy wore a black cap, leather pants, and a red jacket. The closest I've come at doing this was to use a variety of phrase slop boosts simultaneously - so that red [any_few_words] cap baseball cap leather jacket, black [any_few_words] jacket all add boosts to the score. wunder On Oct 4, 2010, at 8:38 PM, Otis Gospodnetic wrote: Hi, If you want blue to be used in search, then you should not treat it as a stopword. Re payloads: http://search-lucene.com/?q=payload+score and http://search-lucene.com/?q=payload+scorefc_type=wiki (even better, look at hit #1) Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Hasnain hasn...@hotmail.com To: solr-user@lucene.apache.org Sent: Mon, October 4, 2010 9:50:46 AM Subject: Re: Prioritizing advectives in solr search Hi Otis, Thank you for replying, unfortunately Im unable to fully grasp what you are trying to say, can you please elaborate what is payload with adjective terms? also Im using stopwords.txt to stop adjectives, adverbs and verbs, now when I search for Blue hammers, solr searches for blue hammers and hammers but not blue, but the problem here is user can also search for just Blue, then it wont search for anything... any suggestions on this?? -- View this message in context: http://lucene.472066.n3.nabble.com/Prioritizing-adjectives-in-solr-search-tp1613029p1629725.html Sent from the Solr - User mailing list archive at Nabble.com.
Can I tell Solr to merge segments more slowly on an I/O starved system?
My system which has documents being added pretty much continually seems pretty well behaved except, it seems, when large segments get merged. During that time the system starts really dragging, and queries that took only a couple seconds are taking dozens. Some other I/O bound servers seem to have features that let you throttle how much I/O they take for administrative background tasks -- for example PostgreSQL's vacuum_cost_delay and related parameters[1], which are described as The intent of this feature is to allow administrators to reduce the I/O impact of these commands on concurrent database activity. There are many situations in which it is not very important that maintenance commands like VACUUM and ANALYZE finish quickly; however, it is usually very important that these commands do not significantly interfere with the ability of the system to perform other database operations. Cost-based vacuum delay provides a way for administrators to achieve this. Are there any similar features for Solr, where it can sacrifice the speed of doing a commit in favor of leaving more I/O bandwidth for users performing searches? If not, where in the code might I look to add such a feature? Ron [1] http://www.postgresql.org/docs/8.4/static/runtime-config-resource.html
Re: Inconsistent search results with multiple keywords
Stéphane Corlosquet wrote: Hi all, I'm new to solr so please let me know if there is a more appropriate place for my question below. I'm noticing a rather unexpected number of results when I add more keywords to a search. I'm listing below a example (where I replaced the real keywords with placeholders): keyword1 851 hits keyword1 keyword2 90 hits keyword1 keyword2 keyword3 269 hits keyword1 keyword2 keyword3 keyword4 47 hits As you can see, adding k2 narrows down the amount of results (as I would expect), but adding k3 to k1 and k2 suddenly increases the amount of results. with 4 keywords, the results have been narrowed down again. My guess - you might have it configured it so at least 60% of keywords have to hit. For 1 or 2 keywords, that means they all need to hit. For 3 keywords, that means 2 of the 3 need to match. For 4 keywords, that means 3 of the 4 need to hit. Or you might have a more complicated expression with effectively the same results. It might be this mm parameter you're looking for: http://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27_Match.29
Re: Null Pointer Exception with shardsfacets where some shards have no values for some facets.
Ron Mayer wrote: Yonik Seeley wrote: I just checked in the last part of those changes that should eliminate any restriction on key. But, that last part dealt with escaping keys that contained whitespace or } Your example really should have worked after my previous 2 commits. Perhaps not all of the servers got successfully upgraded? Yes, quite possible. Can you try trunk again now? Will check sometime tomorrow. Yes, looks good now. Thanks!
Re: Null Pointer Exception with shardsfacets where some shards have no values for some facets.
Yonik Seeley wrote: On Tue, Sep 7, 2010 at 8:31 PM, Ron Mayer r...@0ape.com wrote: Short summary: * Mixing Facets and Shards give me a NullPointerException when not all docs have all facets. https://issues.apache.org/jira/browse/SOLR-2110 I believe the underlying real issue stemmed from your use of a complex key involvement/race_facet. Thanks!Yes - that looks like the actual reason, rather than what I was guessing. I spent a while this morning trying to reproduce the problem with a simpler example, and wasn't able to - probably because I overlooked that part. I see changes have been made (based on comments in) SOLR-2110 and SOLR-2111, so I'll try with the current trunk.. [trying now with trunk as of a few minutes ago] Looking much better. I'm seeing this in the log files: SEVERE: Exception during facet.field of {!terms=$involvement/gender_facet__terms}involvement/gender_facet:org.a pache.lucene.queryParser.ParseException: Expected identifier at pos 20 str='{!terms=$involvement/gender_facet__ terms}involvement/gender_facet' at org.apache.solr.search.QueryParsing$StrParser.getId(QueryParsing.java:718) at org.apache.solr.search.QueryParsing.parseLocalParams(QueryParsing.java:165) ... but at least I'm getting results, and results that look right for both the body of the document and for most of the facets. Perhaps next thing I try will be simplifying my keys for my own sanity as much as for solr's.
Re: Null Pointer Exception with shardsfacets where some shards have no values for some facets.
Yonik Seeley wrote: I just checked in the last part of those changes that should eliminate any restriction on key. But, that last part dealt with escaping keys that contained whitespace or } Your example really should have worked after my previous 2 commits. Perhaps not all of the servers got successfully upgraded? Yes, quite possible. Can you try trunk again now? Will check sometime tomorrow.
Re: Null pointer exception when mixing highlighter shards q.alt
Marc Sturlese wrote: I noticed that long ago. Fixed it doing in HighlightComponent finishStage: ... public void finishStage(ResponseBuilder rb) { ... } Thanks! I'll try that I also seem to have a similar problem with shards + facets -- in particular it seems like the error occurrs when some of the shards have no values for some of the facets. Any chance you (or anyone else) have a fix for that one too? Here's the backtrace I'm getting from a few day old svn trunk. Sep 7, 2010 6:03:58 AM org.apache.solr.common.SolrException log SEVERE: java.lang.NullPointerException at org.apache.solr.handler.component.FacetComponent.refineFacets(FacetComponent.java:340) at org.apache.solr.handler.component.FacetComponent.handleResponses(FacetComponent.java:232) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:301) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1323) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:923) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:547) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Null Pointer Exception with shardsfacets where some shards have no values for some facets.
Short summary: * Mixing Facets and Shards give me a NullPointerException when not all docs have all facets. * Attached patch improves the failure mode, but still spews errors in the log file * Suggestions how to fix that would be appreciated. In my system, I tried separating out a couple similar but different types of documents into a couple different shards. Both shards have the identical schema; with the facets defined as a dynamicfield: dynamicField name=*_facettype=string indexed=true stored=false multiValued=true / Some facets only have documents with a value for them in the first shard, Other facets only have documents with a value for them in the second shard. When I try to do a query that asks for a facet.field that's only has values in the first shard, and for a different facet.field that only has values in the second shard, I'm getting this exception: Sep 7, 2010 4:55:38 PM org.apache.solr.common.SolrException log SEVERE: java.lang.NullPointerException at org.apache.solr.handler.component.FacetComponent.refineFacets(FacetComponent.java:340) at org.apache.solr.handler.component.FacetComponent.handleResponses(FacetComponent.java:232) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:301) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1323) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:923) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:547) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) I don't have a real simple test case yet; but could work on one if it'd make it easier to track down.Also, I could post the schema and solrconfig if that'd help. The attached patch seems to mostly work for me; in that it's returning valid search results and at least some facet information, but with that patch I'm then getting this exception showing up: Sep 7, 2010 5:28:30 PM org.apache.solr.common.SolrException log SEVERE: Exception during facet counts:org.apache.lucene.queryParser.ParseException: Expected identifier at pos 20 str='{!terms=$involvement/race_facet__terms}involvement/race_facet' at org.apache.solr.search.QueryParsing$StrParser.getId(QueryParsing.java:718) at org.apache.solr.search.QueryParsing.parseLocalParams(QueryParsing.java:165) at org.apache.solr.search.QueryParsing.getLocalParams(QueryParsing.java:221) at org.apache.solr.request.SimpleFacets.parseParams(SimpleFacets.java:102) at org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:327) at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:188) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:72) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:206) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1323) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
Re: Null Pointer Exception with shardsfacets where some shards have no values for some facets.
Yonik Seeley wrote: Thanks for the report Ron, can you open a JIRA issue? Sure. I'll do it at work tomorrow morning, hopefully after I try to verify with a standalone test case. What version of Solr is this? This is trunk as of a few days ago. I can update to the latest trunk and check there too. -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8 On Tue, Sep 7, 2010 at 8:31 PM, Ron Mayer r...@0ape.com wrote: Short summary: * Mixing Facets and Shards give me a NullPointerException when not all docs have all facets. * Attached patch improves the failure mode, but still spews errors in the log file * Suggestions how to fix that would be appreciated. In my system, I tried separating out a couple similar but different types of documents into a couple different shards. Both shards have the identical schema; with the facets defined as a dynamicfield: dynamicField name=*_facettype=string indexed=true stored=false multiValued=true / Some facets only have documents with a value for them in the first shard, Other facets only have documents with a value for them in the second shard. When I try to do a query that asks for a facet.field that's only has values in the first shard, and for a different facet.field that only has values in the second shard, I'm getting this exception: Sep 7, 2010 4:55:38 PM org.apache.solr.common.SolrException log SEVERE: java.lang.NullPointerException at org.apache.solr.handler.component.FacetComponent.refineFacets(FacetComponent.java:340) at org.apache.solr.handler.component.FacetComponent.handleResponses(FacetComponent.java:232) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:301) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1323) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:923) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:547) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) I don't have a real simple test case yet; but could work on one if it'd make it easier to track down.Also, I could post the schema and solrconfig if that'd help. The attached patch seems to mostly work for me; in that it's returning valid search results and at least some facet information, but with that patch I'm then getting this exception showing up: Sep 7, 2010 5:28:30 PM org.apache.solr.common.SolrException log SEVERE: Exception during facet counts:org.apache.lucene.queryParser.ParseException: Expected identifier at pos 20 str='{!terms=$involvement/race_facet__terms}involvement/race_facet' at org.apache.solr.search.QueryParsing$StrParser.getId(QueryParsing.java:718) at org.apache.solr.search.QueryParsing.parseLocalParams(QueryParsing.java:165) at org.apache.solr.search.QueryParsing.getLocalParams(QueryParsing.java:221) at org.apache.solr.request.SimpleFacets.parseParams(SimpleFacets.java:102) at org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:327) at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:188) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:72) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:206
Many sparse facets?
Is there a good way of handling a large number of facets that are quite sparse (most documents not having any value most facets)? In my system I have quite a few documents (few million, will soon grow to mid tens of millions), and our users are requesting an ever-increasing number of facets (currently 80, and growing). Many of the facets are not present in a vast majority of the documents (often a facet's only present in under 100K or so docs). Am I right in understanding that lines in the log file like this: INFO: UnInverted multi-valued field {field=cvroffsn_facet,memSize=55532332,tindexSize=132,time=2296,phase1=2257,nTerms=729,bigTerms=0,termInstances=5422,uses=0} suggest that even when a facet only appears in a few thousand docs, it still takes considerable memory? Is there anything clever I can do to tell it to handle such sparsely used facets in a more memory friendly way? Perhaps I should be setting up a bunch of shards? Perhaps small ones dedicated to holding documents with the rare facets, and large ones with the documents without the rare facets? Lines like INFO: UnInverted multi-valued field {field=property_manufacturer_facet,memSize=4224,tindexSize=32,time=66,phase1=66,nTerms=0,bigTerms=0,termInstances=0,uses=0} suggest to me that in the special case of termInstances=0, unused facets don't take up much memory. Would that suggest that I might be able to write a different uninverter that has a more compact representation even for facts that show up a few times? Where might I look to do so?
Re: Many sparse facets?
Jonathan Rochkind wrote: What matters isn't how many documents have a value, so much as how many unique values there are in the field total. If there aren't that many, faceting can be done fairly quickly and fairly efficiently. Really? Don't these 2 log file lines: INFO: UnInverted multi-valued field {field=vehicle_vin_facet,memSize=39513151,tindexSize=208256,time=138382,phase1=138356,nTerms=638642,bigTerms=0,termInstances=739169,uses=0} INFO: UnInverted multi-valued field {field=specialassignyn_facet,memSize=36336696,tindexSize=44,time=1458,phase1=1438,nTerms=5,bigTerms=0,termInstances=138046,uses=0} suggest that whether I have a facet with a half million unique values or a half dozen, they use roughly the same much memory? At first glance they both seem similarly efficient to filter on. Certainly the one with many unique instances takes longer to invert -- but that's just computer time that's hidden from users, no? ... 50 megs times 80 is still only around 4 gigs, not entirely out of the question to simply supply enough RAM for all those caches. Yup - that's what I'm doing for now (just moved to a 24 gig ram machine); but I expect we'll have 10X as many documents, and maybe 2x as many facets by spring. Still not undoable, but I may need to start forecasting RAM budgets.
Re: Many sparse facets?
Jonathan Rochkind wrote: I could certainly be wrong. If you have a facet with a LOT fewer unique values than documents in the query, I'd be curious what happens if you try facet.method=enum. Cool. I'll be trying that later. I'm definitely not an expert, just trying to help figure it out based on what I do know. Thanks! That was great for pointing me in directions I can look further. Why would 'the computer time' be hidden from users? Ah, because the uninverted field is created once (until the next commit), not per query, I guess? Yup - and it appears to be done by the post-commit search warmer; so users keep getting their queries handled by the old index until after all those uninverted fields are done uninverting.
Null pointer exception when mixing highlighter shards q.alt
Short summary: * Using both highlighting and shards and q.alt is giving me a null pointer exception. * Really easy to workaround; but since the similar cases without shards work, perhaps this should too. * If you think it should be fixed, point me in the right direction and I can code up a patch. With a few days ago Solr trunk, I'm seeing a NullPointerException [full backtrace below] in HighlightComponent when I use shards the highlighter q.alt. I think this more minimal example shows the exception.Using the example install from ant example with the example data loaded, this request: http://localhost:8983/solr/select?q.alt=*:*hl=ondefType=edismaxshards=localhost:8983/solr is giving me a null pointer exception in HighlightComponent.finishStage(HighlightComponent.java:158) If I don't try to specify shards like this: http://localhost:8983/solr/select?q.alt=*:*hl=ondefType=edismax it works fine.Also, if I still use shards, but add a harmless q parameter like this: http://localhost:8983/solr/select?q.alt=*:*hl=ondefType=edismaxshards=localhost:8983/solrq=*:* it works fine too. Ron [1] java.lang.NullPointerException at org.apache.solr.handler.component.HighlightComponent.finishStage(HighlightComponent.java:158) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:307) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1323) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:923) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:547) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Re: edismax pf2 and ps
Short summary: * Multiple simultaneous phrase boosts with different ps2 parameters are working very nicely for me on a few million doc QA system. * I've submitted an updated patch to Jira incorporating feedback from the jira comments. Will be testing it more this week. https://issues.apache.org/jira/browse/SOLR-2058 On 2010-08-19 Ron Mayer wrote: Chris Hostetter wrote: [Yonik Seeley wrote] : Perhaps fold it into the pf/pf2 syntax? : pf=text~1^2 // proposed syntax... Big +1 to this idea ... ... I added a ticket here: https://issues.apache.org/jira/browse/SOLR-2058 and attached my patch to that ticket. Just wanted to comment that has been working extremely well for me; with multiple simultaneous phrase boosts with different slops at the same time. I also cleaned up the patch based on comments in Jira and submitted a newer version. In particular, I find if I use the following: * a high boost(500) on pf with slop of 0 * a moderate boost (50) on pf with a slop of 50 * a moderate boost (50) on pf2 with a slop of 0 * a low boost (10) on pf2 with a slop of 10 it's doing a great job of getting the most relevant document in the #1 spot (thanks to the slop=0 boosts), and a very good job at getting the entire first page of results filled with highly relevant documents (thanks to the shingles and more liberal phrase-slop boosts). I'm even having some luck with a whole bunch of those clauses like the following that has a variety of phrase slops on a variety of fields: http://app2.fli:28983/solr/core0/select?pf=source_doc~1^500+text_stem~1^500+source_doc~50^50+text_stem~20^50defType=edismaxhl.maxAnalyzedChars=50q.alt=*%3A*ps=1qt=fliqspf2=text_stem^50+text_stem~10^10+text_unstem~10^10start=0q=red+baseball+cap+black+leather+jacketmm=100%25debugQuery=onfl=id,score which seems to be returning quickly enough on my collection of 4 million documents: lst name=responseHeader int name=status0/int int name=QTime287/int − lst name=params str name=mm100%/str str name=pf2text_stem^50 text_stem~10^10 text_unstem~10^10/str str name=q.alt*:*/str str name=hl.maxAnalyzedChars50/str str name=defTypeedismax/str − str name=pf source_doc~1^500 text_stem~1^500 source_doc~50^50 text_stem~20^50 /str str name=debugQueryon/str str name=flid,score/str str name=start0/str str name=qred baseball cap black leather jacket/str str name=qtfliqs/str str name=ps1/str /lst /lst ... str name=parsedquery +((DisjunctionMaxQuery((text_stem:red^0.5)~0.01) DisjunctionMaxQuery((text_stem:basebal^0.5)~0.01) DisjunctionMaxQuery((text_stem:cap^0.5)~0.01) DisjunctionMaxQuery((text_stem:black^0.5)~0.01) DisjunctionMaxQuery((text_stem:leather^0.5)~0.01) DisjunctionMaxQuery((text_stem:jacket^0.5)~0.01) )~6) DisjunctionMaxQuery((source_doc:red baseball cap black leather jacket~50^50.0)~0.01) DisjunctionMaxQuery((source_doc:red baseball cap black leather jacket~1^500.0 | text_stem:red basebal cap black leather jacket~1^500.0)~0.01) DisjunctionMaxQuery((text_stem:red basebal cap black leather jacket~20^50.0)~0.01) (DisjunctionMaxQuery((text_stem:red basebal~1^50.0)~0.01) DisjunctionMaxQuery((text_stem:basebal cap~1^50.0)~0.01) DisjunctionMaxQuery((text_stem:cap black~1^50.0)~0.01) DisjunctionMaxQuery((text_stem:black leather~1^50.0)~0.01) DisjunctionMaxQuery((text_stem:leather jacket~1^50.0)~0.01) ) (DisjunctionMaxQuery((text_unstem:red baseball~10^10.0 | text_stem:red basebal~10^10.0)~0.01) DisjunctionMaxQuery((text_unstem:baseball cap~10^10.0 | text_stem:basebal cap~10^10.0)~0.01) DisjunctionMaxQuery((text_unstem:cap black~10^10.0 | text_stem:cap black~10^10.0)~0.01) DisjunctionMaxQuery((text_unstem:black leather~10^10.0 | text_stem:black leather~10^10.0)~0.01) DisjunctionMaxQuery((text_unstem:leather jacket~10^10.0 | text_stem:leather jacket~10^10.0)~0.01) ) /str
Re: Create a new index while Solr is running
mraible wrote: We're starting to use Solr for our application. The data that we'll be indexing will change often and not accumulate over time. This means that we want to blow away our index and re-create it every hour or so. What's the easier way to do this while Solr is running and not give users a no data found while we're doing it? In other words, keep the existing index in place until the new one is done being created. I searched the docs a bit, but couldn't find the answer I was looking for. The Multi-core feature seems to work pretty well for me with a similar case where I re-build indexes while a system's still live... http://wiki.apache.org/solr/CoreAdmin in particular you might be interested in the SWAP core command on that page seems to be what you want. The one thing I didn't figure out yet is that while my new index is building, and it decides to merge segments, everything (even connecting to the admin page) on the other core is annoyingly slow. Not sure if the machine's just too I/O constrained, or if something else is happening. I guess a common solution is to build the new index on a different machine and use replication to move it over Oh - or wouldn't everything just magically work if you do your deletes and adds and make sure you don't commit until all the adds were done?
Re: edismax pf2 and ps
Chris Hostetter wrote: : Perhaps fold it into the pf/pf2 syntax? : : pf=text^2// current syntax... makes phrases with a boost of 2 : pf=text~1^2 // proposed syntax... makes phrases with a slop of 1 and : a boost of 2 : : That actually seems pretty natural given the lucene query syntax - an : actual boosted sloppy phrase query already looks like : text:foo bar~1^2 Big +1 to this idea ... the existing ps param can stick arround as the default for any field that doesn't specify it's own slop in the pf/pf2/pf3 fields using the ~ syntax. I think I have a decent first draft of a patch that implements this. Hopefully I'm figuring out the right way to submit patches to this community. I added a ticket here: https://issues.apache.org/jira/browse/SOLR-2058 and attached my patch to that ticket. Any feedback, either on the patch or on how best to submit things to this community would be appreciated. This patch seems to happily turn a query like http://localhost:8983/solr/select?defType=edismaxfl=id,text,scoreq=enterprise+search+foobarps=5qf=textdebugQuery=truepf2=name~0^pf2=name^12+name~10 into what I believe is the desired parsed query: +((text:enterpris) (text:search) (text:foobar)) ((name:enterprise search~5^12.0) (name:search foobar~5^12.0)) ((name:enterprise search^.0) (name:search foobar^.0)) ((name:enterprise search~10) (name:search foobar~10)) which looks like it should give a high boost to docs where both words appear right next to each other, but still substantial boosts to docs where the pairs of words are a few words apart. I'll start testing it with real data today. One question: * Where might I find documentation and/or test cases for the pf2, pf3 parameters? I quick grep of the sources from the tree I got from git://git.apache.org/lucene-solr.git didn't reveal any obvious docs or tests with those parameters. $ git grep pf2 | grep -v 'Binary file' solr/src/java/org/apache/solr/search/ExtendedDismaxQParserPlugin.java: U.parseFieldBoostsAndSlop(solrParams.getParams(pf2)); Am I on the right track? Ron
Re: edismax pf2 and ps
Jayendra Patil wrote: We pretty much had the same issue, ended up customizing the ExtendedDismax code. In your case its just a change of a single line addShingledPhraseQueries(query, normalClauses, phraseFields2, 2, tiebreaker, pslop); to addShingledPhraseQueries(query, normalClauses, phraseFields2, 2, tiebreaker, 0); Thanks!!! Indeed it seems to be providing better results for me (at first glance on a test system). Is there any way of lobbying to make this change in the official releases? On Thu, Aug 12, 2010 at 1:04 PM, Ron Mayer r...@0ape.com wrote: Short summary: Is there any way I can specify that I want a lot of phrase slop for the pf parameter, but none at all for the pf2 parameter? I find the 'pf' parameter with a pretty large 'ps' to do a very nice job for providing a modest boost to many documents that are quite well related to many queries in my system. In contrast, I find the 'pf2' parameter with zero 'ps' does extremely well at providing a high boost to documents that are often exactly what someone's searching for. Is there any way I can get both effects? Edismax's pf2 parameter is really nice for boosting exact [sub]phrases in queries like 'black jacket red cap white shoes'. But as soon as even a little phrase slop (ps) is added, it seems like it starts boosting documents with red jackets and white caps just as much as those with black jackets and red caps. My gut feeling is that if I could have pf with a large phrase slop and the pf2 with zero phrase slop, it'd give me better overall results than any single phrase slop setting that gets applied to both. Is there any good way for me to test that? Thanks, Ron
Can I tell Solr to merge *oldest* rather than smallest segments - if so I think I wouldn't need optimize anymore.
Short summary: * If I could make Solr merge oldest segments (or the one with the most deleted docs) rather than smallest segments; I think I'd almost never need optimize. * Can I tell Solr to do this? Or if not, can someone point me in the right direction regarding where I might patch it to try this myself? I have a system where documents are refreshed and/or expired pretty much in a FIFO manner. In particular, no document in the system can live for over 1 month. Without frequent optimizes, ISTM my indexes tend to get bloated with mostly deleted content. I attached a ls-l below - showing the largest segments in my index are all from July. A query of timestamp:([1999-01-01T00:00:00Z TO 2010-08-01T23:59:59Z]) returns no documents so it appears to me the first 2 segments are entirely filled with deleted documents. I imagine this is not too uncommon a situation -- for example a web-crawler that periodically updates web pages that contain some dynamic content. Perhaps a different good criteria would be selecting to merge the segments with the largest number of deleted documents. In my case it'd be the same; but I could imagine non-FIFO update-heavy systems where that would work better. $ ls -lrt *.fdt -rw-rw-r-- 1 ramayer ramayer 291490823897 Jul 20 21:34 _u63.fdt -rw-rw-r-- 1 ramayer ramayer 78251326159 Jul 29 18:15 _xkh.fdt -rw-rw-r-- 1 ramayer ramayer 69295141685 Aug 8 01:29 _10f5.fdt -rw-rw-r-- 1 ramayer ramayer 5406369697 Aug 10 21:14 _13fv.fdt -rw-rw-r-- 1 ramayer ramayer 66210508029 Aug 10 21:44 _13g1.fdt -rw-rw-r-- 1 ramayer ramayer 2001873014 Aug 10 23:05 _13io.fdt -rw-rw-r-- 1 ramayer ramayer 1578531820 Aug 11 14:10 _13m8.fdt -rw-rw-r-- 1 ramayer ramayer 2254917604 Aug 12 03:49 _13p3.fdt -rw-rw-r-- 1 ramayer ramayer 2890967852 Aug 12 06:49 _13s6.fdt -rw-rw-r-- 1 ramayer ramayer 2820285238 Aug 12 09:49 _13v9.fdt -rw-rw-r-- 1 ramayer ramayer 2905550377 Aug 12 12:52 _13yc.fdt -rw-rw-r-- 1 ramayer ramayer 2776837514 Aug 12 15:54 _141f.fdt -rw-rw-r-- 1 ramayer ramayer259698816 Aug 12 16:15 _141p.fdt -rw-rw-r-- 1 ramayer ramayer290083173 Aug 12 16:34 _1420.fdt -rw-rw-r-- 1 ramayer ramayer279500106 Aug 12 16:54 _142b.fdt -rw-rw-r-- 1 ramayer ramayer277156197 Aug 12 17:17 _142m.fdt -rw-rw-r-- 1 ramayer ramayer 91360010 Aug 13 00:27 _142x.fdt -rw-rw-r-- 1 ramayer ramayer 7351514 Aug 13 00:37 _142y.fdt -rw-rw-r-- 1 ramayer ramayer 7286 Aug 13 00:38 _142z.fdt -rw-rw-r-- 1 ramayer ramayer 21 Aug 13 01:07 _1430.fdt -rw-rw-r-- 1 ramayer ramayer 21 Aug 13 02:07 _1431.fdt -rw-rw-r-- 1 ramayer ramayer 21 Aug 13 03:07 _1432.fdt -rw-rw-r-- 1 ramayer ramayer 21 Aug 13 04:07 _1433.fdt -rw-rw-r-- 1 ramayer ramayer 2388369 Aug 13 04:35 _1434.fdt -rw-rw-r-- 1 ramayer ramayer 21 Aug 13 05:07 _1435.fdt -rw-rw-r-- 1 ramayer ramayer 21 Aug 13 06:07 _1436.fdt -rw-rw-r-- 1 ramayer ramayer 21 Aug 13 07:07 _1437.fdt -rw-rw-r-- 1 ramayer ramayer 21 Aug 13 08:07 _1438.fdt -rw-rw-r-- 1 ramayer ramayer 21 Aug 13 09:07 _1439.fdt -rw-rw-r-- 1 ramayer ramayer 21 Aug 13 10:07 _143a.fdt -rw-rw-r-- 1 ramayer ramayer 198581 Aug 13 11:04 _143b.fdt
Re: edismax pf2 and ps
Yonik Seeley wrote: Perhaps a ps2 parameter to match pf2? That might be nice. I could try to put together such a patch if people were interested. One more thing I've been contemplating is if my results might be even better if I had a couple different pf2s with different ps's at the same time. In particular. One with ps=0 to put a high boost on ones the have the right ordering of words. For example insuring that: red hat black jacket boosts only red hats and not black hats. And another pf2 with a more modest boost with ps=5 or so to handle the query above also boosting docs with red baseball hat. Not sure of a good way to express that in config options, tho. -Yonik http://www.lucidimagination.com On Fri, Aug 13, 2010 at 2:11 PM, Ron Mayer r...@0ape.com wrote: Jayendra Patil wrote: We pretty much had the same issue, ended up customizing the ExtendedDismax code. In your case its just a change of a single line addShingledPhraseQueries(query, normalClauses, phraseFields2, 2, tiebreaker, pslop); to addShingledPhraseQueries(query, normalClauses, phraseFields2, 2, tiebreaker, 0); Thanks!!! Indeed it seems to be providing better results for me (at first glance on a test system). Is there any way of lobbying to make this change in the official releases? On Thu, Aug 12, 2010 at 1:04 PM, Ron Mayer r...@0ape.com wrote: Short summary: Is there any way I can specify that I want a lot of phrase slop for the pf parameter, but none at all for the pf2 parameter? I find the 'pf' parameter with a pretty large 'ps' to do a very nice job for providing a modest boost to many documents that are quite well related to many queries in my system. In contrast, I find the 'pf2' parameter with zero 'ps' does extremely well at providing a high boost to documents that are often exactly what someone's searching for. Is there any way I can get both effects? Edismax's pf2 parameter is really nice for boosting exact [sub]phrases in queries like 'black jacket red cap white shoes'. But as soon as even a little phrase slop (ps) is added, it seems like it starts boosting documents with red jackets and white caps just as much as those with black jackets and red caps. My gut feeling is that if I could have pf with a large phrase slop and the pf2 with zero phrase slop, it'd give me better overall results than any single phrase slop setting that gets applied to both. Is there any good way for me to test that? Thanks, Ron
edismax pf2 and ps
Short summary: Is there any way I can specify that I want a lot of phrase slop for the pf parameter, but none at all for the pf2 parameter? I find the 'pf' parameter with a pretty large 'ps' to do a very nice job for providing a modest boost to many documents that are quite well related to many queries in my system. In contrast, I find the 'pf2' parameter with zero 'ps' does extremely well at providing a high boost to documents that are often exactly what someone's searching for. Is there any way I can get both effects? Edismax's pf2 parameter is really nice for boosting exact phrases in queries like 'black jacket red cap white shoes'. But as soon as even a little phrase slop (ps) is added, it seems like it starts boosting documents with red jackets and white caps just as much as those with black jackets and red caps. My gut feeling is that if I could have pf with a large phrase slop and the pf2 with zero phrase slop, it'd give me better overall results than any single phrase slop setting that gets applied to both. Is there any good way for me to test that? Thanks, Ron
Re: If you could have one feature in Solr...
Erik Hatcher wrote: Ron - I think SOLR-792 meets the need you describe. What do you think? It's tree faceting, allowing you to facet down 2 levels deep arbitrarily on any two fields. Ideally we'd enhance it to be of arbitrary depth too. Nice! It certainly handles my main use case. There are still a couple cases that would benefit from a more flexible function returning data along with the facets. In this app, each document represents a crime report describing, for example, an auto theft. Those documents have fields such as the make model of a car stolen. In some cases the users would like to see numbers showing the number of incidents involving cars of those types (which I think is what Solr returns easily).Sometimes instead of the number of documents, they'd rather see the number of cars involved - for example, if a single theft from a dealership involved multiple cars. And other times, they'd rather see the value of the cars returned. In SQL I can do a select sum(value) from incidents join vehicles..., and haven't (yet) found similar for facets in solr. Then again, maybe I should be using the database for that part On Feb 24, 2010, at 6:40 PM, Ron Mayer wrote: Make FORD GM Honda TOYOTA [] MONDAY 17 23 4 2 TUESDAY11 9174 5 WEDNESDAY 3 69 1
Re: If you could have one feature in Solr...
Grant Ingersoll wrote: What would it be? * Run a MapReduce-likejob on all docs matching the results of a search? I'm currently working on an app where I hope to be able to do a query (hopefully using solr) and generate a map where every state (or county or zip-code or school district or police beat) is colored based on some attribute derived from some fields in the documents. Interestingly it seems pleasantly easy to if I'm just basing it on the count of documents - since I can set up states, etc as facets. But it'd be neat if instead of just getting the count from the facets, if I could run more arbitrary math on the documents without having to suck them into the application. Another use for this is that I'd like to make a quicker way of drilling down on my documents than going one facet at a time by showing the user a 2-dimensional table that combines 2 facets. For example, showing a table like this on the page: Make FORD GM Honda TOYOTA [] MONDAY 17 23 4 2 TUESDAY11 9174 5 WEDNESDAY 3 69 1 ... and when the user clicks the 174 it automatically adds the Vehicle Make = Toyota and Day of week = Tuesday facets to the query. (I'm a solr newbie, so my apologies if this already exists, or if it's just a bad idea, or if I should just be using another tool for that (possibly in conjunction with solr), but)