Re: Standard vs. DisMaxQueryHandler
: Is the main difference between the StandardQueryHandler and : DisMaxQueryHandler the supported query syntax (and different query : parser used in each of them), and the fact that the latter creates : DisjunctionMaxQueries, while the former just creates vanilla : BooleanQueries? Are there any other differences? the main differnece is the query string yes: Standard expects to get lucene QueryParser formatted queries, while DisMax expects to get raw user input strings ... Standard builds queries (wehter they be prefix or boolean or wildcard) using the QueryParser as is, while DisMax does a cross product of the user input across many differnet fields and builds up a very specific query structure -- which can then be augmented with aditional query clauses like the bq boost query and the bf boost function. there's no reason the StandardRequestHandler can't construct DisMaxQueries (once QueryParser has some syntax for them) and DisMaxRequestHandler does (at the outermost level) generate a BooleanQuery (with a custom minShouldMatch value set on it) but the main differnece is really the use case: if you want the clinet to specify the exact query structure that they want, use StandardRequstHandler. if you want the client to just propogate the raw search string typed by the user, without any structure or escaping, and get the nice complex DisMax style query across the configured fields, the DisMax handler was written to fill that niche. (load up the example configs, and take a look at the query toString from this url to see what i mean about the complex structure... http://localhost:8983/solr/select/?qt=dismaxq=how+now+brown+cowdebugQuery=1 -Hoss
Default XML Output Schema
Perhaps a silly questions, but I'm wondering if anyone can tell me why solr outputs XML like this: doc int name=id201038/id int name=siteId31/siteId date name=modified2006-09-15T21:36:39.000Z/date /doc rather than like this: doc id type=int201038/id siteId type=int31/siteId modified type=date2006-09-15T21:36:39.000Z/modified /doc A front-end PHP developer I know is having trouble parsing the default Solr output because of that format and mentioned it would be much easier in the former format... so I was curious if there was a reason it is the way it is. -Sangraal
Re: Default XML Output Schema
On 9/21/06, sangraal aiken [EMAIL PROTECTED] wrote: Perhaps a silly questions, but I'm wondering if anyone can tell me why solr outputs XML like this: During the initial development of Solr (2004), I remember throwing up both options, and most developers preferred to have a limited number of well defined tags. It allows you to have rather arbitrary field names, which you couldn't have if you used the field name as the tag. It also allows consistency with custom data. For example, here is the representation of an array of integer: arrint1/intint2/int/arr If field names were used as tags, we would have to either make up a dummy-name, or we wouldn't be able to use the same style. doc int name=id201038/id int name=siteId31/siteId date name=modified2006-09-15T21:36:39.000Z/date /doc rather than like this: doc id type=int201038/id siteId type=int31/siteId modified type=date2006-09-15T21:36:39.000Z/modified /doc A front-end PHP developer I know is having trouble parsing the default Solr output because of that format and mentioned it would be much easier in the former format... so I was curious if there was a reason it is the way it is. There are a number of options for you. You could write your own QueryResponseWriter to output XML just as you like it, or use an XSLT stylesheet in conjunction with http://issues.apache.org/jira/browse/SOLR-49 or use another format such as JSON. -Yonik
Extending Solr's Admin functionality
Hello, I may need to add functionality to Solr's admin pages. The functionality that I'm looking to add is the ability to trigger certain indexing functions and monitor their progress. I'm wondering if people have thoughts about the best way to do this. Here are my initial ideas: 1. Add additional admin screens/JSPs, make them call custom classes that trigger indexing (e.g. go to a DB, retrieve some data, index it, maybe optimize when done), have that execute in a separate thread, and have these classes call Solr via custom HTTP requests that report progress, so that this progress/status can be viewed through another admin page for monitoring of this stuff. 2. Forget about triggering things from the UI. Write generic/command-line-type classes, have them invoked independently of Solr, but still have them call Solr via custom HTTP requests that report progress, so that this progress/status can be viewed through another admin page for monitoring of this stuff. I like 1, because everything is contained in Solr, but I fear it may be hard to maintain this extended version with Solr, unless the stuff I write ends up being generic enough that I can contribute it back. I guess 2 would have some of these problems because I'd still need an admin page for monitoring. Any thoughts? Has anyone already envisioned a good way to extend Solr's functionality with custom admin screens? Thanks, Otis
Re: relational design in solr?
While it's certianly possible to join the results of multiple indexes, i would do so only when absolutely neccessary -- in my experience the only time i've found that it makes sense, is when one aspect of the data changes extremely rapidly compared to everything else, making complex reindexing a pain, but reindexing just the changed data in it's own index is a lot more feasible. As a rule of thumb, when building paginated style search applications, I would advise people to try and flatten their index as much as possible, so that the application can do one user query (based on the users input) to get a single page of results, and then use the uniqueKeys from that page of results to lookup ancillary data from any other indexes (or databases that you need) -- the key being that all the data you want to search on, and all hte data you need to sort are in the index, but other data you needto return to the user can come from other sources. If you find yourself wanting to join to indexes for hte purposes of matching or sorting, the amount of work you wind up doing tends to be prohibitive on really large indexes -- and if your indxes aren't that large, it would probably just be easier to puteverything in one index and rebuild it frequently. : I am trying to integrate solr search results with results from a rdbms : query. It's working ok, but fairly complicated due to large size of : the results from the database, and many different sort requirements. : : I know that solr/lucene was not designed to intelligently handle : multiple document types in the same collection, i.e. provide join : features, but I'm wondering if anyone on this list has any thoughts on : how to do it in lucene, and how it might be integrated into a custom : solr deployment. I can't see going back to vanilla lucene after solr! : : My basic idea is to add an objType field that would be used to define a : table. There would be one main objType, any related objTypes would : have a field pointing back to the main objs via id, like a foreign key. : : I'd run multiple parallel searches and merge the results based on : foreign keys, either using a Filter or just using custom code. I'm : anticipating that iterating through the results to retrieve the foreign : key values will be too slow. : : Our data is highly textual, temporal and spatial, which pretty much : correspond to the 3 tables I would have. I can de-normalize a lot of : the data, but the combination of times, locations and textual : representations would be way too large to fully flatten. : : I'm about to start experimenting with different strategies, and I would : appreciate any insight anyone can provide. Would the faceting code help : here somehow? -Hoss
http error
I'm getting the following error when I try and hit the admin console: HTTP ERROR: 500 dr01142: dr01142 RequestURI=/solr/admin/stats.jsp Powered by Jetty:// has anyone seen this error before? The queries to this server seem to work just fine, only the admin console is not working. -- Jeff McCormick [EMAIL PROTECTED]
Re: wana use CJKAnalyzer
On 9/21/06, Chris Hostetter [EMAIL PROTECTED] wrote: : i just wanna say: no your help,maybe i will give up.thk u again. : : http://www.flickr.com/photos/[EMAIL PROTECTED]/248815068/ : thk Hoss,Nick Snels,Koji,Mike and everybody who helped me and wanna help : me.. : : i can use solr with Chinese Word. I'm sorry, i'm really confused now ... it seems like you got things working, but you also say maybe i will give up ... ? I read that as without your help, maybe I would have given up. -Yonik
Re: http error
This error was caused by my machine's hostname being changed by DHCP and it not resolving to localhost! Apparently for JSPs, Jetty requires some kind of hostname resolution, so if it won't resolve, you get a nice HTTP 500 error with this rather vague error message. cheers On Thursday 21 September 2006 2:16 pm, Yonik Seeley wrote: On 9/21/06, Jeff McCormick [EMAIL PROTECTED] wrote: I'm getting the following error when I try and hit the admin console: HTTP ERROR: 500 dr01142: dr01142 RequestURI=/solr/admin/stats.jsp Powered by Jetty:// has anyone seen this error before? The queries to this server seem to work just fine, only the admin console is not working. I haven't seen that problem. If you are using the bundled version of Jetty, try making sure that the JVM you are starting it with is from a JDK and not a JRE (javac is needed to compile the JSPs). -Yonik -- Jeff McCormick Rackspace x4596
Reloading solrconfig.xml
Hi, What's the best way to dynamically change solrconfig.xml and have the changes take effect? It looks like one could just regenerate the file and call SolrConfig.initConfig(String file). Is that the proper/best way to do it? Thanks, Otis
Re: Reloading solrconfig.xml
On 9/21/06, Otis Gospodnetic [EMAIL PROTECTED] wrote: What's the best way to dynamically change solrconfig.xml and have the changes take effect? Everything would need to be designed for that, and it's currently not. You might be able to reload the config, but all the classes that looked at the config and configured themselves would need to be At CNET, we are always in a load-balanced environment for scalability and HA. In that environment, you simply change the config and bounce the server, letting the remaining servers handle requests. -Yonik
Re: Facet performance with heterogeneous 'facets'?
On 9/21/06, Michael Imbeault [EMAIL PROTECTED] wrote: It turns out that journal_name has 17038 different tokens, which is manageable, but first_author has 400 000. I don't think this will ever yield good performance, so i might only do journal_name facets. Hang in there Michael, a fix is on the way for your scenario (and subscribe to solr-dev if you want to stay on the bleeding edge): http://www.nabble.com/big-faceting-speedup-for-single-valued-fields-tf2308153.html -Yonik
Re: Default XML Output Schema
Thanks for the great explanation Yonik, I passed it on to my collegues for reference... I knew there was a good reason. -Sangraal On 9/21/06, Yonik Seeley [EMAIL PROTECTED] wrote: On 9/21/06, sangraal aiken [EMAIL PROTECTED] wrote: Perhaps a silly questions, but I'm wondering if anyone can tell me why solr outputs XML like this: During the initial development of Solr (2004), I remember throwing up both options, and most developers preferred to have a limited number of well defined tags. It allows you to have rather arbitrary field names, which you couldn't have if you used the field name as the tag. It also allows consistency with custom data. For example, here is the representation of an array of integer: arrint1/intint2/int/arr If field names were used as tags, we would have to either make up a dummy-name, or we wouldn't be able to use the same style. doc int name=id201038/id int name=siteId31/siteId date name=modified2006-09-15T21:36:39.000Z/date /doc rather than like this: doc id type=int201038/id siteId type=int31/siteId modified type=date2006-09-15T21:36:39.000Z/modified /doc A front-end PHP developer I know is having trouble parsing the default Solr output because of that format and mentioned it would be much easier in the former format... so I was curious if there was a reason it is the way it is. There are a number of options for you. You could write your own QueryResponseWriter to output XML just as you like it, or use an XSLT stylesheet in conjunction with http://issues.apache.org/jira/browse/SOLR-49 or use another format such as JSON. -Yonik
Re: Reloading solrconfig.xml
Thanks, that's actually simpler and it will work for me. Since I'm thinking of only changing mergeFactor and friends on the fly, I suppose I'd only need to modify Master's solrconfig.xml. Otis - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: solr-user@lucene.apache.org; Otis Gospodnetic [EMAIL PROTECTED] Sent: Thursday, September 21, 2006 4:08:58 PM Subject: Re: Reloading solrconfig.xml On 9/21/06, Otis Gospodnetic [EMAIL PROTECTED] wrote: What's the best way to dynamically change solrconfig.xml and have the changes take effect? Everything would need to be designed for that, and it's currently not. You might be able to reload the config, but all the classes that looked at the config and configured themselves would need to be At CNET, we are always in a load-balanced environment for scalability and HA. In that environment, you simply change the config and bounce the server, letting the remaining servers handle requests. -Yonik
Re: Facet performance with heterogeneous 'facets'?
On 9/21/06, Michael Imbeault [EMAIL PROTECTED] wrote: Btw, Any plans for a facets cache? Maybe a partial one (like caching top terms to implement some other optimizations). My general philosophy on caching in Solr has been to cache things the client can't: elemental things, or *parts* of requests to make many different requests faster (most bang-for-the-buck). Caching complete requests/responses is generally less useful since it requires even more memory, has a worse hit ratio, and can be done anyway by the client or a separate process like squid. -Yonik
Re: Reloading solrconfig.xml
On 9/21/06, Otis Gospodnetic [EMAIL PROTECTED] wrote: Thanks, that's actually simpler and it will work for me. Since I'm thinking of only changing mergeFactor and friends on the fly, I suppose I'd only need to modify Master's solrconfig.xml. Is this for testing or something? I could think of usecases where it might make sense to somehow allow changing mergeFactor in add requests (complete index builds vs incremental adds, etc). -Yonik
Fixed first hits - custom RequestHandler?
Hello, I have a situation where I want certain documents to appear at the top of the hit list for certain searches, regardless of their score. One can think of it as the ads right on top of Google's search results (but I'm not dealing with ads). Example: If I'm searching books in a bookstore, and a person is searching for lucene, the owner of the bookstore may want to promote the recently published Lucene in Action instead of some other book about Lucene, so he wants any search for lucene or java search to put the link to Lucene in Action on top. Is there a good way to accomplish this in Solr? My initial thoughts are that it would be best to have an external store, maybe even a Lucene index. This store would host the data to display on top of hits, as well as keywords/phrases that would have to match user's search terms. A custom RequestHandler would then perform a regular search (a la any of the existing RequestHandlers), plus pull the data from this side store, and stick those in the response. Is this a good candidate for a custom RequestHandler? Thanks, Otis
Re: Re: Default XML Output Schema
This structure was inhibiting to me at first too using Coldfusion. However, I was able to create a function that dynamically creates a query recordset for both facets and search results and will accomodate new/additional fields at any time. If I can do it, any reasonable programmer can handle it. On 9/21/06, sangraal aiken [EMAIL PROTECTED] wrote: Thanks for the great explanation Yonik, I passed it on to my collegues for reference... I knew there was a good reason. -Sangraal On 9/21/06, Yonik Seeley [EMAIL PROTECTED] wrote: On 9/21/06, sangraal aiken [EMAIL PROTECTED] wrote: Perhaps a silly questions, but I'm wondering if anyone can tell me why solr outputs XML like this: During the initial development of Solr (2004), I remember throwing up both options, and most developers preferred to have a limited number of well defined tags. It allows you to have rather arbitrary field names, which you couldn't have if you used the field name as the tag. It also allows consistency with custom data. For example, here is the representation of an array of integer: arrint1/intint2/int/arr If field names were used as tags, we would have to either make up a dummy-name, or we wouldn't be able to use the same style. doc int name=id201038/id int name=siteId31/siteId date name=modified2006-09-15T21:36:39.000Z/date /doc rather than like this: doc id type=int201038/id siteId type=int31/siteId modified type=date2006-09-15T21:36:39.000Z/modified /doc A front-end PHP developer I know is having trouble parsing the default Solr output because of that format and mentioned it would be much easier in the former format... so I was curious if there was a reason it is the way it is. There are a number of options for you. You could write your own QueryResponseWriter to output XML just as you like it, or use an XSLT stylesheet in conjunction with http://issues.apache.org/jira/browse/SOLR-49 or use another format such as JSON. -Yonik
Re: Facet performance with heterogeneous 'facets'?
On 9/21/06, Yonik Seeley [EMAIL PROTECTED] wrote: Hang in there Michael, a fix is on the way for your scenario (and subscribe to solr-dev if you want to stay on the bleeding edge): OK, the optimization has been checked in. You can checkout from svn and build Solr, or wait for the 9-22 nightly build (after 8:30 EDT). I'd be interested in hearing your results with it. The first facet request on a field will take longer than subsequent ones because the FieldCache entry is loaded on demand. You can use a firstSearcher/newSearcher hook in solrconfig.xml to send a facet request so that a real user would never see this slower query. -Yonik
Re: Fixed first hits - custom RequestHandler?
Otis, I'm curious as to what you find out here. I'm looking at setting up a second Solr instance to handle keyword advertising and the first instance to handle the site search for our newspaper website. Never thought of your question. Thanks, Tim On 9/21/06, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hello, I have a situation where I want certain documents to appear at the top of the hit list for certain searches, regardless of their score. One can think of it as the ads right on top of Google's search results (but I'm not dealing with ads). Example: If I'm searching books in a bookstore, and a person is searching for lucene, the owner of the bookstore may want to promote the recently published Lucene in Action instead of some other book about Lucene, so he wants any search for lucene or java search to put the link to Lucene in Action on top. Is there a good way to accomplish this in Solr? My initial thoughts are that it would be best to have an external store, maybe even a Lucene index. This store would host the data to display on top of hits, as well as keywords/phrases that would have to match user's search terms. A custom RequestHandler would then perform a regular search (a la any of the existing RequestHandlers), plus pull the data from this side store, and stick those in the response. Is this a good candidate for a custom RequestHandler? Thanks, Otis
Re: Fixed first hits - custom RequestHandler?
On 9/21/06, Otis Gospodnetic [EMAIL PROTECTED] wrote: I have a situation where I want certain documents to appear at the top of the hit list for certain searches, regardless of their score. One can think of it as the ads right on top of Google's search results (but I'm not dealing with ads). You could make anything with an isSpecial boolean field appear first: search_field:java; score desc, special desc The special field could even be an int field so you could control the order that the special docs appeared. You could also do something with boosting: +(search_terms:java) special:true^100 If you have special search terms you want to associate with a doc, you can have another field for that and boost it highly... that would give you a measure of relevancy among special documents: normal_search_field:java special_search_field:java^100 Is this a good candidate for a custom RequestHandler? Hopefully all the tools are already there to do this w/o extra code. -Yonik
Re: Fixed first hits - custom RequestHandler?
On 9/21/06, Yonik Seeley [EMAIL PROTECTED] wrote: You could make anything with an isSpecial boolean field appear first: search_field:java; score desc, special desc Oops, that should be search_field:java; special desc, score desc score desc should be the secondary sort, or whatever you normally want to sort by. -Yonik
dismax and facets with constraints
I'm experimenting with dismax to do faceted browsing, and when I add a constraint with dismax on that facet, I no longer get the entire facet-count. i.e. q=blahqt=dismaxfq=type_id:1hl=truehl.fl=title+summaryhl.snippets=3face t=truefacet.limit=-1facet.zeros=falsefacet.field=type_id lst name=facet_counts lst name=facet_queries/ lst name=facet_fields lst name=type_id int name=295/int /lst /lst /lst I understand why this is happening, but is there another way to add a constraint via querystring (instead of 'fq=type_id:1') and still get the full facet-counts list? I can do it with standardrequest, but then it doesn't appear like I can sort the results.
RE: dismax and facets with constraints
Just to clarify on this point, I am using highlighting in standardquery. When I add a constraint and sort by a field, the highlighting function no longer works. Possible bug or user error? _ I can do it with standardrequest, but then it doesn't appear like I can sort the results.
RE: dismax and facets with constraints
: Just to clarify on this point, I am using highlighting in standardquery. : When I add a constraint and sort by a field, the highlighting function no : longer works. Possible bug or user error? : I can do it with standardrequest, but then it doesn't appear like I can sort : the results. this sounds like it's unrelated to the facet counts issue ... but i'm having trouble following what you mean, can you give us an example URL and the output that you are getting from it (preferably using the example schema/docs -- but we might be able to help even if it's your own custom schema/data) -Hoss
Re: Fixed first hits - custom RequestHandler?
: On 9/21/06, Otis Gospodnetic [EMAIL PROTECTED] wrote: : I have a situation where I want certain documents to appear at the top : of the hit list for certain searches, regardless of their score. One : can think of it as the ads right on top of Google's search results (but : I'm not dealing with ads). the kind of approach Yonik described works well when you really want the boosted documents (ie: ads, promoted products, etc) to be inlcuded in the main paginated flow of results, regardless of how many there are. if you want them to be broken out (like the ads google shows in the right nav of their pages) so that they aren't affected by pagination or sorting changes; or if you want a limited number to appear (ie: bubble the 3 highest scoring promoted products up to the top, but leave the rest of the promoted products where they are in the normally sorted list) then i don't know any way arround this except executing two searches. I've typically done it by making two Solr requests from the client, but you could also do this will a custom request handler that included two DocLists in the results. (now that you can progromatically modify/override the params of a SolrQueryRequest, it would be really easy to write a subclass of any existing request handler that first did the promo search, and then delegated to the super class with fq params telling it to ignore the results you've already included) -Hoss
Re: Facet performance with heterogeneous 'facets'?
I upgraded to the most recent Solr build (9-22) and sadly it's still really slow. 800 seconds query with a single facet on first_author, 15 millions documents total, the query return 180. Maybe i'm doing something wrong? Also, this is on my personal desktop; not on a server. Still, I'm getting 0.1 seconds queries without facets, so I don't think thats the cause. In the admin panel i can still see the filtercache doing millions of lookups (and tons of evictions once it hits the maxsize). Here's the field i'm using in schema.xml : field name =first_author type=string indexed=true stored=true/ This is the query : q=hiv red bloodstart=0rows=20fl=article_title+authors+journal_iso+pubdate+pmid+scoreqt=standardfacet=truefacet.field=first_authorfacet.limit=5facet.missing=falsefacet.zeros=false I'll do more testing on the weekend, Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 Yonik Seeley wrote: On 9/21/06, Yonik Seeley [EMAIL PROTECTED] wrote: Hang in there Michael, a fix is on the way for your scenario (and subscribe to solr-dev if you want to stay on the bleeding edge): OK, the optimization has been checked in. You can checkout from svn and build Solr, or wait for the 9-22 nightly build (after 8:30 EDT). I'd be interested in hearing your results with it. The first facet request on a field will take longer than subsequent ones because the FieldCache entry is loaded on demand. You can use a firstSearcher/newSearcher hook in solrconfig.xml to send a facet request so that a real user would never see this slower query. -Yonik