Re: DateFormatTransformer issue with value 0000-00-00T00:00:00Z
While the year zero exists, month zero and day zero don't. And while APIs ofttimes accept those values (ie day zero is the last day of the previous month) the ISO 8601 spec does not accept it as far as I know. On 11/18/2010 4:26 AM, Dennis Gearon wrote: I thought that that value was a perfectly valid one for ISO 9601 time? http://en.wikipedia.org/wiki/Year_zero Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: gwk To: solr-user@lucene.apache.org Sent: Wed, November 17, 2010 2:12:16 AM Subject: Re: DateFormatTransformer issue with value -00-00T00:00:00Z On 11/16/2010 1:41 PM, Shanmugavel SRD wrote: Hi, I am having a field as below in my feed. -00-00T00:00:00Z I have configured the field as below in data-config.xml. But after indexing, the field value becomes like this 0002-11-30T00:00:00Z I want to have the value as '-00-00T00:00:00Z' after indexing also. Could anyone help on this? PS: I am using solr 1.4.1 As -00-00T00:00:00Z isn't a valid date I don't think the Solr's date field will accept it. Assuming this is MySQL you can use the zeroDateTimeBehavior connection string option, i.e. mysql://user:passw...@mysqlhost/database?zeroDateTimeBehavior=convertToNull This will make the mysql driver return those values as NULL instead of all-zero dates. Regards, gwk
Re: DateFormatTransformer issue with value 0000-00-00T00:00:00Z
On 11/16/2010 1:41 PM, Shanmugavel SRD wrote: Hi, I am having a field as below in my feed. -00-00T00:00:00Z I have configured the field as below in data-config.xml. But after indexing, the field value becomes like this 0002-11-30T00:00:00Z I want to have the value as '-00-00T00:00:00Z' after indexing also. Could anyone help on this? PS: I am using solr 1.4.1 As -00-00T00:00:00Z isn't a valid date I don't think the Solr's date field will accept it. Assuming this is MySQL you can use the zeroDateTimeBehavior connection string option, i.e. mysql://user:passw...@mysqlhost/database?zeroDateTimeBehavior=convertToNull This will make the mysql driver return those values as NULL instead of all-zero dates. Regards, gwk
Re: How to Facet on a price range
On 11/9/2010 7:32 PM, Geert-Jan Brits wrote: when you drag the sliders , an update of how many results would match is immediately shown. I really like this. How did you do this? IS this out-of-the-box available with the suggested Facet_by_range patch? Hi, With the range facets you get the facet counts for every discrete step of the slider, these values are requested in the AJAX request whenever search criteria change and then someone uses the sliders we simply check the range that is selected and add the discrete values of that range to get the expected amount of results. So yes it is available, but as Solr is just the search backend the frontend stuff you'll have to write yourself. Regards, gwk
Re: How to Facet on a price range
Hi, Instead of all the facet queries, you can also make use of range facets (http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range), which is in trunk afaik, it should also be patchable into older versions of Solr, although that should not be necessary. We make use of it (http://www.mysecondhome.co.uk/search.html) to create the nice sliders Geert-Jan describes. We've also used it to add the sparklines above the sliders which give a nice indication of how the current selection is spread out. Regards, gwk On 11/9/2010 3:33 PM, Geert-Jan Brits wrote: Just to add to this, if you want to allow the user more choice in his option to select ranges, perhaps by using a 2-sided javasacript slider for the pricerange (ala kayak.com) it may be very worthwhile to discretize the allowed values for the slider (e.g: steps of 5 dolllar) Most js-slider implementations allow for this easily. This has the advantages of: - having far fewer possible facetqueries and thus a far greater chance of these facetqueries hitting the cache. - a better user-experience, although that's debatable. just to be clear: for this the Solr-side would still use: &facet=on&facet.query=price:[50 TO *]&facet.query=price:[* TO 100] and not the optimized pre-computed variant suggested above. Geert-Jan 2010/11/9 jayant That was very well thought of and a clever solution. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-Facet-on-a-price-range-tp1846392p1869201.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Geographic clustering
Hi Charlie, I think I understand what you mean, I had a similar requirement and this is what we made: http://www.mysecondhome.co.uk/search.html?view=map It allows full faceting on all fields the site allows in normal list search. Some information about my implementation is in my original thread about this issue (http://lucene.472066.n3.nabble.com/Geographic-clustering-td502559.html). Unfortunately in a fit of madness I didn't add my component to version control and have since lost the source of my little geo-clustering component (and yes, I'm still hitting myself over the head for that). If you want more information I'd be happy to help. Regards, gwk On 9/14/2010 8:14 PM, Charlie DeTar wrote: Hi, I'm interested in using geographic clustering of records in a Solr search index. Specifically, I want to be able to efficiently produce a map with clustered bubbles that represent the number of documents that are indexed with points in that general area. I'd like to combine this with other facets and search constraints, so it can't be entirely pre-computed. It looks to me as though LocalSolr (http://www.gissearch.com/localsolr ) is focused on simply constraining search results to a given radius, and not facets/clustering of the entire index. Searching the archives of this list, last year, there was some talk about writing custom geographic clustering components, but I couldn't find code examples. Does anyone have a working implementation of a geographic clustering component, or can anyone point to resources that would help in building one? best, Charlie
Re: Autosuggest on PART of cityname
On 8/20/2010 7:04 PM, PeterKerk wrote: @Markus: thanks, will try to work with that. @Gijs: I've looked at the site and the search function on your homepage is EXACTLY what I need! Do you have some Solr code samples for me to study perhaps? (I just need the relevant fields in the schema.xml and the query url) It would help me a lot! :) Thanks to you both! The fields in our schema are: required="true" /> - Just an id based on type, depth and a number, not important required="true" /> - This is either "buy" or "rent" as our sections have separate autocompleters - Since you can search by country, region or city, this stores the type of this document (well, since we use geonames.org geographical data we actually have 4 regions) - The canonical name of the country/region/city - The name of the country/region/city in various languages - The name of the country/region/city with any of it's parents comma separated, this is used for phrase searches so if you enter "Amsterdam, Netherlands" the dutch Amsterdam will match before any of the Amsterdams in other countries. - The same as parent but in different languages - This is some internal data used to create the correct filters when this particular suggestion is selected - The same as parent but in different languages, as our filters are on the actual name of countries/regions/cities - The number of documents, i.e. the number on the right of the suggestions - Multivalued field which is copyfield-ed from name and name_* - Multivalued field which is copyfield-ed from parent and parent_* Where text is generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> maxGramSize="30"/> ignoreCase="true" expand="true"/> words="stopwords.txt"/> generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> Our autocompletion requests are dismax request where the most important parameters are: - q=the text the user has entered into the searchbox so far - fq=type:sale (or rent) - qf=name_^4 name^4 names (Where is the currently selected language on the website) - pf=name_^4 name^4 names parents Honestly, those parameters are basically just tweaked without quite understanding their meaning until I got something that worked adequately. Hope this helps. Regards, gwk
Re: Autosuggest on PART of cityname
On 8/19/2010 4:45 PM, PeterKerk wrote: I want to have a Google-like autosuggest function on citynames. So when user types some characters I want to show cities that match those characters but ALSO the amount of locations that are in that city. Now with Solr I now have the parameter: "&fq=title:Bost" But the result doesnt show the city Boston. So the fq parameter now seems to be an exact match, where I want it to be a partial match as well, more like this in SQL: WHERE title LIKE '%' How can I do this? Hi, We do something similar (http://www.mysecondhome.co.uk), our solution is quite similar to the one proposed by Markus however we use a separate core for the auto-completion data which is updated hourly, this is due to the fact you can complete on multiple levels of geography which would be quite hard to do with faceting. Regards, gwk
Re: Solr 1.4.1 and 3x: Grouping of query changes results
On 8/9/2010 12:01 AM, David Benson wrote: I'm seeing what I believe to be a logic error in the processing of a query. Returns document 1234 as expected: id:1234 AND -indexid:1 AND -indexid:2 AND -indexid:3 Does not return document as expected: id:1234 AND (-indexid:1 AND -indexid:2) AND -indexid:3 Has anyone else experienced this? The exact placement of the parens isn't key, just adding a level of nesting changes the query results. Thanks, David Hi, I could be wrong but I think this has to do with Solr's lack of support for purely negative queries, try the following and see if it behaves correctly: id:1234 AND (*:* AND -indexid:1 AND -indexid:2) AND -indexid:3 Regards, gwk
Re: Sites with Innovative Presentation of Tags and Facets
On 5/31/2010 4:24 PM, gwk wrote: On 5/31/2010 11:50 AM, gwk wrote: On 5/31/2010 11:29 AM, Geert-Jan Brits wrote: May I ask how you implemented getting the facet counts for each interval? Do you use a facet-query per interval? And perhaps for inspiration a link to the site you implemented this .. Thanks, Geert-Jan I love the idea of a sparkline at range-sliders. I think if I have time, I might add them to the range sliders on our site. I already have all the data since I show the count for a range while the user is dragging by storing the facet counts for each interval in javascript. Hi, Sorry, seems I pressed send halfway through my mail and forgot about it. The site I implemented my numerical range faceting on is http://www.mysecondhome.co.uk/search.html and I got the facets by making a small patch for Solr (https://issues.apache.org/jira/browse/SOLR-1240) which does the same thing for numbers what date faceting does for dates. The biggest issue with range-faceting is the double counting of edges (which also happens in date faceting, see https://issues.apache.org/jira/browse/SOLR-397). My patch deals with that by adding an extra parameter which allows you specify which end of the range query should be exclusive. A secondary issue is that you can't do filter queries with one end inclusive and one end exclusive (i.e. price:[500 TO 1000}). You can get around this by doing "price:({500 TO 1000} OR 500)". I've looked into the JavaCC code of Lucene to see if I could fix it so you could mix [] and {} but unfortunately I'm not familiar enough with it to get it to work. Regards, gwk Hi, I was supposed to work on something else but I just couldn't resist, and just implemented some bar-graphs for the range sliders and I really like it. In my case it was really easy, all the data was already right there in javascript so it's not causing additional server side load. It's also really nice to see the graph updating when a facet is selected/changed. Regards, gwk (Tried attaching an image, but it didn't work, so here it is: http://img249.imageshack.us/img249/7766/faceting.png)
Re: Sites with Innovative Presentation of Tags and Facets
On 5/31/2010 11:50 AM, gwk wrote: On 5/31/2010 11:29 AM, Geert-Jan Brits wrote: May I ask how you implemented getting the facet counts for each interval? Do you use a facet-query per interval? And perhaps for inspiration a link to the site you implemented this .. Thanks, Geert-Jan I love the idea of a sparkline at range-sliders. I think if I have time, I might add them to the range sliders on our site. I already have all the data since I show the count for a range while the user is dragging by storing the facet counts for each interval in javascript. Hi, Sorry, seems I pressed send halfway through my mail and forgot about it. The site I implemented my numerical range faceting on is http://www.mysecondhome.co.uk/search.html and I got the facets by making a small patch for Solr (https://issues.apache.org/jira/browse/SOLR-1240) which does the same thing for numbers what date faceting does for dates. The biggest issue with range-faceting is the double counting of edges (which also happens in date faceting, see https://issues.apache.org/jira/browse/SOLR-397). My patch deals with that by adding an extra parameter which allows you specify which end of the range query should be exclusive. A secondary issue is that you can't do filter queries with one end inclusive and one end exclusive (i.e. price:[500 TO 1000}). You can get around this by doing "price:({500 TO 1000} OR 500)". I've looked into the JavaCC code of Lucene to see if I could fix it so you could mix [] and {} but unfortunately I'm not familiar enough with it to get it to work. Regards, gwk Hi, I was supposed to work on something else but I just couldn't resist, and just implemented some bar-graphs for the range sliders and I really like it. In my case it was really easy, all the data was already right there in javascript so it's not causing additional server side load. It's also really nice to see the graph updating when a facet is selected/changed. Regards, gwk
Re: Sites with Innovative Presentation of Tags and Facets
On 5/31/2010 11:29 AM, Geert-Jan Brits wrote: May I ask how you implemented getting the facet counts for each interval? Do you use a facet-query per interval? And perhaps for inspiration a link to the site you implemented this .. Thanks, Geert-Jan I love the idea of a sparkline at range-sliders. I think if I have time, I might add them to the range sliders on our site. I already have all the data since I show the count for a range while the user is dragging by storing the facet counts for each interval in javascript. Hi, Sorry, seems I pressed send halfway through my mail and forgot about it. The site I implemented my numerical range faceting on is http://www.mysecondhome.co.uk/search.html and I got the facets by making a small patch for Solr (https://issues.apache.org/jira/browse/SOLR-1240) which does the same thing for numbers what date faceting does for dates. The biggest issue with range-faceting is the double counting of edges (which also happens in date faceting, see https://issues.apache.org/jira/browse/SOLR-397). My patch deals with that by adding an extra parameter which allows you specify which end of the range query should be exclusive. A secondary issue is that you can't do filter queries with one end inclusive and one end exclusive (i.e. price:[500 TO 1000}). You can get around this by doing "price:({500 TO 1000} OR 500)". I've looked into the JavaCC code of Lucene to see if I could fix it so you could mix [] and {} but unfortunately I'm not familiar enough with it to get it to work. Regards, gwk
Re: date slider
Hi, I'm not sure if this applies to your use case but when I was building our faceted search (see http://www.mysecondhome.co.uk/search.html) at first I wanted to do the same, retrieve the minimum and maximum values but when I did the few values that were a lot higher than the others made it almost impossible to select a reasonable range. That's why I switched to a fixed range of reasonable values with the last option being "anything higher". This way the resultset is spread out pretty evenly over the length of the slider. If the values over which you want to do range selection don't vary a lot I think this is the best option, otherwise I guess you'll have to use another solution. Maybe if the values do change a lot but not very often you could generate new fixed range values after updating Solr. If you think something like what I've made is useful to you, I'll be happy to answer any questions about how I implemented this. Regards, gwk On 5/16/2010 10:07 PM, Lukas Kahwe Smith wrote: On 16.05.2010, at 21:01, Ahmet Arslan wrote: http://wiki.apache.org/solr/StatsComponent can give you min and max values. Sorry my bad, I just tested StatsComponent with tdate field. And it is not working for date typed fields. Wiki says it is for numeric fields. ok thx for checking. is my use case really so unusual? i guess i could store a unix timestamp or i just do a fixed range. hmm if i use facets with a really large gap will it always give me at least the min and max maybe? will try it out when i get home. regards Lukas
Re: date facets without intersections
Hi, Several possible solutions are discussed in http://lucene.472066.n3.nabble.com/Date-Faceting-and-Double-Counting-td502014.html Regards, gwk On 4/27/2010 10:02 PM, Király Péter wrote: Dear Solr users, I am interesting, whether it is possible to get date facets without intersecting ranges. Now the documents which stands on boundaries of ranges are covered by both ranges. An example: facet result (from Solr): 3 3 12 If we translate into queries, it means that the number of document matching query date_fc:[1000-01-01T00:00:00Z TO 1100-01-01T00:00:00Z] is 3, and the number of document matching query date_fc:[1100-01-01T00:00:00Z TO 1200-01-01T00:00:00Z] is 3 as well. I have a document with date 1100-01-01T00:00:00Z, and it matches both queries. I haven't found such parameters for date facets, but maybe you know a Solr secret, which prevents this intersection. I can do it with query facets, but that seems to be more complicated, than the very comfortable date facet parameters. Thanks Péter
Re: Bucketing a price field
Oops, the new patch only works on Trie fields, other stuff I said should still be valid. (One extra thing to be aware of is double counting, see http://n3.nabble.com/Date-Faceting-and-Double-Counting-td502014.html for example) Regards, gwk On 4/7/2010 4:03 PM, gwk wrote: Hi, A while back I created a patch for Solr (http://issues.apache.org/jira/browse/SOLR-1240) to do range faceting on numbers. I haven't uploaded an updated patch for Solr 1.4 yet, I'll try to do that shortly. I haven't tested it on a floating point field but in theory it should work on most numerical field types. Regards, gwk On 4/7/2010 2:44 AM, Blargy wrote: What would be the best way to do range bucketing on a price field? I'm sort of taking the example from the Solr 1.4 book and I was thinking about using a PatternTokenizerFactory with a SynonymFilterFactory. Is there a better way? Thanks
Re: Bucketing a price field
Hi, A while back I created a patch for Solr (http://issues.apache.org/jira/browse/SOLR-1240) to do range faceting on numbers. I haven't uploaded an updated patch for Solr 1.4 yet, I'll try to do that shortly. I haven't tested it on a floating point field but in theory it should work on most numerical field types. Regards, gwk On 4/7/2010 2:44 AM, Blargy wrote: What would be the best way to do range bucketing on a price field? I'm sort of taking the example from the Solr 1.4 book and I was thinking about using a PatternTokenizerFactory with a SynonymFilterFactory. Is there a better way? Thanks
Re: Drill down a solr result set by facets
Hi, You are using the dismax request handler, which only accepts a simple string in the q parameter, you can't specify other fields in it that way. In any case, using filter queries (fq) as suggested by Indika Tantrigoda is a better option as these are cached separately which is quite useful for faceting. Regards, gwk On 3/29/2010 6:07 PM, Dhanushka Samarakoon wrote: Thanks for the reply. I was just giving the above as an example. Something as simple as following is also not working. /select/?q=france+fDepartmentName:History&version=2.2& So it looks like the query parameter syntax I'm using is wrong. This is the params array I'm getting from the result. 10 0 on kansas fDepartmentName:History dismax 2.2 On Mon, Mar 29, 2010 at 10:59 AM, Tommy Chhengwrote: Try adding quotes to your query: DepartmentName:Chemistry+fSponsor:\"US Cancer/Diabetic Research Institute\" The parser will split on whitespace Tommy Chheng Programmer and UC Irvine Graduate Student Twitter @tommychheng http://tommy.chheng.com On 3/29/10 8:49 AM, Dhanushka Samarakoon wrote: Hi, I'm trying to perform a search based on keywords and then reduce the result set based on facets that user selects. First query for a search would look like this. http://localhost:8983/solr/select/?q=cancer+stem&version=2.2&wt=php&start=&rows=10&indent=on&qt=dismax&facet=on&facet.mincount=1&facet.field=fDepartmentName&facet.field=fInvestigatorName&facet.field=fSponsor&facet.date=DateAwarded&facet.date.start=2009-01-01T00:00:00Z&facet.date.end=2010-01-01T00:00:00Z&facet.date.gap=%2B1MONTH In the above query (as per dismax on the solr config file) it searches multiple fields such as GrantTitle, DepartmentName, InvestigatorName, etc... Then if user select 'Chemistry' from the facet field 'fDepartmentName' and 'US Cancer/Diabetic Research Institute' from 'fSponsor' I need to reduce the result set above to only records from where fDepartmentName is 'Chemistry' and 'fSponsor' is 'US Cancer/Diabetic Research Institute' The following query is not working. select/?q=cancer+stem+fDepartmentName:Chemistry+fSponsor:US Cancer/Diabetic Research Institute&version=2.2& Fields starting with 'f' are defined in the schema.xml as copy fields. Any ideas on the correct syntax? Thanks, Dhanushka.
Re: How do I create a solr core with the data from an existing one?
Hi, I'm not sure if it's the best option but you could use replication to copy the index (http://wiki.apache.org/solr/SolrReplication). As long as you core is configured as a master you can use the fetchindex command to do a one-time replication from the new core (see the HTTP API section in the wiki page). Regards, gwk On 3/24/2010 5:31 PM, Steve Dupree wrote: *Solr 1.4 Enterprise Search Server* recommends doing large updates on a copy of the core, and then swapping it in for the main core. I tried following these steps: 1. Create prep core: http://localhost:8983/solr/admin/cores?action=CREATE&name=prep&instanceDir=main 2. Perform index update, then commit/optimize on prep core. 3. Swap main and prep core: http://localhost:8983/solr/admin/cores?action=SWAP&core=main&other=prep 4. Unload prep core: http://localhost:8983/solr/admin/cores?action=UNLOAD&core=prep The problem I am having is, the core created in step 1 doesn't have any data in it. If I am going to do a full index of everything and the kitchen sink, that would be fine, but if I just want to update a (large) subset of the documents - that's obviously not going to work. (I could merge the cores, but part of what I'm trying to do is get rid of any deleted documents without trying to make a list of them.) Is there some flag to the CREATE action that I'm missing? The Solr Wiki page for CoreAdmin<http://wiki.apache.org/solr/CoreAdmin> is a little sparse on details. Is this approach wrong? I found at least one message on this list that stated that performing updates in a separate core on the same machine won't help, given that they're both using the same CPU. Is that true? thanks in advance ~stannius
Re: distinct on my result
Hi, Try replacing KeywordTokenizerFactory with a WhitespaceTokenizerFactory so it'll create separate terms per word. After a reindex it should work. Regards, gwk On 3/11/2010 4:33 PM, stocki wrote: hey, okay i show your my settings ;) i use an extra core with the standard requesthandler. SCHEMA.XML so i copy my names to the field suggest and use the EdgeNGramFilter and some others so with this konfig i get the results above ... maybe i have t many filters ;) ?! gwk-4 wrote: Hi, I'm no expert on the full-text search features of Solr but I guess that has something to do with your fieldtype, or query. Are you using the standard request handler or dismax for your queries? And what analysers are you using on your product name field? Regards, gwk On 3/11/2010 3:24 PM, stocki wrote: okay. we have a lot of products and i just importet the name of each product to a core. make an edgengram to this and my autoCOMPLETION runs. but i want an auto-suggestion: example. autoCompletion--> I: "harry" O: "harry potter..." but when the input ist --> I. "potter" -- O: / so what i want is, that i get "harry potter ..." when i tipping "potter" into my search field! any idea ? i think the solution is a mixe of termsComponent and EdgeNGram or not ? i am a little bit despair, and in this forum are too many information about it =( gwk-4 wrote: Hi, The autosuggest core is filled by a simple script (written in PHP) which request facet values for all the possible strings one can search for and adds them one by one as a document. Our case has some special issues due to the fact that we search in multiple languages (Typing "España" will suggest "Spain" and the other way around when on the Spanish site). We have about 97500 documents yeilding approximately 12500 different documents in our autosuggest-core and the autosuggest-update script takes about 5 minutes to do a full re-index (all this is done on a separate server and replicated so the indexing has no impact on the performance of the site). Regards, gwk On 3/10/2010 3:09 PM, stocki wrote: okay. thx my suggestion run in another core;) do you distinct during the import with DIH ?
Re: distinct on my result
Hi, I'm no expert on the full-text search features of Solr but I guess that has something to do with your fieldtype, or query. Are you using the standard request handler or dismax for your queries? And what analysers are you using on your product name field? Regards, gwk On 3/11/2010 3:24 PM, stocki wrote: okay. we have a lot of products and i just importet the name of each product to a core. make an edgengram to this and my autoCOMPLETION runs. but i want an auto-suggestion: example. autoCompletion-->I: "harry" O: "harry potter..." but when the input ist --> I. "potter" -- O: / so what i want is, that i get "harry potter ..." when i tipping "potter" into my search field! any idea ? i think the solution is a mixe of termsComponent and EdgeNGram or not ? i am a little bit despair, and in this forum are too many information about it =( gwk-4 wrote: Hi, The autosuggest core is filled by a simple script (written in PHP) which request facet values for all the possible strings one can search for and adds them one by one as a document. Our case has some special issues due to the fact that we search in multiple languages (Typing "España" will suggest "Spain" and the other way around when on the Spanish site). We have about 97500 documents yeilding approximately 12500 different documents in our autosuggest-core and the autosuggest-update script takes about 5 minutes to do a full re-index (all this is done on a separate server and replicated so the indexing has no impact on the performance of the site). Regards, gwk On 3/10/2010 3:09 PM, stocki wrote: okay. thx my suggestion run in another core;) do you distinct during the import with DIH ?
Re: distinct on my result
Hi, The autosuggest core is filled by a simple script (written in PHP) which request facet values for all the possible strings one can search for and adds them one by one as a document. Our case has some special issues due to the fact that we search in multiple languages (Typing "España" will suggest "Spain" and the other way around when on the Spanish site). We have about 97500 documents yeilding approximately 12500 different documents in our autosuggest-core and the autosuggest-update script takes about 5 minutes to do a full re-index (all this is done on a separate server and replicated so the indexing has no impact on the performance of the site). Regards, gwk On 3/10/2010 3:09 PM, stocki wrote: okay. thx my suggestion run in another core;) do you distinct during the import with DIH ?
Re: distinct on my result
Hi, I ran into the same issue, and what I did (at http://www.mysecondhome.co.uk/) was to create a separate core just for autosuggest which is fully updated once an hour which contains the distinct values of the items I want to look for including the count so I can display the approximate amount of results in the suggest dropdown. This might not be a good solution when your data is updated frequently but for us it's worked very well so far. Maybe you can also use clustering so you won't have to create a separate core but I'm thinking my solution performs better (although I haven't tested it so I could be horribly horribly wrong). Regards, gwk On 3/10/2010 2:55 PM, stocki wrote: hello. i implement my suggest-function with edgengramfilter. now when i get my result , is the result not distinct. often ist the name double or more. is it possible that solr gives me only distinct result ? "response":{"numFound":172,"start":0,"docs":[ { "name":"Halloween"}, { "name":"Hallo Taxi"}, { "name":"Halloween"}, { "name":"Hallstatt"}, { "name":"Hallo Mary"}, { "name":"Halloween"}, { "name":"Halloween"}, { "name":"Halloween"}, { "name":"Halleluja"}, { "name":"Halloween"}] so how can i delete Halloween from solr ? i didnt want delete it from client-side thx
Re: Date Facets
Hi Liam, This happens because the range searches for date faceting are inclusive on both ends. So values on the exact edges of the intervals are counted twice. You can see some solutions at http://old.nabble.com/Date-Faceting-and-Double-Counting-td25227846.html Regards, gwk On 2/24/2010 6:54 AM, Liam O'Boyle wrote: Afternoon, I have a strange problem occurring with my date faceting. I seem to have more results in my facets than in my actual result set. The query filters by date to show results for one year, i.e. ib_date:[2000-01-01T00:00:00Z TO 2000-12-31T23:59:59Z], then uses date faceting to break up the dates by month, using the following parameters facet=true facet.date=ib_date facet.date.start=2000-01-01T00:00:00Z facet.date.end=2000-12-31T23:59:59Z facet.date.gap=+1MONTH However, I end up with more numbers in the facets than there are documents in the response, including facets for dates that aren't matched. See below for a summary of the results pulled out through /solr/select. Is there something I'm missing here? Thanks, Liam
Re: Question regarding wildcards and dismax
Have a look at the q.alt parameter (http://wiki.apache.org/solr/DisMaxRequestHandler#q.alt) which is used for exactly this issue. Basically putting q.alt=*:* in your query means you can leave out the q parameter if you want all documents to be selected. Regards, gwk On 2/19/2010 11:28 AM, Roland Villemoes wrote: Hi all, We have a web application build on top of Solr, and we are using a lot of facets - everything works just fine. When the user first hits the searchpage - we would like to do a "get all query" to the a result, and thereby get all facets so we can build up the user interface from this result/facets. So I would like to do a q=*:* on the search. But since I have switched to the dismax requesthandler this does not work anymore. ? My request/url looks like this: a) /solr/da/mysearcher/?q=*:* Does not work b) /solr/da/select?q=*:* Does work But I really need to use a) since I control boosting/ranking in the definition. Furthermore when the user "drill down" the search result, by selecting from the facets, I still need to get the full searchresult, like: /solr/da/mysearcher/?q=*:*&fq=color:red Does not work.
Re: How does one sort facet queries?
On 2/19/2010 2:15 AM, Kelly Taylor wrote: All sorting of facets works great at the field level (count/index)...all good there...but how is sorting accomplished with range queries? The solrj response doesn't seem to maintain the order the queries are sent in, and the order is not in index or count order. What's the trick? http://localhost:8983/solr/select?q=someterm &rows=0 &facet=true &facet.limit=-1 &facet.query=price:[* TO 100] &facet.query=price:[100 TO 200] &facet.query=price:[200 TO 300] &facet.query=price:[300 TO 400] &facet.query=price:[400 TO 500] &facet.query=price:[500 TO 600] &facet.query=price:[600 TO 700] &facet.query=price:[700 TO *] &facet.mincount=1 &collapse.field=dedupe_hash &collapse.threshold=1 &collapse.type=normal &collapse.facet=before The "trick" I use is to use LocalParams to give eacht facet query a well defined name. Afterwards you can loop through the names in whatever order you want. so basically facet.query={!key=price_0}[* TO 100] etc. N.B. the facet queries in your example will lead to some documents to be counted double (i.e. when the price is exactly 100, 200, 300). Regards, gwk
Re: labeling facets and highlighting question
There's a ! missing in there, try {!key=label}. Regards, gwk On 2/18/2010 5:01 AM, adeelmahmood wrote: okay so if I dont want to do any excludes then I am assuming I should just put in {key=label}field .. i tried that and it doesnt work .. it says undefined field {key=label}field Lance Norskog-2 wrote: Here's the problem: the wiki page is confusing: http://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters The line: q=mainquery&fq=status:public&fq={!tag=dt}doctype:pdf&facet=on&facet.field={!ex=dt}doctype is standalone, but the later line: facet.field={!ex=dt key=mylabel}doctype mean 'change the long query from {!ex=dt}docType to {!ex=dt key=mylabel}docType' 'tag=dt' creates a tag (name) for a filter query, and 'ex=dt' means 'exclude this filter query'. On Wed, Feb 17, 2010 at 4:30 PM, adeelmahmood wrote: simple question: I want to give a label to my facet queries instead of the name of facet field .. i found the documentation at solr site that I can do that by specifying the key local param .. syntax something like facet.field={!ex=dt%20key='By%20Owner'}owner I am just not sure what the ex=dt part does .. if i take it out .. it throws an error so it seems its important but what for ??? also I tried turning on the highlighting and i can see that it adds the highlighting items list in the xml at the end .. but it only points out the ids of all the matching results .. it doesnt actually shows the text data thats its making a match with // so i am getting something like this back ... instead of the actual text thats being matched .. isnt it supposed to do that and wrap the search terms in em tag .. how come its not doing that in my case here is my schema -- View this message in context: http://old.nabble.com/labeling-facets-and-highlighting-question-tp27632747p27632747.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
Re: Autosuggest and highlighting
On 2/9/2010 2:57 PM, Ahmet Arslan wrote: I'm trying to improve the search box on our website by adding an autosuggest field. The dataset is a set of properties in the world (mostly europe) and the searchbox is intended to be filled with a country-, region- or city name. To do this I've created a separate, simple core with one document per geographic location, for example the document for the country "France" contains several fields including the number of properties (so we can show the approximate amount of results in the autosuggest box) and the name of the country France in several languages and some other bookkeeping information. The name of the property is stored in two fields: "name" which simple contains the canonical name of the country, region or city and "names" which is a multivalued field containing the name in several different languages. Both fields use an EdgeNGramFilter during analysis so the query "Fr" can match "France". This all seems to work, the autosuggest box gives appropriate suggestions. But when I turn on highlighting the results are less than desirable, for example the query "rho" using dismax (and hl.snippets=5) returns the following: Région Rhône-Alpes Rhône-Alpes Rhône-Alpes Rhône-Alpes Rhône-Alpes Région Rhône-Alpes Département du Rhône Département du Rhône Rhône Département du Rhône Rhône Département du Rhône As you can see, no matter where the match is, the first 3 characters are highlighted. Obviously not correct for many of the fields. Is this because of the NGramFilterFactory or am I doing something wrong? I used https://issues.apache.org/jira/browse/SOLR-357 for this sometime ago. It was giving correct highlights. I just ran a test with the NGramFilter removed (and reindexing) which did give correct highlighting results but I had to query using the whole word. I'll try the PrefixingFilterFactory next although according to the comments it's nothing but a subset of the EdgeNGramFilterFactory so unless I'm configuring it wrong it should yield the same results... However we are now using http://www.ajaxupdates.com/mootools-autocomplete-ajax-script/ It automatically makes bold matching characters without using solr highlighting. Using a pure javascript based solution isn't really an option for us as that wouldn't work for the diacritical marks without a lot of transliteration brouhaha. Regards, gwk
Autosuggest and highlighting
Hi, I'm trying to improve the search box on our website by adding an autosuggest field. The dataset is a set of properties in the world (mostly europe) and the searchbox is intended to be filled with a country-, region- or city name. To do this I've created a separate, simple core with one document per geographic location, for example the document for the country "France" contains several fields including the number of properties (so we can show the approximate amount of results in the autosuggest box) and the name of the country France in several languages and some other bookkeeping information. The name of the property is stored in two fields: "name" which simple contains the canonical name of the country, region or city and "names" which is a multivalued field containing the name in several different languages. Both fields use an EdgeNGramFilter during analysis so the query "Fr" can match "France". This all seems to work, the autosuggest box gives appropriate suggestions. But when I turn on highlighting the results are less than desirable, for example the query "rho" using dismax (and hl.snippets=5) returns the following: Région Rhône-Alpes Rhône-Alpes Rhône-Alpes Rhône-Alpes Rhône-Alpes Région Rhône-Alpes Département du Rhône Département du Rhône Rhône Département du Rhône Rhône Département du Rhône As you can see, no matter where the match is, the first 3 characters are highlighted. Obviously not correct for many of the fields. Is this because of the NGramFilterFactory or am I doing something wrong? The field definition for 'name' and 'names' is: generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> maxGramSize="20"/> ignoreCase="true" expand="true"/> words="stopwords.txt"/> generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> Regards, gwk
Re: trouble with DTD
On 2/8/2010 3:15 PM, Jens Kapitza wrote: hi @all, using solr and dataimport stuff to import ends up in RuntimeException. Caused by: java.lang.RuntimeException: [com.ctc.wstx.exc.WstxLazyException] com.ctc.wstx.exc.WstxParsingException: Undeclared general entity "eacute" at [row,col {unknown-source}]: [49,23] é is an entity defined for (X)HTML. XML only uses " & ' < > and &#; So if you want to use the é character you'll have to either use the character itself or something like É Regards, gwk
Solr and Geoserver/Mapserver
Hello, While my current implementation of searching on a map works, rendering hundreds of markers in an embedded Google map tends to slow browsers on slower computers (or fast computers running internet explorer :\) down to a crawl. I'm looking into generating tiles with the markers rendered on it on the server to improve performance (GTileLayerOverlay) Does anyone have any experience using geoserver, mapserver or a similar application in combination with Solr so that the application can generate tiles from a Solr query and tile position/zoom level? Regards, gwk
Re: Stop solr without losing documents
Michael wrote: I've got a process external to Solr that is constantly feeding it new documents, retrying if Solr is nonresponding. What's the right way to stop Solr (running in Tomcat) so no documents are lost? Currently I'm committing all cores and then running catalina's stop script, but between my commit and the stop, more documents can come in that would need *another* commit... Lots of people must have had this problem already, so I know the answer is simple; I just can't find it! Thanks. Michael I don't know if this is the best solution, or even if it's applicable to your situation but we do incremental updates from a database based on a timestamp, (from a simple seperate sql table filled by triggers so deletes are measures correctly as well). We store this timestamp in solr as well. Our index script first does a simple Solr request to request the newest timestamp and basically selects the documents to update with a "SELECT * FROM document_updates WHERE timestamp >= X" where X is the timestamp returned from Solr (We use >= for the hopefully extremely rare case when two updates are at the same time and also at the same time the index script is run where it only retrieved one of the updates, this will cause some documents to be updates multiple times but as document updates are idempotent this is no real problem.) Regards, gwk
Re: Geographic clustering
Hi all, I've just got my geographic clustering component working (somewhat). I've attached a sample resultset to this mail. It seems to work pretty well and it's pretty fast. I have one issue I need help with concerning the API though. At the moment my Hilbert field is a Sortable Integer, and I do the following call to get the count for a specific cluster: Query rangeQ = new TermRangeQuery("geo_hilbert", lowI, highI, true, true); searcher.numDocs(rangeQ, docs); But I'd like to further reduce the DocSet by the given longitude and latitude bounds given in the geocluster arguments (swlat, swlng, nelat and nelng) but only for the purposes of clustering, I don't just want to have to add fq arguments for to the query as I want my non-geocluster results (like facet counts and numFound) to not be affected by the selected range. So how would I achieve the effect of filterqueries (including the awesome caching) by manipulating either the rangeQ or docs. And since the snippet above is called multiple times with different rangeQ but the same (filtered) DocSet I guess manipulating docs would be faster (I think). Regards, gwk gwk wrote: Hi Joe, Thanks for the link, I'll check it out, I'm not sure it'll help in my situation though since the clustering should happen at runtime due to faceted browsing (unless I'm mistaken at what the preprocessing does). More on my progress though, I thought some more about using Hilbert curve mapping and it seems really suited for what I want. I've just added a Hilbert field to my schema (Trie Integer field) with latitude and longitude at 15bits precision (didn't use 16 bits to avoid the sign bit) so I have a 30 bit number in said field. Getting facet counts for 0 to (2^30 - 1) should get me the entire map while getting counts for 0 to (2^28 - 1), 2^28 to (2^29 - 1), 2^29 to (2^29 + 2^28 - 1) and (2^29 + 2^28) to (2^30 - 1) should give me counts for four equal quadrants, all the way down to 0 to 3, 4 to 7, 8 to 11 (2^30 - 4 to 2^30 - 1) and of course faceting on every separate term. Of course since if you're zoomed in far enough to need such fine grained clustering you'll be looking at a small portion of the map and only a part of the whole range should be counted, but that should be doable by calculating the Hilbert number for the lower and upper bounds. The only problem is the location of the clusters, if I use this method I'll only have the Hilbert number and the number of items in that part of the, what is essentially a quadtree. But I suppose I can calculate the facet counts for one precision finer than the requested precision and use a weighted average of the four parts of the cluster, I'll have to see if that is accurate enough. Hopefully I'll have the time to complete this today or tomorrow. I'll report back if it has worked. Regards, gwk Joe Calderon wrote: there are clustering libraries like http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/, that have bindings to perl/python, you can preprocess your results and create clusters for each zoom level On Tue, Sep 8, 2009 at 8:08 AM, gwk wrote: Hi, I just completed a simple proof-of-concept clusterer component which naively clusters with a specified bounding box around each position, similar to what the javascript MarkerClusterer does. It's currently very slow as I loop over the entire docset and request the longitude and latitude of each document (Not to mention that my unfamiliarity with Lucene/Solr isn't helping the implementations performance any, most code is copied from grep-ing the solr source). Clustering a set of about 80.000 documents takes about 5-6 seconds. I'm currently looking into storing the hilber curve mapping in Solr and clustering using facet counts on numerical ranges of that mapping but I'm not sure it will pan out. Regards, gwk Grant Ingersoll wrote: Not directly related to geo clustering, but http://issues.apache.org/jira/browse/SOLR-769 is all about a pluggable interface to clustering implementations. It currently has Carrot2 implemented, but the APIs are marked as experimental. I would definitely be interested in hearing your experience with implementing your clustering algorithm in it. -Grant On Sep 8, 2009, at 4:00 AM, gwk wrote: Hi, I'm working on a search-on-map interface for our website. I've created a little proof of concept which uses the MarkerClusterer (http://code.google.com/p/gmaps-utility-library-dev/) which clusters the markers nicely. But because sending tens of thousands of markers over Ajax is not quite as fast as I would like it to be, I'd prefer to do the clustering on the server side. I've considered a few options like storing the morton-order and throwing away precision to cluster, assigning all locations to a grid position. Or simply cluster
Re: slow response
Hi Elaine, You can page your resultset with the rows and start parameters (http://wiki.apache.org/solr/CommonQueryParameters). So for example to get the first 100 results one would use the parameters rows=100&start=0 and the second 100 results with rows=100&start=100 etc. etc. Regards, gwk Elaine Li wrote: gwk, Sorry for confusion. I am doing simple phrase search among the sentences which could be in english or other language. Each doc has only several id numbers and the sentence itself. I did not know about paging. Sounds like it is what I need. How to achieve paging from solr? I also need to store all the results into my own tables in javascript to use for connecting with other applications. Elaine On Wed, Sep 9, 2009 at 10:37 AM, gwk wrote: Hi Elaine, I think you need to provide us with some more information on what exactly you are trying to achieve. From your question I also assumed you wanted paging (getting the first 10 results, than the next 10 etc.) But reading it again, "slice my docs into pieces" I now think you might've meant that you only want to retrieve certain fields from each document. For that you can use the fl parameter (http://wiki.apache.org/solr/CommonQueryParameters#head-db2785986af2355759faaaca53dc8fd0b012d1ab). Hope this helps. Regards, gwk Elaine Li wrote: I want to get the 10K results, not just the top 10. The fields are regular language sentences, they are not large. Is clustering the technique for what I am doing? On Wed, Sep 9, 2009 at 10:16 AM, Grant Ingersoll wrote: Do you need 10K results at a time or are you just getting the top 10 or so in a set of 10K? Also, are you retrieving really large stored fields? If you add &debugQuery=true to your request, Solr will return timing information for the various components. On Sep 9, 2009, at 10:10 AM, Elaine Li wrote: Hi, I have 20 million docs on solr. If my query would return more than 10,000 results, the response time will be very very long. How to resolve such problem? Can I slice my docs into pieces and let the query operate within one piece at a time so the response time and response data will be more managable? Thanks. Elaine -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: slow response
Hi Elaine, I think you need to provide us with some more information on what exactly you are trying to achieve. From your question I also assumed you wanted paging (getting the first 10 results, than the next 10 etc.) But reading it again, "slice my docs into pieces" I now think you might've meant that you only want to retrieve certain fields from each document. For that you can use the fl parameter (http://wiki.apache.org/solr/CommonQueryParameters#head-db2785986af2355759faaaca53dc8fd0b012d1ab). Hope this helps. Regards, gwk Elaine Li wrote: I want to get the 10K results, not just the top 10. The fields are regular language sentences, they are not large. Is clustering the technique for what I am doing? On Wed, Sep 9, 2009 at 10:16 AM, Grant Ingersoll wrote: Do you need 10K results at a time or are you just getting the top 10 or so in a set of 10K? Also, are you retrieving really large stored fields? If you add &debugQuery=true to your request, Solr will return timing information for the various components. On Sep 9, 2009, at 10:10 AM, Elaine Li wrote: Hi, I have 20 million docs on solr. If my query would return more than 10,000 results, the response time will be very very long. How to resolve such problem? Can I slice my docs into pieces and let the query operate within one piece at a time so the response time and response data will be more managable? Thanks. Elaine -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: Geographic clustering
Hi Joe, Thanks for the link, I'll check it out, I'm not sure it'll help in my situation though since the clustering should happen at runtime due to faceted browsing (unless I'm mistaken at what the preprocessing does). More on my progress though, I thought some more about using Hilbert curve mapping and it seems really suited for what I want. I've just added a Hilbert field to my schema (Trie Integer field) with latitude and longitude at 15bits precision (didn't use 16 bits to avoid the sign bit) so I have a 30 bit number in said field. Getting facet counts for 0 to (2^30 - 1) should get me the entire map while getting counts for 0 to (2^28 - 1), 2^28 to (2^29 - 1), 2^29 to (2^29 + 2^28 - 1) and (2^29 + 2^28) to (2^30 - 1) should give me counts for four equal quadrants, all the way down to 0 to 3, 4 to 7, 8 to 11 (2^30 - 4 to 2^30 - 1) and of course faceting on every separate term. Of course since if you're zoomed in far enough to need such fine grained clustering you'll be looking at a small portion of the map and only a part of the whole range should be counted, but that should be doable by calculating the Hilbert number for the lower and upper bounds. The only problem is the location of the clusters, if I use this method I'll only have the Hilbert number and the number of items in that part of the, what is essentially a quadtree. But I suppose I can calculate the facet counts for one precision finer than the requested precision and use a weighted average of the four parts of the cluster, I'll have to see if that is accurate enough. Hopefully I'll have the time to complete this today or tomorrow. I'll report back if it has worked. Regards, gwk Joe Calderon wrote: there are clustering libraries like http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/, that have bindings to perl/python, you can preprocess your results and create clusters for each zoom level On Tue, Sep 8, 2009 at 8:08 AM, gwk wrote: Hi, I just completed a simple proof-of-concept clusterer component which naively clusters with a specified bounding box around each position, similar to what the javascript MarkerClusterer does. It's currently very slow as I loop over the entire docset and request the longitude and latitude of each document (Not to mention that my unfamiliarity with Lucene/Solr isn't helping the implementations performance any, most code is copied from grep-ing the solr source). Clustering a set of about 80.000 documents takes about 5-6 seconds. I'm currently looking into storing the hilber curve mapping in Solr and clustering using facet counts on numerical ranges of that mapping but I'm not sure it will pan out. Regards, gwk Grant Ingersoll wrote: Not directly related to geo clustering, but http://issues.apache.org/jira/browse/SOLR-769 is all about a pluggable interface to clustering implementations. It currently has Carrot2 implemented, but the APIs are marked as experimental. I would definitely be interested in hearing your experience with implementing your clustering algorithm in it. -Grant On Sep 8, 2009, at 4:00 AM, gwk wrote: Hi, I'm working on a search-on-map interface for our website. I've created a little proof of concept which uses the MarkerClusterer (http://code.google.com/p/gmaps-utility-library-dev/) which clusters the markers nicely. But because sending tens of thousands of markers over Ajax is not quite as fast as I would like it to be, I'd prefer to do the clustering on the server side. I've considered a few options like storing the morton-order and throwing away precision to cluster, assigning all locations to a grid position. Or simply cluster based on country/region/city depending on zoom level by adding latitude on longitude fields for each zoom level (so that for smaller countries you have to be zoomed in further to get the next level of clustering). I was wondering if anybody else has worked on something similar and if so what their solutions are. Regards, gwk -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: Geographic clustering
Hi, I just completed a simple proof-of-concept clusterer component which naively clusters with a specified bounding box around each position, similar to what the javascript MarkerClusterer does. It's currently very slow as I loop over the entire docset and request the longitude and latitude of each document (Not to mention that my unfamiliarity with Lucene/Solr isn't helping the implementations performance any, most code is copied from grep-ing the solr source). Clustering a set of about 80.000 documents takes about 5-6 seconds. I'm currently looking into storing the hilber curve mapping in Solr and clustering using facet counts on numerical ranges of that mapping but I'm not sure it will pan out. Regards, gwk Grant Ingersoll wrote: Not directly related to geo clustering, but http://issues.apache.org/jira/browse/SOLR-769 is all about a pluggable interface to clustering implementations. It currently has Carrot2 implemented, but the APIs are marked as experimental. I would definitely be interested in hearing your experience with implementing your clustering algorithm in it. -Grant On Sep 8, 2009, at 4:00 AM, gwk wrote: Hi, I'm working on a search-on-map interface for our website. I've created a little proof of concept which uses the MarkerClusterer (http://code.google.com/p/gmaps-utility-library-dev/) which clusters the markers nicely. But because sending tens of thousands of markers over Ajax is not quite as fast as I would like it to be, I'd prefer to do the clustering on the server side. I've considered a few options like storing the morton-order and throwing away precision to cluster, assigning all locations to a grid position. Or simply cluster based on country/region/city depending on zoom level by adding latitude on longitude fields for each zoom level (so that for smaller countries you have to be zoomed in further to get the next level of clustering). I was wondering if anybody else has worked on something similar and if so what their solutions are. Regards, gwk -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: LocalParams for faceting in nightly
Hi Gareth, Try removing the space between de closing bracket } and the field name, I think that should work. Regards, gwk gareth rushgrove wrote: Hi All Hoping someone might be able to help me with a problem. I downloaded and got up and running with the latest nightly release of Solr: http://people.apache.org/builds/lucene/solr/nightly/solr-2009-09-08.zip In order to try out the tagging and excluding filters which have a note saying they are only available in 1.4. http://wiki.apache.org/solr/SimpleFacetParameters#head-4ba81c89b265c3b5992e3292718a0d100f7251ef I have a working index that I can query against, for instance the following returns what I would expect: http://172.16.142.130:8983/solr/products/select/?q=material:metal&fq={!tag=cl}colour:Red&start=24&rows=25&indent=on&wt=json&facet=on&facet.sort=false&facet.field=colour&facet.field=material&sort=popularity%20desc However, once I add the {!ex part it throws an exception: http://172.16.142.130:8983/solr/products/select/?q=material:metal&fq={!tag=colour}colour:Red&start=24&rows=25&indent=on&wt=json&facet=on&facet.sort=false&facet.field=colour&facet.field={!ex=colour}%20material&sort=popularity%20desc specifically "exception":"org.apache.solr.common.SolrException: undefined field {!ex=colour} material\n\tat The schema I'm using was copied from a working solr 1.3 install and as mentioned works great with 1.4, except for this issue I'm having So: * Do I have to enable this feature somewhere? * Is the feature working in the latest release? * Is my syntax correct? * Do you have to define the tag name somewhere other than in the query? Any help much appreciated. Thanks Gareth
Geographic clustering
Hi, I'm working on a search-on-map interface for our website. I've created a little proof of concept which uses the MarkerClusterer (http://code.google.com/p/gmaps-utility-library-dev/) which clusters the markers nicely. But because sending tens of thousands of markers over Ajax is not quite as fast as I would like it to be, I'd prefer to do the clustering on the server side. I've considered a few options like storing the morton-order and throwing away precision to cluster, assigning all locations to a grid position. Or simply cluster based on country/region/city depending on zoom level by adding latitude on longitude fields for each zoom level (so that for smaller countries you have to be zoomed in further to get the next level of clustering). I was wondering if anybody else has worked on something similar and if so what their solutions are. Regards, gwk
Re: A very complex search problem.
Hello Rajan, I might be mistaken, but isn't CouchDB or a similar map/reduce database ideal for situations like this? Regards, gwk rajan chandi wrote: Hi All, We are dealing with a very complex problem of person specific search. We're building a social network where people will post stuff and other users should be able to see the content only from their contacts. e.g. There are 10,000 users in the system and there are only 150 users in my network. I should be search across only 150 users' content. Is there an easy way to approach this problem? We've come-up with different approaches:- - Storing the relationship in each document. - A huge ORed query with all the IDs of the people that needs to be searched. - Creating a query and filtering the results based on the list of contacts. None of these approach sounds to be plausible. We already have gone through recently released book on Solr 1.4 Enterprise Search. The book also doesn't seem to have any pointers. Any good approach/pointers will help. Thanks and regards Rajan Chandi
Re: SOLR vs SQL
Fuad Efendi wrote: "No results found for 'surface area 377', displaying all properties." - why do we need SOLR then... Hi Fuad, The search box is only used for geographical search, i.e. country/region/city searches. The watermark on the homepage indicates this but the "search again" box on the search results page does not, I'll see if we can fix that. We use Solr not so much for the searchbox, which to be honest was an afterthought. But we do use Solr for faceting. Honestly, the thought of writing an SQL query which calculates all these facet counts every time a search parameter is changes gives me a headache, I don't think it's possible to do it in one query (although maybe, but I don't think anybody would want to maintain it). As for performance, every nontrivial database/search engine is affected by dataset for all but the simplest queries, and in my tests Solr trumps Mysql by a huge margin for our use case. We use a database to store our data in a somewhat normalized way, which is good for data consistency, but not so good for retrieval speeds. This is what makes Solr so useful for us, we can index all data in denormalized form with all data for a property in one record. While the (sql) database remains authoritative Full-text search is only one part of Solr, while an important part it isn't the only reason for using Solr. In our case, since we provide support for multiple language we try not to store textual descriptions but every facet a property can have. This gives us exactly the data needed to perform faceting but not so much on the full text search (which is used mind you, to find suggestions when you use the search box). Regards, gwk
Re: Date Faceting and Double Counting
Chris Hostetter wrote: : When I added numerical faceting to my checkout of solr (solr-1240) I basically : copied date faceting and modified it to work with numbers instead of dates. : With numbers I got a lot of doulbe-counted values as well. So to fix my : problem I added an extra parameter to number faceting where you can specify if : either end of each range should be inclusive or exclusive. I just ported it gwk: 1) would you mind opening a Jira issue for your date faceting improvements as well (email attachments tend to get lost, and there are legal headaches with committing them that Jira solves by asking you explicitly if you license them to the ASF) Sure, I've added it to Jira https://issues.apache.org/jira/browse/SOLR-1402. 2) i haven't looked t your patch, but one of the reasons i never implemented an option like this with date faceting is that the query parser doesn't have any way of letting you write a query that is inclusive on one end, and exclusive on the other end -- so you might get accurate facet counts for range A-B and B-C (inclusive of the lower, exclusive of hte upp), but if you try to filter by one of those ranges, your counts will be off. did you find a nice solution for this? I ran into that problem as well but the solution was provided to me by this very list :) See http://www.nabble.com/Range-queries-td24057317.html It's not the cleanest solution, but as long as you know what you're doing it's not that bad. The reason I created 1240 was exactly because my counts were off, with date faceting exact matches are a rarity, or at least you can make them to be one. But since with numbers (in my case, prices) being off by 1 cent is not acceptable I needed this exclusivity. The only real reason for all of this was the geek-candy the price slider on our website, the counts are sent via ajax and the range slider can simply sum the counts for the selected range to get the exact count for that range without having to query Solr for more data. Regards, gwk
Re: Date Faceting and Double Counting
Hi Stephen, When I added numerical faceting to my checkout of solr (solr-1240) I basically copied date faceting and modified it to work with numbers instead of dates. With numbers I got a lot of doulbe-counted values as well. So to fix my problem I added an extra parameter to number faceting where you can specify if either end of each range should be inclusive or exclusive. I just ported it back to date faceting (disclaimer, completely untested) and it should be attached to my post. The following parameter is added: facet.date.exclusive valid values for the parameter are: start, end, both and neither To maintain compatibility with solr without the patch the default is neither. I hope the meaning of the values are self-explanatory. Regards, gwk Stephen Duncan Jr wrote: If we do date faceting and start at 2009-01-01T00:00:00Z, end at 2009-01-03T00:00:00Z, with a gap of +1DAY, then documents that occur at exactly 2009-01-02T00:00:00Z will be included in both the returned counts (2009-01-01T00:00:00Z and 2009-01-02T00:00:00Z). At the moment, this is quite bad for us, as we only index the day-level, so all of our documents are exactly on the line between each facet-range. Because we know our data is indexed as being exactly at midnight each day, I think we can simply always start from 1 second prior and get the results we want (start=2008-12-31T23:59:59Z, end=2009-01-02T23:59:59Z), but I think this problem would affect everyone, even if usually more subtly (instead of all documents being counted twice, only a few on the fencepost between ranges). Is this a known behavior people are happy with, or should I file an issue asking for ranges in date-facets to be constructed to subtract one second from the end of each range (so that the effective range queries for my case would be: [2009-01-01T00:00:00Z TO 2009-01-01T23:59:59Z] & [2009-01-02T00:00:00Z TO 2009-01-02T23:59:59Z])? Alternatively, is there some other suggested way of using the date faceting to avoid this problem? Index: src/java/org/apache/solr/request/SimpleFacets.java === --- src/java/org/apache/solr/request/SimpleFacets.java (revision 809880) +++ src/java/org/apache/solr/request/SimpleFacets.java (working copy) @@ -29,6 +29,7 @@ import org.apache.solr.common.params.SolrParams; import org.apache.solr.common.params.CommonParams; import org.apache.solr.common.params.FacetParams.FacetDateOther; +import org.apache.solr.common.params.FacetParams.FacetDateExclusive; import org.apache.solr.common.util.NamedList; import org.apache.solr.common.util.SimpleOrderedMap; import org.apache.solr.common.util.StrUtils; @@ -586,6 +587,32 @@ "date facet 'end' comes before 'start': "+endS+" < "+startS); } + boolean startInclusive = true; + boolean endInclusive = true; + final String[] exclusiveP = +params.getFieldParams(f,FacetParams.FACET_DATE_EXCLUSIVE); + if (null != exclusiveP && 0 < exclusiveP.length) { +Set exclusives += EnumSet.noneOf(FacetDateExclusive.class); + +for (final String e : exclusiveP) { + exclusives.add(FacetDateExclusive.get(e)); +} + +if(! exclusives.contains(FacetDateExclusive.NEITHER) ) { + boolean both = exclusives.contains(FacetDateExclusive.BOTH); + + if(both || exclusives.contains(FacetDateExclusive.START)) { +startInclusive = false; + } + + if(both || exclusives.contains(FacetDateExclusive.END)) { +endInclusive = false; + } +} + } + + final String gap = required.getFieldParam(f,FacetParams.FACET_DATE_GAP); final DateMathParser dmp = new DateMathParser(ft.UTC, Locale.US); dmp.setNow(NOW); @@ -610,7 +637,7 @@ (SolrException.ErrorCode.BAD_REQUEST, "date facet infinite loop (is gap negative?)"); } - resInner.add(label, rangeCount(sf,low,high,true,true)); + resInner.add(label, rangeCount(sf,low,high,startInclusive,endInclusive)); low = high; } } catch (java.text.ParseException e) { @@ -639,15 +666,15 @@ if (all || others.contains(FacetDateOther.BEFORE)) { resInner.add(FacetDateOther.BEFORE.toString(), - rangeCount(sf,null,start,false,false)); + rangeCount(sf,null,start,false,!startInclusive)); } if (all || others.contains(FacetDateOther.AFTER)) { resInner.add(FacetDateOther.AFTER.toString(), - rangeCount(sf,end,null,false,false)); + rangeCount(sf,end,null,!endInclusive,false)); } if (all || others.contains(FacetDateOther.BETWEEN)) { resInner.add(Fac
Re: Thanks
Dave Searle wrote: Hi Gwk, It's a nice clean site, easy to use and seems very fast, well done! How well does it do in regards to SEO though? I noticed there's a lot of ajax going on in the background to help speed things up for the user (love the sliders), but seems to be lacking structure for the search engines. I'm not sure if this is your intention or not, but you could massively increase the number of pages the crawlers see by extending your url rewrites to be a bit more static Hi Dave, Thanks for the reply, actually, we did think about SEO, turn off javascript in your browser and you'll see the site still works (at least, it's supposed to). We've added all AJAXy-interaction after we implemented the functionality to work without Javascript. So you'll get no nice fancy sliders but two drop-downs to select a range. Regards, gwk
Thanks
Hello, Earlier this your our company decided to (finally :)) upgrade our website to something a little faster/prettier/maintainable-er. After some research we decided on using Solr and after indexing our data for the first time and trying some manual queries we were all amazed at the speed. This summer we started developing the new site and today we've gone live.You can see the site running at http://www.mysecondhome.eu (I don't mean to advertise, so feel free not to buy a house). I'd like to thank the people here for their help with lifting me from Solr-ignorance to Solr-seems-to-know-a-little-bit. We're running a nightly build of Solr 1.4 with SOLR-1240 applied for the dynamic facet count updates when using the sliders in the search screen. Again, thank you and if you have any suggestions or questions regarding our implementation, feel free to ask. Regards, gwk
Re: debugQuery=true issue
Hi, Thanks for your response, I'm still developing so the schema is still in flux so I guess that explains it. Oh and regarding the NPE, I updated my checkout and recompiled and now it's gone so I guess somewhere between revision 787997 and 798482 it's already been fixed. Regards, gwk Robert Petersen wrote: I had something similar happen where optimize fixed an odd sorting/scoring problem, and as I understand it the optimize will clear out index 'lint' from old schemas/documents and so thus could affect result scores since all the term vectors or something similar are refreshed etc etc
Re: debugQuery=true issue
Hi, Hoping this was completely my fault I changed my solr to a nightly build from june (I run Solr patched with SOLR-1240) but the same problems occur. After reindexing a single always_on_top document it suddenly appeared in far down the resultset with score around 5.311 (where it would be if always_on_top were not true) yet the debugQuery output shows a score for that one item to be 10.28 while the rest of the documents score from 5.305 to 5.315. Restarting Solr or reindexing the document again seemed to have no effect but as a last resort I tried optimize which did work. I may have misunderstood the purpose of optimize but that shouldn't have any effect on scoring should it? For what it's worth, I'm using dismax with the functionquery in bf. Regards, gwk Oops, it seems it's due to a fq in the same query, not because of the, there's a range query on price: fq=price:({0 TO *} OR 0) Removing this filter makes debugQuery work however strange thing happen, I took my original query and took the first result and the last result and performing the query (on unique id) without the fq and debugQuery=true yields: 10.288208 195500.0 2009-06-12T12:07:11Z true 695658 5.1031165 68.0 true 147563 while debug part of the response contains: 10.287015 = (MATCH) sum of: 0.09950372 = (MATCH) MatchAllDocsQuery, product of: 0.09950372 = queryNorm 10.187511 = (MATCH) FunctionQuery(sum(product(ord(homepage_teaser),const(5.0)),1000.0/(1.0*float(top(rord(first_publication_date)))+1000.0))), product of: 10.238322 = sum(product(ord(homepage_teaser)=2,const(5.0)),1000.0/(1.0*float(rord(first_publication_date)=3196)+1000.0)) 10.0 = boost 0.09950372 = queryNorm 10.078215 = (MATCH) sum of: 0.09950372 = (MATCH) MatchAllDocsQuery, product of: 0.09950372 = queryNorm 9.978711 = (MATCH) FunctionQuery(sum(product(ord(homepage_teaser),const(5.0)),1000.0/(1.0*float(top(rord(first_publication_date)))+1000.0))), product of: 10.028481 = sum(product(ord(homepage_teaser)=2,const(5.0)),1000.0/(1.0*float(rord(first_publication_date)=34112)+1000.0)) 10.0 = boost 0.09950372 = queryNorm So the score in the response doesn't match the score in debugQuery's output. Does this have something to do with SOLR-947? I'm currently using Solr 1.4 trunk (revision 787997, which is about a month old iirc) Regards, gwk
Re: debugQuery=true issue
Grant Ingersoll wrote: What's the line number that is giving the NPE? Can you paste in a stack trace? Here it is: java.lang.NullPointerException: value cannot be null java.lang.RuntimeException: java.lang.NullPointerException: value cannot be null at org.apache.solr.search.QueryParsing.toString(QueryParsing.java:469) at org.apache.solr.handler.component.DebugComponent.process(DebugComponent.java:75) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:203) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1290) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:324) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522) Caused by: java.lang.NullPointerException: value cannot be null at org.apache.lucene.document.Field.(Field.java:323) at org.apache.lucene.document.Field.(Field.java:298) at org.apache.lucene.document.Field.(Field.java:277) at org.apache.solr.search.QueryParsing.writeFieldVal(QueryParsing.java:306) at org.apache.solr.search.QueryParsing.toString(QueryParsing.java:360) at org.apache.solr.search.QueryParsing.toString(QueryParsing.java:401) at org.apache.solr.search.QueryParsing.toString(QueryParsing.java:466) ... 23 more -Grant On Jul 27, 2009, at 10:59 AM, gwk wrote: gwk wrote: Hi, I'm playing around with sorting via functionqueries, and I've set _val_ to the following: sum(product(always_on_top,5),recip(rord(publication_date),1,1000,1000)) Where the field always_on_top is a simple boolean field, where documents with always_on_top:true should always be on top. I ran into a problem where one of the documents with always_on_top = true was all the way on the bottom instead of on top. So I extracted the query out of my system en copied it to my browser and added &debugQuery=true which gave a NullPointerException. After some searching I found out the document in question had no publication_date field set (which is totally my fault) however it took quite a while to discover this since I couldn't turn on debugQuery. Is this a bug or expected behviour? Regards, gwk Oops, it seems it's due to a fq in the same query, not because of the, there's a range query on price: fq=price:({0 TO *} OR 0) Removing this filter makes debugQuery work however strange thing happen, I took my original query and took the first result and the last result and performing the query (on unique id) without the fq and debugQuery=true yields: 10.288208 195500.0 2009-06-12T12:07:11Z true 695658 5.1031165 68.0 true 147563 while debug part of the response contains: 10.287015 = (MATCH) sum of: 0.09950372 = (MATCH) MatchAllDocsQuery, product of: 0.09950372 = queryNorm 10.187511 = (MATCH) FunctionQuery(sum(product(ord(homepage_teaser),const(5.0)),1000.0/(1.0*float(top(rord(first_publication_date)))+1000.0))), product of: 10.238322 = sum(product(ord(homepage_teaser)=2,const(5.0)),1000.0/(1.0*float(rord(first_publication_date)=3196)+1000.0)) 10.0 = boost 0.09950372 = queryNorm 10.078215 = (MATCH) sum of: 0.09950372 = (MATCH) MatchAllDocsQuery, product of: 0.09950372 = queryNorm 9.978711 = (MATCH) FunctionQuery(sum(product(ord(homepage_teaser),const(5.0)),1000.0/(1.0*float(top(rord(first_publicat
Re: debugQuery=true issue
gwk wrote: Hi, I'm playing around with sorting via functionqueries, and I've set _val_ to the following: sum(product(always_on_top,5),recip(rord(publication_date),1,1000,1000)) Where the field always_on_top is a simple boolean field, where documents with always_on_top:true should always be on top. I ran into a problem where one of the documents with always_on_top = true was all the way on the bottom instead of on top. So I extracted the query out of my system en copied it to my browser and added &debugQuery=true which gave a NullPointerException. After some searching I found out the document in question had no publication_date field set (which is totally my fault) however it took quite a while to discover this since I couldn't turn on debugQuery. Is this a bug or expected behviour? Regards, gwk Oops, it seems it's due to a fq in the same query, not because of the, there's a range query on price: fq=price:({0 TO *} OR 0) Removing this filter makes debugQuery work however strange thing happen, I took my original query and took the first result and the last result and performing the query (on unique id) without the fq and debugQuery=true yields: 10.288208 195500.0 2009-06-12T12:07:11Z true 695658 5.1031165 68.0 true 147563 while debug part of the response contains: 10.287015 = (MATCH) sum of: 0.09950372 = (MATCH) MatchAllDocsQuery, product of: 0.09950372 = queryNorm 10.187511 = (MATCH) FunctionQuery(sum(product(ord(homepage_teaser),const(5.0)),1000.0/(1.0*float(top(rord(first_publication_date)))+1000.0))), product of: 10.238322 = sum(product(ord(homepage_teaser)=2,const(5.0)),1000.0/(1.0*float(rord(first_publication_date)=3196)+1000.0)) 10.0 = boost 0.09950372 = queryNorm 10.078215 = (MATCH) sum of: 0.09950372 = (MATCH) MatchAllDocsQuery, product of: 0.09950372 = queryNorm 9.978711 = (MATCH) FunctionQuery(sum(product(ord(homepage_teaser),const(5.0)),1000.0/(1.0*float(top(rord(first_publication_date)))+1000.0))), product of: 10.028481 = sum(product(ord(homepage_teaser)=2,const(5.0)),1000.0/(1.0*float(rord(first_publication_date)=34112)+1000.0)) 10.0 = boost 0.09950372 = queryNorm So the score in the response doesn't match the score in debugQuery's output. Does this have something to do with SOLR-947? I'm currently using Solr 1.4 trunk (revision 787997, which is about a month old iirc) Regards, gwk
debugQuery=true issue
Hi, I'm playing around with sorting via functionqueries, and I've set _val_ to the following: sum(product(always_on_top,5),recip(rord(publication_date),1,1000,1000)) Where the field always_on_top is a simple boolean field, where documents with always_on_top:true should always be on top. I ran into a problem where one of the documents with always_on_top = true was all the way on the bottom instead of on top. So I extracted the query out of my system en copied it to my browser and added &debugQuery=true which gave a NullPointerException. After some searching I found out the document in question had no publication_date field set (which is totally my fault) however it took quite a while to discover this since I couldn't turn on debugQuery. Is this a bug or expected behviour? Regards, gwk
Re: Faceting
Well, I had a bit of a facepalm moment when thinking about it a little more, I'll just show a "more countries [Y selected]" where Y is the number of countries selected which are not in the top X. If you want a nice concise interface you'll just have to enable javascript. With my earlier adventures in numerical range selection (solr-1240) I became wary of just adding facet.query parameters as Solr seemed to crash when adding a lot of facet.queries of the form facet.query=price:[* TO 10]&facet.query:[10 TO 20] etc. etc Thanks for your help, Regards, Gijs Shalin Shekhar Mangar wrote: On Mon, Jul 13, 2009 at 7:56 PM, gwk wrote: Is there a good way to select the top X facets and include some terms you want to include as well something like facet.field=country&f.country.facet.limit=X&f.country.facet.includeterms=Narnia,Guilder or is there some other way to achieve this? You can use facet.query for each of the terms you want to include. You may need to remove such terms from appearing in the facet.field=country results in the client. e.g. facet.field=country&f.country.facet.limit=X&facet.query=country:Narnia&facet.query=country:Guilder
Faceting
Hi, I'm in the process of making a javascriptless web interface to Solr (the nice ajax-version will be built on top of it unobtrusively). Our database has a lot of fields and so I've grouped those with similar characteristics to make several different 'widgets' (like a numerical type which get a min-max selector or an enumerated type with checkboxes) but I've run into a slight problem with fields which contain a lot of terms. One of those fields is country, what I'd like to do is display the top X countries, which is easily done with facet.field=country&f.country.facet.limit=X and display a more link which will redirect to a new page with all countries (and other query parameters in hidden fields) which posts back to the search page. All this is no problem, but once a person has selected some countries which are not in the top X (say 'Narnia' and 'Guilder') I want to list that country below the X top countries with a checked checkbox. Is there a good way to select the top X facets and include some terms you want to include as well something like facet.field=country&f.country.facet.limit=X&f.country.facet.includeterms=Narnia,Guilder or is there some other way to achieve this? Regards, Gijs Kunze
Re: Numerical range faceting
Shalin Shekhar Mangar wrote: On Tue, Jun 23, 2009 at 4:55 PM, gwk wrote: I was wondering if someone is interested in a patch file and if so, where should I post it? This seems useful. Please open an issue and submit a patch. I'm sure there will be interest. Hi, I cleaned up the code a bit, added some javadoc (I hope I did it correctly) and created a ticket: http://issues.apache.org/jira/browse/SOLR-1240 Regards, gwk
Re: Numerical range faceting
gwk wrote: Hi, I'm currently using facet.query to do my numerical range faceting. I basically use a fixed price range of €0 to €1 in steps of €500 which means 20 facet.queries plus an extra facet.query for anything above €1. I use the inclusive/exclusive query as per my question two days ago so the facets add up to the total number of products. This is done so that the javascript on my search page can accurately show the amount of products returned for a specified range before submitting it to the server by adding up the facet counts for the selected range. I'm a bit concerned about the amount and size of my request to the server. Especially because there are other numerical values which might be interesting to facet on and I've noticed the server won't response correctly if I add (many) more facet.queries by decreasing the step size. I was really hoping for faceting options for numerical ranges similar to the date faceting options. The functionality would be practically identical as far as I can tell (which isn't very far as I know very little about the internals of Solr) so I was wondering if such options are planned or if I'm overlooking something. Regards, gwk Hello, Well since I got no response, I flexed my severely atrophied Java-muscles (Last time I used the language Swing was new) and dove straight into the Solr code. Well, not really, mostly I did some copy-pasting and with some assistance from the API Reference I was able to add numerical faceting on sortable numerical fields (it seems to work for both integers and floating point numbers) with a similar syntax to the date faceting. I also added an extra parameter for whether the ranges should be inclusive or exclusive (on either end). And it seems to work. Although the quality of my code is not of the same grade as the rest of the Solr code (I was amazed how easy it was for me to add this feature). I was wondering if someone is interested in a patch file and if so, where should I post it? Regards, gwk As an example, the following query: http://localhost:8080/select/?q=*%3A*&echoParams=none&rows=0&indent=on&facet=true&; facet.number=price&f.price.facet.number.start=0& f.price.facet.number.end=100&f.price.facet.number.gap=1& f.price.facet.number.other=all&f.price.facet.number.exclusive=end yields the following results: 0 3 1820 2697 2588 2622 2459 2455 2597 2530 2518 2389 18 54 19 23 43 67 1.0 100.0 0 2733 60974
Numerical range faceting
Hi, I'm currently using facet.query to do my numerical range faceting. I basically use a fixed price range of €0 to €1 in steps of €500 which means 20 facet.queries plus an extra facet.query for anything above €1. I use the inclusive/exclusive query as per my question two days ago so the facets add up to the total number of products. This is done so that the javascript on my search page can accurately show the amount of products returned for a specified range before submitting it to the server by adding up the facet counts for the selected range. I'm a bit concerned about the amount and size of my request to the server. Especially because there are other numerical values which might be interesting to facet on and I've noticed the server won't response correctly if I add (many) more facet.queries by decreasing the step size. I was really hoping for faceting options for numerical ranges similar to the date faceting options. The functionality would be practically identical as far as I can tell (which isn't very far as I know very little about the internals of Solr) so I was wondering if such options are planned or if I'm overlooking something. Regards, gwk
Re: Range queries
Yes, this works perfectly, guess the "Never use equality comparison for floating point numbers"-rule was so strong in my mind I didn't even think to consider this possibility. Thanks, gwk Avlesh Singh wrote: Really sorry, this is what I meant: x:{5 TO 8} OR x:5 Cheers Avlesh On Wed, Jun 17, 2009 at 9:36 AM, Avlesh Singh wrote: And how about this - x:{5 TO 8} AND x:5 Cheers Avlesh On Wed, Jun 17, 2009 at 1:57 AM, Peter Keegan wrote: How about this: x:[5 TO 8] AND x:{0 TO 8} On Tue, Jun 16, 2009 at 1:16 PM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: Hi, I think the square brackets/curly braces need to be balanced, so this is currently not doable with existing query parsers. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message ---- From: gwk To: solr-user@lucene.apache.org Sent: Tuesday, June 16, 2009 11:52:12 AM Subject: Range queries Hi, When doing range queries it seems the query is either x:[5 TO 8] which means 5 <= x <= 8 or x:{5 TO 8} which means 5 < x < 8. But how do you get one half exclusive, the other inclusive for double fields the following: 5 <= x < 8? Is this possible? Regards, gwk
Range queries
Hi, When doing range queries it seems the query is either x:[5 TO 8] which means 5 <= x <= 8 or x:{5 TO 8} which means 5 < x < 8. But how do you get one half exclusive, the other inclusive for double fields the following: 5 <= x < 8? Is this possible? Regards, gwk
Re: How to combine facets count from multiple query into one query
Hi, Not sure if this is what you want, but would this do what you need? fq={!tag=p1}publisher_name:publisher1&fq={!tag=p2}publisher_name:publisher2&q=abstract:philosophy&facet=true&facet.mincount=1&facet.field={!ex=p1 key=p2_book_title}book_title&facet.field={!ex=p2 key=p1_book_title}book_title or seperated by newlines instead of & for readability: fq={!tag=p1}publisher_name:publisher1 fq={!tag=p2}publisher_name:publisher2 q=abstract:philosophy facet=true facet.mincount=1 facet.field={!ex=p1 key=p2_book_title}book_title facet.field={!ex=p2 key=p1_book_title}book_title Of course, this uses an 1.4 feature (tagging and excluding) Regards, gwk Jeffrey Tiong wrote: Hi, I have a schema that has the following fields, publisher_name book_title year abstract Currently if I do a facet count when I have a query "q=abstract:philosophy AND publisher_name:publisher1" , it can give me results like below, abstract:philosophy AND publisher_name:publisher1 70 60 20 78 62 19 Likewise for "q=abstract:philosophy AND publisher_name:publisher2" - abstract:philosophy AND publisher_name:publisher2 3 1 1 3 1 1 However I have to do the query separately and get the facet count for each of them separately. Is there a way for me to combine all these into one query and get the facet count for each of them at one query? because sometimes it may go up to 20 queries in order to get all the separate counts. Thanks! Jef
Re: Distributed Search
Otis Gospodnetic wrote: Yes, that's the standard trick. :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: gwk To: solr-user@lucene.apache.org Sent: Wednesday, February 25, 2009 5:18:47 AM Subject: Re: Distributed Search Koji Sekiguchi wrote: gwk wrote: Hello, The wiki states 'When duplicate doc IDs are received, Solr chooses the first doc and discards subsequent ones', I was wondering whether "the first doc" is the doc of the shard which responds first or the doc in the first shard in the shards GET parameter? Regards, gwk It is the doc of the shard which responds first, if my memory is correct... Koji Ok, so it wouldn't be possible to have a smaller, faster authoritative shard for near-real-time updates while keeping the entire dataset in a second shard which is updates less frequently? Regards, gwk Ok, now I'm confused, if the shard the document comes from is non-deterministic, how can you use this 'trick'? (except that since the response time of the first shard which is smaller is usually better which would mean it'll work most of time (BAD!)) Or was Koji's memory incorrect and the shard first mentioned is always the authoritative shard when encountering duplicate keys? Regards, gwk
Re: Distributed Search
Koji Sekiguchi wrote: gwk wrote: Hello, The wiki states 'When duplicate doc IDs are received, Solr chooses the first doc and discards subsequent ones', I was wondering whether "the first doc" is the doc of the shard which responds first or the doc in the first shard in the shards GET parameter? Regards, gwk It is the doc of the shard which responds first, if my memory is correct... Koji Ok, so it wouldn't be possible to have a smaller, faster authoritative shard for near-real-time updates while keeping the entire dataset in a second shard which is updates less frequently? Regards, gwk
Distributed Search
Hello, The wiki states 'When duplicate doc IDs are received, Solr chooses the first doc and discards subsequent ones', I was wondering whether "the first doc" is the doc of the shard which responds first or the doc in the first shard in the shards GET parameter? Regards, gwk
Facet Paging
Hi, With the faceting parameters there is an option to add support for paging through a large number of facets. But to create proper paging it would be helpful if the response contains the total number of facets (the amount of facets if facet.limit was set to a negative value) similar to an ordinary query response's numFound attribute so you can determine how many pages there should be. Is it possible to request this information somehow in the same response and if possible how much does it impact performance? Regards, gwk
Re: DataImportHandler: UTF-8 and Mysql
Shalin Shekhar Mangar wrote: On Mon, Jan 12, 2009 at 3:48 PM, gwk wrote: 1. Posting UTF-8 data through the example post-script works and I get the proper results back when I query using the admin page. However, data imported through the DataImportHandler from a MySQL database (the database contains correct data, it's a copy of a production db and selecting through the client gives the correct characters) I get "ó" instead of "ó". I've tried several combinations of arguments to my datasource url (useUnicode=true&characterEncoding=UTF-8) but it does not seem to help. How do I get this to work correctly? DataImportHandler does not change any encoding. It receives a Java string object from the driver and adds it to Solr. So I'm guessing the problem is in the database or in the driver. Did you create the tables with UTF-8 encoding? Try looking in the MySql driver configuration parameters to force UTF-8. Sorry, I can't be of much help here. I checked again and you were right, while the columns contained utf8-encoded strings, the actual encoding of the columns was set to latin1, I've fixed the database and now it's working correctly. 2. On the wikipage for DataImportHandler, the deletedPkQuery has no real description, am I correct in assuming it should contain a query which returns the ids of items which should be removed from the index? Yes you are right. It should return the primary keys of the rows to be deleted. 3. Another question concerning the DataImportHandler wikipage, I'm not sure about the exact way the field-tag works. From the first data-config.xml example for the full-import I can infer that the "column"-attribute represents the column from the sql-query and the "name"-attribute represents the name of the field in the schema the column should map to. However further on in the RegexTransformer section there are column-attributes which do not correspond to the sql-query result set and its the "sourceColName" attribute which acually represents that data, which comes from the RegexTransformer I understand but why then is the "column" attribute used instead of the "name"-attribute. This has confused me somewhat, any clarification would be greatly appreciated. DataImportHandler reads by "column" from the resultset and writes by "name" to Solr (or if name is unspecified, by "column"). So column is compulsory but "name" is optional. The typical use-case for a RegexTransformer is when you want to read a field (say "a"), process it (save it as "b") and then add it to Solr (by name "c"). So you read by "sourceColName", process and save it as "column" and write to Solr as "name". So if "name" is unspecified, it will be written to Solr as "column". The reason we use column and not name is because the user may want to do something more with it, for example use that field in a template and save that template to Solr. I know it is a bit confusing but it helps us to keep DIH general enough. Hope that helps. Ok, that explains it for me, thanks for the clarification.
Re: Index is not created if my database table is large
Hi, I'm not sure that this is the same issue but I had a similar problem with importing a large table from Mysql, on the DataImportHandler FAQ (http://wiki.apache.org/solr/DataImportHandlerFaq) the first issue mentions memory problems. Try adding the batchSize="-1" attribute to your datasource, it fixed the problem for me. Regards, gwk
DataImportHandler: UTF-8 and Mysql
Hello, First of all thanks to Jacob Singh for his reply on my mail last week, I completely forgot to reply. Multicore is perfect for my needs. I've got Solr running now with my new schema partially implemented and I've started to test importing data with DIH. I've run in to a number of issues though and I hope someone here can help: 1. Posting UTF-8 data through the example post-script works and I get the proper results back when I query using the admin page. However, data imported through the DataImportHandler from a MySQL database (the database contains correct data, it's a copy of a production db and selecting through the client gives the correct characters) I get "ó" instead of "ó". I've tried several combinations of arguments to my datasource url (useUnicode=true&characterEncoding=UTF-8) but it does not seem to help. How do I get this to work correctly? 2. On the wikipage for DataImportHandler, the deletedPkQuery has no real description, am I correct in assuming it should contain a query which returns the ids of items which should be removed from the index? 3. Another question concerning the DataImportHandler wikipage, I'm not sure about the exact way the field-tag works. From the first data-config.xml example for the full-import I can infer that the "column"-attribute represents the column from the sql-query and the "name"-attribute represents the name of the field in the schema the column should map to. However further on in the RegexTransformer section there are column-attributes which do not correspond to the sql-query result set and its the "sourceColName" attribute which acually represents that data, which comes from the RegexTransformer I understand but why then is the "column" attribute used instead of the "name"-attribute. This has confused me somewhat, any clarification would be greatly appreciated. Regards, gwk
Solr 1.3.0 with Jetty 6.1.14
Hello, I'm trying to get multiple instances of Solr running with Jetty as per the instructions on http://wiki.apache.org/solr/SolrJetty, however I've run into a snag. According to the page you set the solr/home parameter as follows: solr/home *My Solr Home Dir* However, as MattKangas mentions on the wiki, using this method to set the JNDI parameter makes it global to the jvm which is bad for running multiple instances but reading the 6.1.14 documentation for the EnvEntry class constructors shows that with this version of jetty you can supply a scope, I've tried this with the following configuration: /solr/home /my/solr/home/dir true But unfortunately this doesn't seem to work, if I set the first argument to NULL (), it works for one instance (as it's in jvm scope) but when I set it to the WebAppContext-scope, solr logs: org.apache.solr.core.SolrResourceLoader locateInstanceDir INFO: No /solr/home in JNDI org.apache.solr.core.SolrResourceLoader locateInstanceDir INFO: solr home defaulted to 'solr/' (could not find system property or JNDI) Am I doing something wrong here? Any help will be appreciated. Regards, gwk