query expansion à la dismax
Hello list, the dismax query type has one feature that is particularly nice... the ability to expand tokens to a query to many fields. This is really useful to do such jobs as prefer a match in title, prefer exact matches over stemmed matches over phonetic matches. My problem: I wish to do the same with the normal Lucene query type because I wish to enable power users to use some syntax if they wish but I would still like to expand on searches in the default field that are on the top level. So I wrote my own code that filters the top level queries and expands them, using a similar instruction as dismax within a particular query component. Question 1: doesn't such a code already exist? (I haven't found it) Question 2: should I rather make a QParserPlugin? (the javadoc is not very helpful) thanks in advance paul
Re: noobie question: sorting
AWESOME, thanks for your time! Regards James On Wed, Mar 16, 2011 at 6:14 PM, David Smiley (@MITRE.org) dsmi...@mitre.org wrote: Hi. Where did you find such an obtuse example? Recently, Solr supports sorting by function query. One such function is named query which takes a query and uses the score of the result of that query as the function's result. Due to constraints of where this query is placed within a function query, it is necessary to use the local-params syntax (e.g. {!v=...}) since you can't simply state category:445. Or, there could have been a parameter dereference like $sortQ where sortQ is another parameter holding category:445. Any way, the net effect is that documents are score-sorted based on the query category:445 instead of the user-query (q param). I'd expect category:445 docs to come up top and all others to appear randomly afterwards. It would be nice if the sort query could simply be category:445 desc but that's not supported. Complicated? You bet! But fear not; this is about as complicated as it gets. References: http://wiki.apache.org/solr/SolrQuerySyntax http://wiki.apache.org/solr/CommonQueryParameters#sort http://wiki.apache.org/solr/FunctionQuery#query ~ David Smiley Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/noobie-question-sorting-tp2685250p2685617.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR DIH importing MySQL text column as a BLOB
Kaushik, i just remembered an ML-Post few weeks ago .. same problem while importing geo-data (http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2254395.html) - the solution was: CAST( CONCAT( lat, ',', lng ) AS CHAR ) at that time i search a little bit for the reason and afaik there was a bug in mysql/jdbc which produces that binary output under certain conditions Regards Stefan On Wed, Mar 16, 2011 at 4:57 AM, Kaushik Chakraborty kaych...@gmail.com wrote: I've a column for posts in MySQL of type `text`, I've tried corresponding `field-type` for it in Solr `schema.xml` e.g. `string, text, text-ws`. But whenever I'm importing it using the DIH, it's getting imported as a BLOB object. I checked, this thing is happening only for columns of type `text` and not for `varchar`(they are getting indexed as string). Hence, the posts field is not becoming searchable. I found about this issue, after repeated search failures, when I did a `*:*` query search on Solr. A sample response: result name=response numFound=223 start=0 maxScore=1.0 doc float name=score1.0/float str name=solr_post_bio[B@10a33ce2/str date name=solr_post_created_at2011-02-21T07:02:55Z/date str name=solr_post_emailtest.acco...@gmail.com/str str name=solr_post_first_nameTest/str str name=solr_post_last_nameAccount/str str name=solr_post_message[B@2c93c4f1/str str name=solr_post_status_message_id1/str /doc The `data-config.xml` : document entity name=posts dataSource=jdbc query=select p.person_id as solr_post_person_id, pr.first_name as solr_post_first_name, pr.last_name as solr_post_last_name, u.email as solr_post_email, p.message as solr_post_message, p.id as solr_post_status_message_id, p.created_at as solr_post_created_at, pr.bio as solr_post_bio from posts p,users u,profiles pr where p.person_id = u.id and p.person_id = pr.person_id and p.type='StatusMessage' field column=solr_post_person_id / field column=solr_post_first_name/ field column=solr_post_last_name / field column=solr_post_email / field column=solr_post_message / field column=solr_post_status_message_id / field column=solr_post_created_at / field column=solr_post_bio/ /entity /document The `schema.xml` : fields field name=solr_post_status_message_id type=string indexed=true stored=true required=true / field name=solr_post_message type=text_ws indexed=true stored=true required=true / field name=solr_post_bio type=text indexed=false stored=true / field name=solr_post_first_name type=string indexed=false stored=true / field name=solr_post_last_name type=string indexed=false stored=true / field name=solr_post_email type=string indexed=false stored=true / field name=solr_post_created_at type=date indexed=false stored=true / /fields uniqueKeysolr_post_status_message_id/uniqueKey defaultSearchFieldsolr_post_message/defaultSearchField Thanks, Kaushik
RE: Faceting help
Hi Upayavira, I use the term constraint to define additional options for a user to refine search with under each facet. If we could think of them as sub facet's then maybe this would explain in slightly better terms. I didn't add additional document source types in my original email but if I knew that there would be xls and doc contained within the Solr index then these would also be added as sub facet's allowing a user to select prior to entering a search query. Can you point me towards documentation or something similar in order to implement the above. I am aware that I have a lot more to learn on faceted search, namely how to properly implement it! Thank you Lewis From: Upayavira [u...@odoko.co.uk] Sent: 15 March 2011 22:42 To: solr-user@lucene.apache.org Subject: Re: Faceting help I'm not sure if I get what you are trying to achieve. What do you mean by constraint? Are you saying that you effectively want to filter the facets that are returned? e.g. for source field, you want to show html/pdf/email, but not, say xls or doc? Upayavira Topics field Legislation constraint Guidance/Policies constraint Customer Service information/complaints procedure constraint financial information constraint etc etc Source field html constraint constraint pdf constraint email constraint etc etc Date field constraint Basically I need resources to understand how to implement the above instead of the example I currently have. Some guidance would be great Thank you kindly Lewis Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html --- Enterprise Search Consultant at Sourcesense UK, Making Sense of Open Source Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html
Maven : Specifying SNAPSHOT Artifacts and the Hudson Repository
Hi all, does anyone have a successfull setup (=pom.xml) that specifies the Hudson snapshot repository : https://builds.apache.org/hudson/job/Lucene-Solr-Maven-3.x/lastStableBuild/artifact/maven_artifacts (or that for trunk) and entries for any solr snapshot artifacts which are then found by Maven in this repository? I have specified the repository in my pom.xml as : repositories repository idsolr-snapshot-3.x/id urlhttps://builds.apache.org/hudson/job/Lucene-Solr-Maven-3.x/lastSuccessfulBuild/artifact/maven_artifacts/url releases enabledfalse/enabled /releases snapshots enabledtrue/enabled /snapshots /repository /repositories And the dependencies: dependency groupIdorg.apache.solr/groupId artifactIdsolr-core/artifactId version3.2-SNAPSHOT/version /dependency dependency groupIdorg.apache.solr/groupId artifactIdsolr-dataimporthandler/artifactId version3.2-SNAPSHOT/version /dependency Maven's output is (for solr-core): Downloading: http://192.168.2.40:8081/nexus/content/groups/public/org/apache/solr/solr-core/3.2-SNAPSHOT/solr-core-3.2-SNAPSHOT.jar [INFO] Unable to find resource 'org.apache.solr:solr-core:jar:3.2-SNAPSHOT' in repository solr-snapshot-3.x (https://builds.apache.org/hudson/job/Lucene-Solr-Maven-3.x/lastSuccessfulBuild/artifact/maven_artifacts) I'm also trying around with specifying the exact name of the jar, but no success so far, and it also seems wrong as it will be constantly changing. Also, searching hasn't returned anything helpful, so far. I'd really appreciate if someone could point me into the right direction! Thanks! Chantal
Multiple spellchecker
Hello, I have a problem with the SOLR spellchecker component. This is the problem: Searching term = Company: American today, City: London (two fields: copyfield to one: Spell ) User search = American tuday, Londen What i want is a collation of: American today london. SOLR returns with the q parameter: American Correction: American today tuday Correction: American today londen Correction: London Collaction: American today American today London SOLR returns with the spellcheck.q parameter: American tuday londen Correction: American today The index of Spell looks like this: American today London google France etc. I want that SOLR makes two parts of terms: (American today) and (London). Both terms have to be checked for spelling, not as one term and not as three terms. Can somebody helps me? -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-spellchecker-tp2687320p2687320.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solrj performance bottleneck
Hi, Thanks for your information. One simple question. Please clarify me. In our setup, we are having Solr index in one machine. And Solrj client part (java code) in another machine. Currently as you suggest, if it may be a 'not enough free RAM for the OS to cache' then whether I need to increase the RAM in the machine in which Solrj query part is there.??? Or need to increase RAM for Solr instance for the OS cache? Since both the system are in local Amazon network (Linux EC2 small instances), I believe the network wont be a issue. Another thing, in the reply you have mentioned 'client not reading fast enough'. Whether it is related to network or Solrj. Thanks in advance for your info. -- View this message in context: http://lucene.472066.n3.nabble.com/Solrj-performance-bottleneck-tp2682797p2687448.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr admin page timed out and index updating issues
Yes, due to warmup queries Solr may run out of heap space at start up. On Monday 14 March 2011 16:52:15 Ranma wrote: I am still stuck at the same point. Looking here and there I could read that the memory limit (heap space) may need to be increased to -Xms512M -Xmx512M when launching the java -jar start.jar command. But in my vps I've been forced to set the Xmx limit to maximum Xmx400M since at higher value it returns a VM initialization error and it won't run. My first question is: could this be the problem not being able to access the solr admin page? Please...! Thanks! - loredanaebook.it -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-admin-page-timed-out-and-index-upd ating-issues-tp2664429p2676437.html Sent from the Solr - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Stemming question
Hmm, i'm not sure if its supposed to stem that way but if it doesn't and you insist then you might be able to abuse the PatternReplaceFilterFactory. On Wednesday 16 March 2011 06:02:32 Bill Bell wrote: When I use the Porter Stemmer in Solr, it appears to take works that are stemmed and replace it with the root work in the index. I verified this by looking at analysis.jsp. Is there an option to expand the stemmer to include all combinations of the word? Like include 's, ly, etc? Other options besides protection? Bill -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
faceting over ngrams
Hello guys. We are using shard'ed solr 1.4 for heavy faceted search over the trigrams field with about 1 million of entries in the result set and more than 100 million of entries to facet on in the index. Currently the faceted search is very slow, taking about 5 minutes per query. Would running on a cloud with Hadoop make it faster (to seconds) as faceting seems to be a natural map-reduce task? Are there any other options to look into before stepping into the cloud? Please let me know, if you need specific details on the schema / solrconfig setup or the like. -- Regards, Dmitry Kan
Re: Dismax: field not returned unless in sort clause?
No, not setting those options in the query or schema.xml file. I'll try what you said, however. Thanks Chris Hostetter-3 wrote: : We have a D field (string, indexed, stored, not required) that is returned : * when we search with the standard request handler : * when we search with dismax request handler _and the field is specified in : the sort parameter_ : : but is not returned when using the dismax handler and the field is not : specified in the sort param. are you using one of the sortMissing options on D or it's fieldType? I'm guessing you have sortMissingLast=true for D, so anytime you sort on it the docs that do have a value appear first. but when you don't sort on it, other factors probably lead docs that don't have a value for the D field to appear first -- solr doesn't include fields in docs that don't have any value for that field. if my guess is correct, adding fq=D:[* TO *] to any of your queries will cause the total number of results to shrink, but the first page of results for your requests that don't sort on D will look exactly the same. the LUkeRequestHandler will help you see how many docs in your index don't have any values indexed in the D field. -Hoss -- View this message in context: http://lucene.472066.n3.nabble.com/Dismax-field-not-returned-unless-in-sort-clause-tp2681447p2688039.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Maven : Specifying SNAPSHOT Artifacts and the Hudson Repository
does anyone have a successfull setup (=pom.xml) that specifies the Hudson snapshot repository : https://builds.apache.org/hudson/job/Lucene-Solr-Maven-3.x/lastStableBuild/artifact/maven_artifacts (or that for trunk) and entries for any solr snapshot artifacts which are then found by Maven in this repository? This is what i use successfully: repository idtrunk/id url https://hudson.apache.org/hudson/job/Lucene-Solr-Maven-trunk/lastSuccessfulBuild/artifact/maven_artifacts/ /url /repository dependency groupIdorg.apache.solr/groupId artifactIdsolr-core/artifactId version4.0-SNAPSHOT/version scopecompile/scope typejar/type /dependency
Re: Stemming question
When I use the Porter Stemmer in Solr, it appears to take works that are stemmed and replace it with the root work in the index. I verified this by looking at analysis.jsp. Is there an option to expand the stemmer to include all combinations of the word? Like include 's, ly, etc? So you want expansion stemming (currently not supported ), which expands query and do not require re-indexing. As described here : http://www.slideshare.net/otisg/finite-state-queries-in-lucene May be you can extract stemming collisions from your index and use them in a huge synonym.txt file? Other options besides protection? What id protection?
Multicore
Hi all, I am setting up multicore and the schema.xml file in the core0 folder says not to sure that one because its very stripped down. So I copied the schema from example/solr/conf but now I am getting a bunch of class not found exceptions: SEVERE: org.apache.solr.common.SolrException: Error loading class 'solr.KeywordMarkerFilterFactory' For example. I also copied over the solrconfig.xml from example/solr/conf and changed all the lib dir=xxx paths to go up one directory higher (lib dir=../xxx / instead). I've found that when I use my solrconfig file with the stripped down schema.xml file, it runs correctly. But when I use the full schema xml file, I get those errors. Now this says to me I am not loading a library or two somewhere but I've looked through the configuration files and cannot see any other place other than solrconfig.xml where that would be set so what am I doing incorrectly? Thanks, Brian Lamb
Re: Multicore
What Solr are you using? That filter is not pre 3.1 releases. On Wednesday 16 March 2011 13:55:21 Brian Lamb wrote: Hi all, I am setting up multicore and the schema.xml file in the core0 folder says not to sure that one because its very stripped down. So I copied the schema from example/solr/conf but now I am getting a bunch of class not found exceptions: SEVERE: org.apache.solr.common.SolrException: Error loading class 'solr.KeywordMarkerFilterFactory' For example. I also copied over the solrconfig.xml from example/solr/conf and changed all the lib dir=xxx paths to go up one directory higher (lib dir=../xxx / instead). I've found that when I use my solrconfig file with the stripped down schema.xml file, it runs correctly. But when I use the full schema xml file, I get those errors. Now this says to me I am not loading a library or two somewhere but I've looked through the configuration files and cannot see any other place other than solrconfig.xml where that would be set so what am I doing incorrectly? Thanks, Brian Lamb -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
SSL and connection pooling
We are unsure whether we should use SSL in order to communicate with our Solr server since it will increase the cost of creating http connections. If we go for SSL, is it advisable to do some additional settings for the HttpClient in order to reduce the connection costs? After reading the Commons Http Client documentation, it is not clear to me whether a connection pooling mechanism is enabled by default since the documentation differs between version 4.1 and 3.1 (Solr uses the latter). Solr will run on Resin 4 with Apache 2.2, so perhaps we need to do some additional adjustments in the httpd.conf file as well in order to prevent Apache from closing the connections. Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: Solrj performance bottleneck
On Wed, Mar 16, 2011 at 7:25 AM, rahul asharud...@gmail.com wrote: In our setup, we are having Solr index in one machine. And Solrj client part (java code) in another machine. Currently as you suggest, if it may be a 'not enough free RAM for the OS to cache' then whether I need to increase the RAM in the machine in which Solrj query part is there.??? Or need to increase RAM for Solr instance for the OS cache? That would be RAM for the Solr instance. If there is not enough free memory for the OS to cache, then each document retrieved will be a disk seek + read. Since both the system are in local Amazon network (Linux EC2 small instances), I believe the network wont be a issue. Ah, how big is your index? Another thing, in the reply you have mentioned 'client not reading fast enough'. Whether it is related to network or Solrj. That was a general issue - it *can* be the client, but since you're using SolrJ it would be the network. -Yonik http://lucidimagination.com
Re: SSL and connection pooling
Am 16.03.2011 14:12, schrieb Erlend Garåsen: We are unsure whether we should use SSL in order to communicate with our Solr server since it will increase the cost of creating http connections. If we go for SSL, is it advisable to do some additional settings for the HttpClient in order to reduce the connection costs? After reading the Commons Http Client documentation, it is not clear to me whether a connection pooling mechanism is enabled by default since the documentation differs between version 4.1 and 3.1 (Solr uses the latter). Solr will run on Resin 4 with Apache 2.2, so perhaps we need to do some additional adjustments in the httpd.conf file as well in order to prevent Apache from closing the connections. Erlend first: You have to use SSL when you have to. If you can live with the fact that someone could watch your internal clear-text-data-streams, than do not use SSL. On the other hand: If you can not, than you definitely have to use SSL. br That should be the main-point for your technical dicission. Not performance. second: In my last checkout's ( a few weeks ago ) Solr repository, the CommonsHttpSolrServer uses a MultiThreaded-connection with 32 connections per host and 128 total connections. Hope this helps. Regards, Em
Replication slows down massively during high load
Hi everyone, I have Solr running on one master and two slaves (load balanced) via Solr 1.4.1 native replication. If the load is low, both slaves replicate with around 100MB/s from master. But when I use Solrmeter (100-400 queries/min) for load tests (over the load balancer), the replication slows down to an unacceptable speed, around 100KB/s (at least that's whats the replication page on /solr/admin says). Going to a slave directly without load balancer yields the same result for the slave under test: Slave 1 gets hammered with Solrmeter and the replication slows down to 100KB/s. At the same time, Slave 2 with only 20-50 queries/min without the load test has no problems. It replicates with 100MB/s and the index version is 5-10 versions ahead of Slave 1. The replications stays in the 100KB/s range even after the load test is over until the application server is restarted. The same issue comes up under both Tomcat and Jetty. The setup looks like this: - Same hardware for all servers: Physical machines with quad core CPUs, 24GB RAM (JVM starts up with -XX:+UseConcMarkSweepGC -Xms10G -Xmx10G) - Index size is about 100GB with 40M docs - Master commits every 10 min/10k docs - Slaves polls every minute I checked this: - Changed network interface; same behavior - Increased thread pool size from 200 to 500 and queue size from 100 to 500 in Tomcat; same behavior - Both disk and network I/O are not bottlenecked. Disk I/O went down to almost zero after every query in the load test got cached. Network isn't doing much and can put through almost an GBit/s with iPerf (network throughput tester) while Solrmeter is running. Any ideas what could be wrong? Best Regards Vadim
Online training for ruby and rails
Hi, We are looking for some one who can provide online training for ruby and rails I found your profile interesting and If you are Interested then please do reply me for this mail. If not then please do not consider this message as a spam. If you are Interested then let me know - How much hours would it require to cover all details and what would be the cost of the training. How you will execute this training session. If Interested then please go through this link - http://tinyurl.com/Rubytraining My company is willing to pay decent amount for this training. Looking forward to hear from you, thanks Regards Sinie Project coordinator at SCAN technologies, Delhi (INDIA) Contact at skype - neel_bpl
Re: Sorting on multiValued fields via function query
Hi David, It did seem to work correctly for me - we had it running on our production indexes for some time and we never noticed any strange sorting behavior. However, many of our multiValued fields are single valued for the majority of documents in our index so we may not have noticed the incorrect sorting behaviors. Regardless, I understand the reasoning behind the restriction, I'm interested in getting around it by using a functionQuery to reduce multiValued fields to a single value. It sounds like this isn't possible, is that correct? Ideally I'd like to sort by the maximum value on descending sorts and the minimum value on ascending sorts. Is there any movement towards implementing this sort of behavior? Best, -Harish -- View this message in context: http://lucene.472066.n3.nabble.com/Sorting-on-multiValued-fields-via-function-query-tp2681833p2688288.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Sorting on multiValued fields via function query
Heh heh, you say it worked correctly for me yet you didn't actually have multi-valued data ;-) Funny. The only solution right now is to store the max and min into indexed single-valued fields at index time. This is pretty straight-forward to do. Even if/when Solr supports sorting on a multi-valued field, I doubt it would perform as well as what I suggest. ~ David Smiley Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/ On Mar 16, 2011, at 10:16 AM, harish.agarwal wrote: Hi David, It did seem to work correctly for me - we had it running on our production indexes for some time and we never noticed any strange sorting behavior. However, many of our multiValued fields are single valued for the majority of documents in our index so we may not have noticed the incorrect sorting behaviors. Regardless, I understand the reasoning behind the restriction, I'm interested in getting around it by using a functionQuery to reduce multiValued fields to a single value. It sounds like this isn't possible, is that correct? Ideally I'd like to sort by the maximum value on descending sorts and the minimum value on ascending sorts. Is there any movement towards implementing this sort of behavior? Best, -Harish -- View this message in context: http://lucene.472066.n3.nabble.com/Sorting-on-multiValued-fields-via-function-query-tp2681833p2688288.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: faceting over ngrams
On Wed, 2011-03-16 at 13:05 +0100, Dmitry Kan wrote: Hello guys. We are using shard'ed solr 1.4 for heavy faceted search over the trigrams field with about 1 million of entries in the result set and more than 100 million of entries to facet on in the index. Currently the faceted search is very slow, taking about 5 minutes per query. I tried creating an index with 1M documents, each with 100 unique terms in a field. A search for *:* with a facet request for the first 1M entries in the field took about 20 seconds for the first call and about 1-1½ second for each subsequent call. This was with Solr trunk. The complexity of my setup is no doubt a lot simpler and lighter than yours, but 5 minutes sounds excessive. My guess is that your performance problem is due to the merging process. Could you try measuring the performance of a direct request to a single shard? If that is satisfactory, going to the cloud would not solve your problem. If you really need 1M entries in your result set, you would be better of investigating whether your index can be in a single instance.
Re: SOLR DIH importing MySQL text column as a BLOB
On Wed, Mar 16, 2011 at 2:29 PM, Stefan Matheis matheis.ste...@googlemail.com wrote: Kaushik, i just remembered an ML-Post few weeks ago .. same problem while importing geo-data (http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2254395.html) - the solution was: CAST( CONCAT( lat, ',', lng ) AS CHAR ) at that time i search a little bit for the reason and afaik there was a bug in mysql/jdbc which produces that binary output under certain conditions [...] As Stefan mentions, there might be a way to solve this. Could you show us the query in DIH that you are using when you get this BLOB, i.e., the SELECT statement that goes to the database? It might also be instructive for you to try that same SELECT directly in a mysql interface. Regards, Gora
Re: SOLR DIH importing MySQL text column as a BLOB
The query's there in the data-config.xml. And the query's fetching as expected from the database. Thanks, Kaushik On Wed, Mar 16, 2011 at 9:21 PM, Gora Mohanty g...@mimirtech.com wrote: On Wed, Mar 16, 2011 at 2:29 PM, Stefan Matheis matheis.ste...@googlemail.com wrote: Kaushik, i just remembered an ML-Post few weeks ago .. same problem while importing geo-data ( http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2254395.html ) - the solution was: CAST( CONCAT( lat, ',', lng ) AS CHAR ) at that time i search a little bit for the reason and afaik there was a bug in mysql/jdbc which produces that binary output under certain conditions [...] As Stefan mentions, there might be a way to solve this. Could you show us the query in DIH that you are using when you get this BLOB, i.e., the SELECT statement that goes to the database? It might also be instructive for you to try that same SELECT directly in a mysql interface. Regards, Gora
Re: faceting over ngrams
I don't know anything about trying to use map-reduce with Solr. But I can tell you that with about 6 million entries in the result set, and around 10 million values to facet on (facetting on a multi-value field) -- I still get fine performance in my application. In the worst case it can take maybe 800ms for my complete query when nothing useful is in the caches, which isn't great, but is FAR from 5 minutes! Now, 100 million values is an order of magnitude more than 10 million -- but it still seems like it ought not to be that slow. Not sure what's making it so slow for you. Could you need more RAM allocated to the JVM? I have found that facetting sometimes gets pathologically slow when I don't have enough RAM -- even though I'm not getting any OOM errors or anything. Of course, I'm not sure exactly what enough RAM is for your use case -- in my case I'm giving my JVM about 5G of heap. I also make sure to use facet.method=fc for these high-ordinality fields (forget if that's the default in 1.4.1 or not). I also do some warming queries at startup to try and fill the various caches that might be involved in facetting -- but I don't entirely understand what I'm doing there, and that isn't your problem, because that would only effect the first time you did such a facetting query, but you're getting the pathological 5min result times on subsequent times too. I am definitely not an expert in the internals of Solr that effect this stuff, I'm just reporting my experience, and from my experience -- your experience does not match mine. Jonathan On 3/16/2011 8:05 AM, Dmitry Kan wrote: Hello guys. We are using shard'ed solr 1.4 for heavy faceted search over the trigrams field with about 1 million of entries in the result set and more than 100 million of entries to facet on in the index. Currently the faceted search is very slow, taking about 5 minutes per query. Would running on a cloud with Hadoop make it faster (to seconds) as faceting seems to be a natural map-reduce task? Are there any other options to look into before stepping into the cloud? Please let me know, if you need specific details on the schema / solrconfig setup or the like.
Re: faceting over ngrams
Ah, wait, you're doing sharding? Yeah, I am NOT doing sharding, so that could explain our different experiences. It seems like sharding definitely has trade-offs, makes some things faster and other things slower. So far I've managed to avoid it, in the interest of keeping things simpler and easier to understand (for me, the developer/Solr manager), thinking that sharding is also a somewhat less mature feature. With only 1M documents are you sure you need sharding at all? You could still use replication to scale out for volume, sharding seems more about scaling for number of documents (or total bytes) in your index. 1M documents is not very large, for Solr, in general. Jonathan On 3/16/2011 11:51 AM, Toke Eskildsen wrote: On Wed, 2011-03-16 at 13:05 +0100, Dmitry Kan wrote: Hello guys. We are using shard'ed solr 1.4 for heavy faceted search over the trigrams field with about 1 million of entries in the result set and more than 100 million of entries to facet on in the index. Currently the faceted search is very slow, taking about 5 minutes per query. I tried creating an index with 1M documents, each with 100 unique terms in a field. A search for *:* with a facet request for the first 1M entries in the field took about 20 seconds for the first call and about 1-1½ second for each subsequent call. This was with Solr trunk. The complexity of my setup is no doubt a lot simpler and lighter than yours, but 5 minutes sounds excessive. My guess is that your performance problem is due to the merging process. Could you try measuring the performance of a direct request to a single shard? If that is satisfactory, going to the cloud would not solve your problem. If you really need 1M entries in your result set, you would be better of investigating whether your index can be in a single instance.
Re: SOLR DIH importing MySQL text column as a BLOB
On Wed, Mar 16, 2011 at 9:50 PM, Kaushik Chakraborty kaych...@gmail.com wrote: The query's there in the data-config.xml. And the query's fetching as expected from the database. [...] Doh! Sorry, had missed that somehow. So, the relevant part is: SELECT ... p.message as solr_post_message, What is the field type for p.message in mysql? Cannot remember off the top of my head for mysql, but if it is a TextField, you might want to look into a ClobTransformer: http://wiki.apache.org/solr/DataImportHandler#ClobTransformer Regards, Gora
RE: hierarchical faceting, SOLR-792 - confused on config
Hi, This is also where I am having problems. I have not been able to understand very much on the wiki. I do not understand how to configure the faceting we are referring to. Although I know very little about this, I can't help but think that the wiki is quite clearly unaccurate by some way! Any comments please Lewis From: kmf [kfole...@gmail.com] Sent: 23 February 2011 17:10 To: solr-user@lucene.apache.org Subject: Re: hierarchical faceting, SOLR-792 - confused on config I'm really confused now. Is this page completely out of date - http://wiki.apache.org/solr/HierarchicalFaceting - as it seems to imply that solr-792 is a form of hierarchical faceting. There are currently two similar, non-competing, approaches to generating tree/hierarchical facets from Solr: SOLR-64 and SOLR-792 To achieve hierarchical faceting, is the rule then that you form the hierarchical facets using a transformer in the DIH and do nothing in schema.xml or solrconfig.xml? I seem to recall reading somewhere that creating a copyField is needed. Sorry for the entry level question but, I'm still trying to understand how to configure solr to do hierarchical faceting. Thanks, kmf -- View this message in context: http://lucene.472066.n3.nabble.com/hierarchical-faceting-SOLR-792-confused-on-config-tp2556394p2561445.html Sent from the Solr - User mailing list archive at Nabble.com. Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html
Re: Solrj performance bottleneck
Hi Thanks for your info. Currently my index size is around 4GB. Normally in small instances total available memory will be 1.6GB. In my setup, I allocated around 1GB as a heap size for tomcat. Hence I believe, remaining 600 MB will be used for OS cache. I believe, I need to migrate my Solr instance from small instance to large. So that some more memory will be allotted for OS cache. But initially I suspect, since I call Solrj code from another instance, I need to increase the memory in the instance from where I run the Solrj. But you said I need to increase the memory in Solr instance only. Here, just I want to double check this case only. sorry for that. Once again thanks for your replies. Regards, On Wed, Mar 16, 2011 at 7:02 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Wed, Mar 16, 2011 at 7:25 AM, rahul asharud...@gmail.com wrote: In our setup, we are having Solr index in one machine. And Solrj client part (java code) in another machine. Currently as you suggest, if it may be a 'not enough free RAM for the OS to cache' then whether I need to increase the RAM in the machine in which Solrj query part is there.??? Or need to increase RAM for Solr instance for the OS cache? That would be RAM for the Solr instance. If there is not enough free memory for the OS to cache, then each document retrieved will be a disk seek + read. Since both the system are in local Amazon network (Linux EC2 small instances), I believe the network wont be a issue. Ah, how big is your index? Another thing, in the reply you have mentioned 'client not reading fast enough'. Whether it is related to network or Solrj. That was a general issue - it *can* be the client, but since you're using SolrJ it would be the network. -Yonik http://lucidimagination.com
Re: Solrj performance bottleneck
On Wed, Mar 16, 2011 at 12:56 PM, Asharudeen asharud...@gmail.com wrote: Currently my index size is around 4GB. Normally in small instances total available memory will be 1.6GB. In my setup, I allocated around 1GB as a heap size for tomcat. Hence I believe, remaining 600 MB will be used for OS cache. Actually, even less. A JVM with a 1.6GB heap size will take up even more memory (since the heap size does not count stuff not on the heap, like the JVM code itself). This is definitely your problem. I believe, I need to migrate my Solr instance from small instance to large. So that some more memory will be allotted for OS cache. But initially I suspect, since I call Solrj code from another instance, I need to increase the memory in the instance from where I run the Solrj. But you said I need to increase the memory in Solr instance only. Here, just I want to double check this case only. sorry for that. SolrJ itself won't take up much memory. It depends on what else your client app is doing, but a small instance may be fine. -Yonik http://lucidimagination.com
Error: Unbuffered entity enclosing request can not be repeated.
Hi all! I created a SolrJ project to run test Solr. So, I am inserting batches of 7000 records, each with 200 attributes which adds up approximately to 13.77 Mb per batch. I am measuring the time it takes to add and commit each set of 7000 records to an instantiation of CommonsHttpSolrServer. Each of the first 6 batches takes approximately 17 to 21 seconds. The 7th batch takes 42sec and the 8th takes 1min. And when it adds the 9th batch to the server it generates this error: Mar 16, 2011 4:56:20 PM org.apache.commons.httpclient.HttpMethodDirector executeWithRetry INFO: I/O exception (java.net.SocketException) caught when processing request: Connection reset Mar 16, 2011 4:56:21 PM org.apache.commons.httpclient.HttpMethodDirector executeWithRetry INFO: Retrying request Exception in thread main org.apache.solr.client.solrj.SolrServerException: org.apache.commons.httpclient.ProtocolException: Unbuffered entity enclosing request can not be repeated. at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:480) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49) I googled this error and one of the suggestions consists of the reduction of the number of records per batch. But I want to achieve a solution with at least 7000 records per batch. Any help would be appreciated. André
Re: hierarchical faceting, SOLR-792 - confused on config
Sorry, I missed the original mail on this thread I put together that hierarchical faceting wiki page a couple of years ago when helping a customer evaluate SOLR-64 vs. SOLR-792 vs.other approaches. Since then, SOLR-792 morphed and is committed as pivot faceting. SOLR-64 spawned a PathTokenizer which is part of Solr now too. Recently Toke updated that page with some additional info. It's definitely not a how to page, and perhaps should get renamed/moved/revamped? Toke? Erik On Mar 16, 2011, at 12:39 , McGibbney, Lewis John wrote: Hi, This is also where I am having problems. I have not been able to understand very much on the wiki. I do not understand how to configure the faceting we are referring to. Although I know very little about this, I can't help but think that the wiki is quite clearly unaccurate by some way! Any comments please Lewis From: kmf [kfole...@gmail.com] Sent: 23 February 2011 17:10 To: solr-user@lucene.apache.org Subject: Re: hierarchical faceting, SOLR-792 - confused on config I'm really confused now. Is this page completely out of date - http://wiki.apache.org/solr/HierarchicalFaceting - as it seems to imply that solr-792 is a form of hierarchical faceting. There are currently two similar, non-competing, approaches to generating tree/hierarchical facets from Solr: SOLR-64 and SOLR-792 To achieve hierarchical faceting, is the rule then that you form the hierarchical facets using a transformer in the DIH and do nothing in schema.xml or solrconfig.xml? I seem to recall reading somewhere that creating a copyField is needed. Sorry for the entry level question but, I'm still trying to understand how to configure solr to do hierarchical faceting. Thanks, kmf -- View this message in context: http://lucene.472066.n3.nabble.com/hierarchical-faceting-SOLR-792-confused-on-config-tp2556394p2561445.html Sent from the Solr - User mailing list archive at Nabble.com. Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html
RE: Different options for autocomplete/autosuggestion
I take raw user search term data, 'collapse' it into a form where I have only unique terms, per store, ordered by frequency of searches over some time period. The suggestions are then grouped and presented with store breakouts. That sounds kind of like what this page is talking about here, but I could be using the wrong terminology: http://wiki.apache.org/solr/FieldCollapsing -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Tuesday, March 15, 2011 9:00 PM To: solr-user@lucene.apache.org Subject: Re: Different options for autocomplete/autosuggestion Hi, I actually don't follow how field collapsing helps with autocompletion...? Over at http://search-lucene.com we eat our own autocomplete dog food: http://sematext.com/products/autocomplete/index.html . Tasty stuff. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Kai Schlamp schl...@gmx.de To: solr-user@lucene.apache.org Sent: Mon, March 14, 2011 11:52:48 PM Subject: Re: Different options for autocomplete/autosuggestion @Robert: That sounds interesting and very flexible, but also like a lot of work. This approach also doesn't seem to allow querying Solr directly by using Ajax ... one of the big benefits in my opinion when using Solr. @Bill: There are some things I don't like about the Suggester component. It doesn't seem to allow infix searches (at least it is not mentioned in the Wiki or elsewhere). It also uses a separate index that has to be rebuild independently of the main index. And it doesn't support any filter queries. The Lucid Imagination blog also describes a further autosuggest approach (http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popu lar-queries-using-edgengrams/). The disadvantage here is that the source documents must have distinct fields (resp. the dih selects must provide distinct data). Otherwise duplications would come up in the Solr query result, cause of the document nature of Solr. In my opinion field collapsing seems to be most promising for a full featured autosuggestion solution. Unfortunately it is not available for Solr 1.4.x or 3.x (I tried patching those branches several times without success). 2011/3/15 Bill Bell billnb...@gmail.com: http://lucidworks.lucidimagination.com/display/LWEUG/Spell+Checking+and+ Aut omatic+Completion+of+User+Queries For Auto-Complete, find the following section in the solrconfig.xml file for the collection: !-- Auto-Complete component -- searchComponent name=autocomplete class=solr.SpellCheckComponent lst name=spellchecker str name=nameautocomplete/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.jaspell.JaspellLookup /s tr str name=fieldautocomplete/str str name=buildOnCommittrue/str !-- str name=sourceLocationamerican-english/str -- /lst On 3/14/11 8:16 PM, Andy angelf...@yahoo.com wrote: Can you provide more details? Or a link? --- On Mon, 3/14/11, Bill Bell billnb...@gmail.com wrote: See how Lucid Enterprise does it... A bit differently. On 3/14/11 12:14 AM, Kai Schlamp kai.schl...@googlemail.com wrote: Hi. There seems to be several options for implementing an autocomplete/autosuggestions feature with Solr. I am trying to summarize those possibilities together with their advantages and disadvantages. It would be really nice to read some of your opinions. * Using N-Gram filter + text field query + available in stable 1.4.x + results can be boosted + sorted by best matches - may return duplicate results * Facets + available in stable 1.4.x + no duplicate entries - sorted by count - may need an extra N-Gram field for infix queries * Terms + available in stable 1.4.x + infix query by using regex in 3.x - only prefix query in 1.4.x - regexp may be slow (just a guess) * Suggestions ? Did not try that yet. Does it allow infix queries? * Field Collapsing + no duplications - only available in 4.x branch ? Does it work together with highlighting? That would be a big plus. What are your experiences regarding autocomplete/autosuggestion with Solr? Any additions, suggestions or corrections? What do you prefer? Kai -- Dr. med. Kai Schlamp Am Fort Elisabeth 17 55131 Mainz Germany Phone +49-177-7402778 Email: schl...@gmx.de
Re: faceting over ngrams
Hi Jonathan, Thanks for sharing useful bits. Each shard has 16G of heap. Unless I do something fundamentally wrong in the SOLR configuration, I have to admit, that counting ngrams up to trigrams across whole set of shard's documents is pretty intensive task, as each ngram can occur anywhere in the index and SOLR most probably doesn't precompute the cumulative count of it. I'll try querying with facet.method=fc, thanks for that. By the way, the trigrams are defined like this: fieldType name=shingle_text_trigram class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.LowerCaseTokenizerFactory/ filter class=solr.ShingleFilterFactory maxShingleSize=3 outputUnigrams=true/ /analyzer /fieldType For the sharding -- I decided to go with it, when the index size approached half a terabyte and doc count went over 100M, I thought it would help us scale better. I also maintain good level of caching, and so far the faceting over normal string fields (no ngrams) performed really well (around 1 sec). On Wed, Mar 16, 2011 at 6:23 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Ah, wait, you're doing sharding? Yeah, I am NOT doing sharding, so that could explain our different experiences. It seems like sharding definitely has trade-offs, makes some things faster and other things slower. So far I've managed to avoid it, in the interest of keeping things simpler and easier to understand (for me, the developer/Solr manager), thinking that sharding is also a somewhat less mature feature. With only 1M documents are you sure you need sharding at all? You could still use replication to scale out for volume, sharding seems more about scaling for number of documents (or total bytes) in your index. 1M documents is not very large, for Solr, in general. Jonathan On 3/16/2011 11:51 AM, Toke Eskildsen wrote: On Wed, 2011-03-16 at 13:05 +0100, Dmitry Kan wrote: Hello guys. We are using shard'ed solr 1.4 for heavy faceted search over the trigrams field with about 1 million of entries in the result set and more than 100 million of entries to facet on in the index. Currently the faceted search is very slow, taking about 5 minutes per query. I tried creating an index with 1M documents, each with 100 unique terms in a field. A search for *:* with a facet request for the first 1M entries in the field took about 20 seconds for the first call and about 1-1½ second for each subsequent call. This was with Solr trunk. The complexity of my setup is no doubt a lot simpler and lighter than yours, but 5 minutes sounds excessive. My guess is that your performance problem is due to the merging process. Could you try measuring the performance of a direct request to a single shard? If that is satisfactory, going to the cloud would not solve your problem. If you really need 1M entries in your result set, you would be better of investigating whether your index can be in a single instance. -- Regards, Dmitry Kan
Using Solr 1.4.1 on most recent Tomcat 7.0.11
Hello list, Is anyone running Solr (in my case 1.4.1) on above Tomcat dist? In the past I have been using guidance in accordance with http://wiki.apache.org/solr/SolrTomcat#Installing_Solr_instances_under_Tomcat but having upgraded from Tomcat 7.0.8 to 7.0.11 I am having problems E.g. INFO: Deploying configuration descriptor wombra.xml This is my context fragment from /home/lewis/Downloads/apache-tomcat-7.0.11/conf/Catalina/localhost 16-Mar-2011 16:57:36 org.apache.tomcat.util.digester.Digester fatalError SEVERE: Parse Fatal Error at line 4 column 6: The processing instruction target matching [xX][mM][lL] is not allowed. org.xml.sax.SAXParseException: The processing instruction target matching [xX][mM][lL] is not allowed. ... 16-Mar-2011 16:57:36 org.apache.catalina.startup.HostConfig deployDescriptor SEVERE: Error deploying configuration descriptor wombra.xml org.xml.sax.SAXParseException: The processing instruction target matching [xX][mM][lL] is not allowed. ... some more ... My configuration descriptor is as follows ?xml version=1.0 encoding=utf-8? Context docBase=/home/lewis/Downloads/wombra/wombra.war crossContext=true Environment name=solr/home type=java.lang.String value=/home/lewis/Downloads/wombra override=true/ /Context Preferably I would upload a WAR file, but I have been working well with the configuration I have been using up until now therefore I didn't question change. I am unfamiliar with the above errors. Can anyone please point me in the right direction? Thank you Lewis Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html
Re: faceting over ngrams
Oh, doc count over 100M is a very different thing than doc count about 1M. In your original message you said I tried creating an index with 1M documents, each with 100 unique terms in a field. If you instead have 100M documents, your use is a couple orders of magnitude larger than mine. It also occurs to me that while I have around 3 million documents, and probably up to 50 million or so unique values in the multi-valued facetted field -- each document only has 3-10 values, not 100 each. So that may also be a difference that effects the facetting algorithm to your detriment, not sure. Prior to Solr 1.4, it was pretty much impossible to facet over 1 million+ unique values at all, now it works wonderfully in many use cases, but you may have found one that's still too much for it. It also raises my curiosity as to why you'd want to facet over a n-grammed field to begin with, that's definitely not an ordinary use case. Perhaps there is some way to do what you need without facetting? But you probably know what you're doing. Jonathan On 3/16/2011 2:25 PM, Dmitry Kan wrote: Hi Jonathan, Thanks for sharing useful bits. Each shard has 16G of heap. Unless I do something fundamentally wrong in the SOLR configuration, I have to admit, that counting ngrams up to trigrams across whole set of shard's documents is pretty intensive task, as each ngram can occur anywhere in the index and SOLR most probably doesn't precompute the cumulative count of it. I'll try querying with facet.method=fc, thanks for that. By the way, the trigrams are defined like this: fieldType name=shingle_text_trigram class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.LowerCaseTokenizerFactory/ filter class=solr.ShingleFilterFactory maxShingleSize=3 outputUnigrams=true/ /analyzer /fieldType For the sharding -- I decided to go with it, when the index size approached half a terabyte and doc count went over 100M, I thought it would help us scale better. I also maintain good level of caching, and so far the faceting over normal string fields (no ngrams) performed really well (around 1 sec). On Wed, Mar 16, 2011 at 6:23 PM, Jonathan Rochkind rochk...@jhu.edu mailto:rochk...@jhu.edu wrote: Ah, wait, you're doing sharding? Yeah, I am NOT doing sharding, so that could explain our different experiences. It seems like sharding definitely has trade-offs, makes some things faster and other things slower. So far I've managed to avoid it, in the interest of keeping things simpler and easier to understand (for me, the developer/Solr manager), thinking that sharding is also a somewhat less mature feature. With only 1M documents are you sure you need sharding at all? You could still use replication to scale out for volume, sharding seems more about scaling for number of documents (or total bytes) in your index. 1M documents is not very large, for Solr, in general. Jonathan On 3/16/2011 11:51 AM, Toke Eskildsen wrote: On Wed, 2011-03-16 at 13:05 +0100, Dmitry Kan wrote: Hello guys. We are using shard'ed solr 1.4 for heavy faceted search over the trigrams field with about 1 million of entries in the result set and more than 100 million of entries to facet on in the index. Currently the faceted search is very slow, taking about 5 minutes per query. I tried creating an index with 1M documents, each with 100 unique terms in a field. A search for *:* with a facet request for the first 1M entries in the field took about 20 seconds for the first call and about 1-1½ second for each subsequent call. This was with Solr trunk. The complexity of my setup is no doubt a lot simpler and lighter than yours, but 5 minutes sounds excessive. My guess is that your performance problem is due to the merging process. Could you try measuring the performance of a direct request to a single shard? If that is satisfactory, going to the cloud would not solve your problem. If you really need 1M entries in your result set, you would be better of investigating whether your index can be in a single instance. -- Regards, Dmitry Kan
Re: hierarchical faceting, SOLR-792 - confused on config
Interesting, any documentation on the PathTokenizer anywhere? Or just have to find and look at the source? That's something I hadn't known about, which may be useful to some stuff I've been working on depending on how it works. If nothing else, in the meantime, I'm going to take that exact message from Erik and just add it to the top of the wiki page, to avoid other people getting confused (I've been confused by that page too) until someone spends the time to rewrite it to be more up to date and accurate, or clear about it's topicality. On 3/16/2011 1:36 PM, Erik Hatcher wrote: Sorry, I missed the original mail on this thread I put together that hierarchical faceting wiki page a couple of years ago when helping a customer evaluate SOLR-64 vs. SOLR-792 vs.other approaches. Since then, SOLR-792 morphed and is committed as pivot faceting. SOLR-64 spawned a PathTokenizer which is part of Solr now too. Recently Toke updated that page with some additional info. It's definitely not a how to page, and perhaps should get renamed/moved/revamped? Toke? Erik On Mar 16, 2011, at 12:39 , McGibbney, Lewis John wrote: Hi, This is also where I am having problems. I have not been able to understand very much on the wiki. I do not understand how to configure the faceting we are referring to. Although I know very little about this, I can't help but think that the wiki is quite clearly unaccurate by some way! Any comments please Lewis From: kmf [kfole...@gmail.com] Sent: 23 February 2011 17:10 To: solr-user@lucene.apache.org Subject: Re: hierarchical faceting, SOLR-792 - confused on config I'm really confused now. Is this page completely out of date - http://wiki.apache.org/solr/HierarchicalFaceting - as it seems to imply that solr-792 is a form of hierarchical faceting. There are currently two similar, non-competing, approaches to generating tree/hierarchical facets from Solr: SOLR-64 and SOLR-792 To achieve hierarchical faceting, is the rule then that you form the hierarchical facets using a transformer in the DIH and do nothing in schema.xml or solrconfig.xml? I seem to recall reading somewhere that creating a copyField is needed. Sorry for the entry level question but, I'm still trying to understand how to configure solr to do hierarchical faceting. Thanks, kmf -- View this message in context: http://lucene.472066.n3.nabble.com/hierarchical-faceting-SOLR-792-confused-on-config-tp2556394p2561445.html Sent from the Solr - User mailing list archive at Nabble.com. Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html
RE: hierarchical faceting, SOLR-792 - confused on config
Hi Erik, I have been reading about the progression of SOLR-792 into pivot faceting, however can you expand to comment on where it is committed. Are you referring to trunk? The reason I am asking is that I have been using 1.4.1 for some time now and have been thinking of upgrading to trunk... or branch Thank you Lewis From: Erik Hatcher [erik.hatc...@gmail.com] Sent: 16 March 2011 17:36 To: solr-user@lucene.apache.org Subject: Re: hierarchical faceting, SOLR-792 - confused on config Sorry, I missed the original mail on this thread I put together that hierarchical faceting wiki page a couple of years ago when helping a customer evaluate SOLR-64 vs. SOLR-792 vs.other approaches. Since then, SOLR-792 morphed and is committed as pivot faceting. SOLR-64 spawned a PathTokenizer which is part of Solr now too. Recently Toke updated that page with some additional info. It's definitely not a how to page, and perhaps should get renamed/moved/revamped? Toke? Erik Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html
Re: SOLR DIH importing MySQL text column as a BLOB
Hi Kaushik, If the field is being treated as blobs, you can try using the FieldStreamDataSource mapping. This handles the blob objects to extract contents from it. This feature is available only after Solr 3.1, I suppose. http://lucene.apache.org/solr/api/org/apache/solr/handler/dataimport/FieldStreamDataSource.html Regards, Jayendra On Tue, Mar 15, 2011 at 11:57 PM, Kaushik Chakraborty kaych...@gmail.com wrote: I've a column for posts in MySQL of type `text`, I've tried corresponding `field-type` for it in Solr `schema.xml` e.g. `string, text, text-ws`. But whenever I'm importing it using the DIH, it's getting imported as a BLOB object. I checked, this thing is happening only for columns of type `text` and not for `varchar`(they are getting indexed as string). Hence, the posts field is not becoming searchable. I found about this issue, after repeated search failures, when I did a `*:*` query search on Solr. A sample response: result name=response numFound=223 start=0 maxScore=1.0 doc float name=score1.0/float str name=solr_post_bio[B@10a33ce2/str date name=solr_post_created_at2011-02-21T07:02:55Z/date str name=solr_post_emailtest.acco...@gmail.com/str str name=solr_post_first_nameTest/str str name=solr_post_last_nameAccount/str str name=solr_post_message[B@2c93c4f1/str str name=solr_post_status_message_id1/str /doc The `data-config.xml` : document entity name=posts dataSource=jdbc query=select p.person_id as solr_post_person_id, pr.first_name as solr_post_first_name, pr.last_name as solr_post_last_name, u.email as solr_post_email, p.message as solr_post_message, p.id as solr_post_status_message_id, p.created_at as solr_post_created_at, pr.bio as solr_post_bio from posts p,users u,profiles pr where p.person_id = u.id and p.person_id = pr.person_id and p.type='StatusMessage' field column=solr_post_person_id / field column=solr_post_first_name/ field column=solr_post_last_name / field column=solr_post_email / field column=solr_post_message / field column=solr_post_status_message_id / field column=solr_post_created_at / field column=solr_post_bio/ /entity /document The `schema.xml` : fields field name=solr_post_status_message_id type=string indexed=true stored=true required=true / field name=solr_post_message type=text_ws indexed=true stored=true required=true / field name=solr_post_bio type=text indexed=false stored=true / field name=solr_post_first_name type=string indexed=false stored=true / field name=solr_post_last_name type=string indexed=false stored=true / field name=solr_post_email type=string indexed=false stored=true / field name=solr_post_created_at type=date indexed=false stored=true / /fields uniqueKeysolr_post_status_message_id/uniqueKey defaultSearchFieldsolr_post_message/defaultSearchField Thanks, Kaushik
Re: faceting over ngrams
Hi Toke, Thanks a lot for trying this out. I have to mention, that the facetted search hits only one specific shard by design, so in general the time to query a shard directly and through the proxy SOLR should be comparable. Would it be feasible for you to make that field ngram'ed or is it too much worry for you? I'll check out the direct query and let you know. On Wed, Mar 16, 2011 at 5:51 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote: On Wed, 2011-03-16 at 13:05 +0100, Dmitry Kan wrote: Hello guys. We are using shard'ed solr 1.4 for heavy faceted search over the trigrams field with about 1 million of entries in the result set and more than 100 million of entries to facet on in the index. Currently the faceted search is very slow, taking about 5 minutes per query. I tried creating an index with 1M documents, each with 100 unique terms in a field. A search for *:* with a facet request for the first 1M entries in the field took about 20 seconds for the first call and about 1-1½ second for each subsequent call. This was with Solr trunk. The complexity of my setup is no doubt a lot simpler and lighter than yours, but 5 minutes sounds excessive. My guess is that your performance problem is due to the merging process. Could you try measuring the performance of a direct request to a single shard? If that is satisfactory, going to the cloud would not solve your problem. If you really need 1M entries in your result set, you would be better of investigating whether your index can be in a single instance. -- Regards, Dmitry Kan
Re: faceting over ngrams
On Wed, Mar 16, 2011 at 8:05 AM, Dmitry Kan dmitry@gmail.com wrote: Hello guys. We are using shard'ed solr 1.4 for heavy faceted search over the trigrams field with about 1 million of entries in the result set and more than 100 million of entries to facet on in the index. Currently the faceted search is very slow, taking about 5 minutes per query. Would running on a cloud with Hadoop make it faster (to seconds) as faceting seems to be a natural map-reduce task? How many indexed tokens does each document have (for the field you are faceting on) on average? How many unique tokens are indexed in that field over the complete index? Or you could go to the admin/stats page and cut-n-paste the fieldValueCache entry after your faceting request - it should contain most of the info to further analyze this. -Yonik http://lucidimagination.com
Re: Using Solr 1.4.1 on most recent Tomcat 7.0.11
Lewis Quick response, I am currently using Tomcat 7.0.8 with solr (with no issues), I will upgrade to 7.0.11 tonight and see if I run into the same issues. Stay tuned as they say. Cheers François On Mar 16, 2011, at 2:38 PM, McGibbney, Lewis John wrote: Hello list, Is anyone running Solr (in my case 1.4.1) on above Tomcat dist? In the past I have been using guidance in accordance with http://wiki.apache.org/solr/SolrTomcat#Installing_Solr_instances_under_Tomcat but having upgraded from Tomcat 7.0.8 to 7.0.11 I am having problems E.g. INFO: Deploying configuration descriptor wombra.xml This is my context fragment from /home/lewis/Downloads/apache-tomcat-7.0.11/conf/Catalina/localhost 16-Mar-2011 16:57:36 org.apache.tomcat.util.digester.Digester fatalError SEVERE: Parse Fatal Error at line 4 column 6: The processing instruction target matching [xX][mM][lL] is not allowed. org.xml.sax.SAXParseException: The processing instruction target matching [xX][mM][lL] is not allowed. ... 16-Mar-2011 16:57:36 org.apache.catalina.startup.HostConfig deployDescriptor SEVERE: Error deploying configuration descriptor wombra.xml org.xml.sax.SAXParseException: The processing instruction target matching [xX][mM][lL] is not allowed. ... some more ... My configuration descriptor is as follows ?xml version=1.0 encoding=utf-8? Context docBase=/home/lewis/Downloads/wombra/wombra.war crossContext=true Environment name=solr/home type=java.lang.String value=/home/lewis/Downloads/wombra override=true/ /Context Preferably I would upload a WAR file, but I have been working well with the configuration I have been using up until now therefore I didn't question change. I am unfamiliar with the above errors. Can anyone please point me in the right direction? Thank you Lewis Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html
Re: 'Registering' a query / Percolation
: I.E. Instruct Solr that you are interested in documents that match a : given query and then have Solr notify you (through whatever callback : mechanism is specified) if and when a document appears that matches the : query. : : We are planning on writing some software that will effectively grind : Solr to give us the same behaviour, but if Solr has this registration : built in, it would be very useful and much easier on our resources... it does not, but there are typically two ways people deal with this depending on the balance of your variables... * max latency of notificications after doc is added/updated * rate of churn of documents in index * number of registered queries for notification 1) if you have a heavy churn of documents, and the max latency allowed for notification is large, then doing periodic polling at a frequency of that latency can be preferably to minimize the amount of redundent work 2) if the churn on documents is going to be relatively small and/or the number of registered queries is going to be relatively large, you can invert the problem and build an index where each document represents a query, and as documents are added/updated you use the terms in those documents to query your query index (this could even be done as an UpdateProcessor on your doc core, querying over to some other notifications core) (disclaimer: i've never implemented any of these ideas personally, this is just what i've picked up over the years on hte mailing lists) -Hoss
Re: Error during auto-warming of key
that is odd... can you let us know exactly what verison of Solr/Lucne you are using (if it's not an official release, can you let us know exactly what the version details on the admin info page say, i'm curious about the svn revision) Of course, that's the stable 1.4.1. can you also please let us know what types of queries you are generating? ... that's the toString output of a query and it's not entirely clear what the original looked like. If you can recognize what the original query was, it would also be helpful to know if you can consistently reproduce this error on autowarming after executing that query (or queries like it with a slightly differnet date value) It's extremely difficult to reproduce. It happened on a multinode system that's being prepared for production. It has been under heavy load for a long time already, updates and queries. It is continuously being updated with real user input and receives real user queries from a source that's being updated from logs. Solr is about to replace an existing search solution. It is impossible to reproduce because of these uncontrollable variables, i tried but failed. The error, however, did occur at least a couple of times after i started this thread. It hasn't reappeared after i reduced precision from milliseconds to an hour, see my other thread for more information: http://web.archiveorange.com/archive/v/AAfXfFuqjPhU4tdq53Tv One of the things that particularly boggles me is this... : org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java : :545) : : at : : org.apache.solr.search.SolrIndexSearcher.cacheDocSet(SolrIndexSearcher.ja : va:520) [...] : Well, i use Dismax' bf parameter to boost very recent documents. I'm not : using the queryResultCache or documentCache, only filterCache and Lucene : fieldCache. ... that cache warming stack trace seems to be coming from filterCache, but that contradicts your statement that you don't use the filterCache. independent of your comments, that's an odd looking query to be cached in the filter cache anyway, since it includes a mandatory matchalldocs clause, and seems to only exist for boosting on that function. But i am using filterCache and fieldCache (forgot to mention the obvious fieldValueCache as well). If you have any methods that may help to reproduce i'm of course willing to take the time and see if i can. It may prove really hard because several weird errors were not reproduceable in a more controlled but similar environment (load and config) and i can't mess with the soon-to-be production cluster. Thanks! -Hoss
Re: faceting over ngrams
Hi Yonik, I have ran the queries against single index solr with only 16M documents. After attaching facet.method=fc the results seemed to come faster (first two queries below), but still not fast enough. Here are the fieldValueCache stats: (facet.limit=100facet.mincount=5facet.method=fc, 542094 hits, 1 min) -- smallest result set *name: *fieldValueCache *class: *org.apache.solr.search.FastLRUCache * version: *1.0 *description: *Concurrent LRU Cache(maxSize=1, initialSize=10, minSize=9000, acceptableSize=9500, cleanupThread=false) * stats: *lookups : 400 hits : 396 hitratio : 0.99 inserts : 1 evictions : 0 size : 1 warmupTime : 0 cumulative_lookups : 400 cumulative_hits : 396 cumulative_hitratio : 0.99 cumulative_inserts : 1 cumulative_evictions : 0 item_shingleContent_trigram : {field=shingleContent_trigram,memSize=1786355392,tindexSize=17977426,time=662387,phase1=654707,nTerms=53492050,bigTerms=38,termInstances=602090958,uses=397} (facet.limit=100facet.mincount=5facet.method=fc, 2837589 hits, 3 min 8 s) -- largest result set *name: *fieldValueCache *class: *org.apache.solr.search.FastLRUCache * version: *1.0 *description: *Concurrent LRU Cache(maxSize=1, initialSize=10, minSize=9000, acceptableSize=9500, cleanupThread=false) * stats: *lookups : 401 hits : 397 hitratio : 0.99 inserts : 1 evictions : 0 size : 1 warmupTime : 0 cumulative_lookups : 401 cumulative_hits : 397 cumulative_hitratio : 0.99 cumulative_inserts : 1 cumulative_evictions : 0 item_shingleContent_trigram : {field=shingleContent_trigram,memSize=1786355392,tindexSize=17977426,time=662387,phase1=654707,nTerms=53492050,bigTerms=38,termInstances=602090958,uses=398} On Wed, Mar 16, 2011 at 9:46 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Wed, Mar 16, 2011 at 8:05 AM, Dmitry Kan dmitry@gmail.com wrote: Hello guys. We are using shard'ed solr 1.4 for heavy faceted search over the trigrams field with about 1 million of entries in the result set and more than 100 million of entries to facet on in the index. Currently the faceted search is very slow, taking about 5 minutes per query. Would running on a cloud with Hadoop make it faster (to seconds) as faceting seems to be a natural map-reduce task? How many indexed tokens does each document have (for the field you are faceting on) on average? How many unique tokens are indexed in that field over the complete index? Or you could go to the admin/stats page and cut-n-paste the fieldValueCache entry after your faceting request - it should contain most of the info to further analyze this. -Yonik http://lucidimagination.com -- Regards, Dmitry Kan
i don't get why my index didn't grow more...
OK I have a 30 gb index where there are lots of sparsly populated int fields and then one title field and one catchall field with title and everything else we want as keywords, the catchall field. I figure it is the biggest field in our documents which as I mentioned is otherwise composed of a variety if int fields and a title. So my puzzlement is that my biggest field is copied into a double metaphone field and now I added another copyfield to also copy the catchall field into a newly created soundex field for an experiment to compare the effectiveness of the two. I expected the index to grow by at least 25% to 30%, but it barely grew at all. Can someone explain this to me? Thanks! J
Re: FunctionQueries and FieldCache and OOM
: Alright, i can now confirm the issue has been resolved by reducing precision. : The garbage collector on nodes without reduced precision has a real hard time : keeping up and clearly shows a very different graph of heap consumption. : : Consider using MINUTE, HOUR or DAY as precision in case you suffer from : excessive memory consumption: : : recip(ms(NOW/PRECISION,DATE_FIELD),TIME_FRACTION,1,1) FWIW: it sounds like your problem wasn't actually related to your fieldCache, but probably instead if was because of how big your queryResultCache is : Am i correct when i assume that Lucene FieldCache entries are added for : each unique function query? In that case, every query is a unique cache ...no, the FieldCache has one entry per field name, and the value of that cache is an array keyed off of the internal docId for every doc in the index, and the corrisponding value (it's an uninverted version of lucene's inverted index for doing fast value lookups by document) changes in the *values* used in your function queries won't affect FieldCache usage -- only changing the *fields* used in your functions would impact that. : each unique function query? In that case, every query is a unique cache : entry because it operates on milliseconds. If all doesn't work i might be what you describe is correct, but not in the FieldCache -- the queryResultCache is where queries that deal with the main result set (ie: paginated and/or sorted) wind up .. having lots of distinct queries in the bq (or q) param will make the number of unique items in that cache grow significantly (just like having lots of distinct queries in the fq will cause your filterCache to grow significantly) you should definitley checkout what max size you have configured for your queryResultCache ... it sounds like it's proably too big, if you were getting OOM errors from having high precision dates in your boost queries. while i think using less precision is a wise choice, you should still consider dialing that max size down, so that if some other usage pattern still causes lots of unique queries in a short time period (a bot crawling your site map perhaps) it doesn't fill up and cause another OOM -Hoss
Re: i don't get why my index didn't grow more...
On Wed, Mar 16, 2011 at 5:10 PM, Robert Petersen rober...@buy.com wrote: OK I have a 30 gb index where there are lots of sparsly populated int fields and then one title field and one catchall field with title and everything else we want as keywords, the catchall field. I figure it is the biggest field in our documents which as I mentioned is otherwise composed of a variety if int fields and a title. So my puzzlement is that my biggest field is copied into a double metaphone field and now I added another copyfield to also copy the catchall field into a newly created soundex field for an experiment to compare the effectiveness of the two. I expected the index to grow by at least 25% to 30%, but it barely grew at all. Can someone explain this to me? Thanks! J I assume you reindexed everything? Anyway, the size of indexed fields generally grows sub-linearly (as opposed to stored fields which is exactly linear). But if it really barely grew at all, this could point to other parts of the index taking up much more space than you realize. If you could do an ls -l of your index directory, we might be able to see what parts of the index are using up the most space. -Yonik http://lucidimagination.com
Re: Error during auto-warming of key
Actually, i dug in the logs again and surprise, it sometimes still occurs with `random` queries. Here's are a few snippets from the error log. Somewhere during that time there might be OOM-errors but older logs are unfortunately rotated away. 2011-03-14 00:25:32,152 ERROR [solr.search.SolrCache] - [pool-1-thread-1] - : Error during auto-warming of key:f_sp_eigenschappen:geo:java.lang.ArrayIndexOutOfBoundsException: 431733 at org.apache.lucene.util.BitVector.get(BitVector.java:102) at org.apache.lucene.index.SegmentTermDocs.read(SegmentTermDocs.java:152) at org.apache.solr.search.SolrIndexSearcher.getDocSetNC(SolrIndexSearcher.java:642) at org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:545) at org.apache.solr.search.SolrIndexSearcher.cacheDocSet(SolrIndexSearcher.java:520) at org.apache.solr.search.SolrIndexSearcher$2.regenerateItem(SolrIndexSearcher.java:296) at org.apache.solr.search.FastLRUCache.warm(FastLRUCache.java:168) at org.apache.solr.search.SolrIndexSearcher.warm(SolrIndexSearcher.java:1481) at org.apache.solr.core.SolrCore$2.call(SolrCore.java:1131) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) 2011-03-14 00:25:32,795 ERROR [solr.search.SolrCache] - [pool-1-thread-1] - : Error during auto-warming of key:+(titel_i:touareg^5.0 | f_advertentietype:touareg^2.0 | f_automodel_j:touareg^8.0 | facets:touareg^2.0 | omschrijving_i:touareg | catlevel1_i:touareg^2.0 | catlevel2_i:touareg^4.0)~0.1 () (10.0/(7.71E-8*float(ms(const(130003560),date(sort_date)))+1.0))^10.0:java.lang.ArrayIndexOutOfBoundsException: 468554 at org.apache.lucene.util.BitVector.get(BitVector.java:102) at org.apache.lucene.index.SegmentTermDocs.readNoTf(SegmentTermDocs.java:169) at org.apache.lucene.index.SegmentTermDocs.read(SegmentTermDocs.java:139) at org.apache.lucene.search.TermScorer.nextDoc(TermScorer.java:130) at org.apache.lucene.search.DisjunctionMaxQuery$DisjunctionMaxWeight.scorer(DisjunctionMaxQuery.java:145) at org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:297) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:246) at org.apache.lucene.search.Searcher.search(Searcher.java:171) at org.apache.solr.search.SolrIndexSearcher.getDocSetNC(SolrIndexSearcher.java:651) at org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:545) at org.apache.solr.search.SolrIndexSearcher.cacheDocSet(SolrIndexSearcher.java:520) at org.apache.solr.search.SolrIndexSearcher$2.regenerateItem(SolrIndexSearcher.java:296) at org.apache.solr.search.FastLRUCache.warm(FastLRUCache.java:168) at org.apache.solr.search.SolrIndexSearcher.warm(SolrIndexSearcher.java:1481) at org.apache.solr.core.SolrCore$2.call(SolrCore.java:1131) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) 2011-03-14 00:25:33,051 ERROR [solr.search.SolrCache] - [pool-1-thread-1] - : Error during auto-warming of key:+*:* (10.0/(7.71E-8*fl oat(ms(const(130003560),date(sort_date)))+1.0))^10.0:java.lang.ArrayIndexOutOfBoundsException: 489479 at org.apache.lucene.util.BitVector.get(BitVector.java:102) at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127) at org.apache.lucene.search.FieldCacheImpl$LongCache.createValue(FieldCacheImpl.java:562) at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:208) at org.apache.lucene.search.FieldCacheImpl.getLongs(FieldCacheImpl.java:525) at org.apache.solr.search.function.LongFieldSource.getValues(LongFieldSource.java:57) at org.apache.solr.search.function.DualFloatFunction.getValues(DualFloatFunction.java:48) at org.apache.solr.search.function.ReciprocalFloatFunction.getValues(ReciprocalFloatFunction.java:61) at org.apache.solr.search.function.FunctionQuery$AllScorer.init(FunctionQuery.java:123) at org.apache.solr.search.function.FunctionQuery$FunctionWeight.scorer(FunctionQuery.java:93) at org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:297) at
dismax parser, parens, what do they do exactly
It looks like Dismax query parser can somehow handle parens, used for applying, for instance, + or - to a group, distributing it. But I'm not sure what effect they have on the overall query. For instance, if I give dismax this: book (dog +( cat -frog)) debugQuery shows: +((DisjunctionMaxQuery((text:book)~0.01) +DisjunctionMaxQuery((text:dog)~0.01) DisjunctionMaxQuery((text:cat)~0.01) -DisjunctionMaxQuery((text:frog)~0.01))~2) () How will that be treated by mm? Let's say I have an mm of 50%. Does that apply to the top-level, like either book needs to match or +(dog +( cat -frog)) needs to match? And for +(dog +( cat -frog)) to match, do just 50% of that subquery need to match... or is mm ignored there? Or something else entirely? Can anyone clear this up? Continuing to try experimentally to clear it up... it _looks_ like the mm actually applies to each _individual_ low-level query. So even though the semantics of: book (dog +( cat -frog)) are respected, if mm is 50%, the nesting is irrelvant, exactly 50% of book, dog, +cat, and +-frog (distributing the operators through I guess?) are required. I think. I'm getting confused even talking about it.
Re: Sorting on multiValued fields via function query
: However, many of our multiValued fields are single valued for the majority : of documents in our index so we may not have noticed the incorrect sorting : behaviors. that would make sense ... if you use a multiValued field as if it were single valued, you would never enocunter a problem. if you had *some* multivalued fields your results would be sorted extremely arbitrarily for those docs that did have multiple values, unless you had more distinct values then you had documents -- at which point you would get a hard crash at query time. : Regardless, I understand the reasoning behind the restriction, I'm : interested in getting around it by using a functionQuery to reduce : multiValued fields to a single value. It sounds like this isn't possible, I don't think we have any functions that do that -- functions are composed of valuesources which may be composed of other value sources but ultimatley the data comes from somewhere, and in every case i can think of (except for constant values) that data comes from the FieldCache -- the same FieldCache used for sorting. I don't think there are any value sources that will let you specify a multiValued field, and then pick one of those values based on a rule/function ... even the PolyFields used for spatial search work by using multiple field names unde the covers (N distinct field names for an N-dimensional space) : is that correct? Ideally I'd like to sort by the maximum value on : descending sorts and the minimum value on ascending sorts. Is there any : movement towards implementing this sort of behavior? this is a fairly classic usecase of just having multiple fields. even if the logic was implemented to support this at query time, it could never be faster then sorting on asingle valued field that you populat with the min/max at indexing time -- the mantra of fast I/R is that if you can precompute it independently of the individual search critera, you should (it's the whole foundation for why the inverted index exists) -Hoss
Re: Version Incompatibility(Invalid version (expected 2, but 1) or the data in not in 'javabin' format)
I am using Solr 4.0 api to search from index (made using solr1.4 version). I am getting error Invalid version (expected 2, but 1) or the data in not in 'javabin' format. Can anyone help me to fix problem. You need to use solrj version 1.4 which is compatible to your index format/version. Actually there exists another solution. Using XMLResponseParser instead of BinaryResponseParser which is the default. new CommonsHttpSolrServer(new URL(http://solr1.4.0Instance:8080/solr;), null, new XMLResponseParser(), false);
Re: Sorting on multiValued fields via function query
On Wed, Mar 16, 2011 at 5:46 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : However, many of our multiValued fields are single valued for the majority : of documents in our index so we may not have noticed the incorrect sorting : behaviors. that would make sense ... if you use a multiValued field as if it were single valued, you would never enocunter a problem. if you had *some* multivalued fields your results would be sorted extremely arbitrarily for those docs that did have multiple values, unless you had more distinct values then you had documents -- at which point you would get a hard crash at query time. AFAIK, not any more. Since that behavior was very unreliable, it has been removed and you can reliably sort by any multi-valued field in lucene (with the sort order being defined by the largest value if there are multiple). -Yonik http://lucidimagination.com
Re: Sorting on multiValued fields via function query
Huh, so lucene is actually doing what has been commonly described as impossible in Solr? But is Solr trunk, as the OP person seemed to report, still not aware of this and raising on a sort on multi-valued field, instead of just saying, okay, we'll just pass it to lucene anyway and go with lucene's approach to sorting on multi-valued field (that is, apparently, using the largest value)? If so... that kind of sounds like a bug/misfeature, yes, no? Also... if lucene is already capable of sorting on multi-valued field by choosing the largest value largest vs. smallest is presumably just arbitrary there, there is presumably no performance implication to choosing the smallest instead of the largest. It just chooses the largest, according to Yonik. So... if someone patched lucene, so whether it chose the largest or smallest in that case was a parameter passed in -- probably not a large patch since lucene, says Yonik, already has been enhanced to choose largest always -- and then patched Solr to take a param and pass it to Lucene for this purpose, which presumably also wouldn't be a large patch if lucene supported it then we'd have the feature OP asked for. Based on Yonik's description (assuming I understand correctly and he's correct), it doesn't sound like a lot of code. But it's still beyond my unfamiliar-with-lucene-code-not-so-great-at-java abilities, nor do I have the interest for my own app needs at the moment. But if OP or someone else has both sounds like a plausible feature? On 3/16/2011 6:00 PM, Yonik Seeley wrote: On Wed, Mar 16, 2011 at 5:46 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : However, many of our multiValued fields are single valued for the majority : of documents in our index so we may not have noticed the incorrect sorting : behaviors. that would make sense ... if you use a multiValued field as if it were single valued, you would never enocunter a problem. if you had *some* multivalued fields your results would be sorted extremely arbitrarily for those docs that did have multiple values, unless you had more distinct values then you had documents -- at which point you would get a hard crash at query time. AFAIK, not any more. Since that behavior was very unreliable, it has been removed and you can reliably sort by any multi-valued field in lucene (with the sort order being defined by the largest value if there are multiple). -Yonik http://lucidimagination.com
Re: hierarchical faceting, SOLR-792 - confused on config
(11/03/17 3:53), Jonathan Rochkind wrote: Interesting, any documentation on the PathTokenizer anywhere? It is PathHierarchyTokenizer: https://hudson.apache.org/hudson/job/Solr-trunk/javadoc/org/apache/solr/analysis/PathHierarchyTokenizerFactory.html Koji -- http://www.rondhuit.com/en/
Re: Sorting on multiValued fields via function query
I agree with this and it is even needed for function sorting for multvalued fields. See geohash patch for one wY to deal with multivalued fields on distance. Not ideal but it works efficiently. Bill Bell Sent from mobile On Mar 16, 2011, at 4:08 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Huh, so lucene is actually doing what has been commonly described as impossible in Solr? But is Solr trunk, as the OP person seemed to report, still not aware of this and raising on a sort on multi-valued field, instead of just saying, okay, we'll just pass it to lucene anyway and go with lucene's approach to sorting on multi-valued field (that is, apparently, using the largest value)? If so... that kind of sounds like a bug/misfeature, yes, no? Also... if lucene is already capable of sorting on multi-valued field by choosing the largest value largest vs. smallest is presumably just arbitrary there, there is presumably no performance implication to choosing the smallest instead of the largest. It just chooses the largest, according to Yonik. So... if someone patched lucene, so whether it chose the largest or smallest in that case was a parameter passed in -- probably not a large patch since lucene, says Yonik, already has been enhanced to choose largest always -- and then patched Solr to take a param and pass it to Lucene for this purpose, which presumably also wouldn't be a large patch if lucene supported it then we'd have the feature OP asked for. Based on Yonik's description (assuming I understand correctly and he's correct), it doesn't sound like a lot of code. But it's still beyond my unfamiliar-with-lucene-code-not-so-great-at-java abilities, nor do I have the interest for my own app needs at the moment. But if OP or someone else has both sounds like a plausible feature? On 3/16/2011 6:00 PM, Yonik Seeley wrote: On Wed, Mar 16, 2011 at 5:46 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : However, many of our multiValued fields are single valued for the majority : of documents in our index so we may not have noticed the incorrect sorting : behaviors. that would make sense ... if you use a multiValued field as if it were single valued, you would never enocunter a problem. if you had *some* multivalued fields your results would be sorted extremely arbitrarily for those docs that did have multiple values, unless you had more distinct values then you had documents -- at which point you would get a hard crash at query time. AFAIK, not any more. Since that behavior was very unreliable, it has been removed and you can reliably sort by any multi-valued field in lucene (with the sort order being defined by the largest value if there are multiple). -Yonik http://lucidimagination.com
Re: Replication slows down massively during high load
On 3/16/2011 7:56 AM, Vadim Kisselmann wrote: If the load is low, both slaves replicate with around 100MB/s from master. But when I use Solrmeter (100-400 queries/min) for load tests (over the load balancer), the replication slows down to an unacceptable speed, around 100KB/s (at least that's whats the replication page on /solr/admin says). snip - Same hardware for all servers: Physical machines with quad core CPUs, 24GB RAM (JVM starts up with -XX:+UseConcMarkSweepGC -Xms10G -Xmx10G) - Index size is about 100GB with 40M docs Primary assumption: You have a 64-bit OS and a 64-bit JVM. It sounds to me like you're I/O bound, because your machine cannot keep enough of your index in RAM. Relative to your 100GB index, you only have a maximum of 14GB of RAM available to the OS disk cache, since Java's heap size is 10GB. How much disk space do all of the index files that end in x take up? I would venture a guess that it's significantly more than 14GB. On Linux, you could do this command to tally it quickly: du -hc *x If you installed enough RAM so the disk cache can be much larger than the total size of those files ending in x, you'd probably stop having these performance issues. Realizing that this is a Alternatively, you could take steps to reduce the size of your index, or perhaps add more machines to go distributed. My own index is distributed and replicated. I've got nearly 53 million documents and a total index size of 95GB. This is split into six shards that each are nearly 16GB. Running that du command I gave you above, the total on one shard is 2.5GB, and there is 7GB of RAM available for the OS cache. NB: I could be completely wrong about the source of the problem. Thanks, Shawn
Re: Replication slows down massively during high load
On 3/16/2011 6:09 PM, Shawn Heisey wrote: du -hc *x I was looking over the files in an index and I think it needs to include more of the files for a true picture of RAM needs. I get 5.9GB running the following command against a 16GB index. It excludes *.fdt (stored field data) and *.tvf (term vector fields), but includes everything else. du -hc `ls | egrep -v tvf|fdt` If any of the experts have a better handle on which files are consulted on virtually all queries, that would help narrow down the OS cache requirements. Thanks, Shawn
Re: Faceting help
: I'm not sure if I get what you are trying to achieve. What do you mean : by constraint? constraint it fairly standard terminology when refering to facets, it's used extensively in our facet docs and is even listed on solr's glossary page (allthough not specificyly in hte context of faceting since it can be used more broadly then that)... http://wiki.apache.org/solr/SolrTerminology In a nutshell: A facet is a way of classifying objects A constraint is a viable way of limiting a set of objects faceted search is a search where feedback on viable constraints (usually in the form of counts) is provided for each facet. (ie: facet counts or constraint counts ... the terms are both used relatively loosely) : I'm trying to use facet's via widget's within Ajax-Solr. I have tried the : wiki for general help on configuring facets and constraints and also : attended the recent Lucidworks webinar on faceted search. Can anyone : please direct me to some reading on how to formally configure facets for : searching. the beauty of faceting in solr is that it doesn't have ot be formally configured -- you can specify it all at query time using request params as long as the data is indexed... http://wiki.apache.org/solr/SolrFacetingOverview http://wiki.apache.org/solr/SimpleFacetParameters : Topics field :Legislation constraint :Guidance/Policies constraint :Customer Service information/complaints procedure constraint :financial information constraint if you index a Topic field, and the topic field contains those field values as indexed terms, then you will get those constraints back using facet.field=Topics -Hoss
Re: Solrj performance bottleneck
Try give Solr like 1.5gb by setting Jave params. Solr is usually CPU bound. So medium or large instances are good. Bill Bell Sent from mobile On Mar 16, 2011, at 10:56 AM, Asharudeen asharud...@gmail.com wrote: Hi Thanks for your info. Currently my index size is around 4GB. Normally in small instances total available memory will be 1.6GB. In my setup, I allocated around 1GB as a heap size for tomcat. Hence I believe, remaining 600 MB will be used for OS cache. I believe, I need to migrate my Solr instance from small instance to large. So that some more memory will be allotted for OS cache. But initially I suspect, since I call Solrj code from another instance, I need to increase the memory in the instance from where I run the Solrj. But you said I need to increase the memory in Solr instance only. Here, just I want to double check this case only. sorry for that. Once again thanks for your replies. Regards, On Wed, Mar 16, 2011 at 7:02 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Wed, Mar 16, 2011 at 7:25 AM, rahul asharud...@gmail.com wrote: In our setup, we are having Solr index in one machine. And Solrj client part (java code) in another machine. Currently as you suggest, if it may be a 'not enough free RAM for the OS to cache' then whether I need to increase the RAM in the machine in which Solrj query part is there.??? Or need to increase RAM for Solr instance for the OS cache? That would be RAM for the Solr instance. If there is not enough free memory for the OS to cache, then each document retrieved will be a disk seek + read. Since both the system are in local Amazon network (Linux EC2 small instances), I believe the network wont be a issue. Ah, how big is your index? Another thing, in the reply you have mentioned 'client not reading fast enough'. Whether it is related to network or Solrj. That was a general issue - it *can* be the client, but since you're using SolrJ it would be the network. -Yonik http://lucidimagination.com
Parent-child options
Hi, The dreaded parent-child without denormalization question. What are one's options for the following example: parent: shoes 3 children. each with 2 attributes/fields: color and size * color: red black orange * size: 10 11 12 The goal is to be able to search for: 1) color:red AND size:10 and get 1 hit for the above 2) color:red AND size:12 and get *no* matches because there are no red shoes of size 12, only size 10. What's the best thing to do without denormalizing? * Are Poly fields designed for this? * Should one use JSONKeyValueTokenizerFactory from SOLR-1690 as suggested by Ryan in http://search-lucene.com/m/I8VaDeusnJ1 ? * Should one use SIREn as suggested by Renaud in http://search-lucene.com/m/qoQWMVk3w91 ? * Should one use SpanMaskingQuery and SpanNearQuery as suggested by Hoss in http://search-lucene.com/m/AEvbbeusnJ1 ? * Should one use JOIN from https://issues.apache.org/jira/browse/SOLR-2272 ? * Should one use Nested Document query support from LUCENE-2454 (not in trunk, not in Solr) ? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/