Re: Updating a single field in a Solr document
Is this feature planned in any of the future releases. I ask because it will help me plan my system architecture accordingly. Thanks, Raghu On Tue, Jan 19, 2010 at 7:28 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Mon, Jan 18, 2010 at 5:11 PM, Raghuveer Kancherla raghuveer.kanche...@aplopio.com wrote: Hi, I have 2 fields one with captures the category of the documents and an other which is a pre processed text of the document. Text of the document is fairly large. The category of the document changes often while the text remains the same. Search happens on both fields. The problem is, I have to index both the text and the category each time the category changes. The text being large obviously makes this suboptimal. Is there a patch or a tricky way to avoid indexing the text field every time. Sure, make the text field as stored, read the old document and create the new one. Sorry, there is no way to update an indexed document in Solr (yet). -- Regards, Shalin Shekhar Mangar.
Updating a single field in a Solr document
Hi, I have 2 fields one with captures the category of the documents and an other which is a pre processed text of the document. Text of the document is fairly large. The category of the document changes often while the text remains the same. Search happens on both fields. The problem is, I have to index both the text and the category each time the category changes. The text being large obviously makes this suboptimal. Is there a patch or a tricky way to avoid indexing the text field every time. Thanks, Raghu
Re: Configuring Solr to use RAMDirectory
Hi Dipti, Just out of curiosity, are you trying to use RAMDirectory for improvement in speed? I tried doing that and did not see any significant improvement. Would be nice to know what your experiment shows. - Raghu On Thu, Dec 31, 2009 at 4:17 PM, Erik Hatcher erik.hatc...@gmail.comwrote: It's possible, but requires a custom DirectoryFactory implementation. There isn't a built in factory to construct a RAMDirectory. You wire it into solrconfig.xml this way: directoryFactory name=DirectoryFactory class=[fully.qualified.classname] !-- Parameters as required by the implementation -- /directoryFactory On Dec 31, 2009, at 5:06 AM, dipti khullar wrote: Hi Can somebody let me know if its possible to configure RAMDirectory from solrconfig.xml. Although its clearly mentioned in https://issues.apache.org/jira/browse/SOLR-465 by Mark that he has worked upon it, but still I couldn't find any such property in config file in Solr 1.4 latest download. May be I am overlooking some simple property. Any help would be appreciated. Thanks Dipti On Fri, Nov 20, 2009 at 2:27 PM, Andrey Klochkov akloch...@griddynamics.com wrote: I thought that SOLR-465 just does what is asked, i.e. one can use any Directory implementation including RAMDirectory. Thomas, take a look at it. On Thu, Nov 12, 2009 at 7:55 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: I think not out of the box, but look at SOLR-243 issue in JIRA. You could also put your index on ram disk (tmpfs), but it would be useless for writing to it. Note that when people ask about loading the whole index in memory explicitly, it's often a premature optimization attempt. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Thomas Nguyen thngu...@ign.com To: solr-user@lucene.apache.org Sent: Wed, November 11, 2009 8:46:11 PM Subject: Configuring Solr to use RAMDirectory Is it possible to configure Solr to fully load indexes in memory? I wasn't able to find any documentation about this on either their site or in the Solr 1.4 Enterprise Search Server book. -- Andrew Klochkov Senior Software Engineer, Grid Dynamics
Re: Multi Solr
Based on your need you can choose one of the options listed at http://wiki.apache.org/solr/MultipleIndexes - Raghu On Tue, Dec 22, 2009 at 10:46 AM, Olala hthie...@gmail.com wrote: Hi all! I have developed Solr on Tomcat, but now I want to building many Solr on only one Tomcat server.Is that can be done or not??? -- View this message in context: http://old.nabble.com/Multi-Solr-tp26884086p26884086.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: payload queries running slow
Hi Grant, My queries are about 5 times slower when using payloads as compared to queries that dont use payloads on the same index. I have not done any profiling yet, I am trying out lucid gaze now. I do all the load testing after warming up. Since my index is small ~1 GB, was wondering if a ramDirectory will help instead of the default Directory implementation for the indexReader? Thanks, Raghu On Thu, Dec 17, 2009 at 6:58 PM, Grant Ingersoll gsing...@apache.orgwrote: On Dec 17, 2009, at 4:52 AM, Raghuveer Kancherla wrote: Hi, With help from the group here, I have been able to set up a search application with payloads enabled. However, there is a noticeable increase in query response times with payloads as compared to the same queries without payloads. I am also seeing a lot more disk IO (I have a 7200 rpm disk) and comparatively lesser cpu usage. I am guessing this is because of the use of payloadTermQuery and payloadNearQuery both of which extend SpanQuery formats. SpanQueries read the positions index which will be much larger than the index accessed by a simple TermQuery. Is there any way of making this system faster without having to distribute the index. My index size is hardly 1GB (~200k documents and only one field to search in). I am experiencing query times as high as 2 seconds (average). Any indications on the direction in which I can experiment will also be very helpful. Yeah, payloads are going to be slower, but how much slower are they for you? Are you warming up those queries? Also, have you done any profiling? I looked at HathiTrust digital library articles. The methods indicated there talk about avoiding reading the positions index (converting PhraseQueries to TermQueries). That will not work in my case because, I still have to read the positions index to get the payload information during scoring. Let me know if my understanding is incorrect. Thanks, -Raghu -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
payload queries running slow
Hi, With help from the group here, I have been able to set up a search application with payloads enabled. However, there is a noticeable increase in query response times with payloads as compared to the same queries without payloads. I am also seeing a lot more disk IO (I have a 7200 rpm disk) and comparatively lesser cpu usage. I am guessing this is because of the use of payloadTermQuery and payloadNearQuery both of which extend SpanQuery formats. SpanQueries read the positions index which will be much larger than the index accessed by a simple TermQuery. Is there any way of making this system faster without having to distribute the index. My index size is hardly 1GB (~200k documents and only one field to search in). I am experiencing query times as high as 2 seconds (average). Any indications on the direction in which I can experiment will also be very helpful. I looked at HathiTrust digital library articles. The methods indicated there talk about avoiding reading the positions index (converting PhraseQueries to TermQueries). That will not work in my case because, I still have to read the positions index to get the payload information during scoring. Let me know if my understanding is incorrect. Thanks, -Raghu
Re: parsedquery becomes PhraseQuery
Its likely that your analyzer has WordDelimiterFilterFactory (look at your schema for the field in question). If a single token is split into more tokens during the analysis phase, solr will do a phrase query instead of a term query. In your case disk/1.0 is being analyzed into disk 1 0 (three tokens). Hence the phrase query. -Raghu On Thu, Dec 17, 2009 at 3:40 AM, Jibo John jiboj...@mac.com wrote: Hello, I have a question on how solr determines whether the q value needs to be analyzed as a regular query or as a phrase query. Let's say, I have a text'jibojohn info disk/1.0' If I query for 'jibojohn info', I get the results. The query is parsed as: str name=rawquerystringjibojohn info/str str name=querystringjibojohn info/str str name=parsedquery+data:jibojohn +data:info/str str name=parsedquery_toString+data:jibojohn +data:info/str However, if I query for 'disk/1.0', I get nothing. The query is parsed as: str name=rawquerystringdisk/1.0/str str name=querystringdisk/1.0/str str name=parsedqueryPhraseQuery(data:disk 1 0)/str str name=parsedquery_toStringdata:disk 1 0/str I was expecting this to be treated as a regular query, instead of a phrase query. I was wondering why. Appreciate your input. -Jibo
Re: Payloads with Phrase queries
The interesting thing I am noticing is that the scoring works fine for a phrase query like solr rocks. This lead me to look at what query I am using in case of a single term. Turns out that I am using PayloadTermQuery taking a cue from solr-1485 patch. I changed this to BoostingTermQuery (i read somewhere that this is deprecated .. but i was just experimenting) and the scoring seems to work as expected now for a single term. Now, the important question is what is the Payload version of a TermQuery? Regards Raghu On Tue, Dec 15, 2009 at 12:45 PM, Raghuveer Kancherla raghuveer.kanche...@aplopio.com wrote: Hi, Thanks everyone for the responses, I am now able to get both phrase queries and term queries to use payloads. However the the score value for each document (and consequently, the ordering of documents) are coming out wrong. In the solr output appended below, document 4 has a score higher than the document 2 (look at the debug part). The results section shows a wrong score (which is the payload value I am returning from my custom similarity class) and the ordering is also wrong because of this. Can someone explain this ? My custom query parser is pasted here http://pastebin.com/m9f21565 In the similarity class, I return 10.0 if payload is 1 and 20.0 if payload is 2. For everything else I return 1.0. { 'responseHeader':{ 'status':0, 'QTime':2, 'params':{ 'fl':'*,score', 'debugQuery':'on', 'indent':'on', 'start':'0', 'q':'solr', 'qt':'aplopio', 'wt':'python', 'fq':'', 'rows':'10'}}, 'response':{'numFound':5,'start':0,'maxScore':20.0,'docs':[ { 'payloadTest':'solr|2 rocks|1', 'id':'2', 'score':20.0}, { 'payloadTest':'solr|2', 'id':'4', 'score':20.0}, { 'payloadTest':'solr|1 rocks|2', 'id':'1', 'score':10.0}, { 'payloadTest':'solr|1 rocks|1', 'id':'3', 'score':10.0}, { 'payloadTest':'solr', 'id':'5', 'score':1.0}] }, 'debug':{ 'rawquerystring':'solr', 'querystring':'solr', 'parsedquery':'PayloadTermQuery(payloadTest:solr)', 'parsedquery_toString':'payloadTest:solr', 'explain':{ '2':'\n7.227325 = (MATCH) fieldWeight(payloadTest:solr in 1), product of:\n 14.142136 = (MATCH) btq, product of:\n0.70710677 = tf(phraseFreq=0.5)\n20.0 = scorePayload(...)\n 0.81767845 = idf(payloadTest: solr=5)\n 0.625 = fieldNorm(field=payloadTest, doc=1)\n', '4':'\n11.56372 = (MATCH) fieldWeight(payloadTest:solr in 3), product of:\n 14.142136 = (MATCH) btq, product of:\n0.70710677 = tf(phraseFreq=0.5)\n20.0 = scorePayload(...)\n 0.81767845 = idf(payloadTest: solr=5)\n 1.0 = fieldNorm(field=payloadTest, doc=3)\n', '1':'\n3.6136625 = (MATCH) fieldWeight(payloadTest:solr in 0), product of:\n 7.071068 = (MATCH) btq, product of:\n0.70710677 = tf(phraseFreq=0.5)\n10.0 = scorePayload(...)\n 0.81767845 = idf(payloadTest: solr=5)\n 0.625 = fieldNorm(field=payloadTest, doc=0)\n', '3':'\n3.6136625 = (MATCH) fieldWeight(payloadTest:solr in 2), product of:\n 7.071068 = (MATCH) btq, product of:\n0.70710677 = tf(phraseFreq=0.5)\n10.0 = scorePayload(...)\n 0.81767845 = idf(payloadTest: solr=5)\n 0.625 = fieldNorm(field=payloadTest, doc=2)\n', '5':'\n0.578186 = (MATCH) fieldWeight(payloadTest:solr in 4), product of:\n 0.70710677 = (MATCH) btq, product of:\n0.70710677 = tf(phraseFreq=0.5)\n1.0 = scorePayload(...)\n 0.81767845 = idf(payloadTest: solr=5)\n 1.0 = fieldNorm(field=payloadTest, doc=4)\n'}, 'QParser':'BoostingTermQParser', 'filter_queries':[''], 'parsed_filter_queries':[], 'timing':{ 'time':2.0, 'prepare':{ 'time':1.0, 'org.apache.solr.handler.component.QueryComponent':{ 'time':1.0}, 'org.apache.solr.handler.component.FacetComponent':{ 'time':0.0}, 'org.apache.solr.handler.component.MoreLikeThisComponent':{ 'time':0.0}, 'org.apache.solr.handler.component.HighlightComponent':{ 'time':0.0}, 'org.apache.solr.handler.component.StatsComponent':{ 'time':0.0}, 'org.apache.solr.handler.component.DebugComponent':{ 'time':0.0}}, 'process':{ 'time':1.0, 'org.apache.solr.handler.component.QueryComponent':{ 'time':0.0}, 'org.apache.solr.handler.component.FacetComponent':{ 'time':0.0}, 'org.apache.solr.handler.component.MoreLikeThisComponent':{ 'time':0.0}, 'org.apache.solr.handler.component.HighlightComponent':{ 'time':0.0}, 'org.apache.solr.handler.component.StatsComponent':{ 'time':0.0}, 'org.apache.solr.handler.component.DebugComponent':{ 'time':1.0} On Thu, Dec
Re: Payloads with Phrase queries
Hi, Thanks everyone for the responses, I am now able to get both phrase queries and term queries to use payloads. However the the score value for each document (and consequently, the ordering of documents) are coming out wrong. In the solr output appended below, document 4 has a score higher than the document 2 (look at the debug part). The results section shows a wrong score (which is the payload value I am returning from my custom similarity class) and the ordering is also wrong because of this. Can someone explain this ? My custom query parser is pasted here http://pastebin.com/m9f21565 In the similarity class, I return 10.0 if payload is 1 and 20.0 if payload is 2. For everything else I return 1.0. { 'responseHeader':{ 'status':0, 'QTime':2, 'params':{ 'fl':'*,score', 'debugQuery':'on', 'indent':'on', 'start':'0', 'q':'solr', 'qt':'aplopio', 'wt':'python', 'fq':'', 'rows':'10'}}, 'response':{'numFound':5,'start':0,'maxScore':20.0,'docs':[ { 'payloadTest':'solr|2 rocks|1', 'id':'2', 'score':20.0}, { 'payloadTest':'solr|2', 'id':'4', 'score':20.0}, { 'payloadTest':'solr|1 rocks|2', 'id':'1', 'score':10.0}, { 'payloadTest':'solr|1 rocks|1', 'id':'3', 'score':10.0}, { 'payloadTest':'solr', 'id':'5', 'score':1.0}] }, 'debug':{ 'rawquerystring':'solr', 'querystring':'solr', 'parsedquery':'PayloadTermQuery(payloadTest:solr)', 'parsedquery_toString':'payloadTest:solr', 'explain':{ '2':'\n7.227325 = (MATCH) fieldWeight(payloadTest:solr in 1), product of:\n 14.142136 = (MATCH) btq, product of:\n0.70710677 = tf(phraseFreq=0.5)\n20.0 = scorePayload(...)\n 0.81767845 = idf(payloadTest: solr=5)\n 0.625 = fieldNorm(field=payloadTest, doc=1)\n', '4':'\n11.56372 = (MATCH) fieldWeight(payloadTest:solr in 3), product of:\n 14.142136 = (MATCH) btq, product of:\n0.70710677 = tf(phraseFreq=0.5)\n20.0 = scorePayload(...)\n 0.81767845 = idf(payloadTest: solr=5)\n 1.0 = fieldNorm(field=payloadTest, doc=3)\n', '1':'\n3.6136625 = (MATCH) fieldWeight(payloadTest:solr in 0), product of:\n 7.071068 = (MATCH) btq, product of:\n0.70710677 = tf(phraseFreq=0.5)\n10.0 = scorePayload(...)\n 0.81767845 = idf(payloadTest: solr=5)\n 0.625 = fieldNorm(field=payloadTest, doc=0)\n', '3':'\n3.6136625 = (MATCH) fieldWeight(payloadTest:solr in 2), product of:\n 7.071068 = (MATCH) btq, product of:\n0.70710677 = tf(phraseFreq=0.5)\n10.0 = scorePayload(...)\n 0.81767845 = idf(payloadTest: solr=5)\n 0.625 = fieldNorm(field=payloadTest, doc=2)\n', '5':'\n0.578186 = (MATCH) fieldWeight(payloadTest:solr in 4), product of:\n 0.70710677 = (MATCH) btq, product of:\n0.70710677 = tf(phraseFreq=0.5)\n1.0 = scorePayload(...)\n 0.81767845 = idf(payloadTest: solr=5)\n 1.0 = fieldNorm(field=payloadTest, doc=4)\n'}, 'QParser':'BoostingTermQParser', 'filter_queries':[''], 'parsed_filter_queries':[], 'timing':{ 'time':2.0, 'prepare':{ 'time':1.0, 'org.apache.solr.handler.component.QueryComponent':{ 'time':1.0}, 'org.apache.solr.handler.component.FacetComponent':{ 'time':0.0}, 'org.apache.solr.handler.component.MoreLikeThisComponent':{ 'time':0.0}, 'org.apache.solr.handler.component.HighlightComponent':{ 'time':0.0}, 'org.apache.solr.handler.component.StatsComponent':{ 'time':0.0}, 'org.apache.solr.handler.component.DebugComponent':{ 'time':0.0}}, 'process':{ 'time':1.0, 'org.apache.solr.handler.component.QueryComponent':{ 'time':0.0}, 'org.apache.solr.handler.component.FacetComponent':{ 'time':0.0}, 'org.apache.solr.handler.component.MoreLikeThisComponent':{ 'time':0.0}, 'org.apache.solr.handler.component.HighlightComponent':{ 'time':0.0}, 'org.apache.solr.handler.component.StatsComponent':{ 'time':0.0}, 'org.apache.solr.handler.component.DebugComponent':{ 'time':1.0} On Thu, Dec 10, 2009 at 5:48 PM, AHMET ARSLAN iori...@yahoo.com wrote: I was looking through some lucene source codes and found the following class org.apache.lucene.search.payloads.PayloadSpanUtil There is a function named queryToSpanQuery in this class. Is this the preferred way to convert a PhraseQuery to PayloadNearQuery? queryToSpanQuery method does not return PayloadNearQuery type. You need to override getFieldQuery(String field, String queryText, int slop) of SolrQueryParser or QueryParser. This code is modified from Lucene In Action Book (2nd edition) Chapter 6.3.4 Allowing ordered phrase queries
Payloads with Phrase queries
Hi, I am looking for a way to use payloads in my search application. Indexing data with payloads into Solr is pretty straightforward. However using the payloads during search time is a bit confusing. Can anyone point me in the right direction to enable payloads on a *PhraseQuery*. I looked at the following resources and got payload on a TermQuery working. 1. http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/ 2. http://www.mail-archive.com/solr-user@lucene.apache.org/msg24863.html 3. There is also a jira issue (SOLR-1485) that gives a patch for using Payload. 4. Lucene-In-Action I am guessing that I should return a payload version of PhraseQuery in QueryParser's (package org.apache.lucene.queryParser.queryParser.java) newPhraseQuery function. If yes, what type should this query be? Thanks, Raghu
Re: Payloads with Phrase queries
I was looking through some lucene source codes and found the following class org.apache.lucene.search.payloads.PayloadSpanUtil There is a function named queryToSpanQuery in this class. Is this the preferred way to convert a PhraseQuery to PayloadNearQuery? Also, are there any performance considerations while using a PayloadNearQuery instead of a PhraseQuery? Thanks, Raghu On Thu, Dec 10, 2009 at 4:40 PM, AHMET ARSLAN iori...@yahoo.com wrote: Hi, I am looking for a way to use payloads in my search application. Indexing data with payloads into Solr is pretty straightforward. However using the payloads during search time is a bit confusing. Can anyone point me in the right direction to enable payloads on a *PhraseQuery*. I looked at the following resources and got payload on a TermQuery working. 1. http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/ 2. http://www.mail-archive.com/solr-user@lucene.apache.org/msg24863.html 3. There is also a jira issue (SOLR-1485) that gives a patch for using Payload. 4. Lucene-In-Action I am guessing that I should return a payload version of PhraseQuery in QueryParser's (package org.apache.lucene.queryParser.queryParser.java) newPhraseQuery function. If yes, what type should this query be? Yes. PayloadNearQuery [1] [1] http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/search/payloads/PayloadNearQuery.html
Re: Retrieving large num of docs
Hi Otis, I think my experiments are not conclusive about reduction in search time. I was playing around with various configurations to reduce the time to retrieve documents from Solr. I am sure that making the two multi valued text fields from stored to un-stored, retrieval time (query time + time to load the stored fields) became very fast. I was expecting the lazyfieldloading setting in solrconfig to take care of this but apparently it is not working as expected. Out of curiosity, I removed these 2 fields from the index (this time I am not even indexing them) and my search time got better (10 times better). However, I am still trying to isolate the reason for the search time reduction. It may be either because of 2 less fields to search in or because of the reduction in size of the index or may be something else. I am not sure if lazyfieldloading has any part in explaining this. - Raghu On Fri, Dec 4, 2009 at 3:07 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hm, hm, interesting. I was looking into something like this the other day (BIG indexed+stored text fields). After seeing enableLazyFieldLoading=true in solrconfig and after seeing fl didn't include those big fields, I though hm, so Lucene/Solr will not be pulling those large fields from disk, OK. You are saying that this may not be true based on your experiment? And what I'm calling your experiment means that you reindexed the same data, but without the 2 multi-valued text fields... .and that was the only change you made and got cca x10 search performance improvement? Sorry for repeating your words, just trying to confirm and understand. Thanks, Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Raghuveer Kancherla raghuveer.kanche...@aplopio.com To: solr-user@lucene.apache.org Sent: Thu, December 3, 2009 8:43:16 AM Subject: Re: Retrieving large num of docs Hi Hoss, I was experimenting with various queries to solve this problem and in one such test I remember that requesting only the ID did not change the retrieval time. To be sure, I tested it again using the curl command today and it confirms my previous observation. Also, enableLazyFieldLoading setting is set to true in my solrconfig. Another general observation (off topic) is that having a moderately large multi valued text field (~200 entries) in the index seems to slow down the search significantly. I removed the 2 multi valued text fields from my index and my search got ~10 time faster. :) - Raghu On Thu, Dec 3, 2009 at 2:14 AM, Chris Hostetter wrote: : I think I solved the problem of retrieving 300 docs per request for now. The : problem was that I was storing 2 moderately large multivalued text fields : though I was not retrieving them during search time. I reindexed all my : data without storing these fields. Now the response time (time for Solr to : return the http response) is very close to the QTime Solr is showing in the Hmmm two comments: 1) the example URL from your previous mail... : http://localhost:1212/solr/select/?rows=300q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29start=0wt=python ...doesn't match your earlier statemnet that you are only returning hte id field (there is no fl param in that URL) ... are you certain you werent' returning those large stored fields in teh response? 2) assuming you were actually using an fl param to limit the fields, make sure you have this setting in your solrconfig.xml... true ..that should make it pretty fast to return only a few fields of each document, even if you do have some jumpto stored fields that aren't being returned. -Hoss
Re: WELCOME to solr-user@lucene.apache.org
2 ways I can think of ... - ExtractingRequestHandler (this is what I am guessing you are using now) Set extractOnly=true while making a request to the extractingRequestHandler and get the parsed content back. Now make a post request on update request handler with what ever fields and field values you want. - Use HTMLStripWhiteSpaceTokenizer factory. This article may be helpful to explain what I mean. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripWhitespaceTokenizerFactory. - Raghu On Sat, Dec 5, 2009 at 3:44 AM, khalid y kern...@gmail.com wrote: Hi, I have a problem with solr. I'm indexing some html content and solr crash because my id field is multivalued. I found that Tika read the html and extract metadata like meta name=id content=12 from my htmls but my documents has an already an id setted by literal.id=10. I tried to map the id from Tika by fmap.id=ignored_ but it ignore also my literal.id I'm using solr 1.4 and tika 0.5 Someone can explain to me how I can ignore this the Tika id metadata ?? Thanks
Re: Retrieving large num of docs
Hi Hoss, I was experimenting with various queries to solve this problem and in one such test I remember that requesting only the ID did not change the retrieval time. To be sure, I tested it again using the curl command today and it confirms my previous observation. Also, enableLazyFieldLoading setting is set to true in my solrconfig. Another general observation (off topic) is that having a moderately large multi valued text field (~200 entries) in the index seems to slow down the search significantly. I removed the 2 multi valued text fields from my index and my search got ~10 time faster. :) - Raghu On Thu, Dec 3, 2009 at 2:14 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : I think I solved the problem of retrieving 300 docs per request for now. The : problem was that I was storing 2 moderately large multivalued text fields : though I was not retrieving them during search time. I reindexed all my : data without storing these fields. Now the response time (time for Solr to : return the http response) is very close to the QTime Solr is showing in the Hmmm two comments: 1) the example URL from your previous mail... : http://localhost:1212/solr/select/?rows=300q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29start=0wt=python ...doesn't match your earlier statemnet that you are only returning hte id field (there is no fl param in that URL) ... are you certain you werent' returning those large stored fields in teh response? 2) assuming you were actually using an fl param to limit the fields, make sure you have this setting in your solrconfig.xml... enableLazyFieldLoadingtrue/enableLazyFieldLoading ..that should make it pretty fast to return only a few fields of each document, even if you do have some jumpto stored fields that aren't being returned. -Hoss
Re: Retrieving large num of docs
Hi Hoss/Andrew, I think I solved the problem of retrieving 300 docs per request for now. The problem was that I was storing 2 moderately large multivalued text fields though I was not retrieving them during search time. I reindexed all my data without storing these fields. Now the response time (time for Solr to return the http response) is very close to the QTime Solr is showing in the logs. Thanks for all the help, Raghu On Mon, Nov 30, 2009 at 11:37 AM, Raghuveer Kancherla raghuveer.kanche...@aplopio.com wrote: Thanks Hoss, In my previous mail, I was measuring the system time difference between sending a (http) request and receiving a response. This was being run on a (different) client machine Like you suggested, I tried to time the response on the server itself as follows: $ /usr/bin/time -p curl -sS -o solr.out http://localhost:1212/solr/select/?rows=300q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29start=0wt=python real 3.49 user 0.00 sys 0.00 The query time in solr log shows me Qtime=600 size of solr.out is 843 kB. As you've mentioned, Solr shouldn't give these kind of numbers for 300 docs, and we're quite perplexed as to whats going on. Thanks, Raghu On Mon, Nov 30, 2009 at 6:00 AM, Chris Hostetter hossman_luc...@fucit.org wrote: : I am using Solr1.4 for searching through half a million documents. The : problem is, I want to retrieve nearly 200 documents for each search query. : The query time in Solr logs is showing 0.02 seconds and I am fairly happy : with that. However Solr is taking a long time (4 to 5 secs) to return the : results (I think it is because of the number of docs I am requesting). I : tried returning only the id's (unique key) without any other stored fields, : but it is not helping me improve the response times (time to return the id's : of matching documents). What exactly does your request URL look like, and how exactly are you timing the total response time? 200 isn't a very big number for the rows param -- people who want to get 100K documents back in their response at a time may have problems, but 200 is not that big. so like i said: how exactly are you timing things? My guess: it's more likely that network overhead or the performance of your client code (reading the data off the wire) is causing your timing code to seem slow, then it is that Solr is taking 5 seconds to write out those document IDs. I suspect if you try hitting the same exact URL using curl via localhost, you'll see the total response time be a lot less then 5 seconds. Here's an example of a query that asks solr to return *every* field from 500 documents, in the XML format. And these are not small documents... $ /usr/bin/time -p curl -sS -o /tmp/solr.out http://localhost:5051/solr/select/?q=doctype:productversion=2.2start=0rows=500indent=on real 0.07 user 0.00 sys 0.00 [chr...@c18-ssa-so-dfll-qry1 ~]$ du -sh /tmp/solr.out 1.6M/tmp/solr.out ...that's 1.6 MB of 500 Solr documents with all of their fields in verbose XML format (including indenting) fetched in 70ms. If it's taking 5 seconds for you to get just the ids of 200 docs, you've got a problem somewhere and i'm 99% certain it's not in Solr. what does a similar time curl command for your URL look like when you run it on your solr server? -Hoss
Re: Retrieving large num of docs
Thanks Hoss, In my previous mail, I was measuring the system time difference between sending a (http) request and receiving a response. This was being run on a (different) client machine Like you suggested, I tried to time the response on the server itself as follows: $ /usr/bin/time -p curl -sS -o solr.out http://localhost:1212/solr/select/?rows=300q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29start=0wt=python real 3.49 user 0.00 sys 0.00 The query time in solr log shows me Qtime=600 size of solr.out is 843 kB. As you've mentioned, Solr shouldn't give these kind of numbers for 300 docs, and we're quite perplexed as to whats going on. Thanks, Raghu On Mon, Nov 30, 2009 at 6:00 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : I am using Solr1.4 for searching through half a million documents. The : problem is, I want to retrieve nearly 200 documents for each search query. : The query time in Solr logs is showing 0.02 seconds and I am fairly happy : with that. However Solr is taking a long time (4 to 5 secs) to return the : results (I think it is because of the number of docs I am requesting). I : tried returning only the id's (unique key) without any other stored fields, : but it is not helping me improve the response times (time to return the id's : of matching documents). What exactly does your request URL look like, and how exactly are you timing the total response time? 200 isn't a very big number for the rows param -- people who want to get 100K documents back in their response at a time may have problems, but 200 is not that big. so like i said: how exactly are you timing things? My guess: it's more likely that network overhead or the performance of your client code (reading the data off the wire) is causing your timing code to seem slow, then it is that Solr is taking 5 seconds to write out those document IDs. I suspect if you try hitting the same exact URL using curl via localhost, you'll see the total response time be a lot less then 5 seconds. Here's an example of a query that asks solr to return *every* field from 500 documents, in the XML format. And these are not small documents... $ /usr/bin/time -p curl -sS -o /tmp/solr.out http://localhost:5051/solr/select/?q=doctype:productversion=2.2start=0rows=500indent=on real 0.07 user 0.00 sys 0.00 [chr...@c18-ssa-so-dfll-qry1 ~]$ du -sh /tmp/solr.out 1.6M/tmp/solr.out ...that's 1.6 MB of 500 Solr documents with all of their fields in verbose XML format (including indenting) fetched in 70ms. If it's taking 5 seconds for you to get just the ids of 200 docs, you've got a problem somewhere and i'm 99% certain it's not in Solr. what does a similar time curl command for your URL look like when you run it on your solr server? -Hoss
Re: Retrieving large num of docs
Hi Andrew, I applied the patch you suggested. I am not finding any significant changes in the response times. I am wondering if I forgot some important configuration setting etc. Here is what I did: 1. Wrote a small program using solrj to use EmbeddedSolrServer (most of the code is from the solr wiki) and run the server on an index of ~700k docs and note down the avg response time 2. Applied the SOLR-797.patch to the source code of Solr1.4 3. complied the source code and rebuilt the jar files. 4. Rerun step 1 using the new jar files. Am I supposed to do any other config changes in order to see the performance jump that you are able to achieve. Thanks a lot, Raghu On Fri, Nov 27, 2009 at 3:16 PM, AHMET ARSLAN iori...@yahoo.com wrote: Hi Andrew, We are running solr using its http interface from python. From the resources I could find, EmbeddedSolrServer is possible only if I am using solr from a java program. It will be useful to understand if a significant part of the performance increase is due to bypassing HTTP before going down this path. In the mean time I am trying my luck with the other suggestions. Can you share the patch that helps cache solr documents instead of lucene documents? May be these links can help http://wiki.apache.org/lucene-java/ImproveSearchingSpeed http://wiki.apache.org/lucene-java/ImproveIndexingSpeed http://www.lucidimagination.com/Downloads/LucidGaze-for-Solr how often do you update your index? is your index optimized? configuring caching can also help: http://wiki.apache.org/solr/SolrCaching http://wiki.apache.org/solr/SolrPerformanceFactors
Re: Retrieving large num of docs
Hi Andrew, We are running solr using its http interface from python. From the resources I could find, EmbeddedSolrServer is possible only if I am using solr from a java program. It will be useful to understand if a significant part of the performance increase is due to bypassing HTTP before going down this path. In the mean time I am trying my luck with the other suggestions. Can you share the patch that helps cache solr documents instead of lucene documents? On a different note, I am wondering why does it take 4 - 5 seconds for Solr to return the ID's of ranked documents when it can rank the results in about 20 milli seconds? Am I missing something here? Thanks, Raghu On Fri, Nov 27, 2009 at 2:15 AM, Andrey Klochkov akloch...@griddynamics.com wrote: Hi We obtain ALL documents for every query, the index size is about 50k. We use number of stored fields. Often the result set size is several thousands of docs. We performed the following things to make it faster: 1. Use EmbeddedSolrServer 2. Patch Solr to avoid unnecessary marshalling while using EmbeddedSolrServer (there's an issue in Solr JIRA) 3. Patch Solr to cache SolrDocument instances instead of Lucene's Document instances. I was going to share this patch, but then decided that our usage of Solr is not common and this functionality is useless in most cases 4. We have all documents in cache 5. In fact our index is stored in a data grid, not a file system. But as tests showed this is not important because standard FSDirectory is faster if you have enough of RAM free for OS caches. These changes improved the performance very much, so in the end we have performance comparable (about 3-5 times slower) to the proper Solr usage (obtaining first 20 documents). To get more details on how different Solr components perform we injected perf4j statements into key points in the code. And a profiler was helpful too. Hope it helps somehow. On Thu, Nov 26, 2009 at 8:48 PM, Raghuveer Kancherla raghuveer.kanche...@aplopio.com wrote: Hi, I am using Solr1.4 for searching through half a million documents. The problem is, I want to retrieve nearly 200 documents for each search query. The query time in Solr logs is showing 0.02 seconds and I am fairly happy with that. However Solr is taking a long time (4 to 5 secs) to return the results (I think it is because of the number of docs I am requesting). I tried returning only the id's (unique key) without any other stored fields, but it is not helping me improve the response times (time to return the id's of matching documents). I understand that retrieving 200 documents for each search term is impractical in most scenarios but I dont have any other option. Any pointers on how to improve the response times will be a great help. Thanks, Raghu -- Andrew Klochkov Senior Software Engineer, Grid Dynamics
Retrieving large num of docs
Hi, I am using Solr1.4 for searching through half a million documents. The problem is, I want to retrieve nearly 200 documents for each search query. The query time in Solr logs is showing 0.02 seconds and I am fairly happy with that. However Solr is taking a long time (4 to 5 secs) to return the results (I think it is because of the number of docs I am requesting). I tried returning only the id's (unique key) without any other stored fields, but it is not helping me improve the response times (time to return the id's of matching documents). I understand that retrieving 200 documents for each search term is impractical in most scenarios but I dont have any other option. Any pointers on how to improve the response times will be a great help. Thanks, Raghu