Merged segment warmer Solr 4.4
Hi, I have a slow storage machine and non sufficient RAM for the whole index to store all the index. This causes the first queries (~5000) to be very slow (they are read from disk and my cpu is most of time in iowait), and after that the readings from the index become very fast and read mainly from memory as the OS caching cached the most used parts of the index. My concern is about new segments that are commited to disk, either merged segments or newly formed segments. My first thought was to deal with linux caching policy (to factor up the caching of index files rather than uninverted files that are least frequently used) to urge the right OS caching without having to explicitly query the index for this to happen. Secondly I thought of initiating a new searcher event listener that queries on docs that were inserted since the last hard commit. A new ability of solr 4.4 (solr 4761) is to configure a mergedSegmentWarmer - how does this component work and is it good for my usecase? Are there any other ideas for dealing this usecase? What would be your proposal as most effective way to deal with it?
Re: processing documents in solr
Hi, The easiest solution would be to have timestamp indexed. Is there any issue in doing re-indexing? If you want to process records in batch then you need a ordered list and a bookmark. You require a field to sort and maintain a counter / last id as bookmark. This is mandatory to solve your problem. If you don't want to re-index, then you need to maintain information related to visited nodes. Have a database / solr core which maintains list of IDs which already processed. Fetch record from Solr, For each record, check the new DB, if the record is already processed. Regards Aditya www.findbestopensource.com On Mon, Jul 29, 2013 at 10:26 AM, Joe Zhang smartag...@gmail.com wrote: Basically, I was thinking about running a range query like Shawn suggested on the tstamp field, but unfortunately it was not indexed. Range queries only work on indexed fields, right? On Sun, Jul 28, 2013 at 9:49 PM, Joe Zhang smartag...@gmail.com wrote: I've been thinking about tstamp solution int the past few days. but too bad, the field is avaialble but not indexed... I'm not familiar with SolrJ. Again, sounds like SolrJ is providing the counter value. If yes, that would be equivalent to an autoincrement id. I'm indexing from Nutch though; don't know how to feed in such counter... On Sun, Jul 28, 2013 at 7:03 AM, Erick Erickson erickerick...@gmail.com wrote: Why wouldn't a simple timestamp work for the ordering? Although I guess simple timestamp isn't really simple if the time settings change. So how about a simple counter field in your documents? Assuming you're indexing from SolrJ, your setup is to query q=*:*sort=counter desc. Take the counter from the first document returned. Increment for each doc for the life of the indexing run. Now you've got, for all intents and purposes, an identity field albeit manually maintained. Then use your counter field as Shawn suggests for pulling all the data out. FWIW, Erick On Sun, Jul 28, 2013 at 1:01 AM, Maurizio Cucchiara mcucchi...@apache.org wrote: In both cases, for better performance, first I'd load just all the IDs, after, during processing I'd load each document. For what concern the incremental requirement, it should not be difficult to write an hash function which maps a non-numerical I'd to a value. On Jul 27, 2013 7:03 AM, Joe Zhang smartag...@gmail.com wrote: Dear list: I have an ever-growing solr repository, and I need to process every single document to extract statistics. What would be a reasonable process that satifies the following properties: - Exhaustive: I have to traverse every single document - Incremental: in other words, it has to allow me to divide and conquer --- if I have processed the first 20k docs, next time I can start with 20001. A simple *:* query would satisfy the 1st but not the 2nd property. In fact, given that the processing will take very long, and the repository keeps growing, it is not even clear that the exhaustiveness is achieved. I'm running solr 3.6.2 in a single-machine setting; no hadoop capability yet. But I guess the same issues still hold even if I have the solr cloud environment, right, say in each shard? Any help would be greatly appreciated. Joe
RAM Usage Debugging
When I look at my dashboard I see that 27.30 GB available for JVM, 24.77 GB is gray and 16.50 GB is black. I don't do anything on my machine right now. Did it cache documents or is there any problem, how can I learn it?
RE: new field type - enum field
Thanks, Erick. I have tried it four times. It keeps failing. The problem reoccurred today. Thanks. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Monday, July 29, 2013 2:44 AM To: solr-user@lucene.apache.org Subject: Re: new field type - enum field You should be able to attach a patch, wonder if there was some temporary glitch in the JIRA. Is this persisting. Let us know if this continues... Erick On Sun, Jul 28, 2013 at 12:11 PM, Elran Dvir elr...@checkpoint.com wrote: Hi, I have created an issue: https://issues.apache.org/jira/browse/SOLR-5084 I tried to attach my patch, but it failed: Cannot attach file Solr-5084.patch: Unable to communicate with JIRA. What am I doing wrong? Thanks. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, July 25, 2013 3:25 PM To: solr-user@lucene.apache.org Subject: Re: new field type - enum field Start here: http://wiki.apache.org/solr/HowToContribute Then, when your patch is ready submit a JIRA and attach your patch. Then nudge (gently) if none of the committers picks it up and applies it NOTE: It is _not_ necessary that the first version of your patch is completely polished. I often put up partial/incomplete patches (comments with //nocommit are explicitly caught by the ant precommit target for instance) to see if anyone has any comments before polishing. Best Erick On Thu, Jul 25, 2013 at 5:04 AM, Elran Dvir elr...@checkpoint.com wrote: Hi, I have implemented like Chris described it: The field is indexed as numeric, but displayed as string, according to configuration. It applies to facet, pivot, group and query. How do we proceed? How do I contribute it? Thanks. -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Thursday, July 25, 2013 4:40 AM To: solr-user@lucene.apache.org Subject: Re: new field type - enum field : Doable at Lucene level by any chance? Given how well the Trie fields compress (ByteField and ShortField have been deprecated in favor of TrieIntField for this reason) it probably just makes sense to treat it as a numeric at the Lucene level. : If there's positive feedback, I'll open an issue with a patch for the functionality. I've typically dealt with this sort of thing at the client layer using a simple numeric field in Solr, or used an UpdateProcessor to convert the String-numeric mapping when indexing used clinet logic of a DocTransformer to handle the stored value at query time -- but having a built in FieldType that handles that for you automatically (and helps ensure the indexed values conform to the enum) would certainly be cool if you'd like to contribute it. -Hoss Email secured by Check Point Email secured by Check Point Email secured by Check Point
Re: Two-steps queries with different sorting criteria
Hi, Not sure if this was already answered, but... If the source of the problem are overly general queries, I would try to eliminate or minimize that. For example: * offering query autocomplete functionality can have an affect on query length and precision * showing related searches (derived from query logs) and exposing queries that are not as general could lead to people using that functionality after the initial search without running into the issue you described As for sorting by relevance and then sorting top N of such hits by something else - yes, you can write a custom SearchComponent and do this in a single call from the client. We've implemented similar functionality a a few times for a couple of Sematext clients and it worked well. Otis -- Search Analytics - http://sematext.com/search-analytics/index.html Performance Monitoring - http://sematext.com/spm/index.html On Thursday, July 18, 2013, Fabio Amato wrote: Hi all, I need to execute a Solr query in two steps, executing in the first step a generic limited-results query ordered by relevance, and in the second step the ordering of the results of the first step according to a given sorting criterion (different from relevance). This two-steps query is meaningful when the query terms are so generic in such a way that the number of matched results exceeds the wanted number of results. In such circumstance, using single-step queries with different sorting criteria has a very confusing effect on the user experience, because at each change of sorting criterion the user gets different results even if the search query and the filtering conditions have not changed. On the contrary, using a two-steps query where the sorting order of the first step is always the relevance is more acceptable in case of large number of matched results because the result set would not change with the sorting criterion of the second step. I am wondering if such a two-steps query is achievable with a single Solr query, or if I am obliged to execute the sorting step of my two-steps query out of Solr (i.e.:in my application). Another possibility could be the development of a Solr plugin, but I am afraid of the possible effects on the performances. I am using Solr 3.4.0 Thanks in advance for your kind help. Fabio -- Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm
.lock file not created when making a backup snapshot
Hi, when making a backup snapshot using /replication?command=backup call, a snapshot directory is created and starts to be filled, but appropriate .lock file is not created so it's impossible to check when backup is finished. I've taken a look at code and it seems to me that lock.obtain() call is missing: there is public class SnapShooter { ... void createSnapshot(final IndexCommit indexCommit, int numberToKeep, ReplicationHandler replicationHandler) { ... lock = lockFactory.makeLock(directoryName + .lock); ... lock.release(); so lock file is not actually created. This is Solr 4.3.1, release notes for Solr 4.4 do not include this problem. Should I raise a JIRA issue for this? Or maybe you could suggest more reliable way to make a backup? Regards, Artem.
AND Queries
I am searching for a keyword as like that: lang:en AND url:book pencil cat It returns me results however none of them includes both book, pencil and cat keywords. How should I rewrite my query? I tried this: lang:en AND url:(book AND pencil AND cat) and looks like OK. However this not: lang:en AND url:book AND pencil AND cat why?
Re: AND Queries
Hello! Try turning on debugQuery and see what I happening. From what I see you are searching the en term in lang field, the book term in url field and the pencil and cat terms in the default search field, but from your second query I see that you would like to find the last two terms in the url. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - ElasticSearch I am searching for a keyword as like that: lang:en AND url:book pencil cat It returns me results however none of them includes both book, pencil and cat keywords. How should I rewrite my query? I tried this: lang:en AND url:(book AND pencil AND cat) and looks like OK. However this not: lang:en AND url:book AND pencil AND cat why?
Re: .lock file not created when making a backup snapshot
Hi Artem, I noticed this recently too. I created a JIRA issue here: https://issues.apache.org/jira/browse/SOLR-5040 Cheers, Mark Artem Karpenko a.karpe...@oxseed.com writes: Hi, when making a backup snapshot using /replication?command=backup call, a snapshot directory is created and starts to be filled, but appropriate .lock file is not created so it's impossible to check when backup is finished. I've taken a look at code and it seems to me that lock.obtain() call is missing: there is public class SnapShooter { ... void createSnapshot(final IndexCommit indexCommit, int numberToKeep, ReplicationHandler replicationHandler) { ... lock = lockFactory.makeLock(directoryName + .lock); ... lock.release(); so lock file is not actually created. This is Solr 4.3.1, release notes for Solr 4.4 do not include this problem. Should I raise a JIRA issue for this? Or maybe you could suggest more reliable way to make a backup? Regards, Artem. -- Mark Triggs m...@dishevelled.net
swap and GC
Something interesting I have noticed today, after running my huge single index (49 mio. records / 137 GB index) for about a week and replicating today I recognized that the heap usage after replication did not go down as expected. Expected means if solr is started I have a heap size between 4 to 5 GB and during the week under heavy load it might go up to 10 GB. But after replication in offline mode it recovers to between 5 to 6 GB. But today it was not going under 8 GB, even with forced GC from jvisualvm. So I first dropped the caches and tried again, no success. Next I turned off swap which took quite a while and turned it back on. This forced all content from swap back into memory. After calling Perform GC from jvisualvm the heap dropped below 5 GB. Bingo! This leads me to the conclusion that java GC is not seeing or reaching objects which are located in swap. Anyone else seen this? As I am not short on memory or have any other problems I don't need any solution, but if there are some users having memory problems with old objects in swap I would suggest a cronjob after replication with swapoff/swapon and GC afterwards. Bernd
Re: AND Queries
When I send that query: select?pf=url^10+title^8fl=url,content,titlestart=0q=lang:en+AND+(cat+AND+dog+AND+pencil)qf=content^5+url^8.0+title^6wt=xmldebugQuery=on It is debugged as: +(+lang:en +(+(content:cat^5.0 | title:cat^6.0 | url:cat^8.0) +(content:dog^5.0 | title:dog^6.0 | url:dog^8.0) +(content:pencil^5.0 | title:pencil^6.0 | url:pencil^8.0))) (url:cat dog pencil^10.0) (title:(cat dog pencil)^8.0) Why default field is not applied at this situation? 2013/7/29 fbrisbart fbrisb...@bestofmedia.com It's because when you don't specify any field, it's the default field which is used. So, lang:en AND url:book AND pencil AND cat is interpreted as : ang:en AND url:book AND default_field:pencil AND default_field:cat The default search field is defined in your schema.xml file (defaultSearchField) Franck Brisbart Le lundi 29 juillet 2013 à 12:06 +0300, Furkan KAMACI a écrit : I am searching for a keyword as like that: lang:en AND url:book pencil cat It returns me results however none of them includes both book, pencil and cat keywords. How should I rewrite my query? I tried this: lang:en AND url:(book AND pencil AND cat) and looks like OK. However this not: lang:en AND url:book AND pencil AND cat why?
Re: AND Queries
Because you specified the search fields to use with 'qf' which overrides the default search field. Franck Brisbart Le lundi 29 juillet 2013 à 13:01 +0300, Furkan KAMACI a écrit : When I send that query: select?pf=url^10+title^8fl=url,content,titlestart=0q=lang:en+AND+(cat+AND+dog+AND+pencil)qf=content^5+url^8.0+title^6wt=xmldebugQuery=on It is debugged as: +(+lang:en +(+(content:cat^5.0 | title:cat^6.0 | url:cat^8.0) +(content:dog^5.0 | title:dog^6.0 | url:dog^8.0) +(content:pencil^5.0 | title:pencil^6.0 | url:pencil^8.0))) (url:cat dog pencil^10.0) (title:(cat dog pencil)^8.0) Why default field is not applied at this situation? 2013/7/29 fbrisbart fbrisb...@bestofmedia.com It's because when you don't specify any field, it's the default field which is used. So, lang:en AND url:book AND pencil AND cat is interpreted as : ang:en AND url:book AND default_field:pencil AND default_field:cat The default search field is defined in your schema.xml file (defaultSearchField) Franck Brisbart Le lundi 29 juillet 2013 à 12:06 +0300, Furkan KAMACI a écrit : I am searching for a keyword as like that: lang:en AND url:book pencil cat It returns me results however none of them includes both book, pencil and cat keywords. How should I rewrite my query? I tried this: lang:en AND url:(book AND pencil AND cat) and looks like OK. However this not: lang:en AND url:book AND pencil AND cat why?
solr query range upper exclusive
q=price_1_1:[197 TO 249] and q=*:*fq=price_1_1:[197 TO 249] returns 2 records but I have two records with the price_1_1 = 249, it seams that the upper range is exclusive and I can't figure out why, can you help me? dynamicField name=price_*type=tfloat indexed=true/ fieldType name=tfloat class=solr.TrieFloatField precisionStep=8 omitNorms=true positionIncrementGap=0/ -- View this message in context: http://lucene.472066.n3.nabble.com/solr-query-range-upper-exclusive-tp4080978.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: processing documents in solr
No SolrJ doesn't provide this automatically. You'd be providing the counter by inserting it into the document as you created new docs. You could do this with any kind of document creation you are using. Best Erick On Mon, Jul 29, 2013 at 2:51 AM, Aditya findbestopensou...@gmail.com wrote: Hi, The easiest solution would be to have timestamp indexed. Is there any issue in doing re-indexing? If you want to process records in batch then you need a ordered list and a bookmark. You require a field to sort and maintain a counter / last id as bookmark. This is mandatory to solve your problem. If you don't want to re-index, then you need to maintain information related to visited nodes. Have a database / solr core which maintains list of IDs which already processed. Fetch record from Solr, For each record, check the new DB, if the record is already processed. Regards Aditya www.findbestopensource.com On Mon, Jul 29, 2013 at 10:26 AM, Joe Zhang smartag...@gmail.com wrote: Basically, I was thinking about running a range query like Shawn suggested on the tstamp field, but unfortunately it was not indexed. Range queries only work on indexed fields, right? On Sun, Jul 28, 2013 at 9:49 PM, Joe Zhang smartag...@gmail.com wrote: I've been thinking about tstamp solution int the past few days. but too bad, the field is avaialble but not indexed... I'm not familiar with SolrJ. Again, sounds like SolrJ is providing the counter value. If yes, that would be equivalent to an autoincrement id. I'm indexing from Nutch though; don't know how to feed in such counter... On Sun, Jul 28, 2013 at 7:03 AM, Erick Erickson erickerick...@gmail.com wrote: Why wouldn't a simple timestamp work for the ordering? Although I guess simple timestamp isn't really simple if the time settings change. So how about a simple counter field in your documents? Assuming you're indexing from SolrJ, your setup is to query q=*:*sort=counter desc. Take the counter from the first document returned. Increment for each doc for the life of the indexing run. Now you've got, for all intents and purposes, an identity field albeit manually maintained. Then use your counter field as Shawn suggests for pulling all the data out. FWIW, Erick On Sun, Jul 28, 2013 at 1:01 AM, Maurizio Cucchiara mcucchi...@apache.org wrote: In both cases, for better performance, first I'd load just all the IDs, after, during processing I'd load each document. For what concern the incremental requirement, it should not be difficult to write an hash function which maps a non-numerical I'd to a value. On Jul 27, 2013 7:03 AM, Joe Zhang smartag...@gmail.com wrote: Dear list: I have an ever-growing solr repository, and I need to process every single document to extract statistics. What would be a reasonable process that satifies the following properties: - Exhaustive: I have to traverse every single document - Incremental: in other words, it has to allow me to divide and conquer --- if I have processed the first 20k docs, next time I can start with 20001. A simple *:* query would satisfy the 1st but not the 2nd property. In fact, given that the processing will take very long, and the repository keeps growing, it is not even clear that the exhaustiveness is achieved. I'm running solr 3.6.2 in a single-machine setting; no hadoop capability yet. But I guess the same issues still hold even if I have the solr cloud environment, right, say in each shard? Any help would be greatly appreciated. Joe
Re: new field type - enum field
OK, if you can attach it to an e-mail, I'll attach it. Just to check, though, make sure you're logged in. I've been fooled once or twice by being automatically signed out... Erick On Mon, Jul 29, 2013 at 3:17 AM, Elran Dvir elr...@checkpoint.com wrote: Thanks, Erick. I have tried it four times. It keeps failing. The problem reoccurred today. Thanks. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Monday, July 29, 2013 2:44 AM To: solr-user@lucene.apache.org Subject: Re: new field type - enum field You should be able to attach a patch, wonder if there was some temporary glitch in the JIRA. Is this persisting. Let us know if this continues... Erick On Sun, Jul 28, 2013 at 12:11 PM, Elran Dvir elr...@checkpoint.com wrote: Hi, I have created an issue: https://issues.apache.org/jira/browse/SOLR-5084 I tried to attach my patch, but it failed: Cannot attach file Solr-5084.patch: Unable to communicate with JIRA. What am I doing wrong? Thanks. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, July 25, 2013 3:25 PM To: solr-user@lucene.apache.org Subject: Re: new field type - enum field Start here: http://wiki.apache.org/solr/HowToContribute Then, when your patch is ready submit a JIRA and attach your patch. Then nudge (gently) if none of the committers picks it up and applies it NOTE: It is _not_ necessary that the first version of your patch is completely polished. I often put up partial/incomplete patches (comments with //nocommit are explicitly caught by the ant precommit target for instance) to see if anyone has any comments before polishing. Best Erick On Thu, Jul 25, 2013 at 5:04 AM, Elran Dvir elr...@checkpoint.com wrote: Hi, I have implemented like Chris described it: The field is indexed as numeric, but displayed as string, according to configuration. It applies to facet, pivot, group and query. How do we proceed? How do I contribute it? Thanks. -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Thursday, July 25, 2013 4:40 AM To: solr-user@lucene.apache.org Subject: Re: new field type - enum field : Doable at Lucene level by any chance? Given how well the Trie fields compress (ByteField and ShortField have been deprecated in favor of TrieIntField for this reason) it probably just makes sense to treat it as a numeric at the Lucene level. : If there's positive feedback, I'll open an issue with a patch for the functionality. I've typically dealt with this sort of thing at the client layer using a simple numeric field in Solr, or used an UpdateProcessor to convert the String-numeric mapping when indexing used clinet logic of a DocTransformer to handle the stored value at query time -- but having a built in FieldType that handles that for you automatically (and helps ensure the indexed values conform to the enum) would certainly be cool if you'd like to contribute it. -Hoss Email secured by Check Point Email secured by Check Point Email secured by Check Point
Re: Performance vs. maxBufferedAddsPerServer=10
SOLR-4816 won't address this - it will just speed up *different* parts. There are other things that will need to be done to speed up that part. - Mark On Jul 26, 2013, at 3:53 PM, Erick Erickson erickerick...@gmail.com wrote: This is current a hard-coded limit from what I've understood. From what I remember, Mark said Yonik said that there are reasons to make the packets that size. But whether this is empirically a Good Thing I don't know. SOLR-4816 will address this a different way by making SolrJ batch up the docs and send them to the right leader, which should pretty much remove any performance consideration here. There's some anecdotal evidence that changing that in the code might improve throughput, but I don't remember the details. FWIW Erick On Thu, Jul 25, 2013 at 7:09 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Context: * https://issues.apache.org/jira/browse/SOLR-4956 * http://search-lucene.com/c/Solr:/core/src/java/org/apache/solr/update/SolrCmdDistributor.java%7C%7CmaxBufferedAddsPerServer As you can see, maxBufferedAddsPerServer = 10. We have an app that sends 20K docs to SolrCloud using CloudSolrServer. We batch 20K docs for performance reasons. But then the receiving node ends up sending VERY small batches of just 10 docs around for indexing and we lose the benefit of batching those 20K docs in the first place. Our app is add only. Is there anything one can do to avoid performance loss associated with maxBufferedAddsPerServer=10? Thanks, Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm
DIH to index the data - 250 millions - Need a best architecture
Hi, I have a huge volume of DB records, which is close to 250 millions. I am going to use DIH to index the data into Solr. I need a best architecture to index and query the data in an efficient manner. I am using windows server 2008 with 16 GB RAM, zion processor and Solr 4.4. With Regards, Santanu
Re: DIH to index the data - 250 millions - Need a best architecture
On 29 July 2013 17:30, Santanu8939967892 mishra.sant...@gmail.com wrote: Hi, I have a huge volume of DB records, which is close to 250 millions. I am going to use DIH to index the data into Solr. I need a best architecture to index and query the data in an efficient manner. [...] This is difficult to answer without knowing details of your particular use case. Your best bet would be to prototype a system, and measure performance on at least a subset of the data. If you search through earlier message on the list, you should also come across some numbers for performance, but it is best to test for your own needs. Regards, Gora
Re: DIH to index the data - 250 millions - Need a best architecture
The initial question is not how to index the data, but how you want to use or query the data. Use cases for query and data access should drive the data model that you will use to index the data. So, what are some sample queries? How will users want to search and access the data? What data will they expect to see and in what form? Not so much from a UI perspective, but in terms of how the client app(s) will access data. -- Jack Krupansky -Original Message- From: Santanu8939967892 Sent: Monday, July 29, 2013 8:00 AM To: solr-user@lucene.apache.org Subject: DIH to index the data - 250 millions - Need a best architecture Hi, I have a huge volume of DB records, which is close to 250 millions. I am going to use DIH to index the data into Solr. I need a best architecture to index and query the data in an efficient manner. I am using windows server 2008 with 16 GB RAM, zion processor and Solr 4.4. With Regards, Santanu
Re: .lock file not created when making a backup snapshot
Thanks Mark! 29.07.2013 12:32, Mark Triggs пишет: Hi Artem, I noticed this recently too. I created a JIRA issue here: https://issues.apache.org/jira/browse/SOLR-5040 Cheers, Mark Artem Karpenko a.karpe...@oxseed.com writes: Hi, when making a backup snapshot using /replication?command=backup call, a snapshot directory is created and starts to be filled, but appropriate .lock file is not created so it's impossible to check when backup is finished. I've taken a look at code and it seems to me that lock.obtain() call is missing: there is public class SnapShooter { ... void createSnapshot(final IndexCommit indexCommit, int numberToKeep, ReplicationHandler replicationHandler) { ... lock = lockFactory.makeLock(directoryName + .lock); ... lock.release(); so lock file is not actually created. This is Solr 4.3.1, release notes for Solr 4.4 do not include this problem. Should I raise a JIRA issue for this? Or maybe you could suggest more reliable way to make a backup? Regards, Artem.
Re: solr query range upper exclusive
Square brackets are inclusive and curly braces are exclusive for range queries. I tried a similar example with the standard Solr example and it works fine: curl http://localhost:8983/solr/update?commit=true; \ -H 'Content-type:application/json' -d ' [{id: doc-1, price_f: 249}]' curl http://localhost:8983/solr/select/?q=price_f:%5b149+TO+249%5dindent=truewt=json; Make sure that you don't have some other dynamic field pattern that is overriding or overlapping the one you showed us. -- Jack Krupansky -Original Message- From: alin1918 Sent: Monday, July 29, 2013 6:38 AM To: solr-user@lucene.apache.org Subject: solr query range upper exclusive q=price_1_1:[197 TO 249] and q=*:*fq=price_1_1:[197 TO 249] returns 2 records but I have two records with the price_1_1 = 249, it seams that the upper range is exclusive and I can't figure out why, can you help me? dynamicField name=price_*type=tfloat indexed=true/ fieldType name=tfloat class=solr.TrieFloatField precisionStep=8 omitNorms=true positionIncrementGap=0/ -- View this message in context: http://lucene.472066.n3.nabble.com/solr-query-range-upper-exclusive-tp4080978.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: swap and GC
This is interesting... How are you measuring the heap size? -Michael -Original Message- From: Bernd Fehling [mailto:bernd.fehl...@uni-bielefeld.de] Sent: Monday, July 29, 2013 5:34 AM To: solr-user@lucene.apache.org Subject: swap and GC Something interesting I have noticed today, after running my huge single index (49 mio. records / 137 GB index) for about a week and replicating today I recognized that the heap usage after replication did not go down as expected. Expected means if solr is started I have a heap size between 4 to 5 GB and during the week under heavy load it might go up to 10 GB. But after replication in offline mode it recovers to between 5 to 6 GB. But today it was not going under 8 GB, even with forced GC from jvisualvm. So I first dropped the caches and tried again, no success. Next I turned off swap which took quite a while and turned it back on. This forced all content from swap back into memory. After calling Perform GC from jvisualvm the heap dropped below 5 GB. Bingo! This leads me to the conclusion that java GC is not seeing or reaching objects which are located in swap. Anyone else seen this? As I am not short on memory or have any other problems I don't need any solution, but if there are some users having memory problems with old objects in swap I would suggest a cronjob after replication with swapoff/swapon and GC afterwards. Bernd
Re: DIH to index the data - 250 millions - Need a best architecture
Hi Jack, My sample query will be with a keyword (text) and probably 2 to 3 filters. There is a java interface for display of data, which will consume a class, and the class returns a data set object using SolrJ. So for display we will use a list for binding. we may display 20 or 30 meta data information. I believe I have provided the information you have asked for. With Regards, Santanu On Mon, Jul 29, 2013 at 5:50 PM, Jack Krupansky j...@basetechnology.comwrote: The initial question is not how to index the data, but how you want to use or query the data. Use cases for query and data access should drive the data model that you will use to index the data. So, what are some sample queries? How will users want to search and access the data? What data will they expect to see and in what form? Not so much from a UI perspective, but in terms of how the client app(s) will access data. -- Jack Krupansky -Original Message- From: Santanu8939967892 Sent: Monday, July 29, 2013 8:00 AM To: solr-user@lucene.apache.org Subject: DIH to index the data - 250 millions - Need a best architecture Hi, I have a huge volume of DB records, which is close to 250 millions. I am going to use DIH to index the data into Solr. I need a best architecture to index and query the data in an efficient manner. I am using windows server 2008 with 16 GB RAM, zion processor and Solr 4.4. With Regards, Santanu
Re: DIH to index the data - 250 millions - Need a best architecture
You neglected to provide information about the filters or the 20 or 30 meta data information. Did you mean to imply that you will not be querying against the metadata (only returning it)? -- Jack Krupansky -Original Message- From: Santanu8939967892 Sent: Monday, July 29, 2013 9:41 AM To: solr-user@lucene.apache.org Subject: Re: DIH to index the data - 250 millions - Need a best architecture Hi Jack, My sample query will be with a keyword (text) and probably 2 to 3 filters. There is a java interface for display of data, which will consume a class, and the class returns a data set object using SolrJ. So for display we will use a list for binding. we may display 20 or 30 meta data information. I believe I have provided the information you have asked for. With Regards, Santanu On Mon, Jul 29, 2013 at 5:50 PM, Jack Krupansky j...@basetechnology.comwrote: The initial question is not how to index the data, but how you want to use or query the data. Use cases for query and data access should drive the data model that you will use to index the data. So, what are some sample queries? How will users want to search and access the data? What data will they expect to see and in what form? Not so much from a UI perspective, but in terms of how the client app(s) will access data. -- Jack Krupansky -Original Message- From: Santanu8939967892 Sent: Monday, July 29, 2013 8:00 AM To: solr-user@lucene.apache.org Subject: DIH to index the data - 250 millions - Need a best architecture Hi, I have a huge volume of DB records, which is close to 250 millions. I am going to use DIH to index the data into Solr. I need a best architecture to index and query the data in an efficient manner. I am using windows server 2008 with 16 GB RAM, zion processor and Solr 4.4. With Regards, Santanu
Re: solr query range upper exclusive
what query parser should I use? http://wiki.apache.org/solr/SolrQuerySyntax Differences From Lucene Query Parser Differences in the Solr Query Parser include Range queries [a TO z], prefix queries a*, and wildcard queries a*b are constant-scoring (all matching documents get an equal score). The scoring factors tf, idf, index boost, and coord are not used. There is no limitation on the number of terms that match (as there was in past versions of Lucene). Lucene 2.1 has also switched to use ConstantScoreRangeQuery for its range queries. A * may be used for either or both endpoints to specify an open-ended range query. field:[* TO 100] finds all field values less than or equal to 100 field:[100 TO *] finds all field values greater than or equal to 100 field:[* TO *] matches all documents with the field Pure negative queries (all clauses prohibited) are allowed. -inStock:false finds all field values where inStock is not false -field:[* TO *] finds all documents without a value for field A hook into FunctionQuery syntax. Quotes will be necessary to encapsulate the function when it includes parentheses. Example: _val_:myfield Example: _val_:recip(rord(myfield),1,2,3) Nested query support for any type of query parser (via QParserPlugin). Quotes will often be necessary to encapsulate the nested query if it contains reserved characters. Example: _query_:{!dismax qf=myfield}how now brown cow -- View this message in context: http://lucene.472066.n3.nabble.com/solr-query-range-upper-exclusive-tp4080978p4081042.html Sent from the Solr - User mailing list archive at Nabble.com.
restricting a query by a set of field values
Hi, Is it possible to construct a query in SOLR to perform a query that is restricted to only those documents that have a field value in a particular set of values similar to what would be done in POstgres with the SQL query: SELECT date_deposited FROM stats WHERE date BETWEEN '2013-07-01 00:00:00' AND '2013-07-31 23:59:00' AND collection_id IN () In my SOLR schema.xml date_deposited is a TrieDateField and collection_id is an IntField Regards, Ben -- Dr Ben Ryan Jorum Technical Manager 5.12 Roscoe Building The University of Manchester Oxford Road Manchester M13 9PL Tel: 0160 275 6039 E-mail: benjamin.r...@manchester.ac.ukhttps://outlook.manchester.ac.uk/owa/redir.aspx?C=b28b5bdd1a91425abf8e32748c93f487URL=mailto%3abenjamin.ryan%40manchester.ac.uk --
The meaning of the of the doc= on the debugQuery output
Hello One line on my debugQuery of a query is 2.1706323e-6 = score(doc=49578,freq=1.0 = termfreq=1.0), product of: I wanted to know what the doc= means. It seems to be something used on the fieldWeight but on the other hand it is the same for all fields on the document, regardless of the query made or fields searched... Regards Bruno -- Bruno René Santos Lisboa - Portugal
Re: restricting a query by a set of field values
Ben, This could be constructed as so: fl=date_depositedfq=date[2013-07-01T00:00:00Z TO 2013-07-31T23:59:00Z]fq=collection_id(1 2 n)q.op=OR The parenthesis around the 1 2 n set indicate a boolean query, and we're ensuring they are an OR boolean by the q.op parameter. This should get you the result set you desire. Please beware that a very large boolean set (your IN(…) parameter) may be expensive to run. Jason On Jul 29, 2013, at 7:33 AM, Benjamin Ryan benjamin.r...@manchester.ac.uk wrote: Hi, Is it possible to construct a query in SOLR to perform a query that is restricted to only those documents that have a field value in a particular set of values similar to what would be done in POstgres with the SQL query: SELECT date_deposited FROM stats WHERE date BETWEEN '2013-07-01 00:00:00' AND '2013-07-31 23:59:00' AND collection_id IN () In my SOLR schema.xml date_deposited is a TrieDateField and collection_id is an IntField Regards, Ben -- Dr Ben Ryan Jorum Technical Manager 5.12 Roscoe Building The University of Manchester Oxford Road Manchester M13 9PL Tel: 0160 275 6039 E-mail: benjamin.r...@manchester.ac.ukhttps://outlook.manchester.ac.uk/owa/redir.aspx?C=b28b5bdd1a91425abf8e32748c93f487URL=mailto%3abenjamin.ryan%40manchester.ac.uk --
SolrCloud and Joins
I'm setting up SolrCloud with around 600 million documents. The basic structure of each document is: stories_id: integer, media_id: integer, sentence: text_en We have a number of stories from different media and we treat each sentence as a separate document because we need to run sentence level analytics. We also have a concept of groups or sets of sources. We've imported this media source to media sets mapping into Solr using the following structure: media_id_inner: integer, media_sets_id: integer For the single node case, we're able to filter our sources by media_set_id using a join query like the following: http://localhost:8983/solr/select?q={!join+from=media_id_inner+to=media_id}media_sets_id:1http://localhost:8983/solr/select?q=%7B!join+from=media_id_inner+to=media_id%7Dmedia_sets_id:1 However, this does not work correctly with SolrCloud. The problem is that the join query is performed separately on each of the shards and no shard has the complete media set to source mapping data. So SolrCloud returns incomplete results. Since the complete media set to source mapping data is comparatively small (~50,000 rows), I would like to replicate it on every shard. So that the results of the individual join queries on separate shards would be equivalent to performing the same query on a single shard system. However, I'm can't figure out how to replicate documents on separate shards. The compositeID router has the ability to colocate documents based on a prefix in the document ID but this isn't what I need. What I would like is some way to either have the media set to source data replicated on every shard or to be able to explicitly upload this data to the individual shards. (For the rest of the data I like the compositeID autorouting.) Any suggestions? -- Thanks, David
Re: DIH to index the data - 250 millions - Need a best architecture
On 7/29/2013 6:00 AM, Santanu8939967892 wrote: Hi, I have a huge volume of DB records, which is close to 250 millions. I am going to use DIH to index the data into Solr. I need a best architecture to index and query the data in an efficient manner. I am using windows server 2008 with 16 GB RAM, zion processor and Solr 4.4. Gora and Jack have given you great information. I would add that when you are dealing with an index of this size, you need to be prepared to spend some real money on hardware if you want maximum performance. With 20-30 fields, I would imagine that each document is probably a few KB in size. Even if they will be much smaller than that, with 250 million of them, your index will be pretty large. I'd be VERY surprised if the index is less than 100GB, and something larger than 500GB is probably more likely. For illustration purposes, let's be conservative and say it's 200GB. 16GB of RAM isn't enough for an index that size. An ideal round memory size for a 200GB index would be 256GB - 200GB of RAM for the OS disk cache and enough memory for whatever size java heap you might need. In truth, you probably don't need to cache the ENTIRE index ... most searches will involve only certain parts of the index and won't touch the entire thing. A good enough memory size might be 128GB which would keep the most relevant parts of the index in RAM at all times. If you were to put a 200GB index onto a disk that's SSD, you could probably get away with 64GB of RAM - 50GB or so for the OS disk cache and the rest for the java heap. If your index will be larger than 200GB, then the numbers I have given you will go up. These numbers also assume that you have your entire index on one server, which is probably not a good idea. http://wiki.apache.org/solr/SolrPerformanceProblems SolrCloud would likely be the best architecture. It would spread out your system requirements and load across multiple machines. If you had 20 machines, each with 16-32GB of RAM, you could do a SolrCloud installation with 10 shards and a replicationFactor of 2, and there wouldn't be any memory problems. Each machine would have 25 million records on it, and you'd have two complete copies of your index so you'd be able to keep running if a machine completely failed -- which DOES happen. The information I've given you is for an ideal setup. You can go smaller, and budget needs might indeed cause you to go smaller. If you don't need extremely good performance from Solr, then you don't need to spend the money required for an architecture like I've described. Thanks, Shawn
Re: The meaning of the of the doc= on the debugQuery output
Hi, doc is the internal docId of the index. Each doc in the index has an internal id. It starts from 1 (1st doc inserted in the index), 2 for the 2nd, ... Franck Brisbart Le lundi 29 juillet 2013 à 15:34 +0100, Bruno René Santos a écrit : Hello One line on my debugQuery of a query is 2.1706323e-6 = score(doc=49578,freq=1.0 = termfreq=1.0), product of: I wanted to know what the doc= means. It seems to be something used on the fieldWeight but on the other hand it is the same for all fields on the document, regardless of the query made or fields searched... Regards Bruno
Solr 4.3.1 - query does not return documents, just numFounds, 2 shards, replication Factor 1
Hi, I am using Solr 4.3.1 with 2 Shards and replication factor of 1, running on apache tomcat 7.0.42 with external zookeeper 3.4.5. When I query select?q=*:* I only get the number of documents found, but no actual document. When I query with rows=0, I do get correct count of documents in the index. Faceting queries as well as group by queries also work with rows=0. However, when rows is not equal to 0 I do not get any documents. When I query the index I see that a query is being sent to both shards, and subsequently I see a query being sent with just ids, however, after that query returns I do not see any documents back. Not sure what do I need to change, please help. Thanks, Nitin
Solr Out Of Memory with Field Collapsing
Hi, We are using Field collapsing feature with multiple shards. We ran into into Out of Memory errors on one of the shards. We use filed collapsing on a particular field which has only one specific value on the shard that goes out of memory. Interestingly the Out of Memory error recurred multiple times during the day (about 4 times in 24 hours) without any significant deviation in traffic from the normal or the nature of queries being run. The max heap size allocated to the shard 8 Gb. Since then we have done the following and the problem seems to be arrested for now - 1. Added more horizontal slaves, from 3 we have brought this up to 6. 2. We have increased the replication poll interval from 5 minutes to 20 minutes. 3. We have decreased the minimum heap allocation to this tomcat to 1Gb. Earlier this was 4Gb. The typical size of index directory on the problem shard is around 1Gb, about 1 million documents in all. The average requests served are about 10/second for this shard. We have tried replaying the entire logs for the day on a test environment but somehow it never goes out of memory with the same heap settings. Now we are not certain that this would not happen again. Can someone suggest what could be the problem here ? Any help would be greatly appreciated. Regards, Tushar -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Out-Of-Memory-with-Field-Collapsing-tp4081076.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud and Joins
Denormalize. Add media_set_id to each sentence document. Done. wunder On Jul 29, 2013, at 7:58 AM, David Larochelle wrote: I'm setting up SolrCloud with around 600 million documents. The basic structure of each document is: stories_id: integer, media_id: integer, sentence: text_en We have a number of stories from different media and we treat each sentence as a separate document because we need to run sentence level analytics. We also have a concept of groups or sets of sources. We've imported this media source to media sets mapping into Solr using the following structure: media_id_inner: integer, media_sets_id: integer For the single node case, we're able to filter our sources by media_set_id using a join query like the following: http://localhost:8983/solr/select?q={!join+from=media_id_inner+to=media_id}media_sets_id:1http://localhost:8983/solr/select?q=%7B!join+from=media_id_inner+to=media_id%7Dmedia_sets_id:1 However, this does not work correctly with SolrCloud. The problem is that the join query is performed separately on each of the shards and no shard has the complete media set to source mapping data. So SolrCloud returns incomplete results. Since the complete media set to source mapping data is comparatively small (~50,000 rows), I would like to replicate it on every shard. So that the results of the individual join queries on separate shards would be equivalent to performing the same query on a single shard system. However, I'm can't figure out how to replicate documents on separate shards. The compositeID router has the ability to colocate documents based on a prefix in the document ID but this isn't what I need. What I would like is some way to either have the media set to source data replicated on every shard or to be able to explicitly upload this data to the individual shards. (For the rest of the data I like the compositeID autorouting.) Any suggestions? -- Thanks, David -- Walter Underwood wun...@wunderwood.org
solr - set fileds as default search field
The following query works well for me http://[]:8983/solr/vault/select?q=VersionComments%3AWhite returns all the documents where version comments includes White I try to omit the field name and put it as a default value as follows : In solr config I write requestHandler name=/select class=solr.SearchHandler !-- default values for query parameters can be specified, these will be overridden by parameters in the request -- lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=dfPackageName/str str name=dfTag/str str name=dfVersionComments/str str name=dfVersionTag/str str name=dfDescription/str str name=dfSKU/str str name=dfSKUDesc/str /lst I restart the solr and create a full import. Then I try using http://[]:8983/solr/vault/select?q=White (Where http://[]:8983/solr/vault/select?q=VersionComments%3AWhite still works) But I dont get the document any as answer. What am I doing wrong?
Re: solr - set fileds as default search field
Hi, df is a single valued parameter. Only one field can be a default field. To query multiple fields use (e)dismax query parser : http://wiki.apache.org/solr/ExtendedDisMax#qf_.28Query_Fields.29 From: Mysurf Mail stammail...@gmail.com To: solr-user@lucene.apache.org Sent: Monday, July 29, 2013 6:31 PM Subject: solr - set fileds as default search field The following query works well for me http://[]:8983/solr/vault/select?q=VersionComments%3AWhite returns all the documents where version comments includes White I try to omit the field name and put it as a default value as follows : In solr config I write requestHandler name=/select class=solr.SearchHandler !-- default values for query parameters can be specified, these will be overridden by parameters in the request -- lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=dfPackageName/str str name=dfTag/str str name=dfVersionComments/str str name=dfVersionTag/str str name=dfDescription/str str name=dfSKU/str str name=dfSKUDesc/str /lst I restart the solr and create a full import. Then I try using http://[]:8983/solr/vault/select?q=White (Where http://[]:8983/solr/vault/select?q=VersionComments%3AWhite still works) But I dont get the document any as answer. What am I doing wrong?
Re: Solr 4.3.1 - query does not return documents, just numFounds, 2 shards, replication Factor 1
Nitin, You need to ensure the fields you wish to see are marked stored=true in your schema.xml file, and you should include fields in your fl= parameter (fl=*,score is a good place to start). Jason On Jul 29, 2013, at 8:08 AM, Nitin Agarwal 2nitinagar...@gmail.com wrote: Hi, I am using Solr 4.3.1 with 2 Shards and replication factor of 1, running on apache tomcat 7.0.42 with external zookeeper 3.4.5. When I query select?q=*:* I only get the number of documents found, but no actual document. When I query with rows=0, I do get correct count of documents in the index. Faceting queries as well as group by queries also work with rows=0. However, when rows is not equal to 0 I do not get any documents. When I query the index I see that a query is being sent to both shards, and subsequently I see a query being sent with just ids, however, after that query returns I do not see any documents back. Not sure what do I need to change, please help. Thanks, Nitin
Re: solr - set fileds as default search field
Or use the copyField technique to a single searchable field and set df= to that field. The example schema does this with the field called text. On Jul 29, 2013, at 8:35 AM, Ahmet Arslan iori...@yahoo.com wrote: Hi, df is a single valued parameter. Only one field can be a default field. To query multiple fields use (e)dismax query parser : http://wiki.apache.org/solr/ExtendedDisMax#qf_.28Query_Fields.29 From: Mysurf Mail stammail...@gmail.com To: solr-user@lucene.apache.org Sent: Monday, July 29, 2013 6:31 PM Subject: solr - set fileds as default search field The following query works well for me http://[]:8983/solr/vault/select?q=VersionComments%3AWhite returns all the documents where version comments includes White I try to omit the field name and put it as a default value as follows : In solr config I write requestHandler name=/select class=solr.SearchHandler !-- default values for query parameters can be specified, these will be overridden by parameters in the request -- lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=dfPackageName/str str name=dfTag/str str name=dfVersionComments/str str name=dfVersionTag/str str name=dfDescription/str str name=dfSKU/str str name=dfSKUDesc/str /lst I restart the solr and create a full import. Then I try using http://[]:8983/solr/vault/select?q=White (Where http://[]:8983/solr/vault/select?q=VersionComments%3AWhite still works) But I dont get the document any as answer. What am I doing wrong?
Re: SolrCloud and Joins
We'd like to be able to easily update the media set to source mapping. I'm concerned that if we store the media_sets_id in the sentence documents, it will be very difficult to add additional media set to source mapping. I imagine that adding a new media set would either require reimporting all 600 million documents or writing complicated application logic to find out which sentences to update. Hence joins seem like a cleaner solution. -- David On Mon, Jul 29, 2013 at 11:22 AM, Walter Underwood wun...@wunderwood.orgwrote: Denormalize. Add media_set_id to each sentence document. Done. wunder On Jul 29, 2013, at 7:58 AM, David Larochelle wrote: I'm setting up SolrCloud with around 600 million documents. The basic structure of each document is: stories_id: integer, media_id: integer, sentence: text_en We have a number of stories from different media and we treat each sentence as a separate document because we need to run sentence level analytics. We also have a concept of groups or sets of sources. We've imported this media source to media sets mapping into Solr using the following structure: media_id_inner: integer, media_sets_id: integer For the single node case, we're able to filter our sources by media_set_id using a join query like the following: http://localhost:8983/solr/select?q={!join+from=media_id_inner+to=media_id}media_sets_id:1 http://localhost:8983/solr/select?q=%7B!join+from=media_id_inner+to=media_id%7Dmedia_sets_id:1 However, this does not work correctly with SolrCloud. The problem is that the join query is performed separately on each of the shards and no shard has the complete media set to source mapping data. So SolrCloud returns incomplete results. Since the complete media set to source mapping data is comparatively small (~50,000 rows), I would like to replicate it on every shard. So that the results of the individual join queries on separate shards would be equivalent to performing the same query on a single shard system. However, I'm can't figure out how to replicate documents on separate shards. The compositeID router has the ability to colocate documents based on a prefix in the document ID but this isn't what I need. What I would like is some way to either have the media set to source data replicated on every shard or to be able to explicitly upload this data to the individual shards. (For the rest of the data I like the compositeID autorouting.) Any suggestions? -- Thanks, David -- Walter Underwood wun...@wunderwood.org
Re: Solr 4.3.1 - query does not return documents, just numFounds, 2 shards, replication Factor 1
Jason, all my fields are set with stored=ture and indexed = true, and I used select?q=*:*fl=*,score but still I get the same response *response lst name=responseHeader int name=status0/int int name=QTime138/int lst name=params str name=fl*,score/str str name=q*:*/str /lst /lst result name=response numFound=167906126 start=0 maxScore=1.0/ /response* Here is what my schema looks like *fields field name=_version_ type=long indexed=true stored=true multiValued=false / field name=bill_account_name type=lowercase indexed=true stored=true required=false / field name=bill_account_nbr type=lowercase indexed=true stored=true required=false / field name=cust_name type=lowercase indexed=true stored=true required=false / **field name=tn_lookup_key_id type=lowercase indexed=true stored=true required=true / /fields* Nitin On Mon, Jul 29, 2013 at 9:38 AM, Jason Hellman jhell...@innoventsolutions.com wrote: Nitin, You need to ensure the fields you wish to see are marked stored=true in your schema.xml file, and you should include fields in your fl= parameter (fl=*,score is a good place to start). Jason On Jul 29, 2013, at 8:08 AM, Nitin Agarwal 2nitinagar...@gmail.com wrote: Hi, I am using Solr 4.3.1 with 2 Shards and replication factor of 1, running on apache tomcat 7.0.42 with external zookeeper 3.4.5. When I query select?q=*:* I only get the number of documents found, but no actual document. When I query with rows=0, I do get correct count of documents in the index. Faceting queries as well as group by queries also work with rows=0. However, when rows is not equal to 0 I do not get any documents. When I query the index I see that a query is being sent to both shards, and subsequently I see a query being sent with just ids, however, after that query returns I do not see any documents back. Not sure what do I need to change, please help. Thanks, Nitin
Re: restricting a query by a set of field values
: fl=date_depositedfq=date[2013-07-01T00:00:00Z TO 2013-07-31T23:59:00Z]fq=collection_id(1 2 n)q.op=OR typo -- the colon is missing... fq=collection_id:(1 2 n) if you don't want the q.op to apply globally to your request, you can also scope it only for that filter. likewise the field_name: and paren syntax can be replaced by using the df param... fq={!lucene q.op=OR df=collection_id}1 2 3 4 5 -Hoss
Re: Solr 4.3.1 - query does not return documents, just numFounds, 2 shards, replication Factor 1
Check the /select request handler in solrconfig. See if it defaults start or rows. start is the initial document number (e.g., 1), and rows is the number of rows to actually return in the response (nothing to do with numFound). The internal Solr default is rows=10, but you can set it to 20, 50, 100, or whatever, but DO NOT set it to 0 unless you just want the header without any actual documents. -- Jack Krupansky -Original Message- From: Nitin Agarwal Sent: Monday, July 29, 2013 11:49 AM To: solr-user@lucene.apache.org Subject: Re: Solr 4.3.1 - query does not return documents, just numFounds, 2 shards, replication Factor 1 Jason, all my fields are set with stored=ture and indexed = true, and I used select?q=*:*fl=*,score but still I get the same response *response lst name=responseHeader int name=status0/int int name=QTime138/int lst name=params str name=fl*,score/str str name=q*:*/str /lst /lst result name=response numFound=167906126 start=0 maxScore=1.0/ /response* Here is what my schema looks like *fields field name=_version_ type=long indexed=true stored=true multiValued=false / field name=bill_account_name type=lowercase indexed=true stored=true required=false / field name=bill_account_nbr type=lowercase indexed=true stored=true required=false / field name=cust_name type=lowercase indexed=true stored=true required=false / **field name=tn_lookup_key_id type=lowercase indexed=true stored=true required=true / /fields* Nitin On Mon, Jul 29, 2013 at 9:38 AM, Jason Hellman jhell...@innoventsolutions.com wrote: Nitin, You need to ensure the fields you wish to see are marked stored=true in your schema.xml file, and you should include fields in your fl= parameter (fl=*,score is a good place to start). Jason On Jul 29, 2013, at 8:08 AM, Nitin Agarwal 2nitinagar...@gmail.com wrote: Hi, I am using Solr 4.3.1 with 2 Shards and replication factor of 1, running on apache tomcat 7.0.42 with external zookeeper 3.4.5. When I query select?q=*:* I only get the number of documents found, but no actual document. When I query with rows=0, I do get correct count of documents in the index. Faceting queries as well as group by queries also work with rows=0. However, when rows is not equal to 0 I do not get any documents. When I query the index I see that a query is being sent to both shards, and subsequently I see a query being sent with just ids, however, after that query returns I do not see any documents back. Not sure what do I need to change, please help. Thanks, Nitin
Re: processing documents in solr
I'll try reindexing the timestamp. The id-creation approach suggested by Erick sounds attractive, but the nutch/solr integration seems rather tight. I don't where to break in to insert the id into solr. On Mon, Jul 29, 2013 at 4:11 AM, Erick Erickson erickerick...@gmail.comwrote: No SolrJ doesn't provide this automatically. You'd be providing the counter by inserting it into the document as you created new docs. You could do this with any kind of document creation you are using. Best Erick On Mon, Jul 29, 2013 at 2:51 AM, Aditya findbestopensou...@gmail.com wrote: Hi, The easiest solution would be to have timestamp indexed. Is there any issue in doing re-indexing? If you want to process records in batch then you need a ordered list and a bookmark. You require a field to sort and maintain a counter / last id as bookmark. This is mandatory to solve your problem. If you don't want to re-index, then you need to maintain information related to visited nodes. Have a database / solr core which maintains list of IDs which already processed. Fetch record from Solr, For each record, check the new DB, if the record is already processed. Regards Aditya www.findbestopensource.com On Mon, Jul 29, 2013 at 10:26 AM, Joe Zhang smartag...@gmail.com wrote: Basically, I was thinking about running a range query like Shawn suggested on the tstamp field, but unfortunately it was not indexed. Range queries only work on indexed fields, right? On Sun, Jul 28, 2013 at 9:49 PM, Joe Zhang smartag...@gmail.com wrote: I've been thinking about tstamp solution int the past few days. but too bad, the field is avaialble but not indexed... I'm not familiar with SolrJ. Again, sounds like SolrJ is providing the counter value. If yes, that would be equivalent to an autoincrement id. I'm indexing from Nutch though; don't know how to feed in such counter... On Sun, Jul 28, 2013 at 7:03 AM, Erick Erickson erickerick...@gmail.com wrote: Why wouldn't a simple timestamp work for the ordering? Although I guess simple timestamp isn't really simple if the time settings change. So how about a simple counter field in your documents? Assuming you're indexing from SolrJ, your setup is to query q=*:*sort=counter desc. Take the counter from the first document returned. Increment for each doc for the life of the indexing run. Now you've got, for all intents and purposes, an identity field albeit manually maintained. Then use your counter field as Shawn suggests for pulling all the data out. FWIW, Erick On Sun, Jul 28, 2013 at 1:01 AM, Maurizio Cucchiara mcucchi...@apache.org wrote: In both cases, for better performance, first I'd load just all the IDs, after, during processing I'd load each document. For what concern the incremental requirement, it should not be difficult to write an hash function which maps a non-numerical I'd to a value. On Jul 27, 2013 7:03 AM, Joe Zhang smartag...@gmail.com wrote: Dear list: I have an ever-growing solr repository, and I need to process every single document to extract statistics. What would be a reasonable process that satifies the following properties: - Exhaustive: I have to traverse every single document - Incremental: in other words, it has to allow me to divide and conquer --- if I have processed the first 20k docs, next time I can start with 20001. A simple *:* query would satisfy the 1st but not the 2nd property. In fact, given that the processing will take very long, and the repository keeps growing, it is not even clear that the exhaustiveness is achieved. I'm running solr 3.6.2 in a single-machine setting; no hadoop capability yet. But I guess the same issues still hold even if I have the solr cloud environment, right, say in each shard? Any help would be greatly appreciated. Joe
Re: SolrCloud and Joins
A join may seem clean, but it will be slow and (currently) doesn't work in a cluster. You find all the sentences in a media set by searching for that set id and requesting only the sentence_id (yes, you need that). Then you reindex them. With small documents like this, it is probably fairly fast. If you can't estimate how often the media sets will change or the size of the changes, then you aren't ready to choose a design. wunder On Jul 29, 2013, at 8:41 AM, David Larochelle wrote: We'd like to be able to easily update the media set to source mapping. I'm concerned that if we store the media_sets_id in the sentence documents, it will be very difficult to add additional media set to source mapping. I imagine that adding a new media set would either require reimporting all 600 million documents or writing complicated application logic to find out which sentences to update. Hence joins seem like a cleaner solution. -- David On Mon, Jul 29, 2013 at 11:22 AM, Walter Underwood wun...@wunderwood.orgwrote: Denormalize. Add media_set_id to each sentence document. Done. wunder On Jul 29, 2013, at 7:58 AM, David Larochelle wrote: I'm setting up SolrCloud with around 600 million documents. The basic structure of each document is: stories_id: integer, media_id: integer, sentence: text_en We have a number of stories from different media and we treat each sentence as a separate document because we need to run sentence level analytics. We also have a concept of groups or sets of sources. We've imported this media source to media sets mapping into Solr using the following structure: media_id_inner: integer, media_sets_id: integer For the single node case, we're able to filter our sources by media_set_id using a join query like the following: http://localhost:8983/solr/select?q={!join+from=media_id_inner+to=media_id}media_sets_id:1 http://localhost:8983/solr/select?q=%7B!join+from=media_id_inner+to=media_id%7Dmedia_sets_id:1 However, this does not work correctly with SolrCloud. The problem is that the join query is performed separately on each of the shards and no shard has the complete media set to source mapping data. So SolrCloud returns incomplete results. Since the complete media set to source mapping data is comparatively small (~50,000 rows), I would like to replicate it on every shard. So that the results of the individual join queries on separate shards would be equivalent to performing the same query on a single shard system. However, I'm can't figure out how to replicate documents on separate shards. The compositeID router has the ability to colocate documents based on a prefix in the document ID but this isn't what I need. What I would like is some way to either have the media set to source data replicated on every shard or to be able to explicitly upload this data to the individual shards. (For the rest of the data I like the compositeID autorouting.) Any suggestions? -- Thanks, David -- Walter Underwood wun...@wunderwood.org -- Walter Underwood wun...@wunderwood.org
Re: SolrCloud shard down
I am using Solr 4.3.1 . I did hard commit after indexing. I think you're right that the node was still recovering. I didn't think so since it didn't show up as yellow recovering on the visual display, but after quite a while it went from Down to Active . Thanks! On Fri, Jul 26, 2013 at 7:59 PM, Anshum Gupta ans...@anshumgupta.netwrote: Can you also let me know what version of Solr are you on? On Sat, Jul 27, 2013 at 8:26 AM, Anshum Gupta ans...@anshumgupta.net wrote: Hi Katie, 1. First things first, I would strongly advice to manually update/remove zk or any other info when you're running things in the SolrCloud mode unless you are sure of what you're doing. 2. Also, your node could be currently recovering from the transaction log(did you issue a hard commit after indexing?). The mailing list doesn't allow long texts inline so it'd be good if you could use something like http://pastebin.com/ to share the log in detail. 3. If you had replicas, you wouldn't need to manually switch. It get's taken care of automatically. On Sat, Jul 27, 2013 at 4:16 AM, Katie McCorkell katiemccork...@gmail.com wrote: Hello, I am using the SolrCloud with a zookeeper ensemble like on example C from the wiki except with total of 3 shards and no replicas (oops). After indexing a whole bunch of documents, shard 2 went down and I'm not sure why. I tried restarting it with the jar command and I tried deleting shard1 's zoo_data folder and then restarting but it is still down, and I'm not sure what to do. 1) Is there anyway to avoid reindexing all the data? It's no good to proceed without shard 2 because I don't know which documents are there vs. the other shards, and indexing and querying don't work when one shard is down. I can't exactly tell why restarting it is failing, all I can see is on the admin tool webpage the shard is yellow in the little cloud diagram. On the console is messages that I will copy and paste below. 2) How can I tell the exact problem? 3) If I had had replicas, I could have just switched to shard 2's replica at this point, correct? Thanks! Katie Console message from start.jar --- 2325 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.cloud.ZkController – We are http://172.16.2.182:/solr/collection1/ and leader is http://172.16.2.182:/solr/collection1/ 12329 [recoveryExecutor-6-thread-1] WARN org.apache.solr.update.UpdateLog – Starting log replay tlog{file=/opt/solr-4.3.1/example/solr/collection1/data/tlog/tlog.0005179 refcount=2} active=false starting pos=0 12534 [recoveryExecutor-6-thread-1] INFO org.apache.solr.core.SolrCore – SolrDeletionPolicy.onInit: commits:num=1 commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@ /opt/solr-4.3.1/example/solr/collection1/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@5f99ea3c; maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_404,generation=5188,filenames=[_1gqo.fdx, _1h1q.nvm, _1h8x.fdt, _1gmi_Lucene41_0.pos, _1gqo.fdt, _1h8s.nvd, _ 1gmi.si, _1h1q.nvd, _1h6l.fnm, _1h8q.nvm, _1h6l_Lucene41_0.tim, _1h6l_Lucene41_0.tip, _1h8o_Lucene41_0.tim, _1h8o_Lucene41_0.tip, _1aq9_67.del, _1gqo.nvm, _1aq9_Lucene41_0.pos, _1h8q.fdx, _1h1q.fdt, _1h8r.fdt, _1h8q.fdt, _1h8p_Lucene41_0.pos, _1h8s_Lucene41_0.pos, _1h8r.fdx, _1gqo.nvd, _1h8s.fdx, _1h8s.fdt, _1h8x_Lucene41_. -- Anshum Gupta http://www.anshumgupta.net -- Anshum Gupta http://www.anshumgupta.net
Re: Solr 4.3.1 - query does not return documents, just numFounds, 2 shards, replication Factor 1
: Here is what my schema looks like what is your uniqueKey field? I'm going to bet it's tn_lookup_key_id and i'm going to bet your lowercase fieldType has an interesting analyzer on it. you are probably hitting a situation where the analyzer you have on your uniqueKey field is munging the values in such a way that when the coordinator node decides which N docs to include in the response, and then asks the various shards to give it those specific N docs, those subsequent field fetching queries fail because of an analysis mismatch. you need to keep your uniqueKeyField simple -- i strongly recommend a basic StrField. If you also want to do lowercase lookups on your key field, index it redundently in a second field. : *fields : field name=_version_ type=long indexed=true stored=true : multiValued=false / : field name=bill_account_name type=lowercase indexed=true : stored=true required=false / : field name=bill_account_nbr type=lowercase indexed=true : stored=true required=false / : field name=cust_name type=lowercase indexed=true stored=true : required=false / : **field name=tn_lookup_key_id type=lowercase : indexed=true stored=true required=true / : /fields* -Hoss
Re: SolrCloud shard down
On Jul 29, 2013, at 12:49 PM, Katie McCorkell katiemccork...@gmail.com wrote: I didn't think so since it didn't show up as yellow recovering on the visual display, but after quite a while it went from Down to Active . Thanks! Thanks, I think we should improve this! We should publish a recovery state when replaying the log on startup - right now it uses the down state and only advertises recovery when recovering from the leader. It would be useful to be able to tell when it's recovering from the log replay on startup as well though. Feel free to create a JIRA issue - I'll try and get to it otherwise. - Mark
Re: Performance vs. maxBufferedAddsPerServer=10
Why wouldn't it? Or are you saying that the routing to replicas from the leader also 10/packet? Hmmm, hadn't thought of that... On Mon, Jul 29, 2013 at 7:58 AM, Mark Miller markrmil...@gmail.com wrote: SOLR-4816 won't address this - it will just speed up *different* parts. There are other things that will need to be done to speed up that part. - Mark On Jul 26, 2013, at 3:53 PM, Erick Erickson erickerick...@gmail.com wrote: This is current a hard-coded limit from what I've understood. From what I remember, Mark said Yonik said that there are reasons to make the packets that size. But whether this is empirically a Good Thing I don't know. SOLR-4816 will address this a different way by making SolrJ batch up the docs and send them to the right leader, which should pretty much remove any performance consideration here. There's some anecdotal evidence that changing that in the code might improve throughput, but I don't remember the details. FWIW Erick On Thu, Jul 25, 2013 at 7:09 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Context: * https://issues.apache.org/jira/browse/SOLR-4956 * http://search-lucene.com/c/Solr:/core/src/java/org/apache/solr/update/SolrCmdDistributor.java%7C%7CmaxBufferedAddsPerServer As you can see, maxBufferedAddsPerServer = 10. We have an app that sends 20K docs to SolrCloud using CloudSolrServer. We batch 20K docs for performance reasons. But then the receiving node ends up sending VERY small batches of just 10 docs around for indexing and we lose the benefit of batching those 20K docs in the first place. Our app is add only. Is there anything one can do to avoid performance loss associated with maxBufferedAddsPerServer=10? Thanks, Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm
Re: DIH to index the data - 250 millions - Need a best architecture
Mishra, What if you setup DIH with single SQLEntityProcessor without caching, does it works for you? On Mon, Jul 29, 2013 at 4:00 PM, Santanu8939967892 mishra.sant...@gmail.com wrote: Hi, I have a huge volume of DB records, which is close to 250 millions. I am going to use DIH to index the data into Solr. I need a best architecture to index and query the data in an efficient manner. I am using windows server 2008 with 16 GB RAM, zion processor and Solr 4.4. With Regards, Santanu -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Pentaho Kettle vs DIH
Hello, Don't you have any experience with using Pentaho Kettle for processing RDBMS and pouring them into Solr? Isn't it some sort of replacement of the DIH? -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Solr 4.3.1 - query does not return documents, just numFounds, 2 shards, replication Factor 1
Erick, I had typed tn_lookup_key_id as lowercase and it was defined as fieldType name=lowercase class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType Nitin On Mon, Jul 29, 2013 at 1:23 PM, Erick Erickson erickerick...@gmail.comwrote: Nitin: What was your tn_lookup_key_id field definition when things didn't work? The stock lowercase is KeywordTokenizerFactory+LowerCaseFilterFactory and if this leads to mis-matches as Hoss outlined, it'd surprise me so I need to file it away in my list of things not to do. Thanks, Erick On Mon, Jul 29, 2013 at 3:01 PM, Nitin Agarwal 2nitinagar...@gmail.com wrote: Hoss, you rock! That was the issue, I changed tn_lookup_key_id, which was my unique key field, to string and reloaded the index and it works. Jason, Jack and Hoss, thanks for your help. Nitin On Mon, Jul 29, 2013 at 12:22 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : Here is what my schema looks like what is your uniqueKey field? I'm going to bet it's tn_lookup_key_id and i'm going to bet your lowercase fieldType has an interesting analyzer on it. you are probably hitting a situation where the analyzer you have on your uniqueKey field is munging the values in such a way that when the coordinator node decides which N docs to include in the response, and then asks the various shards to give it those specific N docs, those subsequent field fetching queries fail because of an analysis mismatch. you need to keep your uniqueKeyField simple -- i strongly recommend a basic StrField. If you also want to do lowercase lookups on your key field, index it redundently in a second field. : *fields : field name=_version_ type=long indexed=true stored=true : multiValued=false / : field name=bill_account_name type=lowercase indexed=true : stored=true required=false / : field name=bill_account_nbr type=lowercase indexed=true : stored=true required=false / : field name=cust_name type=lowercase indexed=true stored=true : required=false / : **field name=tn_lookup_key_id type=lowercase : indexed=true stored=true required=true / : /fields* -Hoss
solr sizing
Hi all, we have - 70 mio documents to 100 mio documents and we want - 800 requests per second How many servers Amazon EC2/real hardware we Need for this? Solr 4.x with solr cloud or better shards with loadbalancer? Is anyone here who can give me some information, or who operates a similar system itself? Regards, Torsten
Re: Merged segment warmer Solr 4.4
: I have a slow storage machine and non sufficient RAM for the whole index to : store all the index. This causes the first queries (~5000) to be very slow ... : Secondly I thought of initiating a new searcher event listener that queries : on docs that were inserted since the last hard commit. the first step in a situation like this should always be to configure at least some autowarming on your queryResultCache and filterCache -- this will not only ensure that some basic warming of your index is done, but will also prime the caches for your newSearcher with actual queries that your solr instance has alreayd recieved -- using a newSearcher listener on top of this can be useful for garunteeing that specific sorts or facets are fast against each new searcher (even if they haven't been queried on before) but i really wouldn't worry about htat until you are certain you have autowarming enabled. : A new ability of solr 4.4 (solr 4761) is to configure a mergedSegmentWarmer : - how does this component work and is it good for my usecase? the new mergedSegmentWarmer option is extremely low level. it may be useful, but it may also be redundent if you alreayd configure autowarming and/or newSearcher listener to execute basic queries -- it won't help with things like seeding your filterCache, queryResultCache, or FieldCaches. -Hoss
Re: solr sizing
On 7/29/2013 2:18 PM, Torsten Albrecht wrote: we have - 70 mio documents to 100 mio documents and we want - 800 requests per second How many servers Amazon EC2/real hardware we Need for this? Solr 4.x with solr cloud or better shards with loadbalancer? Is anyone here who can give me some information, or who operates a similar system itself? Your question is impossible to answer, aside from generalities that won't really help all that much. I have a similarly sized system (82 million docs), but I don't have query volume anywhere near what yours is. I've got less than 10 queries per second. I have two copies of my index. I use a load balancer with traditional sharding. I don't do replication, my two index copies are completely independent. I set it up this way long before SolrCloud was released. Having two completely independent indexes lets me do a lot of experimentation that a typical SolrCloud setup won't let me do. One copy of the index is running 3.5.0 and is about 142GB if you add up all the shards. The other copy of the index is running 4.2.1 and is about 87GB on disk. Each copy of the index runs on two servers, six large cold shards and one small hot shard. Each of those servers has two quad-core processors (Xeon E5400 series, so fairly old now) and 64GB of RAM. I can get away with multiple shards per host because my query volume is so low. Here's a screenshot of a status servlet that I wrote for my index. There's tons of info here about my index stats: https://dl.dropboxusercontent.com/u/97770508/statuspagescreenshot.png If I needed to start over from scratch with your higher query volume, I would probably set up two independent SolrCloud installs, each with a replicationFactor of at least two, and I'd use 4-8 shards. I would put a load balancer in front of it so that I could bring one cloud down and have everything still work, though with lower performance. Because of the query volume, I'd only have one shard per host. Depending on how big the index ended up being, I'd want 16-32GB (or possibly more) RAM per host. You might not need the flexibility of two independent clouds, and it would require additional complexity in your indexing software. If you only went with one cloud, you'd just need a higher replicationFactor. I'd also want to have another set of servers (not as beefy) to have another independent SolrCloud with a replicationFactor of 1 or 2 for dev purposes. That's a LOT of hardware, and it would NOT be cheap. Can I be sure that you'd really need that much hardware? Not really. To to be quite honest, you'll just have to set up a proof-of-concept system and be prepared to make it bigger. Thanks, Shawn
SOLR replication question?
I am currently using SOLR 4.4. but not planning to use solrcloud in very near future. I have 3 master / 3 slave setup. Each master is linked to its corresponding slave.. I have disabled auto polling.. We do both push (using MQ) and pull indexing using SOLRJ indexing program. I have enabled soft commit in slave (to view the changes immediately pushed by queue). I am thinking of doing the batch indexing in master (optimize and hard commit) and push indexing in both master / slave. I am trying to do more testing with my configuration but thought of getting to know some answers before diving very deep... Since the queue pushes the docs in master / slave there is a possibility of slave having more record compared to master (when master is busy doing batch indexing).. What would happen if the slave has additional segments compared to Master. will that be deleted when the replication happens. If a message is pushed from a queue to both master and slave during replication, will there be a latency in seeing that document even if we use softcommit in slave? We want to make sure that we are not missing any documents from queue (since its updated via UI and we don't really store that data anywhere except in index). -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-replication-question-tp4081161.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr Cloud - How to balance Batch and Queue indexing?
I need some advice on the best way to implement Batch indexing with soft commit / Push indexing (via queue) with soft commit when using SolrCloud. *I am trying to figure out a way to: * 1. Make the push indexing available almost real time (using soft commit) without degrading the search / indexing performance. 2. Ability to not overwrite the existing document (based on listing_id, I assume I can use overwrite=false flag to disable overwrite). 3. Not block the push indexing when delta indexing happens (push indexing happens via UI, user should be able to search for the document pushed via UI almost instantaneously). Delta processing might take more time to complete indexing and I don't want the queue to wait until the batch processing is complete. 4. Copy the updated collection for backup. *More information on setup: *We have 100 million records (around 6 stored fields / 12 indexed fields). We are planning to have 5 cores (each with 20 million documents) with 5 replicas. We will be always doing delta batch indexing. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-How-to-balance-Batch-and-Queue-indexing-tp4081169.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR replication question?
I am currently using SOLR 4.4. but not planning to use solrcloud in very near future. I have 3 master / 3 slave setup. Each master is linked to its corresponding slave.. I have disabled auto polling.. We do both push (using MQ) and pull indexing using SOLRJ indexing program. I have enabled soft commit in slave (to view the changes immediately pushed by queue). I am thinking of doing the batch indexing in master (optimize and hard commit) and push indexing in both master / slave. I am trying to do more testing with my configuration but thought of getting to know some answers before diving very deep... Since the queue pushes the docs in master / slave there is a possibility of slave having more record compared to master (when master is busy doing batch indexing).. What would happen if the slave has additional segments compared to Master. will that be deleted when the replication happens. If a message is pushed from a queue to both master and slave during replication, will there be a latency in seeing that document even if we use softcommit in slave? We want to make sure that we are not missing any documents from queue (since its updated via UI and we don't really store that data anywhere except in index) If you are doing replication, then all updates must go to the master server. You cannot update the slave directly. The replication happens, the slave will be identical to the master... Any documents aent to only the slave will be lost. Replication will happen according to the interval you have configured, or since you say you have disabled polling, according to whatever schedule you manually trigger a replication. SolrCloud would probably be a better fit for you. With a properly configured SolrCloud you just index to any host in the cloud and documents end up exactly where they need to go, and all replicas get updated. Thanks, Shawn
Re: Streaming Updates Using HttpSolrServer.add(Iterator) In Solr 4.3
I am indexing more than 300 million records, it takes less than 7 hours to index all the records.. Send the documents in batches and also use CUSS (ConcurrentUpdateSolrServer) for multi threading support. Ex: ConcurrentUpdateSolrServer server= new ConcurrentUpdateSolrServer(solrServer, queueSize, threadCount); ListSolrInputDocument solrDocList = new ArrayListSolrInputDocument(); While (loop) { solrDocList.add(doc); -- Add the documents to array if(count =100){ server.add(solrDocList); -- Add documents to SOLR in batches } count++; } server.commit(); -- Commit after adding all the documents Using CUSS is only acceptable if you don't care about error handling. If you shut down the Solr server, your program will only see an error on the commit. It will think the update worked perfectly, even though the server is down.
Re: Performance question on Spatial Search
Can you compare with the old geo handler as a baseline. ? Bill Bell Sent from mobile On Jul 29, 2013, at 4:25 PM, Erick Erickson erickerick...@gmail.com wrote: This is very strange. I'd expect slow queries on the first few queries while these caches were warmed, but after that I'd expect things to be quite fast. For a 12G index and 256G RAM, you have on the surface a LOT of hardware to throw at this problem. You can _try_ giving the JVM, say, 18G but that really shouldn't be a big issue, your index files should be MMaped. Let's try the crude thing first and give the JVM more memory. FWIW Erick On Mon, Jul 29, 2013 at 4:45 PM, Steven Bower smb-apa...@alcyon.net wrote: I've been doing some performance analysis of a spacial search use case I'm implementing in Solr 4.3.0. Basically I'm seeing search times alot higher than I'd like them to be and I'm hoping people may have some suggestions for how to optimize further. Here are the specs of what I'm doing now: Machine: - 16 cores @ 2.8ghz - 256gb RAM - 1TB (RAID 1+0 on 10 SSD) Content: - 45M docs (not very big only a few fields with no large textual content) - 1 geo field (using config below) - index is 12gb - 1 shard - Using MMapDirectory Field config: fieldType name=geo class=solr.SpatialRecursivePrefixTreeFieldType distErrPct=0.025 maxDistErr=0.00045 spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFactory units=degrees/ field name=geopoint indexed=true multiValued=false required=false stored=true type=geo/ What I've figured out so far: - Most of my time (98%) is being spent in java.nio.Bits.copyToByteArray(long,Object,long,long) which is being driven by BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() which from what I gather is basically reading terms from the .tim file in blocks - I moved from Java 1.6 to 1.7 based upon what I read here: http://blog.vlad1.com/2011/10/05/looking-at-java-nio-buffer-performance/ and it definitely had some positive impact (i haven't been able to measure this independantly yet) - I changed maxDistErr from 0.09 (which is 1m precision per docs) to 0.00045 (50m precision) .. - It looks to me that the .tim file are being memory mapped fully (ie they show up in pmap output) the virtual size of the jvm is ~18gb (heap is 6gb) - I've optimized the index but this doesn't have a dramatic impact on performance Changing the precision and the JVM upgrade yielded a drop from ~18s avg query time to ~9s avg query time.. This is fantastic but I want to get this down into the 1-2 second range. At this point it seems that basically i am bottle-necked on basically copying memory out of the mapped .tim file which leads me to think that the only solution to my problem would be to read less data or somehow read it more efficiently.. If anyone has any suggestions of where to go with this I'd love to know thanks, steve
Re: Performance vs. maxBufferedAddsPerServer=10
Yes, the internal document forwarding path is different and does not use the CloudSolrServer. It currently works with a buffer of 10. - Mark On Jul 29, 2013, at 3:10 PM, Erick Erickson erickerick...@gmail.com wrote: Why wouldn't it? Or are you saying that the routing to replicas from the leader also 10/packet? Hmmm, hadn't thought of that... On Mon, Jul 29, 2013 at 7:58 AM, Mark Miller markrmil...@gmail.com wrote: SOLR-4816 won't address this - it will just speed up *different* parts. There are other things that will need to be done to speed up that part. - Mark On Jul 26, 2013, at 3:53 PM, Erick Erickson erickerick...@gmail.com wrote: This is current a hard-coded limit from what I've understood. From what I remember, Mark said Yonik said that there are reasons to make the packets that size. But whether this is empirically a Good Thing I don't know. SOLR-4816 will address this a different way by making SolrJ batch up the docs and send them to the right leader, which should pretty much remove any performance consideration here. There's some anecdotal evidence that changing that in the code might improve throughput, but I don't remember the details. FWIW Erick On Thu, Jul 25, 2013 at 7:09 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Context: * https://issues.apache.org/jira/browse/SOLR-4956 * http://search-lucene.com/c/Solr:/core/src/java/org/apache/solr/update/SolrCmdDistributor.java%7C%7CmaxBufferedAddsPerServer As you can see, maxBufferedAddsPerServer = 10. We have an app that sends 20K docs to SolrCloud using CloudSolrServer. We batch 20K docs for performance reasons. But then the receiving node ends up sending VERY small batches of just 10 docs around for indexing and we lose the benefit of batching those 20K docs in the first place. Our app is add only. Is there anything one can do to avoid performance loss associated with maxBufferedAddsPerServer=10? Thanks, Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm