Re: SOLR 4.4 - Slave always replicates full index
How are you committing? Are you committing every document? (you shouldn't). Or, sin of all sins, are you _optimizing_ frequently? That'll cause your entire index to be replicated every time. Best, Erick On Thu, Jan 23, 2014 at 3:26 PM, sureshrk19 sureshr...@gmail.com wrote: Hi, I have configured single core master, slave nodes on 2 different machines. The replication configuration is fine and it is working but, what I observed is, on every change to master index full replication is being triggered on slave. I was planning to get only incremental indexing done on every change. *Master config:* requestHandler name=/replication class=solr.ReplicationHandler lst name=master str name=replicateAfterstartup/str str name=replicateAftercommit/str str name=confFilesschema.xml,stopwords.txt,elevate.xml/str str name=commitReserveDuration00:00:20/str /lst str name=maxNumberOfBackups1/str /requestHandler *Slave config:* requestHandler name=/replication class=solr.ReplicationHandler lst name=slave str name=masterUrlhttp://IP:Port/solr/core0/replication/str str name=pollInterval00:00:20/str /lst /requestHandler What I observed is, the index directory name is appended with timestamp i.e., /index.timestamp/ on slave instance. I have seen a similar issue on older version of SOLR and it is fixed in 4.2 (per description). So, not sure if this is related to the same. https://issues.apache.org/jira/browse/SOLR-4471 http://lucene.472066.n3.nabble.com/Slaves-always-replicate-entire-index-amp-Index-versions-td4041256.html#a4041808 Any pointers would be highly appreciated. Thanks, Suresh -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-4-4-Slave-always-replicates-full-index-tp4113089.html Sent from the Solr - User mailing list archive at Nabble.com.
LinkedIn'de bağlantı kurma daveti
LinkedIn vibhoreng04 Lucene], Sizi LinkedIn'deki profesyonel ağıma eklemek istiyorum. - ömer sevinç ömer sevinç Ondokuzmayıs Üniversitesi Uzaktan Eğitim Merkezi şirketinde Öğr. Gör. Bilgisayar Müh. pozisyonunda Samsun, Türkiye ömer sevinç adlı kişiyi tanıdığınızı onaylayın: https://www.linkedin.com/e/-raxvo-hqtkwlbr-67/isd/12853395556/gr7DJb-a/?hs=falsetok=2dovPxUUV6F641 -- Bağlantı kurmak için davet e-postaları alıyorsunuz. Aboneliği iptal etmek için tıklayın: http://www.linkedin.com/e/-raxvo-hqtkwlbr-67/k4DmoUl0INtaVAIq-J7z9PWKN77TMrUq-KEzulsJgeVVicpw-KNocoLGnzC/goo/ml-node%2Bs472066n3787592h24%40n3%2Enabble%2Ecom/20061/I6339264533_1/?hs=falsetok=3C8nQRJF16F641 (c) 2012 LinkedIn Corporation 2029 Stierlin Ct., Mountain View, CA 94043 USA -- View this message in context: http://lucene.472066.n3.nabble.com/LinkedIn-de-ba-lant-kurma-daveti-tp4113203.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solrcloud shards backup/restoration
We've managed some success restoring existing/backed up indexes into solr cloud and even building the indexes offline and dumping the lucene files into the directories that solr expects. The general steps we follow are: 1) Round up your files. It doesn't matter if you pull from a master or slave so long as you've committed and get a consistent copy of the data. 2) Use the collection api to create a collection in solr. The collection you're creating must have the same number of shards as the collection you've backed up and are restoring. 3) Stop all solr nodes. 4) Remove the index_name/data/ directory from the shards you're going to make the leader. In our case we've got 6 shards and a replication factor of 3 on a 6 node cluster so each server/jvm has three shards on it. Conveniently the shards are all either even or odd per jvm. 5) Populate the index_name/data/ directories on your intended leaders. As mentioned above since we've got six shards and any two jvm contain the entire index we only populate the data on two servers. 6) Start up *JUST* the servers that you've just populated. The goal here is to make these servers you've populated the leaders for the new collection and to have the official full copy of the index. Upon startup you might have to wait $leaderVoteWait for previously non-leader servers to timeout and become leaders 7) Once you've got at least one core up in each shard of your collection go ahead and start the others up. I think Aditya was failing by removing all the zookeeper data and starting everything up at once. If you force solr's hand a bit to pick leaders with the data that you want you'll have success when it replicates out to other nodes. It might also be possible to do this on-line by not stopping solr after creating the empty collection then copying the files into place on the leaders and issuing a RELOAD to pick up the changed indexes. I'm not sure how replicas would handle that though. Thanks, Greg On Jan 24, 2014, at 12:47 AM, Allan Mascarenhas allan.mascarenhas1...@gmail.com wrote: Any update on this ? I am also stuck with same problem, I want to install snapshot of master solr server to my local environment. but i could't :( All most spend 2 days to figure it out the way. Please help!! -- View this message in context: http://lucene.472066.n3.nabble.com/solrcloud-shards-backup-restoration-tp4088447p4113142.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr solr.JSONResponseWriter not escaping backslash '\' characters
Hello, thanks to all for the help :-) we have managed to narrow it down what is exactly going wrong. My initial thinking on the backslashes within field values being the problem were incorrect. The source of the problem is in-fact submitting a document with a blank field value. The JSON returned by a query containing the problematic value, is when doing a facet search. Details below: # cat test.xml add doc field name=id9553524/field field name=year/field /doc /add # curl 'http://localhost:8983/solr/collection1/update?commit=true' --data-binary @test.xml -H 'Content-Type: application/xml' ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime369/int/lst /response # curl ' http://localhost:8983/solr/collection1/select?wt=jsonfacet=truefacet.field=yearfacet.mincount=1json.nl=mapq=id%3A9553524start=0rows=3indent=true ' { responseHeader: { status: 0, QTime: 8669, params: { facet: true, facet.mincount: 1, start: 0, q: id:9553524, facet.field: [year], json.nl: map, wt: json, rows: 3 } }, response: { numFound: 1, start: 0, docs: [{ id: 9553524, year: [], _version_: 1458116227706650624 }] }, facet_counts: { facet_queries: { }, facet_fields: { year: { : 1 } }, facet_dates: { }, facet_ranges: { } } } As you can see above the facet count for the '*year*' field contains a blank JSON field name. This errors when parsing with *PHP's json_decode* (...). *Fatal error*: Cannot access empty property in The workaround is to not submit empty field values into the index but this isn't a great solution :-( Kind Regards Steven On 23 January 2014 18:49, Chris Hostetter-3 [via Lucene] ml-node+s472066n4113050...@n3.nabble.com wrote: : The problem I have is if I try to parse this response in *php *using : *json_decode()* I get a syntax error because of the '*\n*' s that are in : the response. I could escape the before doing the *json_decode() *or at the : point of submitting to the index but this seems wrong... I don't really know anything about PHP, but i managed to muddle my way through both of the little experiments below and couldn't reporoduce any error from json_decode when the response contains \n (ie: the two byte sequence represnting an escaped newline character) inside of a JSON string, but i do get the expected error if a literal, one byte, newline character is in the string. (something that Solr doesn't do) are you sure when you fetch the data from Solr you aren't pre-parsing it in some what that's evaluating hte \n and converting it to a real newline? : I am probably doing something silly and a good nights sleep will reveal : what I am doing wrong ;-) Good luck. ### Experiment #1, locally crated strings, one bogus json hossman@frisbee:~$ php -a Interactive shell php $valid = '{id: newline: (\n)}'; php $bogus = {\id\: \newline: (\n)\}; php var_dump($valid); string(23) {id: newline: (\n)} php var_dump($bogus); string(22) {id: newline: ( )} php var_dump(json_decode($valid)); object(stdClass)#1 (1) { [id]= string(12) newline: ( ) } php var_dump(json_decode($bogus)); NULL php var_dump(json_last_error()); int(4) ### Experiment #2, fetching json data from Solr... hossman@frisbee:~$ curl ' http://localhost:8983/solr/collection1/select?q=id:HOSSwt=jsonindent=trueomitHeader=true' { response:{numFound:1,start:0,docs:[ { id:HOSS, name:quote: (\) backslash: (\\) backslash-quote: (\\\) newline: (\n) backslash-n: (\\n), _version_:1458038130437259264}] }} hossman@frisbee:~$ php -a Interactive shell php $data = file_get_contents(' http://localhost:8983/solr/collection1/select?q=id:HOSSwt=jsonindent=trueomitHeader=true'); php var_dump($data); string(227) { response:{numFound:1,start:0,docs:[ { id:HOSS, name:quote: (\) backslash: (\\) backslash-quote: (\\\) newline: (\n) backslash-n: (\\n), _version_:1458038130437259264}] }} php var_dump(json_decode($data)); object(stdClass)#1 (1) { [response]= object(stdClass)#2 (3) { [numFound]= int(1) [start]= int(0) [docs]= array(1) { [0]= object(stdClass)#3 (3) { [id]= string(4) HOSS [name]= string(78) quote: () backslash: (\) backslash-quote: (\) newline: ( ) backslash-n: (\n) [_version_]= int(1458038130437259264) } } } } -Hoss http://www.lucidworks.com/ -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Solr-solr-JSONResponseWriter-not-escaping-backslash-characters-tp4112990p4113050.html To unsubscribe from Solr solr.JSONResponseWriter not escaping backslash '\' characters, click
Re: Solr solr.JSONResponseWriter not escaping backslash '\' characters
How about using http://lucene.apache.org/solr/4_6_0/solr-core/org/apache/solr/update/processor/RemoveBlankFieldUpdateProcessorFactory.html On Friday, January 24, 2014 5:39 PM, stevenNabble ste...@actual-systems.com wrote: Hello, thanks to all for the help :-) we have managed to narrow it down what is exactly going wrong. My initial thinking on the backslashes within field values being the problem were incorrect. The source of the problem is in-fact submitting a document with a blank field value. The JSON returned by a query containing the problematic value, is when doing a facet search. Details below: # cat test.xml add doc field name=id9553524/field field name=year/field /doc /add # curl 'http://localhost:8983/solr/collection1/update?commit=true' --data-binary @test.xml -H 'Content-Type: application/xml' ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime369/int/lst /response # curl ' http://localhost:8983/solr/collection1/select?wt=jsonfacet=truefacet.field=yearfacet.mincount=1json.nl=mapq=id%3A9553524start=0rows=3indent=true ' { responseHeader: { status: 0, QTime: 8669, params: { facet: true, facet.mincount: 1, start: 0, q: id:9553524, facet.field: [year], json.nl: map, wt: json, rows: 3 } }, response: { numFound: 1, start: 0, docs: [{ id: 9553524, year: [], _version_: 1458116227706650624 }] }, facet_counts: { facet_queries: { }, facet_fields: { year: { : 1 } }, facet_dates: { }, facet_ranges: { } } } As you can see above the facet count for the '*year*' field contains a blank JSON field name. This errors when parsing with *PHP's json_decode* (...). *Fatal error*: Cannot access empty property in The workaround is to not submit empty field values into the index but this isn't a great solution :-( Kind Regards Steven On 23 January 2014 18:49, Chris Hostetter-3 [via Lucene] ml-node+s472066n4113050...@n3.nabble.com wrote: : The problem I have is if I try to parse this response in *php *using : *json_decode()* I get a syntax error because of the '*\n*' s that are in : the response. I could escape the before doing the *json_decode() *or at the : point of submitting to the index but this seems wrong... I don't really know anything about PHP, but i managed to muddle my way through both of the little experiments below and couldn't reporoduce any error from json_decode when the response contains \n (ie: the two byte sequence represnting an escaped newline character) inside of a JSON string, but i do get the expected error if a literal, one byte, newline character is in the string. (something that Solr doesn't do) are you sure when you fetch the data from Solr you aren't pre-parsing it in some what that's evaluating hte \n and converting it to a real newline? : I am probably doing something silly and a good nights sleep will reveal : what I am doing wrong ;-) Good luck. ### Experiment #1, locally crated strings, one bogus json hossman@frisbee:~$ php -a Interactive shell php $valid = '{id: newline: (\n)}'; php $bogus = {\id\: \newline: (\n)\}; php var_dump($valid); string(23) {id: newline: (\n)} php var_dump($bogus); string(22) {id: newline: ( )} php var_dump(json_decode($valid)); object(stdClass)#1 (1) { [id]= string(12) newline: ( ) } php var_dump(json_decode($bogus)); NULL php var_dump(json_last_error()); int(4) ### Experiment #2, fetching json data from Solr... hossman@frisbee:~$ curl ' http://localhost:8983/solr/collection1/select?q=id:HOSSwt=jsonindent=trueomitHeader=true' { response:{numFound:1,start:0,docs:[ { id:HOSS, name:quote: (\) backslash: (\\) backslash-quote: (\\\) newline: (\n) backslash-n: (\\n), _version_:1458038130437259264}] }} hossman@frisbee:~$ php -a Interactive shell php $data = file_get_contents(' http://localhost:8983/solr/collection1/select?q=id:HOSSwt=jsonindent=trueomitHeader=true'); php var_dump($data); string(227) { response:{numFound:1,start:0,docs:[ { id:HOSS, name:quote: (\) backslash: (\\) backslash-quote: (\\\) newline: (\n) backslash-n: (\\n), _version_:1458038130437259264}] }} php var_dump(json_decode($data)); object(stdClass)#1 (1) { [response]= object(stdClass)#2 (3) { [numFound]= int(1) [start]= int(0) [docs]= array(1) { [0]= object(stdClass)#3 (3) { [id]= string(4) HOSS [name]= string(78) quote: () backslash: (\) backslash-quote: (\) newline: ( ) backslash-n: (\n) [_version_]= int(1458038130437259264) } } } } -Hoss http://www.lucidworks.com/ -- If you reply to this email, your message will be added to the discussion below:
Loading resources from Zookeeper
Hi, I'm in the process to move our organization search infrastructure to SOLR4/SolrCloud. One of the main point is to centralize our cores configuration in Zookeeper in order to roll out changes wout redeploying all the nodes in our cluster. Unfortunately I have some code (custom indexers extending org.apache.solr.handler.dataimport.EntityProcessorBase) that are assuming to load resources from the filesystem and this is now a problem given that everything under solr.home/core/conf is hosted in Zookeeper. My question is : what is the best way to load a resource from Zookeeper using SOLR APIs ?? Regards, Ugo
Loading resources from Zookeeper using SolrCloud API
Hi, we have a quite large SOLR 3.6 installation and we are trying to update to 4.6.x. One of the main point in doing this is to get SolrCloud and centralized configuration using Zookeeper. Unfortunately, some custom code we have (custom indexer extending org.apache.solr.handler.dataimport.EntityProcessorBase) are trying to load resources from the file system and this is now a problem given that everything under solr.home/core/conf is under Zookeeper. What is the best way to load resources from Zookeeper using SolrCloud API ? Regards, Ugo
Re: Loading resources from Zookeeper
Hi Ugo, You can load things from the conf/ directory via SolrResourceLoader, which will load either from the filesystem or from zookeeper, depending on whether or not you're running in SolrCloud mode. Alan Woodward www.flax.co.uk On 24 Jan 2014, at 16:02, Ugo Matrangolo wrote: Hi, I'm in the process to move our organization search infrastructure to SOLR4/SolrCloud. One of the main point is to centralize our cores configuration in Zookeeper in order to roll out changes wout redeploying all the nodes in our cluster. Unfortunately I have some code (custom indexers extending org.apache.solr.handler.dataimport.EntityProcessorBase) that are assuming to load resources from the filesystem and this is now a problem given that everything under solr.home/core/conf is hosted in Zookeeper. My question is : what is the best way to load a resource from Zookeeper using SOLR APIs ?? Regards, Ugo
Re: Loading resources from Zookeeper using SolrCloud API
The best way is to use the ResourceLoader without relying on ResourceLoader#getConfigDir (which will fail in SolrCloud mode). For example, see openSchema, openConfig, openResource. If you use these API’s, your code will work both with those files being on the local filesystem for non SolrCloud mode and being in ZooKeeper in SolrCloud mode. There are also low level API’s you could use, but I wouldn’t normally recommend that. - Mark On Jan 24, 2014, at 11:16 AM, Ugo Matrangolo ugo.matrang...@gmail.com wrote: Hi, we have a quite large SOLR 3.6 installation and we are trying to update to 4.6.x. One of the main point in doing this is to get SolrCloud and centralized configuration using Zookeeper. Unfortunately, some custom code we have (custom indexer extending org.apache.solr.handler.dataimport.EntityProcessorBase) are trying to load resources from the file system and this is now a problem given that everything under solr.home/core/conf is under Zookeeper. What is the best way to load resources from Zookeeper using SolrCloud API ? Regards, Ugo
What is the right way to bring a failed SolrCloud node back online?
I have an environment where new collections are being added frequently (isolated per customer), and the backup is virtually guaranteed to be missing some of them. As it stands, bringing up the restored/out-of-date instance results in thos collections being stuck in 'Recovering' state, because the cores don't exist on the resulting server. This can also be extended to the case of restoring a completely blank instance. Is there any way to tell SolrCloud Try recreating any missing cores for this collection based on where you know they should be located. Or do I need to actually determine a list of cores (..._shardX_replicaY) and trigger the core creates myself, at which point I gather that it will start recovery for each of them? -- Nathan Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412
Re: Distributed search with Terms Component and Solr Cloud.
Hi Ryan, just take a look on the thread TermsComponent/SolrCloud. Setting your parameters as default in solrconfig.xml should help. Uwe Am 13.01.2014 20:24, schrieb Ryan Fox: Hello, I am running Solr 4.6.0. I am experiencing some difficulties using the terms component across multiple shards. I see according to the documentation, it should work, but I am unable to do so with solr cloud. When I have one shard, queries using the terms component respond as I would expect. However, when I split my index across two shards, I get empty results for the same query. I am querying solr with a CloudSolrServer object. When I manually add the query params shards and shards.qt to my SolrQuery, I get the expected response. It's not ideal, but if there's a way to get a list of all shards programmatically, I could set that parameter. From the documentation, it appears to me the terms component should be supported by solr cloud, but I can't find anything that explicitly says one way or the other. If there is a better way to do it, or perhaps something I have misconfigured, any advice would be much appreciated. If it's just not possible, I will manage. I can provide more configuration or specifically how I am running the query if that would help. Ryan Fox
Re: Searching and scoring with block join
Zitat von Mikhail Khludnev mkhlud...@griddynamics.com: nesting query parsers is shown at http://blog.griddynamics.com/2013/12/grandchildren-and-siblings-with-block.html try to start from the following: title:Test _query_:{!parent which=is_parent:true}{!dismax qf=content_de}Test mind about local params referencing eg {!... v=$nest}nest=... Thank you for the hint. I don't really know how {!dismax ...} and local parameter referencing are solving my problem. I read your blog entry, but I have some issues to understand how I can use your explanations. Would you mind giving me a short example how these query params helping me to get a proper result with a combined score for parent and children? Thank you very much. there is no such parm in https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/join/BlockJoinParentQParser.java#L67 Raise an feature request issue, at least, don't hesitate to contribute. Ah, okay, it was a misunderstanding then. I created an issue: https://issues.apache.org/jira/browse/SOLR-5662 Sorry if I ask stupid questions but I just have started to work with solr and some techniques are not very familiar. Thanks -Gesh
Re: SOLR 4.4 - Slave always replicates full index
Erick, Thanks for the reply.. I'm not committing each document but, have following configuration in solrconfig.xml (commit every 5mins). autoCommit maxTime30/maxTime openSearcherfalse/openSearcher /autoCommit Also, if you look at my master config, I do not have 'optimize'. str name=replicateAfterstartup/str str name=replicateAftercommit/str Is there any way other option which triggers 'optimize'? Thanks, Suresh -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-4-4-Slave-always-replicates-full-index-tp4113089p4113249.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr server requirements for 100+ million documents
Hi, Currently we are indexing 10 million document from database (10 db data entities) index size is around 8 GB on windows virtual box. Indexing in one shot taking 12+ hours while indexing parallel in separate cores merging them together taking 4+ hours. We are looking to scale to 100+ million documents and looking for recommendation on servers requirements on below parameters for a Production environment. There can be 200+ users performing search same time. No of physical servers (considering solr cloud) Memory requirement Processor requirement (# cores) Linux as OS oppose to windows Thanks in advance. Susheel
Re: SOLR 4.4 - Slave always replicates full index
On 1/24/2014 10:36 AM, sureshrk19 wrote: I'm not committing each document but, have following configuration in solrconfig.xml (commit every 5mins). autoCommit maxTime30/maxTime openSearcherfalse/openSearcher /autoCommit Also, if you look at my master config, I do not have 'optimize'. str name=replicateAfterstartup/str str name=replicateAftercommit/str Is there any way other option which triggers 'optimize'? I think Erick was actually asking if you are optimizing your index frequently, not whether you have replication configured to replicate after optimize. Optimizing your index (a forced merge down to one Lucene index segment) is something you have to do yourself. It won't happen automatically. If you optimize your index, all old segments are gone and only a single new segment remains. Even if you don't replicate immediately, the next time you commit, the entire index will need to be copied to the slave. Your autoCommit cannot be the only committing that you do, because that configuration will not make new documents visible - it has openSearcher=false. Therefore if you are adding new content, you must be doing additional soft commits, or hard commits with openSearcher=true. This might be accomplished with a parameter on your updates, like commit, softCommit, or commitWithin. It might also be an explicit commit. Optimizing *IS* a useful feature, but if you optimize very frequently (especially if it's done every time you add new documents), Solr's performance will really suffer. Personal anecdote: One of my shards is very tiny and holds all new content. That gets optimized once an hour. In general, this is pretty frequently, but it happens very quickly, so in my setup it's not excessive. That is a LOT more often than what I do for my other shards, the large ones. I optimize one of those once every day, so each one only gets optimized once every six days. Thanks, Shawn
Complex nested structure in solr
Hi guys, I have to load extra meta data to an existing collection. This is what I am looking for: For a UPC: Store availability by merchantId per location (which has lat/lon) My query pattern will be: Given a keyword, find all available products for a merchantId around the given lat/lon. Example: Input: keyword=ipod, merchantId=922,lat/lon=28.222,82.333 Output: List of UPCs which match the criteria So how should I go about doing it? Any suggestions? -- Thanks, -Utkarsh
Re: Solr server requirements for 100+ million documents
Can't be done with the information you provided, and can only be guessed at even with more comprehensive information. Here's why: http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ Also, at a guess, your indexing speed is so slow due to data acquisition; I rather doubt you're being limited by raw Solr indexing. If you're using SolrJ, try commenting out the server.add() bit and running again. My guess is that your indexing speed will be almost unchanged, in which case it's the data acquisition process is where you should concentrate efforts. As a comparison, I can index 11M Wikipedia docs on my laptop in 45 minutes without any attempts at parallelization. Best, Erick On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar susheel.ku...@thedigitalgroup.net wrote: Hi, Currently we are indexing 10 million document from database (10 db data entities) index size is around 8 GB on windows virtual box. Indexing in one shot taking 12+ hours while indexing parallel in separate cores merging them together taking 4+ hours. We are looking to scale to 100+ million documents and looking for recommendation on servers requirements on below parameters for a Production environment. There can be 200+ users performing search same time. No of physical servers (considering solr cloud) Memory requirement Processor requirement (# cores) Linux as OS oppose to windows Thanks in advance. Susheel
RE: Solr server requirements for 100+ million documents
Thanks, Erick for the info. For indexing I agree the more time is consumed in data acquisition which in our case from Database. For indexing currently we are using the manual process i.e. Solr dashboard Data Import but now looking to automate. How do you suggest to automate the index part. Do you recommend to use SolrJ or should we try to automate using Curl? -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, January 24, 2014 2:59 PM To: solr-user@lucene.apache.org Subject: Re: Solr server requirements for 100+ million documents Can't be done with the information you provided, and can only be guessed at even with more comprehensive information. Here's why: http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ Also, at a guess, your indexing speed is so slow due to data acquisition; I rather doubt you're being limited by raw Solr indexing. If you're using SolrJ, try commenting out the server.add() bit and running again. My guess is that your indexing speed will be almost unchanged, in which case it's the data acquisition process is where you should concentrate efforts. As a comparison, I can index 11M Wikipedia docs on my laptop in 45 minutes without any attempts at parallelization. Best, Erick On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar susheel.ku...@thedigitalgroup.net wrote: Hi, Currently we are indexing 10 million document from database (10 db data entities) index size is around 8 GB on windows virtual box. Indexing in one shot taking 12+ hours while indexing parallel in separate cores merging them together taking 4+ hours. We are looking to scale to 100+ million documents and looking for recommendation on servers requirements on below parameters for a Production environment. There can be 200+ users performing search same time. No of physical servers (considering solr cloud) Memory requirement Processor requirement (# cores) Linux as OS oppose to windows Thanks in advance. Susheel
Re: Solr server requirements for 100+ million documents
I just indexed 100 million db docs (records) with 22 fields (4 multivalued) in 9524 sec using libcurl. 11 million took 763 seconds so the speed drops somewhat with increasing dbsize. We write 1000 docs (just an arbitrary number) in each request from two threads. If you will be using solrcloud you will want more writer threads. The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one SSD and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual machine. /svante 2014/1/24 Susheel Kumar susheel.ku...@thedigitalgroup.net Thanks, Erick for the info. For indexing I agree the more time is consumed in data acquisition which in our case from Database. For indexing currently we are using the manual process i.e. Solr dashboard Data Import but now looking to automate. How do you suggest to automate the index part. Do you recommend to use SolrJ or should we try to automate using Curl? -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, January 24, 2014 2:59 PM To: solr-user@lucene.apache.org Subject: Re: Solr server requirements for 100+ million documents Can't be done with the information you provided, and can only be guessed at even with more comprehensive information. Here's why: http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ Also, at a guess, your indexing speed is so slow due to data acquisition; I rather doubt you're being limited by raw Solr indexing. If you're using SolrJ, try commenting out the server.add() bit and running again. My guess is that your indexing speed will be almost unchanged, in which case it's the data acquisition process is where you should concentrate efforts. As a comparison, I can index 11M Wikipedia docs on my laptop in 45 minutes without any attempts at parallelization. Best, Erick On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar susheel.ku...@thedigitalgroup.net wrote: Hi, Currently we are indexing 10 million document from database (10 db data entities) index size is around 8 GB on windows virtual box. Indexing in one shot taking 12+ hours while indexing parallel in separate cores merging them together taking 4+ hours. We are looking to scale to 100+ million documents and looking for recommendation on servers requirements on below parameters for a Production environment. There can be 200+ users performing search same time. No of physical servers (considering solr cloud) Memory requirement Processor requirement (# cores) Linux as OS oppose to windows Thanks in advance. Susheel
Re: Solr server requirements for 100+ million documents
Hi Susheel, Like Erick said, it's impossible to give precise recommendations, but making a few assumptions and combining them with experience (+ a licked finger in the air): * 3 servers * 32 GB * 2+ CPU cores * Linux Assuming docs are not bigger than a few KB, that they are not being reindexed over and over, that you don't have a search rate higher than a few dozen QPS, assuming your queries are not a page long, etc. assuming best practices are followed, the above should be sufficient. I hope this helps. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Fri, Jan 24, 2014 at 1:10 PM, Susheel Kumar susheel.ku...@thedigitalgroup.net wrote: Hi, Currently we are indexing 10 million document from database (10 db data entities) index size is around 8 GB on windows virtual box. Indexing in one shot taking 12+ hours while indexing parallel in separate cores merging them together taking 4+ hours. We are looking to scale to 100+ million documents and looking for recommendation on servers requirements on below parameters for a Production environment. There can be 200+ users performing search same time. No of physical servers (considering solr cloud) Memory requirement Processor requirement (# cores) Linux as OS oppose to windows Thanks in advance. Susheel
Replica not consistent after update request?
How can we issue an update request and be certain that all of the replicas in the SolrCloud cluster are up to date? I found this post: http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/79886 which seems to indicate that all replicas for a shard must finish/succeed before it returns to client that the operation succeeded - but we've been seeing behavior lately (until we configured automatic soft commits) where the replicas were almost always not current - i.e. the replicas were missing documents/etc. Is this something wrong with our cloud setup/replication, or am I misinterpreting the way that updates in a cloud deployment are supposed to function? If it's a problem with our cloud setup, do you have any suggestions on diagnostics? Alternatively, are we perhaps just using it wrong? -- Nathan Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412
Re: Replica not consistent after update request?
If you're on Solr 4.6 then this is likely the issue: https://issues.apache.org/jira/browse/SOLR-4260. The issue is resolved for Solr 4.6.1 which should be out next week. Joel Bernstein Search Engineer at Heliosearch On Fri, Jan 24, 2014 at 9:52 PM, Nathan Neulinger nn...@neulinger.orgwrote: How can we issue an update request and be certain that all of the replicas in the SolrCloud cluster are up to date? I found this post: http://comments.gmane.org/gmane.comp.jakarta.lucene. solr.user/79886 which seems to indicate that all replicas for a shard must finish/succeed before it returns to client that the operation succeeded - but we've been seeing behavior lately (until we configured automatic soft commits) where the replicas were almost always not current - i.e. the replicas were missing documents/etc. Is this something wrong with our cloud setup/replication, or am I misinterpreting the way that updates in a cloud deployment are supposed to function? If it's a problem with our cloud setup, do you have any suggestions on diagnostics? Alternatively, are we perhaps just using it wrong? -- Nathan Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412
Re: Replica not consistent after update request?
Hi Nathan, It'd be great to have more information about your setup, Solr Version? Depending upon your version, you might want to also look at: https://issues.apache.org/jira/browse/SOLR-4260 (which is now fixed). On Fri, Jan 24, 2014 at 6:52 PM, Nathan Neulinger nn...@neulinger.orgwrote: How can we issue an update request and be certain that all of the replicas in the SolrCloud cluster are up to date? I found this post: http://comments.gmane.org/gmane.comp.jakarta.lucene. solr.user/79886 which seems to indicate that all replicas for a shard must finish/succeed before it returns to client that the operation succeeded - but we've been seeing behavior lately (until we configured automatic soft commits) where the replicas were almost always not current - i.e. the replicas were missing documents/etc. Is this something wrong with our cloud setup/replication, or am I misinterpreting the way that updates in a cloud deployment are supposed to function? If it's a problem with our cloud setup, do you have any suggestions on diagnostics? Alternatively, are we perhaps just using it wrong? -- Nathan Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412 -- Anshum Gupta http://www.anshumgupta.net
Re: Replica not consistent after update request?
Wow, the detail in that jira issue makes my brain hurt... Great to see it's got a quick answer/fix! Thank you! -- Nathan On 01/24/2014 09:43 PM, Joel Bernstein wrote: If you're on Solr 4.6 then this is likely the issue: https://issues.apache.org/jira/browse/SOLR-4260. The issue is resolved for Solr 4.6.1 which should be out next week. Joel Bernstein Search Engineer at Heliosearch On Fri, Jan 24, 2014 at 9:52 PM, Nathan Neulinger nn...@neulinger.orgwrote: How can we issue an update request and be certain that all of the replicas in the SolrCloud cluster are up to date? I found this post: http://comments.gmane.org/gmane.comp.jakarta.lucene. solr.user/79886 which seems to indicate that all replicas for a shard must finish/succeed before it returns to client that the operation succeeded - but we've been seeing behavior lately (until we configured automatic soft commits) where the replicas were almost always not current - i.e. the replicas were missing documents/etc. Is this something wrong with our cloud setup/replication, or am I misinterpreting the way that updates in a cloud deployment are supposed to function? If it's a problem with our cloud setup, do you have any suggestions on diagnostics? Alternatively, are we perhaps just using it wrong? -- Nathan Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412 -- Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412
Re: Replica not consistent after update request?
It's 4.6.0. Pair of servers with an external 3-node zk ensemble. SOLR-4260 looks like a very promising answer. Will check it out as soon as 4.6.1 is released. May also check out the nightly builds since this is still just development/prototype usage. -- Nathan On 01/24/2014 09:45 PM, Anshum Gupta wrote: Hi Nathan, It'd be great to have more information about your setup, Solr Version? Depending upon your version, you might want to also look at: https://issues.apache.org/jira/browse/SOLR-4260 (which is now fixed). On Fri, Jan 24, 2014 at 6:52 PM, Nathan Neulinger nn...@neulinger.orgwrote: How can we issue an update request and be certain that all of the replicas in the SolrCloud cluster are up to date? I found this post: http://comments.gmane.org/gmane.comp.jakarta.lucene. solr.user/79886 which seems to indicate that all replicas for a shard must finish/succeed before it returns to client that the operation succeeded - but we've been seeing behavior lately (until we configured automatic soft commits) where the replicas were almost always not current - i.e. the replicas were missing documents/etc. Is this something wrong with our cloud setup/replication, or am I misinterpreting the way that updates in a cloud deployment are supposed to function? If it's a problem with our cloud setup, do you have any suggestions on diagnostics? Alternatively, are we perhaps just using it wrong? -- Nathan Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412 -- Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412
Re: Replica not consistent after update request?
Right. There updates are guaranteed to be on the replicas and in their transaction logs. That doesn't mean they're searchable, however. For a document to be found in a search there must be a commit, either soft, or hard with openSearcher=true. Here's a post that outlines all this. If you have discrepancies when after commits, that's a problem Best, Erick On Fri, Jan 24, 2014 at 8:52 PM, Nathan Neulinger nn...@neulinger.org wrote: How can we issue an update request and be certain that all of the replicas in the SolrCloud cluster are up to date? I found this post: http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/79886 which seems to indicate that all replicas for a shard must finish/succeed before it returns to client that the operation succeeded - but we've been seeing behavior lately (until we configured automatic soft commits) where the replicas were almost always not current - i.e. the replicas were missing documents/etc. Is this something wrong with our cloud setup/replication, or am I misinterpreting the way that updates in a cloud deployment are supposed to function? If it's a problem with our cloud setup, do you have any suggestions on diagnostics? Alternatively, are we perhaps just using it wrong? -- Nathan Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412
RE: Solr server requirements for 100+ million documents
Thanks, Svante. Your indexing speed using db seems to really fast. Can you please provide some more detail on how you are indexing db records. Is it thru DataImportHandler? And what database? Is that local db? We are indexing around 70 fields (60 multivalued) but data is not populated always in all fields. The average size of document is in 5-10 kbs. -Original Message- From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf Of svante karlsson Sent: Friday, January 24, 2014 5:05 PM To: solr-user@lucene.apache.org Subject: Re: Solr server requirements for 100+ million documents I just indexed 100 million db docs (records) with 22 fields (4 multivalued) in 9524 sec using libcurl. 11 million took 763 seconds so the speed drops somewhat with increasing dbsize. We write 1000 docs (just an arbitrary number) in each request from two threads. If you will be using solrcloud you will want more writer threads. The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one SSD and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual machine. /svante 2014/1/24 Susheel Kumar susheel.ku...@thedigitalgroup.net Thanks, Erick for the info. For indexing I agree the more time is consumed in data acquisition which in our case from Database. For indexing currently we are using the manual process i.e. Solr dashboard Data Import but now looking to automate. How do you suggest to automate the index part. Do you recommend to use SolrJ or should we try to automate using Curl? -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, January 24, 2014 2:59 PM To: solr-user@lucene.apache.org Subject: Re: Solr server requirements for 100+ million documents Can't be done with the information you provided, and can only be guessed at even with more comprehensive information. Here's why: http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we -dont-have-a-definitive-answer/ Also, at a guess, your indexing speed is so slow due to data acquisition; I rather doubt you're being limited by raw Solr indexing. If you're using SolrJ, try commenting out the server.add() bit and running again. My guess is that your indexing speed will be almost unchanged, in which case it's the data acquisition process is where you should concentrate efforts. As a comparison, I can index 11M Wikipedia docs on my laptop in 45 minutes without any attempts at parallelization. Best, Erick On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar susheel.ku...@thedigitalgroup.net wrote: Hi, Currently we are indexing 10 million document from database (10 db data entities) index size is around 8 GB on windows virtual box. Indexing in one shot taking 12+ hours while indexing parallel in separate cores merging them together taking 4+ hours. We are looking to scale to 100+ million documents and looking for recommendation on servers requirements on below parameters for a Production environment. There can be 200+ users performing search same time. No of physical servers (considering solr cloud) Memory requirement Processor requirement (# cores) Linux as OS oppose to windows Thanks in advance. Susheel
Re: Solr server requirements for 100+ million documents
can you post the complete solrconfig.xml file and schema.xml files to review all of your settings that would impact your indexing performance. Thanks, Kranti K. Parisa http://www.linkedin.com/in/krantiparisa On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar susheel.ku...@thedigitalgroup.net wrote: Thanks, Svante. Your indexing speed using db seems to really fast. Can you please provide some more detail on how you are indexing db records. Is it thru DataImportHandler? And what database? Is that local db? We are indexing around 70 fields (60 multivalued) but data is not populated always in all fields. The average size of document is in 5-10 kbs. -Original Message- From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf Of svante karlsson Sent: Friday, January 24, 2014 5:05 PM To: solr-user@lucene.apache.org Subject: Re: Solr server requirements for 100+ million documents I just indexed 100 million db docs (records) with 22 fields (4 multivalued) in 9524 sec using libcurl. 11 million took 763 seconds so the speed drops somewhat with increasing dbsize. We write 1000 docs (just an arbitrary number) in each request from two threads. If you will be using solrcloud you will want more writer threads. The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one SSD and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual machine. /svante 2014/1/24 Susheel Kumar susheel.ku...@thedigitalgroup.net Thanks, Erick for the info. For indexing I agree the more time is consumed in data acquisition which in our case from Database. For indexing currently we are using the manual process i.e. Solr dashboard Data Import but now looking to automate. How do you suggest to automate the index part. Do you recommend to use SolrJ or should we try to automate using Curl? -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, January 24, 2014 2:59 PM To: solr-user@lucene.apache.org Subject: Re: Solr server requirements for 100+ million documents Can't be done with the information you provided, and can only be guessed at even with more comprehensive information. Here's why: http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we -dont-have-a-definitive-answer/ Also, at a guess, your indexing speed is so slow due to data acquisition; I rather doubt you're being limited by raw Solr indexing. If you're using SolrJ, try commenting out the server.add() bit and running again. My guess is that your indexing speed will be almost unchanged, in which case it's the data acquisition process is where you should concentrate efforts. As a comparison, I can index 11M Wikipedia docs on my laptop in 45 minutes without any attempts at parallelization. Best, Erick On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar susheel.ku...@thedigitalgroup.net wrote: Hi, Currently we are indexing 10 million document from database (10 db data entities) index size is around 8 GB on windows virtual box. Indexing in one shot taking 12+ hours while indexing parallel in separate cores merging them together taking 4+ hours. We are looking to scale to 100+ million documents and looking for recommendation on servers requirements on below parameters for a Production environment. There can be 200+ users performing search same time. No of physical servers (considering solr cloud) Memory requirement Processor requirement (# cores) Linux as OS oppose to windows Thanks in advance. Susheel