RE: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.
The easiest, and coarsest measure of response time [not service time in a distributed system] can be picked up in your localhost_access.log file. You're using tomcat write? Lookup AccessLogValve in the docs and server.xml. You can add configuration to report the payload and time to service the request without touching any code. Queueing theory is what Otis was talking about when he said you've saturated your environment. In AWS people just auto-scale up and don't worry about where the load comes from; its dumb if it happens more than 2 times. Capacity planning is tough, let's hope it doesn't disappear altogether. G'luck -Original Message- From: S.L [mailto:simpleliving...@gmail.com] Sent: Monday, October 27, 2014 9:25 PM To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Good point about ZK logs , I do see the following exceptions intermittently in the ZK log. 2014-10-27 06:54:14,621 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029 2014-10-27 07:00:06,697 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /xxx.xxx.xxx.xxx:37336 2014-10-27 07:00:06,725 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to establish new session at /xxx.xxx.xxx.xxx:37336 2014-10-27 07:00:06,746 [myid:1] - INFO [CommitProcessor:1:ZooKeeperServer@617] - Established session 0x14949db9da40037 with negotiated timeout 1 for client /xxx.xxx.xxx.xxx:37336 2014-10-27 07:01:06,520 [myid:1] - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x14949db9da40037, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:744) For queuing theory , I dont know of any way to see how fasts the requests are being served by SolrCloud , and if a queue is being maintained if the service rate is slower than the rate of requests from the incoming multiple threads. On Mon, Oct 27, 2014 at 7:09 PM, Will Martin wmartin...@gmail.com wrote: 2 naïve comments, of course. - Queuing theory - Zookeeper logs. From: S.L [mailto:simpleliving...@gmail.com] Sent: Monday, October 27, 2014 1:42 PM To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Please find the clusterstate.json attached. Also in this case atleast the Shard1 replicas are out of sync , as can be seen below. Shard 1 replica 1 *does not* return a result with distrib=false. Query :http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:* http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=% 28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebu g=trackshards.info=true fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=false debug=track shards.info=true Result : responselst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=q*:*/strstr name= shards.infotrue/strstr name=distribfalse/strstr name=debugtrack/strstr name=wtxml/strstr name=fq(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)/str/lst/lst result name=response numFound=0 start=0/lst name=debug//response Shard1 replica 2 *does* return the result with distrib=false. Query: http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:* http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=% 28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebu g=trackshards.info=true fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=false debug=track shards.info=true Result: responselst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=q*:*/strstr name= shards.infotrue/strstr name=distribfalse/strstr name=debugtrack/strstr name=wtxml/strstr name=fq(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)/str/lst/lst result name=response numFound=1 start=0docstr name=thingURL http://www.xyz.com/strstr name=id9f4748c0-fe16-4632-b74e-4fee6b80cbf5/strlong name=_version_1483135330558148608/long/doc/resultlst name=debug//response On Mon, Oct 27, 2014 at 12:19 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Mon, Oct 27, 2014 at 9:40 PM, S.L simpleliving...@gmail.com wrote: One is not smaller than the other, because the numDocs is same for both replicas and essentially they seem to be disjoint sets. That is strange. Can we see your clusterstate.json? With that, please also
Log message zkClient has disconnected.
Hi, I am getting following INFO log messages many a times during my indexing. The indexing process read records from database and using multiple threads sends them for indexing in batches. There are four shards and one embedded Zookeeper on one of the shards. org.apache.zookeeper.ClientCnxn$SendThread run INFO: Client session timed out, have not heard from server in 9276ms for sessionid id, closing socket connection and attempting reconnect org.apache.solr.common.cloud.ConnectionManager process INFO: Watcher org.apache.solr.common.cloud.ConnectionManager@3debc153 name:ZooKeeperConnection Watcher:host:port got event WatchedEvent state:Disconnected type:None path:null path:null type:None org.apache.solr.common.cloud.ConnectionManager process INFO: zkClient has disconnected Kindly help me understand the possible cause of Zookeeper state disconnection. Thanks, Modassar
unable to build solr 4.10.1
Hi , I am getting below error while doing ant dist . :: problems summary :: [ivy:retrieve] WARNINGS [ivy:retrieve] [FAILED ] javax.activation#activation;1.1.1!activation.jar(javadoc): (0ms) [ivy:retrieve] shared: tried [ivy:retrieve] /home/.ivy2/shared/javax.activation/activation/1.1.1/javadocs/activation.jar [ivy:retrieve] public: tried [ivy:retrieve] http://repo1.maven.org/maven2/javax/activation/activation/1.1.1/activation-1.1.1-javadoc.jar [ivy:retrieve] :: [ivy:retrieve] :: FAILED DOWNLOADS:: [ivy:retrieve] :: ^ see resolution messages for details ^ :: [ivy:retrieve] :: [ivy:retrieve] :: javax.activation#activation;1.1.1!activation.jar(javadoc) [ivy:retrieve] :: [ivy:retrieve] [ivy:retrieve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS BUILD FAILED /home/solr_trunk/solr-4.10.1/solr/build.xml:339: The following error occurred while executing this line: /home/solr_trunk/solr-4.10.1/solr/common-build.xml:438: The following error occurred while executing this line: /home/solr_trunk/solr-4.10.1/solr/contrib/contrib-build.xml:52: impossible to resolve dependencies: resolve failed - see output for details Please tell me if i am doing something wrong. Thanks, Karunakar.
RE: unable to build solr 4.10.1
There is no Javadoc jar at that location. Does that help? -Original Message- From: Karunakar Reddy [mailto:karunaka...@gmail.com] Sent: Tuesday, October 28, 2014 2:41 AM To: solr-user@lucene.apache.org Subject: unable to build solr 4.10.1 Hi , I am getting below error while doing ant dist . :: problems summary :: [ivy:retrieve] WARNINGS [ivy:retrieve] [FAILED ] javax.activation#activation;1.1.1!activation.jar(javadoc): (0ms) [ivy:retrieve] shared: tried [ivy:retrieve] /home/.ivy2/shared/javax.activation/activation/1.1.1/javadocs/activation.jar [ivy:retrieve] public: tried [ivy:retrieve] http://repo1.maven.org/maven2/javax/activation/activation/1.1.1/activation-1.1.1-javadoc.jar [ivy:retrieve] :: [ivy:retrieve] :: FAILED DOWNLOADS:: [ivy:retrieve] :: ^ see resolution messages for details ^ :: [ivy:retrieve] :: [ivy:retrieve] :: javax.activation#activation;1.1.1!activation.jar(javadoc) [ivy:retrieve] :: [ivy:retrieve] [ivy:retrieve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS BUILD FAILED /home/solr_trunk/solr-4.10.1/solr/build.xml:339: The following error occurred while executing this line: /home/solr_trunk/solr-4.10.1/solr/common-build.xml:438: The following error occurred while executing this line: /home/solr_trunk/solr-4.10.1/solr/contrib/contrib-build.xml:52: impossible to resolve dependencies: resolve failed - see output for details Please tell me if i am doing something wrong. Thanks, Karunakar.
RE: Log message zkClient has disconnected.
Modassar: Can you share your hw setup? And what size are your batches? Can you make them smaller; it doesn't mean your throughput will necessarily suffer. Re Will -Original Message- From: Modassar Ather [mailto:modather1...@gmail.com] Sent: Tuesday, October 28, 2014 2:12 AM To: solr-user@lucene.apache.org Subject: Log message zkClient has disconnected. Hi, I am getting following INFO log messages many a times during my indexing. The indexing process read records from database and using multiple threads sends them for indexing in batches. There are four shards and one embedded Zookeeper on one of the shards. org.apache.zookeeper.ClientCnxn$SendThread run INFO: Client session timed out, have not heard from server in 9276ms for sessionid id, closing socket connection and attempting reconnect org.apache.solr.common.cloud.ConnectionManager process INFO: Watcher org.apache.solr.common.cloud.ConnectionManager@3debc153 name:ZooKeeperConnection Watcher:host:port got event WatchedEvent state:Disconnected type:None path:null path:null type:None org.apache.solr.common.cloud.ConnectionManager process INFO: zkClient has disconnected Kindly help me understand the possible cause of Zookeeper state disconnection. Thanks, Modassar
Re: unable to build solr 4.10.1
Hi Martin, Thanks for your quick response. Yes specified file is not present in that location. Now How to disable this? I do not have idea of ant,ivy build tools. Thanks Regards, Karunakar. On Tue, Oct 28, 2014 at 12:15 PM, Will Martin wmartin...@gmail.com wrote: There is no Javadoc jar at that location. Does that help? -Original Message- From: Karunakar Reddy [mailto:karunaka...@gmail.com] Sent: Tuesday, October 28, 2014 2:41 AM To: solr-user@lucene.apache.org Subject: unable to build solr 4.10.1 Hi , I am getting below error while doing ant dist . :: problems summary :: [ivy:retrieve] WARNINGS [ivy:retrieve] [FAILED ] javax.activation#activation;1.1.1!activation.jar(javadoc): (0ms) [ivy:retrieve] shared: tried [ivy:retrieve] /home/.ivy2/shared/javax.activation/activation/1.1.1/javadocs/activation.jar [ivy:retrieve] public: tried [ivy:retrieve] http://repo1.maven.org/maven2/javax/activation/activation/1.1.1/activation-1.1.1-javadoc.jar [ivy:retrieve] :: [ivy:retrieve] :: FAILED DOWNLOADS:: [ivy:retrieve] :: ^ see resolution messages for details ^ :: [ivy:retrieve] :: [ivy:retrieve] :: javax.activation#activation;1.1.1!activation.jar(javadoc) [ivy:retrieve] :: [ivy:retrieve] [ivy:retrieve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS BUILD FAILED /home/solr_trunk/solr-4.10.1/solr/build.xml:339: The following error occurred while executing this line: /home/solr_trunk/solr-4.10.1/solr/common-build.xml:438: The following error occurred while executing this line: /home/solr_trunk/solr-4.10.1/solr/contrib/contrib-build.xml:52: impossible to resolve dependencies: resolve failed - see output for details Please tell me if i am doing something wrong. Thanks, Karunakar.
Re: SolrCloud config question and zookeeper
Yes, garbage collection is a very good argument to have external zookeepers. I haven't thought about that. But does this also mean seperate server for each zookeeper or can they live side by side with solr on the same server? What is the problem with 4 zookeepers beside that I have no real gain against 3 zookeepers (only 1 can fail)? Regards Bernd Am 27.10.2014 um 15:41 schrieb Michael Della Bitta: You want external zookeepers. Partially because you don't want your Solr garbage collections holding up zookeeper availability, but also because you don't want your zookeepers going offline if you have to restart Solr for some reason. Also, you want 3 or 5 zookeeepers, not 4 or 8. On 10/27/14 10:35, Bernd Fehling wrote: While starting now with SolrCloud I tried to understand the sense of external zookeeper. Let's assume I want to split 1 huge collection accross 4 server. My straight forward idea is to setup a cloud with 4 shards (one on each server) and also have a replication of the shard on another server. server_1: shard_1, shard_replication_4 server_2: shard_2, shard_replication_1 server_3: shard_3, shard_replication_2 server_4: shard_4, shard_replication_3 In this configuration I always have all 4 shards available if one server fails. But now to zookeeper. I would start the internal zookeeper for all shards including replicas. Does this make sense? Or I only start the internal zookeeper for shard 1 to 4 but not the replicas. Should be good enough, one server can fail, or not? Or I follow the recommendations and install on all 4 server an external seperate zookeeper, but what is the advantage against having the internal zookeeper on each server? I really don't get it at this point. Can anyone help me here? Regards Bernd
Re: unable to build solr 4.10.1
The following link might help. https://wiki.apache.org/solr/HowToCompileSolr You might need to run ant ivy-bootstrap as described in the link above. On Tue, Oct 28, 2014 at 12:20 PM, Karunakar Reddy karunaka...@gmail.com wrote: Hi Martin, Thanks for your quick response. Yes specified file is not present in that location. Now How to disable this? I do not have idea of ant,ivy build tools. Thanks Regards, Karunakar. On Tue, Oct 28, 2014 at 12:15 PM, Will Martin wmartin...@gmail.com wrote: There is no Javadoc jar at that location. Does that help? -Original Message- From: Karunakar Reddy [mailto:karunaka...@gmail.com] Sent: Tuesday, October 28, 2014 2:41 AM To: solr-user@lucene.apache.org Subject: unable to build solr 4.10.1 Hi , I am getting below error while doing ant dist . :: problems summary :: [ivy:retrieve] WARNINGS [ivy:retrieve] [FAILED ] javax.activation#activation;1.1.1!activation.jar(javadoc): (0ms) [ivy:retrieve] shared: tried [ivy:retrieve] /home/.ivy2/shared/javax.activation/activation/1.1.1/javadocs/activation.jar [ivy:retrieve] public: tried [ivy:retrieve] http://repo1.maven.org/maven2/javax/activation/activation/1.1.1/activation-1.1.1-javadoc.jar [ivy:retrieve] :: [ivy:retrieve] :: FAILED DOWNLOADS:: [ivy:retrieve] :: ^ see resolution messages for details ^ :: [ivy:retrieve] :: [ivy:retrieve] :: javax.activation#activation;1.1.1!activation.jar(javadoc) [ivy:retrieve] :: [ivy:retrieve] [ivy:retrieve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS BUILD FAILED /home/solr_trunk/solr-4.10.1/solr/build.xml:339: The following error occurred while executing this line: /home/solr_trunk/solr-4.10.1/solr/common-build.xml:438: The following error occurred while executing this line: /home/solr_trunk/solr-4.10.1/solr/contrib/contrib-build.xml:52: impossible to resolve dependencies: resolve failed - see output for details Please tell me if i am doing something wrong. Thanks, Karunakar.
Re: Log message zkClient has disconnected.
Hi Will, Thanks for your response. These Solrcloud instances are 8-core machines with a RAM of 24 GB each assigned to tomcat. The Indexer machine starts with -Xmx16g. All these machines are connected to the same switch. The batch size is 5000 documents and there are 8 threads which adds 5000 document's batch per thread to solrcloud. I have tried with a bigger batch size but that caused OutOfMemory error. I can see the Solr cloud instances are not running out of memory or going low in memory. The CPU utilization is also around 50% on each core. Whereas indexer is using maximum of the assigned memory which is -Xmx16g but is not going out of memory. Thanks, Modassar On Tue, Oct 28, 2014 at 12:18 PM, Will Martin wmartin...@gmail.com wrote: Modassar: Can you share your hw setup? And what size are your batches? Can you make them smaller; it doesn't mean your throughput will necessarily suffer. Re Will -Original Message- From: Modassar Ather [mailto:modather1...@gmail.com] Sent: Tuesday, October 28, 2014 2:12 AM To: solr-user@lucene.apache.org Subject: Log message zkClient has disconnected. Hi, I am getting following INFO log messages many a times during my indexing. The indexing process read records from database and using multiple threads sends them for indexing in batches. There are four shards and one embedded Zookeeper on one of the shards. org.apache.zookeeper.ClientCnxn$SendThread run INFO: Client session timed out, have not heard from server in 9276ms for sessionid id, closing socket connection and attempting reconnect org.apache.solr.common.cloud.ConnectionManager process INFO: Watcher org.apache.solr.common.cloud.ConnectionManager@3debc153 name:ZooKeeperConnection Watcher:host:port got event WatchedEvent state:Disconnected type:None path:null path:null type:None org.apache.solr.common.cloud.ConnectionManager process INFO: zkClient has disconnected Kindly help me understand the possible cause of Zookeeper state disconnection. Thanks, Modassar
Re: SolrCloud config question and zookeeper
As Michael says, you really want an odd number of zookeepers in order to meet the quorum requirements (which based on your comments you seem to be aware of). There is nothing wrong with 4 ZKs as such, just that it doesn't buy you anything above having 3, so its one more that might go wrong and cause you problems. In your case, I would suggest you just pick the first 3 machines to run ZK or even have 3 other machines outside the cloud to house ZK. The offline argument is also a good one, you really want your ZK instances to be longer lived than Solr, whilst you can restart individual Cores within a Solr Instance, it is often (at least for us) more convenient to bounce the whole java instance. In that scenario (again just re-iterating what Michael said), you don't want ZK to be down at the same time. If you are using Solr Cloud, then all your replicas need to be connected to ZK, you can't have the master instances in ZK, and the replicas not connected (that's more of the old Master-Slave replication system which is still available but orthogonal to Cloud). On 28 October 2014 07:01, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Yes, garbage collection is a very good argument to have external zookeepers. I haven't thought about that. But does this also mean seperate server for each zookeeper or can they live side by side with solr on the same server? What is the problem with 4 zookeepers beside that I have no real gain against 3 zookeepers (only 1 can fail)? Regards Bernd Am 27.10.2014 um 15:41 schrieb Michael Della Bitta: You want external zookeepers. Partially because you don't want your Solr garbage collections holding up zookeeper availability, but also because you don't want your zookeepers going offline if you have to restart Solr for some reason. Also, you want 3 or 5 zookeeepers, not 4 or 8. On 10/27/14 10:35, Bernd Fehling wrote: While starting now with SolrCloud I tried to understand the sense of external zookeeper. Let's assume I want to split 1 huge collection accross 4 server. My straight forward idea is to setup a cloud with 4 shards (one on each server) and also have a replication of the shard on another server. server_1: shard_1, shard_replication_4 server_2: shard_2, shard_replication_1 server_3: shard_3, shard_replication_2 server_4: shard_4, shard_replication_3 In this configuration I always have all 4 shards available if one server fails. But now to zookeeper. I would start the internal zookeeper for all shards including replicas. Does this make sense? Or I only start the internal zookeeper for shard 1 to 4 but not the replicas. Should be good enough, one server can fail, or not? Or I follow the recommendations and install on all 4 server an external seperate zookeeper, but what is the advantage against having the internal zookeeper on each server? I really don't get it at this point. Can anyone help me here? Regards Bernd
Sharding configuration
Hi, We have a SolrCloud configuration of 10 servers, no sharding, 20 millions of documents, the index has 26 GB. As the number of documents has increased recently, the performance of the cluster decreased. We thought of sharding the index, in order to measure the latency. What is the best approach ? - to use shard splitting and have several sub-shards on the same server and in the same tomcat instance - having several shards on the same server but on different tomcat instances - having one shard on each server (for example 2 shards / 5 replicas on 10 servers) What's the impact of these 3 configuration on performance ? Thanks, Anca -- Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Re: SolrCloud config question and zookeeper
Thanks for the explanations. My idea about 4 zookeepers is a result of having the same software (java, zookeeper, solr, ...) installed on all 4 servers. But yes, I don't need to start a zookeeper on the 4th server. 3 other machines outside the cloud for ZK seams a bit oversized. And you have another point of failure with the network between ZK and cloud. If one of the cloud servers end up in smoke the ZK system should still work with ZK and cloud on the same servers. So the offline argument says the first thing I start is ZK and the last I shutdown is ZK. Good point. While moving fom master-slave to cloud I'm aware of the fact that all shards have to be connected to ZK. But how can I tell ZK that on server_1 is leader shard_1 AND replica shard_4 ? Unfortunately the Getting Started with SolrCloud is a bit short on this. Regards Bernd Am 28.10.2014 um 09:15 schrieb Daniel Collins: As Michael says, you really want an odd number of zookeepers in order to meet the quorum requirements (which based on your comments you seem to be aware of). There is nothing wrong with 4 ZKs as such, just that it doesn't buy you anything above having 3, so its one more that might go wrong and cause you problems. In your case, I would suggest you just pick the first 3 machines to run ZK or even have 3 other machines outside the cloud to house ZK. The offline argument is also a good one, you really want your ZK instances to be longer lived than Solr, whilst you can restart individual Cores within a Solr Instance, it is often (at least for us) more convenient to bounce the whole java instance. In that scenario (again just re-iterating what Michael said), you don't want ZK to be down at the same time. If you are using Solr Cloud, then all your replicas need to be connected to ZK, you can't have the master instances in ZK, and the replicas not connected (that's more of the old Master-Slave replication system which is still available but orthogonal to Cloud). On 28 October 2014 07:01, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Yes, garbage collection is a very good argument to have external zookeepers. I haven't thought about that. But does this also mean seperate server for each zookeeper or can they live side by side with solr on the same server? What is the problem with 4 zookeepers beside that I have no real gain against 3 zookeepers (only 1 can fail)? Regards Bernd Am 27.10.2014 um 15:41 schrieb Michael Della Bitta: You want external zookeepers. Partially because you don't want your Solr garbage collections holding up zookeeper availability, but also because you don't want your zookeepers going offline if you have to restart Solr for some reason. Also, you want 3 or 5 zookeeepers, not 4 or 8. On 10/27/14 10:35, Bernd Fehling wrote: While starting now with SolrCloud I tried to understand the sense of external zookeeper. Let's assume I want to split 1 huge collection accross 4 server. My straight forward idea is to setup a cloud with 4 shards (one on each server) and also have a replication of the shard on another server. server_1: shard_1, shard_replication_4 server_2: shard_2, shard_replication_1 server_3: shard_3, shard_replication_2 server_4: shard_4, shard_replication_3 In this configuration I always have all 4 shards available if one server fails. But now to zookeeper. I would start the internal zookeeper for all shards including replicas. Does this make sense? Or I only start the internal zookeeper for shard 1 to 4 but not the replicas. Should be good enough, one server can fail, or not? Or I follow the recommendations and install on all 4 server an external seperate zookeeper, but what is the advantage against having the internal zookeeper on each server? I really don't get it at this point. Can anyone help me here? Regards Bernd
RE: suggestion for new custom atomic update
Shalin and Matthew, Thank you very much. -Original Message- From: Matthew Nigl [mailto:matthew.n...@gmail.com] Sent: Monday, October 27, 2014 7:24 PM To: solr-user@lucene.apache.org Subject: Re: suggestion for new custom atomic update No problem Elran. As Shalin mentioned, you will need to do it like this: processor class=solr.DistributedUpdateProcessorFactory/ processor class=mycode.solr_plugins.FieldManipulationProcessorFactory / processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / On 28 October 2014 03:22, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Hi Elran, You need to explicitly specify the DistributedUpdateProcessorFactory in the chain and then add your custom processor after it. On Mon, Oct 27, 2014 at 9:26 PM, Elran Dvir elr...@checkpoint.com wrote: Thank you very much for your suggestion. I created an update processor factory with my logic. I changed the update processor chain to be: processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / processor class=mycode.solr_plugins.FieldManipulationProcessorFactory / But nothing seems to happen. When I move my class to be the first in the chain, the logic is running (not as I want, of course. It's calculated based on the update value rather than the stored value) . How can I define a custom update processor factory that will run after DistributedUpdateProcessorFactory? Thank you very much. -Original Message- From: Matthew Nigl [mailto:matthew.n...@gmail.com] Sent: Monday, October 27, 2014 12:10 PM To: solr-user@lucene.apache.org Subject: Re: suggestion for new custom atomic update You can get the summed value, 13, if you add a processor after DistributedUpdateProcessorFactory in the URP chain. Then one possibility would be to clone this value to another field, such as field_b, and run other processors on that field. Or for something more customized, you can use the StatelessScriptUpdateProcessorFactory, and retrieve the value of field_a with: var doc = cmd.solrDoc; // org.apache.solr.common.SolrInputDocument var field_a = doc.getFieldValue(field_a); Note if you try to get the value of field_a before DistributedUpdateProcessorFactory, then using your example with an atomic update, the value would be 5 (the value of the increment from the input document). On 27 October 2014 18:03, Elran Dvir elr...@checkpoint.com wrote: I will explain with an example. Let's say field_a is sent in the update with the value of 5. field_a is already stored in the document with the value 8. After the update field_a should have the value 13 (sum). The value of field_b will be based on the value of 13 and not 5. Is there a way in URP to know what is the value which is already stored in field_a? Thank you very much. -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Sunday, October 26, 2014 6:07 PM To: solr-user Subject: Re: suggestion for new custom atomic update I am not sure what the problem is. URP catches all operations. So, you can modify the source document to add the calculation when the field_a is either new or updated. Or are you trying to calculate things across multiple documents? In that case, neither mine nor your solution will work, I think. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On 26 October 2014 12:00, Elran Dvir elr...@checkpoint.com wrote: Thanks for your response. If the calculation is based on the most recent summed value of field_a and the value of field_a in the update, how can I? Thanks. -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Sunday, October 26, 2014 2:11 PM To: solr-user Subject: RE: suggestion for new custom atomic update Can't you do the calculation in custom UpdateRequestProcessor? Regards, Alex On 26/10/2014 4:17 am, Elran Dvir elr...@checkpoint.com wrote: Hi all, Did anyone have a chance to review my idea? Thanks. -Original Message- From: Elran Dvir Sent: Monday, October 20, 2014 12:42 PM To: solr-user Subject: suggestion for new custom atomic update Hi all, This is my use case: I have a stored field, field_a, which is atomic updated (let's say by inc). field_a is stored but not indexed due to the large number of distinct values it can have. I need to index field_b (I need facet and stats on it) which is not in the document but its value is based on a calculation of the
Re: SolrCloud config question and zookeeper
On Tuesday 28 October 2014 10:42:11 Bernd Fehling wrote: Thanks for the explanations. My idea about 4 zookeepers is a result of having the same software (java, zookeeper, solr, ...) installed on all 4 servers. But yes, I don't need to start a zookeeper on the 4th server. 3 other machines outside the cloud for ZK seams a bit oversized. And you have another point of failure with the network between ZK and cloud. If one of the cloud servers end up in smoke the ZK system should still work with ZK and cloud on the same servers. So the offline argument says the first thing I start is ZK and the last I shutdown is ZK. Good point. While moving fom master-slave to cloud I'm aware of the fact that all shards have to be connected to ZK. But how can I tell ZK that on server_1 is leader shard_1 AND replica shard_4 ? You don't, it will elect a leader by itself. Unfortunately the Getting Started with SolrCloud is a bit short on this. Regards Bernd Am 28.10.2014 um 09:15 schrieb Daniel Collins: As Michael says, you really want an odd number of zookeepers in order to meet the quorum requirements (which based on your comments you seem to be aware of). There is nothing wrong with 4 ZKs as such, just that it doesn't buy you anything above having 3, so its one more that might go wrong and cause you problems. In your case, I would suggest you just pick the first 3 machines to run ZK or even have 3 other machines outside the cloud to house ZK. The offline argument is also a good one, you really want your ZK instances to be longer lived than Solr, whilst you can restart individual Cores within a Solr Instance, it is often (at least for us) more convenient to bounce the whole java instance. In that scenario (again just re-iterating what Michael said), you don't want ZK to be down at the same time. If you are using Solr Cloud, then all your replicas need to be connected to ZK, you can't have the master instances in ZK, and the replicas not connected (that's more of the old Master-Slave replication system which is still available but orthogonal to Cloud). On 28 October 2014 07:01, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Yes, garbage collection is a very good argument to have external zookeepers. I haven't thought about that. But does this also mean seperate server for each zookeeper or can they live side by side with solr on the same server? What is the problem with 4 zookeepers beside that I have no real gain against 3 zookeepers (only 1 can fail)? Regards Bernd Am 27.10.2014 um 15:41 schrieb Michael Della Bitta: You want external zookeepers. Partially because you don't want your Solr garbage collections holding up zookeeper availability, but also because you don't want your zookeepers going offline if you have to restart Solr for some reason. Also, you want 3 or 5 zookeeepers, not 4 or 8. On 10/27/14 10:35, Bernd Fehling wrote: While starting now with SolrCloud I tried to understand the sense of external zookeeper. Let's assume I want to split 1 huge collection accross 4 server. My straight forward idea is to setup a cloud with 4 shards (one on each server) and also have a replication of the shard on another server. server_1: shard_1, shard_replication_4 server_2: shard_2, shard_replication_1 server_3: shard_3, shard_replication_2 server_4: shard_4, shard_replication_3 In this configuration I always have all 4 shards available if one server fails. But now to zookeeper. I would start the internal zookeeper for all shards including replicas. Does this make sense? Or I only start the internal zookeeper for shard 1 to 4 but not the replicas. Should be good enough, one server can fail, or not? Or I follow the recommendations and install on all 4 server an external seperate zookeeper, but what is the advantage against having the internal zookeeper on each server? I really don't get it at this point. Can anyone help me here? Regards Bernd
Total term frequency in solr includes deleted documents
Currently I am working on getting term frequency (not document frequency) of term in particular field for whole index. For that I am using function query ttf(field_name,'term'), This returns me total occurrences of term in that field. But It seems it is also considering deleted documents while calculating count. I have verified this using index optimization, after optimization It is showing me correct count. How can we get exact term frequency with excluding deleted documents term frequency, and that is without optimization because optimization is expensive in our case ? Is there any other way we can get term frequency for entire collection in solr? I have also tried following solutions, I have also explored other options like, 1. term vector component - It returns per document term frequency for the documents which matched the query. 2. facet - it returns document frequency 3. Luke request handler - returns top terms from given field (document frequency) 4. terms component - returns document frequency -- View this message in context: http://lucene.472066.n3.nabble.com/Total-term-frequency-in-solr-includes-deleted-documents-tp4166288.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Log message zkClient has disconnected.
On 10/28/2014 1:48 AM, Modassar Ather wrote: These Solrcloud instances are 8-core machines with a RAM of 24 GB each assigned to tomcat. The Indexer machine starts with -Xmx16g. All these machines are connected to the same switch. If you have not tuned your garbage collection, a 16GB heap will be enough to create garbage collection pauses that are long enough to exceed a 15 second zkClientTimeout, which is the setting that is commonly seen in example configs. I was seeing pauses longer than 12 seconds with ConcurrentMarkSweep enabled on an 8GB heap, before I tuned the GC. With a 16GB heap, it would even be possible to exceed a 30 second timeout, which is the default in later releases. After I tuned the CMS collector, my GC pauses are no longer long enough to cause problems. These are GC settings that have worked for me and for others: http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning Thanks, Shawn
Re: SolrCloud config question and zookeeper
On 10/28/2014 3:42 AM, Bernd Fehling wrote: Thanks for the explanations. My idea about 4 zookeepers is a result of having the same software (java, zookeeper, solr, ...) installed on all 4 servers. But yes, I don't need to start a zookeeper on the 4th server. 3 other machines outside the cloud for ZK seams a bit oversized. And you have another point of failure with the network between ZK and cloud. If one of the cloud servers end up in smoke the ZK system should still work with ZK and cloud on the same servers. The only problem with 4 zookeepers instead of 3 is that with both situations, only one server can go down. With 3 servers, you have one less possible point of failure, so it's actually a little more stable. Might as well stick with three, or move to five where you can have two failures. So the offline argument says the first thing I start is ZK and the last I shutdown is ZK. Good point. While moving fom master-slave to cloud I'm aware of the fact that all shards have to be connected to ZK. But how can I tell ZK that on server_1 is leader shard_1 AND replica shard_4 ? As Markus already said, the leaders will be automatically chosen, there's very little you can do to influence the elections. This should be changing in version 5.0, though. See this umbrella issue and its dependent issues, which add the ability to choose which replicas are the preferred leaders, and force an election: https://issues.apache.org/jira/browse/SOLR-6491 Thanks, Shawn
Re: Total term frequency in solr includes deleted documents
On 10/28/2014 7:16 AM, nutchsolruser wrote: How can we get exact term frequency with excluding deleted documents term frequency, and that is without optimization because optimization is expensive in our case ? Is there any other way we can get term frequency for entire collection in solr? This is not possible except through index optimization. Lucene is amazingly efficient at computing information across the entire index. If it were possible to keep that efficiency while also excluding info from deleted documents, I'm sure it would have already been implemented. Thanks, Shawn
Collapse and Expand Results in Solr 4.10 / Highlighting
Hello! I'm testing the »Collapse and Expand« functionality of Solr 4.10. Collapsing and expanding results is working pretty well but it seems that there's no way to get highlighting snippets for the expanded results. Highlighting is only available for the result name=»response». Am I right or did I miss something? Thanks, Michael
Re: Total term frequency in solr includes deleted documents
Merge policy would probably affect at how often _some_ of the deleted documents are purged at the cost lower than the full optimization. https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig#IndexConfiginSolrConfig-MergingIndexSegments But it is still not a 100% solution. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On 28 October 2014 09:42, Shawn Heisey apa...@elyograg.org wrote: On 10/28/2014 7:16 AM, nutchsolruser wrote: How can we get exact term frequency with excluding deleted documents term frequency, and that is without optimization because optimization is expensive in our case ? Is there any other way we can get term frequency for entire collection in solr? This is not possible except through index optimization. Lucene is amazingly efficient at computing information across the entire index. If it were possible to keep that efficiency while also excluding info from deleted documents, I'm sure it would have already been implemented. Thanks, Shawn
Define default Shard in implicit collection
Hi All, I have a collection with implicit router. When I try to load document where router doesn't exist I got error:org.apache.solr.common.SolrException: No shard called =2015 in DocCollection(COL)={ Is it possible to define a default shard where document will be loaded if router doesn't exist? Regards,Nabil.
Re: [ANN] Heliosearch 0.08 released
Is the new faceted search module the cause why I don't have any lucene-facet-hs_0.08.jar in the binary distribution? And what is with lucene-classification and lucene-replicator? How can I build from source, with solr/hs.xml? Regards Bernd Am 27.10.2014 um 17:25 schrieb Yonik Seeley: http://heliosearch.org/download Heliosearch v0.08 Features: o Heliosearch v0.08 is based on (and contains all features of) Lucene/Solr 4.10.2 o Streaming Aggregations over search results API: http://heliosearch.org/streaming-aggregation-for-solrcloud/ o Optimized request logging, and added a logLimit request parameter that limits the size of logged request parameters o A new faceted search module to more easily support future search features o A JSON Facet API to more naturally express Facet Statistics and Nested Sub-Facets http://heliosearch.org/json-facet-api/ Example: curl http://localhost:8983/solr/query -d 'q=*:* json.facet={ categories:{ terms:{// terms facet creates a bucket for each indexed term in the field field : cat, facet:{ avg_price : avg(price), // average price per bucket num_manufacturers : unique(manu), // number of unique manufacturers per bucket my_subfacet: {terms: {...}} // do a sub-facet for every bucket } } } } ' -Yonik http://heliosearch.org - native code faceting, facet functions, sub-facets, off-heap data
Re: [ANN] Heliosearch 0.08 released
On Tue, Oct 28, 2014 at 10:10 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Is the new faceted search module the cause why I don't have any lucene-facet-hs_0.08.jar in the binary distribution? Solr has never used that (and Heliosearch doesn't either). ES never has either AFAIK. And what is with lucene-classification and lucene-replicator? Ditto for these. How can I build from source, with solr/hs.xml? The only thing hs.xml is used for is building the final package. Other stuff uses the straight build.xml... ant test, ant example, etc... There is a shell script in the solr/native directory to build the native code libraries, but if you aren't changing them it's easiest to just take the solr/example/native directory from the heliosearch download. -Yonik http://heliosearch.org - native code faceting, facet functions, sub-facets, off-heap data Regards Bernd Am 27.10.2014 um 17:25 schrieb Yonik Seeley: http://heliosearch.org/download Heliosearch v0.08 Features: o Heliosearch v0.08 is based on (and contains all features of) Lucene/Solr 4.10.2 o Streaming Aggregations over search results API: http://heliosearch.org/streaming-aggregation-for-solrcloud/ o Optimized request logging, and added a logLimit request parameter that limits the size of logged request parameters o A new faceted search module to more easily support future search features o A JSON Facet API to more naturally express Facet Statistics and Nested Sub-Facets http://heliosearch.org/json-facet-api/ Example: curl http://localhost:8983/solr/query -d 'q=*:* json.facet={ categories:{ terms:{// terms facet creates a bucket for each indexed term in the field field : cat, facet:{ avg_price : avg(price), // average price per bucket num_manufacturers : unique(manu), // number of unique manufacturers per bucket my_subfacet: {terms: {...}} // do a sub-facet for every bucket } } } } ' -Yonik http://heliosearch.org - native code faceting, facet functions, sub-facets, off-heap data
Re: Collapse and Expand Results in Solr 4.10 / Highlighting
You are correct. Highlighting is working from the DocList, which only includes the collapsed set when using Collapse/Expand. Joel Bernstein Search Engineer at Heliosearch On Tue, Oct 28, 2014 at 9:46 AM, Michael Hagström mhagstr...@brox.de wrote: Hello! I'm testing the »Collapse and Expand« functionality of Solr 4.10. Collapsing and expanding results is working pretty well but it seems that there's no way to get highlighting snippets for the expanded results. Highlighting is only available for the result name=»response». Am I right or did I miss something? Thanks, Michael
Re: unable to build solr 4.10.1
Hi Karunakar, 4.10.2 (which will be released some time this week) has a fix for this: https://issues.apache.org/jira/browse/LUCENE-6007 The build failure is caused by solr/contrib/dataimporthandler-extras/ivy.xml Apply this patch to make the build succeed: https://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_4_10/solr/contrib/dataimporthandler-extras/ivy.xml?r1=1633651r2=1633650pathrev=1633651 Steve On Oct 28, 2014, at 2:41 AM, Karunakar Reddy karunaka...@gmail.com wrote: Hi , I am getting below error while doing ant dist . :: problems summary :: [ivy:retrieve] WARNINGS [ivy:retrieve] [FAILED ] javax.activation#activation;1.1.1!activation.jar(javadoc): (0ms) [ivy:retrieve] shared: tried [ivy:retrieve] /home/.ivy2/shared/javax.activation/activation/1.1.1/javadocs/activation.jar [ivy:retrieve] public: tried [ivy:retrieve] http://repo1.maven.org/maven2/javax/activation/activation/1.1.1/activation-1.1.1-javadoc.jar [ivy:retrieve] :: [ivy:retrieve] :: FAILED DOWNLOADS:: [ivy:retrieve] :: ^ see resolution messages for details ^ :: [ivy:retrieve] :: [ivy:retrieve] :: javax.activation#activation;1.1.1!activation.jar(javadoc) [ivy:retrieve] :: [ivy:retrieve] [ivy:retrieve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS BUILD FAILED /home/solr_trunk/solr-4.10.1/solr/build.xml:339: The following error occurred while executing this line: /home/solr_trunk/solr-4.10.1/solr/common-build.xml:438: The following error occurred while executing this line: /home/solr_trunk/solr-4.10.1/solr/contrib/contrib-build.xml:52: impossible to resolve dependencies: resolve failed - see output for details Please tell me if i am doing something wrong. Thanks, Karunakar.
Re: unable to build solr 4.10.1
: I am getting below error while doing ant dist . The build system (up to 4.10.1) was unintentinally requiring that javadoc jars existed -- and this recently manifested as a problem when this particular javadoc jar somehow fvanished from maven.org. This issue tracks the fix which will be n 4.10.2... https://issues.apache.org/jira/browse/LUCENE-6007 Since hte missing jar is just javadocs, and are not actually needed to build/run anything, you can just create a dummy file to satisfy the dependency checker, using the shard path mentioned in the error... : [ivy:retrieve] shared: tried : [ivy:retrieve] : /home/.ivy2/shared/javax.activation/activation/1.1.1/javadocs/activation.jar like so... mkdir -p /home/.ivy2/shared/javax.activation/activation/1.1.1/javadocs/ touch /home/.ivy2/shared/javax.activation/activation/1.1.1/javadocs/activation.jar -Hoss http://www.lucidworks.com/
Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.
Will, I think in one of your other emails(which I am not able to find) you has asked if I was indexing directly from MapReduce jobs, yes I am indexing directly from the map task and that is done using SolrJ with a SolrCloudServer initialized with the ZK ensemble URLs.Do I need to use something like MapReducerIndexerTool , which I suupose writes to HDFS and that is in a subsequent step moved to Solr index ? If so why ? I dont use any softCommits and do autocommit every 15 seconds , the snippet in the configuration can be seen below. autoSoftCommit maxTime${solr. autoSoftCommit.maxTime:-1}/maxTime /autoSoftCommit autoCommit maxTime${solr.autoCommit.maxTime:15000}/maxTime openSearchertrue/openSearcher /autoCommit I looked at the localhost_access.log file , all the GET and POST requests have a sub-second response time. On Tue, Oct 28, 2014 at 2:06 AM, Will Martin wmartin...@gmail.com wrote: The easiest, and coarsest measure of response time [not service time in a distributed system] can be picked up in your localhost_access.log file. You're using tomcat write? Lookup AccessLogValve in the docs and server.xml. You can add configuration to report the payload and time to service the request without touching any code. Queueing theory is what Otis was talking about when he said you've saturated your environment. In AWS people just auto-scale up and don't worry about where the load comes from; its dumb if it happens more than 2 times. Capacity planning is tough, let's hope it doesn't disappear altogether. G'luck -Original Message- From: S.L [mailto:simpleliving...@gmail.com] Sent: Monday, October 27, 2014 9:25 PM To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Good point about ZK logs , I do see the following exceptions intermittently in the ZK log. 2014-10-27 06:54:14,621 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029 2014-10-27 07:00:06,697 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /xxx.xxx.xxx.xxx:37336 2014-10-27 07:00:06,725 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to establish new session at /xxx.xxx.xxx.xxx:37336 2014-10-27 07:00:06,746 [myid:1] - INFO [CommitProcessor:1:ZooKeeperServer@617] - Established session 0x14949db9da40037 with negotiated timeout 1 for client /xxx.xxx.xxx.xxx:37336 2014-10-27 07:01:06,520 [myid:1] - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x14949db9da40037, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:744) For queuing theory , I dont know of any way to see how fasts the requests are being served by SolrCloud , and if a queue is being maintained if the service rate is slower than the rate of requests from the incoming multiple threads. On Mon, Oct 27, 2014 at 7:09 PM, Will Martin wmartin...@gmail.com wrote: 2 naïve comments, of course. - Queuing theory - Zookeeper logs. From: S.L [mailto:simpleliving...@gmail.com] Sent: Monday, October 27, 2014 1:42 PM To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Please find the clusterstate.json attached. Also in this case atleast the Shard1 replicas are out of sync , as can be seen below. Shard 1 replica 1 *does not* return a result with distrib=false. Query :http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:* http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=% 28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebu g=trackshards.info=true fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=false debug=track shards.info=true Result : responselst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=q*:*/strstr name= shards.infotrue/strstr name=distribfalse/strstr name=debugtrack/strstr name=wtxml/strstr name=fq(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)/str/lst/lst result name=response numFound=0 start=0/lst name=debug//response Shard1 replica 2 *does* return the result with distrib=false. Query: http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:* http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=%
Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.
We index directly from mappers using SolrJ. It does work, but you pay the price of having to instantiate all those sockets vs. the way MapReduceIndexerTool works, where you're writing to an EmbeddedSolrServer directly in the Reduce task. You don't *need* to use MapReduceIndexerTool, but it's more efficient, and if you don't, you then have to make sure to appropriately tune your Hadoop implementation to match what your Solr installation is capable of. On 10/28/14 12:39, S.L wrote: Will, I think in one of your other emails(which I am not able to find) you has asked if I was indexing directly from MapReduce jobs, yes I am indexing directly from the map task and that is done using SolrJ with a SolrCloudServer initialized with the ZK ensemble URLs.Do I need to use something like MapReducerIndexerTool , which I suupose writes to HDFS and that is in a subsequent step moved to Solr index ? If so why ? I dont use any softCommits and do autocommit every 15 seconds , the snippet in the configuration can be seen below. autoSoftCommit maxTime${solr. autoSoftCommit.maxTime:-1}/maxTime /autoSoftCommit autoCommit maxTime${solr.autoCommit.maxTime:15000}/maxTime openSearchertrue/openSearcher /autoCommit I looked at the localhost_access.log file , all the GET and POST requests have a sub-second response time. On Tue, Oct 28, 2014 at 2:06 AM, Will Martin wmartin...@gmail.com wrote: The easiest, and coarsest measure of response time [not service time in a distributed system] can be picked up in your localhost_access.log file. You're using tomcat write? Lookup AccessLogValve in the docs and server.xml. You can add configuration to report the payload and time to service the request without touching any code. Queueing theory is what Otis was talking about when he said you've saturated your environment. In AWS people just auto-scale up and don't worry about where the load comes from; its dumb if it happens more than 2 times. Capacity planning is tough, let's hope it doesn't disappear altogether. G'luck -Original Message- From: S.L [mailto:simpleliving...@gmail.com] Sent: Monday, October 27, 2014 9:25 PM To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Good point about ZK logs , I do see the following exceptions intermittently in the ZK log. 2014-10-27 06:54:14,621 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029 2014-10-27 07:00:06,697 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /xxx.xxx.xxx.xxx:37336 2014-10-27 07:00:06,725 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to establish new session at /xxx.xxx.xxx.xxx:37336 2014-10-27 07:00:06,746 [myid:1] - INFO [CommitProcessor:1:ZooKeeperServer@617] - Established session 0x14949db9da40037 with negotiated timeout 1 for client /xxx.xxx.xxx.xxx:37336 2014-10-27 07:01:06,520 [myid:1] - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x14949db9da40037, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:744) For queuing theory , I dont know of any way to see how fasts the requests are being served by SolrCloud , and if a queue is being maintained if the service rate is slower than the rate of requests from the incoming multiple threads. On Mon, Oct 27, 2014 at 7:09 PM, Will Martin wmartin...@gmail.com wrote: 2 naïve comments, of course. - Queuing theory - Zookeeper logs. From: S.L [mailto:simpleliving...@gmail.com] Sent: Monday, October 27, 2014 1:42 PM To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Please find the clusterstate.json attached. Also in this case atleast the Shard1 replicas are out of sync , as can be seen below. Shard 1 replica 1 *does not* return a result with distrib=false. Query :http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:* http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=% 28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebu g=trackshards.info=true fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=false debug=track shards.info=true Result : responselst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=q*:*/strstr name= shards.infotrue/strstr name=distribfalse/strstr name=debugtrack/strstr name=wtxml/strstr
Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.
I m using Apache Hadoop and Solr , do I nee dto switch to Cloudera On Tue, Oct 28, 2014 at 1:27 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: We index directly from mappers using SolrJ. It does work, but you pay the price of having to instantiate all those sockets vs. the way MapReduceIndexerTool works, where you're writing to an EmbeddedSolrServer directly in the Reduce task. You don't *need* to use MapReduceIndexerTool, but it's more efficient, and if you don't, you then have to make sure to appropriately tune your Hadoop implementation to match what your Solr installation is capable of. On 10/28/14 12:39, S.L wrote: Will, I think in one of your other emails(which I am not able to find) you has asked if I was indexing directly from MapReduce jobs, yes I am indexing directly from the map task and that is done using SolrJ with a SolrCloudServer initialized with the ZK ensemble URLs.Do I need to use something like MapReducerIndexerTool , which I suupose writes to HDFS and that is in a subsequent step moved to Solr index ? If so why ? I dont use any softCommits and do autocommit every 15 seconds , the snippet in the configuration can be seen below. autoSoftCommit maxTime${solr. autoSoftCommit.maxTime:-1}/maxTime /autoSoftCommit autoCommit maxTime${solr.autoCommit.maxTime:15000}/maxTime openSearchertrue/openSearcher /autoCommit I looked at the localhost_access.log file , all the GET and POST requests have a sub-second response time. On Tue, Oct 28, 2014 at 2:06 AM, Will Martin wmartin...@gmail.com wrote: The easiest, and coarsest measure of response time [not service time in a distributed system] can be picked up in your localhost_access.log file. You're using tomcat write? Lookup AccessLogValve in the docs and server.xml. You can add configuration to report the payload and time to service the request without touching any code. Queueing theory is what Otis was talking about when he said you've saturated your environment. In AWS people just auto-scale up and don't worry about where the load comes from; its dumb if it happens more than 2 times. Capacity planning is tough, let's hope it doesn't disappear altogether. G'luck -Original Message- From: S.L [mailto:simpleliving...@gmail.com] Sent: Monday, October 27, 2014 9:25 PM To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Good point about ZK logs , I do see the following exceptions intermittently in the ZK log. 2014-10-27 06:54:14,621 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029 2014-10-27 07:00:06,697 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /xxx.xxx.xxx.xxx:37336 2014-10-27 07:00:06,725 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to establish new session at /xxx.xxx.xxx.xxx:37336 2014-10-27 07:00:06,746 [myid:1] - INFO [CommitProcessor:1:ZooKeeperServer@617] - Established session 0x14949db9da40037 with negotiated timeout 1 for client /xxx.xxx.xxx.xxx:37336 2014-10-27 07:01:06,520 [myid:1] - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x14949db9da40037, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) at org.apache.zookeeper.server.NIOServerCnxnFactory.run( NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:744) For queuing theory , I dont know of any way to see how fasts the requests are being served by SolrCloud , and if a queue is being maintained if the service rate is slower than the rate of requests from the incoming multiple threads. On Mon, Oct 27, 2014 at 7:09 PM, Will Martin wmartin...@gmail.com wrote: 2 naïve comments, of course. - Queuing theory - Zookeeper logs. From: S.L [mailto:simpleliving...@gmail.com] Sent: Monday, October 27, 2014 1:42 PM To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Please find the clusterstate.json attached. Also in this case atleast the Shard1 replicas are out of sync , as can be seen below. Shard 1 replica 1 *does not* return a result with distrib=false. Query :http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:* http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=% 28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebu g=trackshards.info=true
Slow forwarding requests to collection leader
I have three equal machines each running solr cloud (4.8). I have multiple collections that are replicated but not sharded. I also have document generation processes running on these nodes which involves querying the collection ~5 times per document generated. Node 1 has a replica of collection A and is running document generation code that pushes to the HTTP /update/json hander. Node 2 is the leader of collection A. Node 3 does not have a replica of node A, but is running document generation code for collection A. The issue I see is that node 1 can push documents into Solr 3-5 times faster than node 3 when they both talk to the solr instance on their localhost. If either of them talk directly to the solr instance on node 2, the performance is excellent (on par with node 1). To me it seems that the only difference in these cases is the query/put request forwarding. Does this involve some slow zookeeper communication that should be avoided? Any other insights? Thanks smime.p7s Description: S/MIME cryptographic signature
Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.
Yeah , I get that not using a MarReduceIndexerTool could be more resource intensive , but the way this issue is manifesting which is resulting in disjoint SolrCloud replicas perplexes me . While you were tuning your SolrCloud environment to cater to the Hadoop indexing requirements , did you ever face the issue of disjoint replicas? Is MapReduceIndexer tool Cloudera distro specific? I am using Apache Solr and Hadoop. Thanks On Tue, Oct 28, 2014 at 1:27 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: We index directly from mappers using SolrJ. It does work, but you pay the price of having to instantiate all those sockets vs. the way MapReduceIndexerTool works, where you're writing to an EmbeddedSolrServer directly in the Reduce task. You don't *need* to use MapReduceIndexerTool, but it's more efficient, and if you don't, you then have to make sure to appropriately tune your Hadoop implementation to match what your Solr installation is capable of. On 10/28/14 12:39, S.L wrote: Will, I think in one of your other emails(which I am not able to find) you has asked if I was indexing directly from MapReduce jobs, yes I am indexing directly from the map task and that is done using SolrJ with a SolrCloudServer initialized with the ZK ensemble URLs.Do I need to use something like MapReducerIndexerTool , which I suupose writes to HDFS and that is in a subsequent step moved to Solr index ? If so why ? I dont use any softCommits and do autocommit every 15 seconds , the snippet in the configuration can be seen below. autoSoftCommit maxTime${solr. autoSoftCommit.maxTime:-1}/maxTime /autoSoftCommit autoCommit maxTime${solr.autoCommit.maxTime:15000}/maxTime openSearchertrue/openSearcher /autoCommit I looked at the localhost_access.log file , all the GET and POST requests have a sub-second response time. On Tue, Oct 28, 2014 at 2:06 AM, Will Martin wmartin...@gmail.com wrote: The easiest, and coarsest measure of response time [not service time in a distributed system] can be picked up in your localhost_access.log file. You're using tomcat write? Lookup AccessLogValve in the docs and server.xml. You can add configuration to report the payload and time to service the request without touching any code. Queueing theory is what Otis was talking about when he said you've saturated your environment. In AWS people just auto-scale up and don't worry about where the load comes from; its dumb if it happens more than 2 times. Capacity planning is tough, let's hope it doesn't disappear altogether. G'luck -Original Message- From: S.L [mailto:simpleliving...@gmail.com] Sent: Monday, October 27, 2014 9:25 PM To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Good point about ZK logs , I do see the following exceptions intermittently in the ZK log. 2014-10-27 06:54:14,621 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029 2014-10-27 07:00:06,697 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /xxx.xxx.xxx.xxx:37336 2014-10-27 07:00:06,725 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to establish new session at /xxx.xxx.xxx.xxx:37336 2014-10-27 07:00:06,746 [myid:1] - INFO [CommitProcessor:1:ZooKeeperServer@617] - Established session 0x14949db9da40037 with negotiated timeout 1 for client /xxx.xxx.xxx.xxx:37336 2014-10-27 07:01:06,520 [myid:1] - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x14949db9da40037, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) at org.apache.zookeeper.server.NIOServerCnxnFactory.run( NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:744) For queuing theory , I dont know of any way to see how fasts the requests are being served by SolrCloud , and if a queue is being maintained if the service rate is slower than the rate of requests from the incoming multiple threads. On Mon, Oct 27, 2014 at 7:09 PM, Will Martin wmartin...@gmail.com wrote: 2 naïve comments, of course. - Queuing theory - Zookeeper logs. From: S.L [mailto:simpleliving...@gmail.com] Sent: Monday, October 27, 2014 1:42 PM To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Please find the clusterstate.json attached. Also in this case atleast the Shard1 replicas are out of sync , as can be seen below. Shard
Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.
No you do not, although you may consider it, because you'd be getting a sort of integrated stack. But really, the decision to switch to running Solr in HDFS should not be taken lightly. Unless you are on a team familiar with running a Hadoop stack, or you're willing to devote a lot of effort toward becoming proficient with one, I would recommend against it. On 10/28/14 15:27, S.L wrote: I m using Apache Hadoop and Solr , do I nee dto switch to Cloudera On Tue, Oct 28, 2014 at 1:27 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: We index directly from mappers using SolrJ. It does work, but you pay the price of having to instantiate all those sockets vs. the way MapReduceIndexerTool works, where you're writing to an EmbeddedSolrServer directly in the Reduce task. You don't *need* to use MapReduceIndexerTool, but it's more efficient, and if you don't, you then have to make sure to appropriately tune your Hadoop implementation to match what your Solr installation is capable of. On 10/28/14 12:39, S.L wrote: Will, I think in one of your other emails(which I am not able to find) you has asked if I was indexing directly from MapReduce jobs, yes I am indexing directly from the map task and that is done using SolrJ with a SolrCloudServer initialized with the ZK ensemble URLs.Do I need to use something like MapReducerIndexerTool , which I suupose writes to HDFS and that is in a subsequent step moved to Solr index ? If so why ? I dont use any softCommits and do autocommit every 15 seconds , the snippet in the configuration can be seen below. autoSoftCommit maxTime${solr. autoSoftCommit.maxTime:-1}/maxTime /autoSoftCommit autoCommit maxTime${solr.autoCommit.maxTime:15000}/maxTime openSearchertrue/openSearcher /autoCommit I looked at the localhost_access.log file , all the GET and POST requests have a sub-second response time. On Tue, Oct 28, 2014 at 2:06 AM, Will Martin wmartin...@gmail.com wrote: The easiest, and coarsest measure of response time [not service time in a distributed system] can be picked up in your localhost_access.log file. You're using tomcat write? Lookup AccessLogValve in the docs and server.xml. You can add configuration to report the payload and time to service the request without touching any code. Queueing theory is what Otis was talking about when he said you've saturated your environment. In AWS people just auto-scale up and don't worry about where the load comes from; its dumb if it happens more than 2 times. Capacity planning is tough, let's hope it doesn't disappear altogether. G'luck -Original Message- From: S.L [mailto:simpleliving...@gmail.com] Sent: Monday, October 27, 2014 9:25 PM To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Good point about ZK logs , I do see the following exceptions intermittently in the ZK log. 2014-10-27 06:54:14,621 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029 2014-10-27 07:00:06,697 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /xxx.xxx.xxx.xxx:37336 2014-10-27 07:00:06,725 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to establish new session at /xxx.xxx.xxx.xxx:37336 2014-10-27 07:00:06,746 [myid:1] - INFO [CommitProcessor:1:ZooKeeperServer@617] - Established session 0x14949db9da40037 with negotiated timeout 1 for client /xxx.xxx.xxx.xxx:37336 2014-10-27 07:01:06,520 [myid:1] - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x14949db9da40037, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) at org.apache.zookeeper.server.NIOServerCnxnFactory.run( NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:744) For queuing theory , I dont know of any way to see how fasts the requests are being served by SolrCloud , and if a queue is being maintained if the service rate is slower than the rate of requests from the incoming multiple threads. On Mon, Oct 27, 2014 at 7:09 PM, Will Martin wmartin...@gmail.com wrote: 2 naïve comments, of course. - Queuing theory - Zookeeper logs. From: S.L [mailto:simpleliving...@gmail.com] Sent: Monday, October 27, 2014 1:42 PM To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Please find the clusterstate.json attached. Also in this case atleast the Shard1 replicas are out of sync , as can be seen below. Shard 1 replica 1 *does not* return a result with
Re: Sharding configuration
As far as the second option goes, unless you are using a large amount of memory and you reach a point where a JVM can't sensibly deal with a GC load, having multiple JVMs wouldn't buy you much. With a 26GB index, you probably haven't reached that point. There are also other shared resources at an instance level like connection pools and ZK connections, but those are tunable and you probably aren't pushing them as well (I would imagine you are just trying to have only a handful of shards given that you aren't sharded at all currently). That leaves single vs multiple machines. Assuming the network isn't a bottleneck, and given the same amount of resources overall (number of cores, amount of memory, IO bandwidth times number of machines), it shouldn't matter between the two. If you are procuring new hardware, I would say buy more, smaller machines, but if you already have the hardware, you could serve as much as possible off a machine before moving to a second. There's nothing which limits the number of shards as long as the underlying machine has the sufficient amount of parallelism. Again, this advice is for a small number of shards, if you had a lot more (hundreds) of shards and significant volume of requests, things start to become a bit more fuzzy with other limits kicking in. On 28 Oct 2014 09:26, Anca Kopetz anca.kop...@kelkoo.com wrote: Hi, We have a SolrCloud configuration of 10 servers, no sharding, 20 millions of documents, the index has 26 GB. As the number of documents has increased recently, the performance of the cluster decreased. We thought of sharding the index, in order to measure the latency. What is the best approach ? - to use shard splitting and have several sub-shards on the same server and in the same tomcat instance - having several shards on the same server but on different tomcat instances - having one shard on each server (for example 2 shards / 5 replicas on 10 servers) What's the impact of these 3 configuration on performance ? Thanks, Anca -- Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Indexing documents/files for production use
Hi All, I am reading the solr documentation. I have understood that post.jar http://wiki.apache.org/solr/ExtractingRequestHandler#SimplePostTool_.28post.jar.29 is not meant for production use, cURL https://cwiki.apache.org/confluence/display/solr/Introduction+to+Solr+Indexing is not recommanded. Is SolrJ better for production? Thank you. Regards Olivier
Re: Indexing documents/files for production use
What is your production use? You have to answer that for yourself. post.jar makes a couple of things easy. If your production use fits into those (e.g. no cluster) - great, use it. It is certainly not any worse than cURL. But if you are running a cluster and have specific requirements, then yes, use something that's cluster aware. Whether it is a custom client on top of SolrJ, Spring Data, or Cloudera pipeline will depend on your particular use case. Don't make your life over-complicated in advance. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On 28 October 2014 17:12, Olivier Austina olivier.aust...@gmail.com wrote: Hi All, I am reading the solr documentation. I have understood that post.jar http://wiki.apache.org/solr/ExtractingRequestHandler#SimplePostTool_.28post.jar.29 is not meant for production use, cURL https://cwiki.apache.org/confluence/display/solr/Introduction+to+Solr+Indexing is not recommanded. Is SolrJ better for production? Thank you. Regards Olivier
Re: Indexing documents/files for production use
Hello Olivier, for real production use, you won't really want to use any toys like post.jar or curl. You want a decent connector to whatever data source there is, that fetches data, possibly massages it a bit, and then feeds it into Solr - by means of SolrJ or directly into the web service of Solr via binary protocols. This way, you can properly handle incremental feeding, processing of data from remote locations (with the connector being closer to the data source), and also source data security. Also think about what happens if you do processing of incoming documents in Solr. What happens if Tika runs out of memory because of PDF problems? What if this crashes your Solr node? In our Solr projects, we generally do not do any sizable processing within Solr as document processing and document indexing or querying have all different scaling properties. Production use most typically is not achieved by deploying a vanilla Solr, but rather having a bit more glue and wrappage, so the whole will fit your requirements in terms of functionality, scaling, monitoring and robustness. Some similar platforms like Elasticsearch try to alleviate these pains of going to a production-style infrastructure, but that's at the expense of flexibility and comes with limitations. For proof-of-concept or demonstrator-style applications, the plain tools out of the box will be fine. For production applications, you want to have more robust components. Best regards, --Jürgen On 28.10.2014 22:12, Olivier Austina wrote: Hi All, I am reading the solr documentation. I have understood that post.jar http://wiki.apache.org/solr/ExtractingRequestHandler#SimplePostTool_.28post.jar.29 is not meant for production use, cURL https://cwiki.apache.org/confluence/display/solr/Introduction+to+Solr+Indexing is not recommanded. Is SolrJ better for production? Thank you. Regards Olivier -- Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением *i.A. Jürgen Wagner* Head of Competence Center Intelligence Senior Cloud Consultant Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543 E-Mail: juergen.wag...@devoteam.com mailto:juergen.wag...@devoteam.com, URL: www.devoteam.de http://www.devoteam.de/ Managing Board: Jürgen Hatzipantelis (CEO) Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
Re: Indexing documents/files for production use
And one other consideration in addition to the two excellent responses so far In a SolrCloud environment, SolrJ via CloudSolrServer will automatically route the documents to the correct shard leader, saving some additional overhead. Post.jar and cURL send the docs to a node, which in turn forward the docs to the correct shard leader which lowers throughput Best, Erick On Tue, Oct 28, 2014 at 2:32 PM, Jürgen Wagner (DVT) juergen.wag...@devoteam.com wrote: Hello Olivier, for real production use, you won't really want to use any toys like post.jar or curl. You want a decent connector to whatever data source there is, that fetches data, possibly massages it a bit, and then feeds it into Solr - by means of SolrJ or directly into the web service of Solr via binary protocols. This way, you can properly handle incremental feeding, processing of data from remote locations (with the connector being closer to the data source), and also source data security. Also think about what happens if you do processing of incoming documents in Solr. What happens if Tika runs out of memory because of PDF problems? What if this crashes your Solr node? In our Solr projects, we generally do not do any sizable processing within Solr as document processing and document indexing or querying have all different scaling properties. Production use most typically is not achieved by deploying a vanilla Solr, but rather having a bit more glue and wrappage, so the whole will fit your requirements in terms of functionality, scaling, monitoring and robustness. Some similar platforms like Elasticsearch try to alleviate these pains of going to a production-style infrastructure, but that's at the expense of flexibility and comes with limitations. For proof-of-concept or demonstrator-style applications, the plain tools out of the box will be fine. For production applications, you want to have more robust components. Best regards, --Jürgen On 28.10.2014 22:12, Olivier Austina wrote: Hi All, I am reading the solr documentation. I have understood that post.jar http://wiki.apache.org/solr/ExtractingRequestHandler#SimplePostTool_.28post.jar.29 is not meant for production use, cURL https://cwiki.apache.org/confluence/display/solr/Introduction+to+Solr+Indexing is not recommanded. Is SolrJ better for production? Thank you. Regards Olivier -- Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением i.A. Jürgen Wagner Head of Competence Center Intelligence Senior Cloud Consultant Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543 E-Mail: juergen.wag...@devoteam.com, URL: www.devoteam.de Managing Board: Jürgen Hatzipantelis (CEO) Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
RE: Sharding configuration
Informational only. FYI Machine parallelism has been empirically proven to be application dependent. See DaCapo benchmarks (lucene indexing and lucene searching) use in http://dx.doi.org/10.1145/2479871.2479901 Parallelism profiling and wall-time prediction for multi-threaded applications 2013. FYI: -Original Message- From: Ramkumar R. Aiyengar [mailto:andyetitmo...@gmail.com] Sent: Tuesday, October 28, 2014 3:44 PM To: solr-user@lucene.apache.org Subject: Re: Sharding configuration As far as the second option goes, unless you are using a large amount of memory and you reach a point where a JVM can't sensibly deal with a GC load, having multiple JVMs wouldn't buy you much. With a 26GB index, you probably haven't reached that point. There are also other shared resources at an instance level like connection pools and ZK connections, but those are tunable and you probably aren't pushing them as well (I would imagine you are just trying to have only a handful of shards given that you aren't sharded at all currently). That leaves single vs multiple machines. Assuming the network isn't a bottleneck, and given the same amount of resources overall (number of cores, amount of memory, IO bandwidth times number of machines), it shouldn't matter between the two. If you are procuring new hardware, I would say buy more, smaller machines, but if you already have the hardware, you could serve as much as possible off a machine before moving to a second. There's nothing which limits the number of shards as long as the underlying machine has the sufficient amount of parallelism. Again, this advice is for a small number of shards, if you had a lot more (hundreds) of shards and significant volume of requests, things start to become a bit more fuzzy with other limits kicking in. On 28 Oct 2014 09:26, Anca Kopetz mailto:anca.kop...@kelkoo.com anca.kop...@kelkoo.com wrote: Hi, We have a SolrCloud configuration of 10 servers, no sharding, 20 millions of documents, the index has 26 GB. As the number of documents has increased recently, the performance of the cluster decreased. We thought of sharding the index, in order to measure the latency. What is the best approach ? - to use shard splitting and have several sub-shards on the same server and in the same tomcat instance - having several shards on the same server but on different tomcat instances - having one shard on each server (for example 2 shards / 5 replicas on 10 servers) What's the impact of these 3 configuration on performance ? Thanks, Anca -- Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Re: Log message zkClient has disconnected.
Thanks Shawn for your response and the link of GC tuning. Regards, Modassar On Tue, Oct 28, 2014 at 7:01 PM, Shawn Heisey apa...@elyograg.org wrote: On 10/28/2014 1:48 AM, Modassar Ather wrote: These Solrcloud instances are 8-core machines with a RAM of 24 GB each assigned to tomcat. The Indexer machine starts with -Xmx16g. All these machines are connected to the same switch. If you have not tuned your garbage collection, a 16GB heap will be enough to create garbage collection pauses that are long enough to exceed a 15 second zkClientTimeout, which is the setting that is commonly seen in example configs. I was seeing pauses longer than 12 seconds with ConcurrentMarkSweep enabled on an 8GB heap, before I tuned the GC. With a 16GB heap, it would even be possible to exceed a 30 second timeout, which is the default in later releases. After I tuned the CMS collector, my GC pauses are no longer long enough to cause problems. These are GC settings that have worked for me and for others: http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning Thanks, Shawn