Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
I'm curious how on the fly updates are handled as a new shard is added to an alias. Eg, how does the system know to which shard to send an update? On Tue, Apr 17, 2012 at 4:00 PM, Lukáš Vlček lukas.vl...@gmail.com wrote: Hi, speaking about ES I think it would be fair to mention that one has to specify number of shards upfront when the index is created - that is correct, however, it is possible to give index one or more aliases which basically means that you can add new indices on the fly and give them same alias which is then used to search against. Given that you can add/remove indices, nodes and aliases on the fly I think there is a way how to handle growing data set with ease. If anyone is interested such scenario has been discussed in detail in ES mail list. Regards, Lukas On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: One of big weaknesses of Solr Cloud (and ES?) is the lack of the ability to redistribute shards across servers. Meaning, as a single shard grows too large, splitting the shard, while live updates. How do you plan on elastically adding more servers without this feature? Cassandra and HBase handle elasticity in their own ways. Cassandra has successfully implemented the Dynamo model and HBase uses the traditional BigTable 'split'. Both systems are complex though are at a singular level of maturity. Also Cassandra [successfully] implements multiple data center support, is that available in SC or ES? On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello Ali, I'm trying to setup a large scale *Crawl + Index + Search *infrastructure using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, crawled + indexed every *4 weeks, *with a search latency of less than 0.5 seconds. That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. Needless to mention, the search index needs to scale to 5Billion pages. It is also possible that I might need to store multiple indexes -- one for crawled content, and one for ancillary data that is also very large. Each of these indices would likely require a logically distributed and replicated index. Yup, OK. However, I would like for such a system to be homogenous with the Hadoop infrastructure that is already installed on the cluster (for the crawl). In other words, I would much prefer if the replication and distribution of the Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of using another scalability framework (such as SolrCloud). In addition, it would be ideal if this environment was flexible enough to be dynamically scaled based on the size requirements of the index and the search traffic at the time (i.e. if it is deployed on an Amazon cluster, it should be easy enough to automatically provision additional processing power into the cluster without requiring server re-starts). There is no such thing just yet. There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase. However, I'm not sure which Solr-based tool in the Hadoop ecosystem would be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is mature enough and would be the right architectural choice to go along with a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects above. Here is a summary on all of them: * Search on HBase - I assume you are referring to the same thing I mentioned above. Not ready. * Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra. Looks good. * Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s) on the side. Not really integrated the way you want it to be. * ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a mny-B documents index and hundreds of nodes with ElasticSearch right now), etc. But again, not integrated with Hadoop the way you want it. * IndexTank - has some technical weaknesses, not integrated with Hadoop, not sure about its future considering LinkedIn uses Zoie and Sensei already. * And there is SolrCloud, which is coming soon and will be solid, but is again not integrated. If I were you and I had to pick today - I'd pick ElasticSearch if I were completely open. If I had Solr bias I'd give SolrCloud a try first. Lastly, how much hardware (assuming a medium sized EC2 instance) would you estimate my needing with this setup, for regular web-data (HTML text) at this scale? I don't know off the topic of my head, but I'm guessing several hundred
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
AFAIK it can not. You can only add new shards by creating a new index and you will then need to index new data into that new index. Index aliases are useful mainly for searching part. So it means that you need to plan for this when you implement your indexing logic. On the other hand the query logic does not need to change as you only add new indices and give them all the same alias. I am not an expert on this but I think that index splitting and re-sharding can be expensive for [near] real-time search system and the point is that you can probably use different techniques to support your large scale needs. Index aliasing and routing in elasticsearch can help a lot in supporting various large scale data scenarios, check the following thread in ES ML for some examples: https://groups.google.com/forum/#!msg/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ Just to sum it up, the fact that elasticsearch does have fixed number of shards per index and does not support resharding and index splitting does not mean you can not scale your data easily. (I was not following this whole thread in every detail. So may be you may have specific needs that can be solved only by splitting or resharding, in such case I would recommend you to ask on ES ML with further questions, I do not want to run into system X vs system Y flame here...) Regards, Lukas On Wed, Apr 18, 2012 at 2:22 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: I'm curious how on the fly updates are handled as a new shard is added to an alias. Eg, how does the system know to which shard to send an update? On Tue, Apr 17, 2012 at 4:00 PM, Lukáš Vlček lukas.vl...@gmail.com wrote: Hi, speaking about ES I think it would be fair to mention that one has to specify number of shards upfront when the index is created - that is correct, however, it is possible to give index one or more aliases which basically means that you can add new indices on the fly and give them same alias which is then used to search against. Given that you can add/remove indices, nodes and aliases on the fly I think there is a way how to handle growing data set with ease. If anyone is interested such scenario has been discussed in detail in ES mail list. Regards, Lukas On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: One of big weaknesses of Solr Cloud (and ES?) is the lack of the ability to redistribute shards across servers. Meaning, as a single shard grows too large, splitting the shard, while live updates. How do you plan on elastically adding more servers without this feature? Cassandra and HBase handle elasticity in their own ways. Cassandra has successfully implemented the Dynamo model and HBase uses the traditional BigTable 'split'. Both systems are complex though are at a singular level of maturity. Also Cassandra [successfully] implements multiple data center support, is that available in SC or ES? On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello Ali, I'm trying to setup a large scale *Crawl + Index + Search *infrastructure using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, crawled + indexed every *4 weeks, *with a search latency of less than 0.5 seconds. That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. Needless to mention, the search index needs to scale to 5Billion pages. It is also possible that I might need to store multiple indexes -- one for crawled content, and one for ancillary data that is also very large. Each of these indices would likely require a logically distributed and replicated index. Yup, OK. However, I would like for such a system to be homogenous with the Hadoop infrastructure that is already installed on the cluster (for the crawl). In other words, I would much prefer if the replication and distribution of the Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of using another scalability framework (such as SolrCloud). In addition, it would be ideal if this environment was flexible enough to be dynamically scaled based on the size requirements of the index and the search traffic at the time (i.e. if it is deployed on an Amazon cluster, it should be easy enough to automatically provision additional processing power into the cluster without requiring server re-starts). There is no such thing just yet. There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase. However, I'm not sure which Solr-based tool in the Hadoop ecosystem would be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, Lily, ElasticSearch, IndexTank
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
The main point being made is established NoSQL solutions (eg, Cassandra, HBase, et al) have solved the update problem (among many other scalability issues, for several years). If an update is being performed and it is not known where the record exists, the update capability of the system is inefficient. In addition, in a production system, the mere possibility of losing data, or inaccurate updates is usually a red flag. On Wed, Apr 18, 2012 at 6:40 AM, Lukáš Vlček lukas.vl...@gmail.com wrote: AFAIK it can not. You can only add new shards by creating a new index and you will then need to index new data into that new index. Index aliases are useful mainly for searching part. So it means that you need to plan for this when you implement your indexing logic. On the other hand the query logic does not need to change as you only add new indices and give them all the same alias. I am not an expert on this but I think that index splitting and re-sharding can be expensive for [near] real-time search system and the point is that you can probably use different techniques to support your large scale needs. Index aliasing and routing in elasticsearch can help a lot in supporting various large scale data scenarios, check the following thread in ES ML for some examples: https://groups.google.com/forum/#!msg/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ Just to sum it up, the fact that elasticsearch does have fixed number of shards per index and does not support resharding and index splitting does not mean you can not scale your data easily. (I was not following this whole thread in every detail. So may be you may have specific needs that can be solved only by splitting or resharding, in such case I would recommend you to ask on ES ML with further questions, I do not want to run into system X vs system Y flame here...) Regards, Lukas On Wed, Apr 18, 2012 at 2:22 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: I'm curious how on the fly updates are handled as a new shard is added to an alias. Eg, how does the system know to which shard to send an update? On Tue, Apr 17, 2012 at 4:00 PM, Lukáš Vlček lukas.vl...@gmail.com wrote: Hi, speaking about ES I think it would be fair to mention that one has to specify number of shards upfront when the index is created - that is correct, however, it is possible to give index one or more aliases which basically means that you can add new indices on the fly and give them same alias which is then used to search against. Given that you can add/remove indices, nodes and aliases on the fly I think there is a way how to handle growing data set with ease. If anyone is interested such scenario has been discussed in detail in ES mail list. Regards, Lukas On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: One of big weaknesses of Solr Cloud (and ES?) is the lack of the ability to redistribute shards across servers. Meaning, as a single shard grows too large, splitting the shard, while live updates. How do you plan on elastically adding more servers without this feature? Cassandra and HBase handle elasticity in their own ways. Cassandra has successfully implemented the Dynamo model and HBase uses the traditional BigTable 'split'. Both systems are complex though are at a singular level of maturity. Also Cassandra [successfully] implements multiple data center support, is that available in SC or ES? On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello Ali, I'm trying to setup a large scale *Crawl + Index + Search *infrastructure using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, crawled + indexed every *4 weeks, *with a search latency of less than 0.5 seconds. That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. Needless to mention, the search index needs to scale to 5Billion pages. It is also possible that I might need to store multiple indexes -- one for crawled content, and one for ancillary data that is also very large. Each of these indices would likely require a logically distributed and replicated index. Yup, OK. However, I would like for such a system to be homogenous with the Hadoop infrastructure that is already installed on the cluster (for the crawl). In other words, I would much prefer if the replication and distribution of the Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of using another scalability framework (such as SolrCloud). In addition, it would be ideal if this environment was flexible enough to be dynamically scaled based on the size requirements of the index and the search traffic at the time (i.e. if it is deployed on an Amazon cluster, it should be easy enough to
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
Hi, I think Katta integration is nice, but it is not very real-time. What if you want both? Perhaps a Katta/SolrCloud integration could make the two frameworks play together, so that some shards in SolrCloud may be marked as static while others are realtime. SolrCloud will handle indexing the realtime shards as today, but indexing the static shards will be handled by Katta. If Katta adds a shard it will tell SolrCloud by updating the ZK tree, and SolrCloud will pick up the shard and start serving search for it.. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 17. apr. 2012, at 02:42, Jason Rutherglen wrote: One of big weaknesses of Solr Cloud (and ES?) is the lack of the ability to redistribute shards across servers. Meaning, as a single shard grows too large, splitting the shard, while live updates. How do you plan on elastically adding more servers without this feature? Cassandra and HBase handle elasticity in their own ways. Cassandra has successfully implemented the Dynamo model and HBase uses the traditional BigTable 'split'. Both systems are complex though are at a singular level of maturity. Also Cassandra [successfully] implements multiple data center support, is that available in SC or ES? On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello Ali, I'm trying to setup a large scale *Crawl + Index + Search *infrastructure using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, crawled + indexed every *4 weeks, *with a search latency of less than 0.5 seconds. That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. Needless to mention, the search index needs to scale to 5Billion pages. It is also possible that I might need to store multiple indexes -- one for crawled content, and one for ancillary data that is also very large. Each of these indices would likely require a logically distributed and replicated index. Yup, OK. However, I would like for such a system to be homogenous with the Hadoop infrastructure that is already installed on the cluster (for the crawl). In other words, I would much prefer if the replication and distribution of the Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of using another scalability framework (such as SolrCloud). In addition, it would be ideal if this environment was flexible enough to be dynamically scaled based on the size requirements of the index and the search traffic at the time (i.e. if it is deployed on an Amazon cluster, it should be easy enough to automatically provision additional processing power into the cluster without requiring server re-starts). There is no such thing just yet. There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase. However, I'm not sure which Solr-based tool in the Hadoop ecosystem would be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is mature enough and would be the right architectural choice to go along with a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects above. Here is a summary on all of them: * Search on HBase - I assume you are referring to the same thing I mentioned above. Not ready. * Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra. Looks good. * Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s) on the side. Not really integrated the way you want it to be. * ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a mny-B documents index and hundreds of nodes with ElasticSearch right now), etc. But again, not integrated with Hadoop the way you want it. * IndexTank - has some technical weaknesses, not integrated with Hadoop, not sure about its future considering LinkedIn uses Zoie and Sensei already. * And there is SolrCloud, which is coming soon and will be solid, but is again not integrated. If I were you and I had to pick today - I'd pick ElasticSearch if I were completely open. If I had Solr bias I'd give SolrCloud a try first. Lastly, how much hardware (assuming a medium sized EC2 instance) would you estimate my needing with this setup, for regular web-data (HTML text) at this scale? I don't know off the topic of my head, but I'm guessing several hundred for serving search requests. HTH, Otis -- Search Analytics - http://sematext.com/search-analytics/index.html Scalable Performance Monitoring - http://sematext.com/spm/index.html Any architectural guidance would be greatly
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
I think Jason is right - there is no index splitting in ES and SolrCloud, so one has to think ahead, overshard, and then count on redistributing shards from oversubscribed nodes to other nodes. No resharding on demand and no index/shard splitting yet. Otis Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html From: Jason Rutherglen jason.rutherg...@gmail.com To: solr-user@lucene.apache.org Sent: Monday, April 16, 2012 8:42 PM Subject: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment One of big weaknesses of Solr Cloud (and ES?) is the lack of the ability to redistribute shards across servers. Meaning, as a single shard grows too large, splitting the shard, while live updates. How do you plan on elastically adding more servers without this feature? Cassandra and HBase handle elasticity in their own ways. Cassandra has successfully implemented the Dynamo model and HBase uses the traditional BigTable 'split'. Both systems are complex though are at a singular level of maturity. Also Cassandra [successfully] implements multiple data center support, is that available in SC or ES? On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello Ali, I'm trying to setup a large scale *Crawl + Index + Search *infrastructure using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, crawled + indexed every *4 weeks, *with a search latency of less than 0.5 seconds. That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. Needless to mention, the search index needs to scale to 5Billion pages. It is also possible that I might need to store multiple indexes -- one for crawled content, and one for ancillary data that is also very large. Each of these indices would likely require a logically distributed and replicated index. Yup, OK. However, I would like for such a system to be homogenous with the Hadoop infrastructure that is already installed on the cluster (for the crawl). In other words, I would much prefer if the replication and distribution of the Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of using another scalability framework (such as SolrCloud). In addition, it would be ideal if this environment was flexible enough to be dynamically scaled based on the size requirements of the index and the search traffic at the time (i.e. if it is deployed on an Amazon cluster, it should be easy enough to automatically provision additional processing power into the cluster without requiring server re-starts). There is no such thing just yet. There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase. However, I'm not sure which Solr-based tool in the Hadoop ecosystem would be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is mature enough and would be the right architectural choice to go along with a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects above. Here is a summary on all of them: * Search on HBase - I assume you are referring to the same thing I mentioned above. Not ready. * Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra. Looks good. * Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s) on the side. Not really integrated the way you want it to be. * ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a mny-B documents index and hundreds of nodes with ElasticSearch right now), etc. But again, not integrated with Hadoop the way you want it. * IndexTank - has some technical weaknesses, not integrated with Hadoop, not sure about its future considering LinkedIn uses Zoie and Sensei already. * And there is SolrCloud, which is coming soon and will be solid, but is again not integrated. If I were you and I had to pick today - I'd pick ElasticSearch if I were completely open. If I had Solr bias I'd give SolrCloud a try first. Lastly, how much hardware (assuming a medium sized EC2 instance) would you estimate my needing with this setup, for regular web-data (HTML text) at this scale? I don't know off the topic of my head, but I'm guessing several hundred for serving search requests. HTH, Otis -- Search Analytics - http://sematext.com/search-analytics/index.html Scalable Performance Monitoring - http://sematext.com/spm/index.html Any architectural guidance would be greatly appreciated. The more details provided, the wider my grin :). Many many
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
redistributing shards from oversubscribed nodes to other nodes Redistributing shards on a live system is not possible however because the updates in-flight will likely be lost. Also it is not simple technology to build from the ground-up. As is today, one would need to schedule downtime, for multi-terabyte live realtime systems, that not acceptable and will cause the system to not meet SLAs. Solr Cloud seems limited to a simple hashing algorithm for sending updates to the appropriate shard. This is precisely what Dynamo (and Cassandra) solves, eg, elastically and dynamically rearranging the hash 'ring' both logically and physically. In addition, there is the potential for data loss which Cassandra has the technology for. On Tue, Apr 17, 2012 at 1:33 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: I think Jason is right - there is no index splitting in ES and SolrCloud, so one has to think ahead, overshard, and then count on redistributing shards from oversubscribed nodes to other nodes. No resharding on demand and no index/shard splitting yet. Otis Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html From: Jason Rutherglen jason.rutherg...@gmail.com To: solr-user@lucene.apache.org Sent: Monday, April 16, 2012 8:42 PM Subject: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment One of big weaknesses of Solr Cloud (and ES?) is the lack of the ability to redistribute shards across servers. Meaning, as a single shard grows too large, splitting the shard, while live updates. How do you plan on elastically adding more servers without this feature? Cassandra and HBase handle elasticity in their own ways. Cassandra has successfully implemented the Dynamo model and HBase uses the traditional BigTable 'split'. Both systems are complex though are at a singular level of maturity. Also Cassandra [successfully] implements multiple data center support, is that available in SC or ES? On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello Ali, I'm trying to setup a large scale *Crawl + Index + Search *infrastructure using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, crawled + indexed every *4 weeks, *with a search latency of less than 0.5 seconds. That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. Needless to mention, the search index needs to scale to 5Billion pages. It is also possible that I might need to store multiple indexes -- one for crawled content, and one for ancillary data that is also very large. Each of these indices would likely require a logically distributed and replicated index. Yup, OK. However, I would like for such a system to be homogenous with the Hadoop infrastructure that is already installed on the cluster (for the crawl). In other words, I would much prefer if the replication and distribution of the Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of using another scalability framework (such as SolrCloud). In addition, it would be ideal if this environment was flexible enough to be dynamically scaled based on the size requirements of the index and the search traffic at the time (i.e. if it is deployed on an Amazon cluster, it should be easy enough to automatically provision additional processing power into the cluster without requiring server re-starts). There is no such thing just yet. There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase. However, I'm not sure which Solr-based tool in the Hadoop ecosystem would be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is mature enough and would be the right architectural choice to go along with a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects above. Here is a summary on all of them: * Search on HBase - I assume you are referring to the same thing I mentioned above. Not ready. * Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra. Looks good. * Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s) on the side. Not really integrated the way you want it to be. * ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a mny-B documents index and hundreds of nodes with ElasticSearch right now), etc. But again, not integrated with Hadoop the way you want it. * IndexTank - has some technical weaknesses, not integrated with Hadoop, not sure about its future considering
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
Hi, speaking about ES I think it would be fair to mention that one has to specify number of shards upfront when the index is created - that is correct, however, it is possible to give index one or more aliases which basically means that you can add new indices on the fly and give them same alias which is then used to search against. Given that you can add/remove indices, nodes and aliases on the fly I think there is a way how to handle growing data set with ease. If anyone is interested such scenario has been discussed in detail in ES mail list. Regards, Lukas On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: One of big weaknesses of Solr Cloud (and ES?) is the lack of the ability to redistribute shards across servers. Meaning, as a single shard grows too large, splitting the shard, while live updates. How do you plan on elastically adding more servers without this feature? Cassandra and HBase handle elasticity in their own ways. Cassandra has successfully implemented the Dynamo model and HBase uses the traditional BigTable 'split'. Both systems are complex though are at a singular level of maturity. Also Cassandra [successfully] implements multiple data center support, is that available in SC or ES? On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello Ali, I'm trying to setup a large scale *Crawl + Index + Search *infrastructure using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, crawled + indexed every *4 weeks, *with a search latency of less than 0.5 seconds. That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. Needless to mention, the search index needs to scale to 5Billion pages. It is also possible that I might need to store multiple indexes -- one for crawled content, and one for ancillary data that is also very large. Each of these indices would likely require a logically distributed and replicated index. Yup, OK. However, I would like for such a system to be homogenous with the Hadoop infrastructure that is already installed on the cluster (for the crawl). In other words, I would much prefer if the replication and distribution of the Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of using another scalability framework (such as SolrCloud). In addition, it would be ideal if this environment was flexible enough to be dynamically scaled based on the size requirements of the index and the search traffic at the time (i.e. if it is deployed on an Amazon cluster, it should be easy enough to automatically provision additional processing power into the cluster without requiring server re-starts). There is no such thing just yet. There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase. However, I'm not sure which Solr-based tool in the Hadoop ecosystem would be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is mature enough and would be the right architectural choice to go along with a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects above. Here is a summary on all of them: * Search on HBase - I assume you are referring to the same thing I mentioned above. Not ready. * Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra. Looks good. * Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s) on the side. Not really integrated the way you want it to be. * ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a mny-B documents index and hundreds of nodes with ElasticSearch right now), etc. But again, not integrated with Hadoop the way you want it. * IndexTank - has some technical weaknesses, not integrated with Hadoop, not sure about its future considering LinkedIn uses Zoie and Sensei already. * And there is SolrCloud, which is coming soon and will be solid, but is again not integrated. If I were you and I had to pick today - I'd pick ElasticSearch if I were completely open. If I had Solr bias I'd give SolrCloud a try first. Lastly, how much hardware (assuming a medium sized EC2 instance) would you estimate my needing with this setup, for regular web-data (HTML text) at this scale? I don't know off the topic of my head, but I'm guessing several hundred for serving search requests. HTH, Otis -- Search Analytics - http://sematext.com/search-analytics/index.html Scalable Performance Monitoring - http://sematext.com/spm/index.html Any architectural guidance would be
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
One of big weaknesses of Solr Cloud (and ES?) is the lack of the ability to redistribute shards across servers. Meaning, as a single shard grows too large, splitting the shard, while live updates. How do you plan on elastically adding more servers without this feature? Cassandra and HBase handle elasticity in their own ways. Cassandra has successfully implemented the Dynamo model and HBase uses the traditional BigTable 'split'. Both systems are complex though are at a singular level of maturity. Also Cassandra [successfully] implements multiple data center support, is that available in SC or ES? On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello Ali, I'm trying to setup a large scale *Crawl + Index + Search *infrastructure using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, crawled + indexed every *4 weeks, *with a search latency of less than 0.5 seconds. That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. Needless to mention, the search index needs to scale to 5Billion pages. It is also possible that I might need to store multiple indexes -- one for crawled content, and one for ancillary data that is also very large. Each of these indices would likely require a logically distributed and replicated index. Yup, OK. However, I would like for such a system to be homogenous with the Hadoop infrastructure that is already installed on the cluster (for the crawl). In other words, I would much prefer if the replication and distribution of the Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of using another scalability framework (such as SolrCloud). In addition, it would be ideal if this environment was flexible enough to be dynamically scaled based on the size requirements of the index and the search traffic at the time (i.e. if it is deployed on an Amazon cluster, it should be easy enough to automatically provision additional processing power into the cluster without requiring server re-starts). There is no such thing just yet. There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase. However, I'm not sure which Solr-based tool in the Hadoop ecosystem would be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is mature enough and would be the right architectural choice to go along with a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects above. Here is a summary on all of them: * Search on HBase - I assume you are referring to the same thing I mentioned above. Not ready. * Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra. Looks good. * Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s) on the side. Not really integrated the way you want it to be. * ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a mny-B documents index and hundreds of nodes with ElasticSearch right now), etc. But again, not integrated with Hadoop the way you want it. * IndexTank - has some technical weaknesses, not integrated with Hadoop, not sure about its future considering LinkedIn uses Zoie and Sensei already. * And there is SolrCloud, which is coming soon and will be solid, but is again not integrated. If I were you and I had to pick today - I'd pick ElasticSearch if I were completely open. If I had Solr bias I'd give SolrCloud a try first. Lastly, how much hardware (assuming a medium sized EC2 instance) would you estimate my needing with this setup, for regular web-data (HTML text) at this scale? I don't know off the topic of my head, but I'm guessing several hundred for serving search requests. HTH, Otis -- Search Analytics - http://sematext.com/search-analytics/index.html Scalable Performance Monitoring - http://sematext.com/spm/index.html Any architectural guidance would be greatly appreciated. The more details provided, the wider my grin :). Many many thanks in advance. Thanks, Safdar
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
This was done in SOLR-1301 going on several years ago now. On Sat, Apr 14, 2012 at 4:11 PM, Lance Norskog goks...@gmail.com wrote: It sounds like you really want the final map/reduce phase to put Solr index files into HDFS. Solr has a feature to do this called 'Embedded Solr'. This packages Solr as a library instead of an HTTP servlet. The Solr committers mostly hate it and want it to go away, but it is useful for exactly this problem. There is some integration work here, both to bolt ES to the Hadoop output libraries and also some trickery to write out the HDFS files. HDFS only appends and most of the codecs (Lucene segment formats) like to seek a lot. Then at the end it needs a way to tell SolrCloud about the files. If someone wants a great Summer Of Code project, Hadoop-Lucene indexes-SolrCloud would be a lot of fun and make you widely loved by people with money. I'm not kidding. Do a good job of this and write clean code, and you'll get offers for very cool jobs. On Sat, Apr 14, 2012 at 2:27 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello, Unfortunately I don't know when exactly SolrCloud release will be ready, but we've used trunk versions in the past and didn't have major issues. Otis Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html From: Ali S Kureishy safdar.kurei...@gmail.com To: Otis Gospodnetic otis_gospodne...@yahoo.com Cc: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Friday, April 13, 2012 7:16 PM Subject: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment Thanks Otis. I really appreciate the details offered here. This was very helpful information. I'm going to go through Solandra and Elastic Search and see if those make sense. I was also given a suggestion to use SolrCloud on FuseDFS (that's two recommendations for SolrCloud so far), so I will give that a shot when it is available. However, do you know when SolrCloud IS expected to be available? Thanks again! Warm regards, Safdar On Fri, Apr 13, 2012 at 5:23 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello Ali, I'm trying to setup a large scale *Crawl + Index + Search *infrastructure using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, crawled + indexed every *4 weeks, *with a search latency of less than 0.5 seconds. That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. Needless to mention, the search index needs to scale to 5Billion pages. It is also possible that I might need to store multiple indexes -- one for crawled content, and one for ancillary data that is also very large. Each of these indices would likely require a logically distributed and replicated index. Yup, OK. However, I would like for such a system to be homogenous with the Hadoop infrastructure that is already installed on the cluster (for the crawl). In other words, I would much prefer if the replication and distribution of the Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of using another scalability framework (such as SolrCloud). In addition, it would be ideal if this environment was flexible enough to be dynamically scaled based on the size requirements of the index and the search traffic at the time (i.e. if it is deployed on an Amazon cluster, it should be easy enough to automatically provision additional processing power into the cluster without requiring server re-starts). There is no such thing just yet. There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase. However, I'm not sure which Solr-based tool in the Hadoop ecosystem would be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is mature enough and would be the right architectural choice to go along with a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects above. Here is a summary on all of them: * Search on HBase - I assume you are referring to the same thing I mentioned above. Not ready. * Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra. Looks good. * Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s) on the side. Not really integrated the way you want it to be. * ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a mny-B documents index and hundreds of nodes with ElasticSearch right now), etc. But again, not integrated with Hadoop the way you want it. * IndexTank - has some technical weaknesses
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
Hi, This won't give you the performance you need, unless you have enough RAM on the Solr box to cache the whole index in memory. Have you tested this yourself? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 12. apr. 2012, at 15:27, Darren Govoni wrote: You could use SolrCloud (for the automatic scaling) and just mount a fuse[1] HDFS directory and configure solr to use that directory for its data. [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote: Hi, I'm trying to setup a large scale *Crawl + Index + Search *infrastructure using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, crawled + indexed every *4 weeks, *with a search latency of less than 0.5 seconds. Needless to mention, the search index needs to scale to 5Billion pages. It is also possible that I might need to store multiple indexes -- one for crawled content, and one for ancillary data that is also very large. Each of these indices would likely require a logically distributed and replicated index. However, I would like for such a system to be homogenous with the Hadoop infrastructure that is already installed on the cluster (for the crawl). In other words, I would much prefer if the replication and distribution of the Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of using another scalability framework (such as SolrCloud). In addition, it would be ideal if this environment was flexible enough to be dynamically scaled based on the size requirements of the index and the search traffic at the time (i.e. if it is deployed on an Amazon cluster, it should be easy enough to automatically provision additional processing power into the cluster without requiring server re-starts). However, I'm not sure which Solr-based tool in the Hadoop ecosystem would be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is mature enough and would be the right architectural choice to go along with a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects above. Lastly, how much hardware (assuming a medium sized EC2 instance) would you estimate my needing with this setup, for regular web-data (HTML text) at this scale? Any architectural guidance would be greatly appreciated. The more details provided, the wider my grin :). Many many thanks in advance. Thanks, Safdar
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
Hello, Unfortunately I don't know when exactly SolrCloud release will be ready, but we've used trunk versions in the past and didn't have major issues. Otis Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html From: Ali S Kureishy safdar.kurei...@gmail.com To: Otis Gospodnetic otis_gospodne...@yahoo.com Cc: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Friday, April 13, 2012 7:16 PM Subject: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment Thanks Otis. I really appreciate the details offered here. This was very helpful information. I'm going to go through Solandra and Elastic Search and see if those make sense. I was also given a suggestion to use SolrCloud on FuseDFS (that's two recommendations for SolrCloud so far), so I will give that a shot when it is available. However, do you know when SolrCloud IS expected to be available? Thanks again! Warm regards, Safdar On Fri, Apr 13, 2012 at 5:23 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello Ali, I'm trying to setup a large scale *Crawl + Index + Search *infrastructure using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, crawled + indexed every *4 weeks, *with a search latency of less than 0.5 seconds. That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. Needless to mention, the search index needs to scale to 5Billion pages. It is also possible that I might need to store multiple indexes -- one for crawled content, and one for ancillary data that is also very large. Each of these indices would likely require a logically distributed and replicated index. Yup, OK. However, I would like for such a system to be homogenous with the Hadoop infrastructure that is already installed on the cluster (for the crawl). In other words, I would much prefer if the replication and distribution of the Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of using another scalability framework (such as SolrCloud). In addition, it would be ideal if this environment was flexible enough to be dynamically scaled based on the size requirements of the index and the search traffic at the time (i.e. if it is deployed on an Amazon cluster, it should be easy enough to automatically provision additional processing power into the cluster without requiring server re-starts). There is no such thing just yet. There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase. However, I'm not sure which Solr-based tool in the Hadoop ecosystem would be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is mature enough and would be the right architectural choice to go along with a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects above. Here is a summary on all of them: * Search on HBase - I assume you are referring to the same thing I mentioned above. Not ready. * Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra. Looks good. * Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s) on the side. Not really integrated the way you want it to be. * ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a mny-B documents index and hundreds of nodes with ElasticSearch right now), etc. But again, not integrated with Hadoop the way you want it. * IndexTank - has some technical weaknesses, not integrated with Hadoop, not sure about its future considering LinkedIn uses Zoie and Sensei already. * And there is SolrCloud, which is coming soon and will be solid, but is again not integrated. If I were you and I had to pick today - I'd pick ElasticSearch if I were completely open. If I had Solr bias I'd give SolrCloud a try first. Lastly, how much hardware (assuming a medium sized EC2 instance) would you estimate my needing with this setup, for regular web-data (HTML text) at this scale? I don't know off the topic of my head, but I'm guessing several hundred for serving search requests. HTH, Otis -- Search Analytics - http://sematext.com/search-analytics/index.html Scalable Performance Monitoring - http://sematext.com/spm/index.html Any architectural guidance would be greatly appreciated. The more details provided, the wider my grin :). Many many thanks in advance. Thanks, Safdar
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
It sounds like you really want the final map/reduce phase to put Solr index files into HDFS. Solr has a feature to do this called 'Embedded Solr'. This packages Solr as a library instead of an HTTP servlet. The Solr committers mostly hate it and want it to go away, but it is useful for exactly this problem. There is some integration work here, both to bolt ES to the Hadoop output libraries and also some trickery to write out the HDFS files. HDFS only appends and most of the codecs (Lucene segment formats) like to seek a lot. Then at the end it needs a way to tell SolrCloud about the files. If someone wants a great Summer Of Code project, Hadoop-Lucene indexes-SolrCloud would be a lot of fun and make you widely loved by people with money. I'm not kidding. Do a good job of this and write clean code, and you'll get offers for very cool jobs. On Sat, Apr 14, 2012 at 2:27 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello, Unfortunately I don't know when exactly SolrCloud release will be ready, but we've used trunk versions in the past and didn't have major issues. Otis Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html From: Ali S Kureishy safdar.kurei...@gmail.com To: Otis Gospodnetic otis_gospodne...@yahoo.com Cc: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Friday, April 13, 2012 7:16 PM Subject: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment Thanks Otis. I really appreciate the details offered here. This was very helpful information. I'm going to go through Solandra and Elastic Search and see if those make sense. I was also given a suggestion to use SolrCloud on FuseDFS (that's two recommendations for SolrCloud so far), so I will give that a shot when it is available. However, do you know when SolrCloud IS expected to be available? Thanks again! Warm regards, Safdar On Fri, Apr 13, 2012 at 5:23 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello Ali, I'm trying to setup a large scale *Crawl + Index + Search *infrastructure using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, crawled + indexed every *4 weeks, *with a search latency of less than 0.5 seconds. That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. Needless to mention, the search index needs to scale to 5Billion pages. It is also possible that I might need to store multiple indexes -- one for crawled content, and one for ancillary data that is also very large. Each of these indices would likely require a logically distributed and replicated index. Yup, OK. However, I would like for such a system to be homogenous with the Hadoop infrastructure that is already installed on the cluster (for the crawl). In other words, I would much prefer if the replication and distribution of the Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of using another scalability framework (such as SolrCloud). In addition, it would be ideal if this environment was flexible enough to be dynamically scaled based on the size requirements of the index and the search traffic at the time (i.e. if it is deployed on an Amazon cluster, it should be easy enough to automatically provision additional processing power into the cluster without requiring server re-starts). There is no such thing just yet. There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase. However, I'm not sure which Solr-based tool in the Hadoop ecosystem would be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is mature enough and would be the right architectural choice to go along with a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects above. Here is a summary on all of them: * Search on HBase - I assume you are referring to the same thing I mentioned above. Not ready. * Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra. Looks good. * Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s) on the side. Not really integrated the way you want it to be. * ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a mny-B documents index and hundreds of nodes with ElasticSearch right now), etc. But again, not integrated with Hadoop the way you want it. * IndexTank - has some technical weaknesses, not integrated with Hadoop, not sure about its future considering LinkedIn uses Zoie and Sensei already. * And there is SolrCloud, which is coming soon
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
Hi, For a web crawl+search like this you will probably need a lot of additional Big Data crunching, so a Hadoop based solution is wise. In addition to those products mentioned we also now have Amazon's own CloudSearch http://aws.amazon.com/cloudsearch/ It's new, is not as cool as Solr (not even Lucene based), but gives you the elasticity you request I guess. If you run your Hadoop cluster in EC2 already it would be quite efficient to batch-load the crawled and processed data into a SearchDomain in the same availability zone. However, both cost and features may prohibit this as a realistic choice for you. It would be cool to explore a Hadoop/HDFS + SolrCloud integration. SolrCloud would not build the indexes, but be pulling pre-built indexes from HDFS down to local disk every time it's told to. Or perhaps the SolrCloud nodes could be part of the hadoop cluster, being responsible for the Reduce part building the indexes? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 13. apr. 2012, at 04:23, Otis Gospodnetic wrote: Hello Ali, I'm trying to setup a large scale *Crawl + Index + Search *infrastructure using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, crawled + indexed every *4 weeks, *with a search latency of less than 0.5 seconds. That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. Needless to mention, the search index needs to scale to 5Billion pages. It is also possible that I might need to store multiple indexes -- one for crawled content, and one for ancillary data that is also very large. Each of these indices would likely require a logically distributed and replicated index. Yup, OK. However, I would like for such a system to be homogenous with the Hadoop infrastructure that is already installed on the cluster (for the crawl). In other words, I would much prefer if the replication and distribution of the Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of using another scalability framework (such as SolrCloud). In addition, it would be ideal if this environment was flexible enough to be dynamically scaled based on the size requirements of the index and the search traffic at the time (i.e. if it is deployed on an Amazon cluster, it should be easy enough to automatically provision additional processing power into the cluster without requiring server re-starts). There is no such thing just yet. There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase. However, I'm not sure which Solr-based tool in the Hadoop ecosystem would be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is mature enough and would be the right architectural choice to go along with a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects above. Here is a summary on all of them: * Search on HBase - I assume you are referring to the same thing I mentioned above. Not ready. * Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra. Looks good. * Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s) on the side. Not really integrated the way you want it to be. * ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a mny-B documents index and hundreds of nodes with ElasticSearch right now), etc. But again, not integrated with Hadoop the way you want it. * IndexTank - has some technical weaknesses, not integrated with Hadoop, not sure about its future considering LinkedIn uses Zoie and Sensei already. * And there is SolrCloud, which is coming soon and will be solid, but is again not integrated. If I were you and I had to pick today - I'd pick ElasticSearch if I were completely open. If I had Solr bias I'd give SolrCloud a try first. Lastly, how much hardware (assuming a medium sized EC2 instance) would you estimate my needing with this setup, for regular web-data (HTML text) at this scale? I don't know off the topic of my head, but I'm guessing several hundred for serving search requests. HTH, Otis -- Search Analytics - http://sematext.com/search-analytics/index.html Scalable Performance Monitoring - http://sematext.com/spm/index.html Any architectural guidance would be greatly appreciated. The more details provided, the wider my grin :). Many many thanks in advance. Thanks, Safdar
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
Thanks Otis. I really appreciate the details offered here. This was very helpful information. I'm going to go through Solandra and Elastic Search and see if those make sense. I was also given a suggestion to use SolrCloud on FuseDFS (that's two recommendations for SolrCloud so far), so I will give that a shot when it is available. However, do you know when SolrCloud IS expected to be available? Thanks again! Warm regards, Safdar On Fri, Apr 13, 2012 at 5:23 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello Ali, I'm trying to setup a large scale *Crawl + Index + Search *infrastructure using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, crawled + indexed every *4 weeks, *with a search latency of less than 0.5 seconds. That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. Needless to mention, the search index needs to scale to 5Billion pages. It is also possible that I might need to store multiple indexes -- one for crawled content, and one for ancillary data that is also very large. Each of these indices would likely require a logically distributed and replicated index. Yup, OK. However, I would like for such a system to be homogenous with the Hadoop infrastructure that is already installed on the cluster (for the crawl). In other words, I would much prefer if the replication and distribution of the Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of using another scalability framework (such as SolrCloud). In addition, it would be ideal if this environment was flexible enough to be dynamically scaled based on the size requirements of the index and the search traffic at the time (i.e. if it is deployed on an Amazon cluster, it should be easy enough to automatically provision additional processing power into the cluster without requiring server re-starts). There is no such thing just yet. There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase. However, I'm not sure which Solr-based tool in the Hadoop ecosystem would be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is mature enough and would be the right architectural choice to go along with a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects above. Here is a summary on all of them: * Search on HBase - I assume you are referring to the same thing I mentioned above. Not ready. * Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra. Looks good. * Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s) on the side. Not really integrated the way you want it to be. * ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a mny-B documents index and hundreds of nodes with ElasticSearch right now), etc. But again, not integrated with Hadoop the way you want it. * IndexTank - has some technical weaknesses, not integrated with Hadoop, not sure about its future considering LinkedIn uses Zoie and Sensei already. * And there is SolrCloud, which is coming soon and will be solid, but is again not integrated. If I were you and I had to pick today - I'd pick ElasticSearch if I were completely open. If I had Solr bias I'd give SolrCloud a try first. Lastly, how much hardware (assuming a medium sized EC2 instance) would you estimate my needing with this setup, for regular web-data (HTML text) at this scale? I don't know off the topic of my head, but I'm guessing several hundred for serving search requests. HTH, Otis -- Search Analytics - http://sematext.com/search-analytics/index.html Scalable Performance Monitoring - http://sematext.com/spm/index.html Any architectural guidance would be greatly appreciated. The more details provided, the wider my grin :). Many many thanks in advance. Thanks, Safdar
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
You could use SolrCloud (for the automatic scaling) and just mount a fuse[1] HDFS directory and configure solr to use that directory for its data. [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote: Hi, I'm trying to setup a large scale *Crawl + Index + Search *infrastructure using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, crawled + indexed every *4 weeks, *with a search latency of less than 0.5 seconds. Needless to mention, the search index needs to scale to 5Billion pages. It is also possible that I might need to store multiple indexes -- one for crawled content, and one for ancillary data that is also very large. Each of these indices would likely require a logically distributed and replicated index. However, I would like for such a system to be homogenous with the Hadoop infrastructure that is already installed on the cluster (for the crawl). In other words, I would much prefer if the replication and distribution of the Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of using another scalability framework (such as SolrCloud). In addition, it would be ideal if this environment was flexible enough to be dynamically scaled based on the size requirements of the index and the search traffic at the time (i.e. if it is deployed on an Amazon cluster, it should be easy enough to automatically provision additional processing power into the cluster without requiring server re-starts). However, I'm not sure which Solr-based tool in the Hadoop ecosystem would be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is mature enough and would be the right architectural choice to go along with a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects above. Lastly, how much hardware (assuming a medium sized EC2 instance) would you estimate my needing with this setup, for regular web-data (HTML text) at this scale? Any architectural guidance would be greatly appreciated. The more details provided, the wider my grin :). Many many thanks in advance. Thanks, Safdar
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
Thanks Darren. Actually, I would like the system to be homogenous - i.e., use Hadoop based tools that already provide all the necessary scaling for the lucene index (in terms of throughput, latency of writes/reads etc). Since SolrCloud adds its own layer of sharding/replication that is outside Hadoop, I feel that using SolrCloud would be redundant, and a step in the opposite direction, which is what I'm trying to avoid in the first place. Or am I mistaken? Thanks, Safdar On Thu, Apr 12, 2012 at 4:27 PM, Darren Govoni dar...@ontrenet.com wrote: You could use SolrCloud (for the automatic scaling) and just mount a fuse[1] HDFS directory and configure solr to use that directory for its data. [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote: Hi, I'm trying to setup a large scale *Crawl + Index + Search *infrastructure using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, crawled + indexed every *4 weeks, *with a search latency of less than 0.5 seconds. Needless to mention, the search index needs to scale to 5Billion pages. It is also possible that I might need to store multiple indexes -- one for crawled content, and one for ancillary data that is also very large. Each of these indices would likely require a logically distributed and replicated index. However, I would like for such a system to be homogenous with the Hadoop infrastructure that is already installed on the cluster (for the crawl). In other words, I would much prefer if the replication and distribution of the Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of using another scalability framework (such as SolrCloud). In addition, it would be ideal if this environment was flexible enough to be dynamically scaled based on the size requirements of the index and the search traffic at the time (i.e. if it is deployed on an Amazon cluster, it should be easy enough to automatically provision additional processing power into the cluster without requiring server re-starts). However, I'm not sure which Solr-based tool in the Hadoop ecosystem would be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is mature enough and would be the right architectural choice to go along with a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects above. Lastly, how much hardware (assuming a medium sized EC2 instance) would you estimate my needing with this setup, for regular web-data (HTML text) at this scale? Any architectural guidance would be greatly appreciated. The more details provided, the wider my grin :). Many many thanks in advance. Thanks, Safdar
RE: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
Solrcloud or any other tech specific replication isnt going to 'just work' with hadoop replication. But with some significant custom coding anything should be possible. Interesting idea. brbrbr--- Original Message --- On 4/12/2012 09:21 AM Ali S Kureishy wrote:brThanks Darren. br brActually, I would like the system to be homogenous - i.e., use Hadoop based brtools that already provide all the necessary scaling for the lucene index br(in terms of throughput, latency of writes/reads etc). Since SolrCloud adds brits own layer of sharding/replication that is outside Hadoop, I feel that brusing SolrCloud would be redundant, and a step in the opposite brdirection, which is what I'm trying to avoid in the first place. Or am I brmistaken? br brThanks, brSafdar br br brOn Thu, Apr 12, 2012 at 4:27 PM, Darren Govoni dar...@ontrenet.com wrote: br br You could use SolrCloud (for the automatic scaling) and just mount a br fuse[1] HDFS directory and configure solr to use that directory for its br data. br br [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS br br On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote: br Hi, br br I'm trying to setup a large scale *Crawl + Index + Search *infrastructure br using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, br crawled + indexed every *4 weeks, *with a search latency of less than 0.5 br seconds. br br Needless to mention, the search index needs to scale to 5Billion pages. br It br is also possible that I might need to store multiple indexes -- one for br crawled content, and one for ancillary data that is also very large. Each br of these indices would likely require a logically distributed and br replicated index. br br However, I would like for such a system to be homogenous with the Hadoop br infrastructure that is already installed on the cluster (for the crawl). br In br other words, I would much prefer if the replication and distribution of br the br Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of br using another scalability framework (such as SolrCloud). In addition, it br would be ideal if this environment was flexible enough to be dynamically br scaled based on the size requirements of the index and the search traffic br at the time (i.e. if it is deployed on an Amazon cluster, it should be br easy br enough to automatically provision additional processing power into the br cluster without requiring server re-starts). br br However, I'm not sure which Solr-based tool in the Hadoop ecosystem would br be ideal for this scenario. I've heard mention of Solr-on-HBase, br Solandra, br Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these br is br mature enough and would be the right architectural choice to go along br with br a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling br aspects br above. br br Lastly, how much hardware (assuming a medium sized EC2 instance) would br you br estimate my needing with this setup, for regular web-data (HTML text) at br this scale? br br Any architectural guidance would be greatly appreciated. The more details br provided, the wider my grin :). br br Many many thanks in advance. br br Thanks, br Safdar br br br br
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
Hello Ali, I'm trying to setup a large scale *Crawl + Index + Search *infrastructure using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, crawled + indexed every *4 weeks, *with a search latency of less than 0.5 seconds. That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. Needless to mention, the search index needs to scale to 5Billion pages. It is also possible that I might need to store multiple indexes -- one for crawled content, and one for ancillary data that is also very large. Each of these indices would likely require a logically distributed and replicated index. Yup, OK. However, I would like for such a system to be homogenous with the Hadoop infrastructure that is already installed on the cluster (for the crawl). In other words, I would much prefer if the replication and distribution of the Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of using another scalability framework (such as SolrCloud). In addition, it would be ideal if this environment was flexible enough to be dynamically scaled based on the size requirements of the index and the search traffic at the time (i.e. if it is deployed on an Amazon cluster, it should be easy enough to automatically provision additional processing power into the cluster without requiring server re-starts). There is no such thing just yet. There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase. However, I'm not sure which Solr-based tool in the Hadoop ecosystem would be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is mature enough and would be the right architectural choice to go along with a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects above. Here is a summary on all of them: * Search on HBase - I assume you are referring to the same thing I mentioned above. Not ready. * Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra. Looks good. * Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s) on the side. Not really integrated the way you want it to be. * ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a mny-B documents index and hundreds of nodes with ElasticSearch right now), etc. But again, not integrated with Hadoop the way you want it. * IndexTank - has some technical weaknesses, not integrated with Hadoop, not sure about its future considering LinkedIn uses Zoie and Sensei already. * And there is SolrCloud, which is coming soon and will be solid, but is again not integrated. If I were you and I had to pick today - I'd pick ElasticSearch if I were completely open. If I had Solr bias I'd give SolrCloud a try first. Lastly, how much hardware (assuming a medium sized EC2 instance) would you estimate my needing with this setup, for regular web-data (HTML text) at this scale? I don't know off the topic of my head, but I'm guessing several hundred for serving search requests. HTH, Otis -- Search Analytics - http://sematext.com/search-analytics/index.html Scalable Performance Monitoring - http://sematext.com/spm/index.html Any architectural guidance would be greatly appreciated. The more details provided, the wider my grin :). Many many thanks in advance. Thanks, Safdar