RE: Architecture Question
If you just want to store the data, you can dump it into HDFS sequence files. While HBase is really nice if you want to process and serve data real-time, it adds overhead to use it as pure storage. Dave -Original Message- From: Cool Techi [mailto:cooltec...@outlook.com] Sent: Friday, November 16, 2012 8:26 PM To: solr-user@lucene.apache.org Subject: RE: Architecture Question Hi Otis, Thanks for your reply, just wanted to check what NoSql structure would be best suited to store data and use the least amount of memory, since for most of my work Solr would be sufficient and I want to store data just in case we want to reindex and as a backup. Regards, Ayush Date: Fri, 16 Nov 2012 15:47:40 -0500 Subject: Re: Architecture Question From: otis.gospodne...@gmail.com To: solr-user@lucene.apache.org Hello, I am not sure if this is the right forum for this question, but it would be great if I could be pointed in the right direction. We have been using a combination of MySql and Solr for all our company full text and query needs. But as our customers have grow so has the amount of data and MySql is just not proving to be a right option for storing/querying. I have been looking at Solr Cloud and it looks really impressive, but and not sure if we should give away our storage system. So, I have been exploring DataStax but a commercial option is out of question. So we were thinking of using hbase to store the data and at the same time index the data into Solr cloud, but for many reasons this design doesn't seem convincing (Also seen basic of Lilly). 1) Would it be recommended to just user Solr cloud with multiple replication or hbase-solr seems like good option If you trust SolrCloud with replication and keep all your fields stored then you could live without an external DB. At this point I personally would still want an external DB. Whether HBase is the right DB for the job I can't tell because I don't know anything about your data, volume, access patterns, etc. I can tell you that HBase does scale well - we have tables with many billions of rows stored in it for instance. 2) How much strain would be to keep both Solr Shard and Hbase node on the same machine HBase loves memory. So does Solr. They both dislike disk IO (who doesn't!). Solr can use a lot of CPU for indexing/searching, depending on the volume. HBase RegionServers can use a lot of CPU if you run MapReuce on data in HBase. 3) if there a calculation on what kind of machine configuration would I need to store 500-1000 million records. Most of these with be social data (Twitter/facebook/blogs etc) and how many shards. No recipe here, unfortunately. You'd have to experiment and test, do load and performance testing, etc. If you need help with Solr + HBase, we happen to have a lot of experience with both and have even used them together for some of our clients. Otis -- Performance Monitoring - http://sematext.com/spm/index.html Search Analytics - http://sematext.com/search-analytics/index.html
Re: Architecture Question
Hello, I am not sure if this is the right forum for this question, but it would be great if I could be pointed in the right direction. We have been using a combination of MySql and Solr for all our company full text and query needs. But as our customers have grow so has the amount of data and MySql is just not proving to be a right option for storing/querying. I have been looking at Solr Cloud and it looks really impressive, but and not sure if we should give away our storage system. So, I have been exploring DataStax but a commercial option is out of question. So we were thinking of using hbase to store the data and at the same time index the data into Solr cloud, but for many reasons this design doesn't seem convincing (Also seen basic of Lilly). 1) Would it be recommended to just user Solr cloud with multiple replication or hbase-solr seems like good option If you trust SolrCloud with replication and keep all your fields stored then you could live without an external DB. At this point I personally would still want an external DB. Whether HBase is the right DB for the job I can't tell because I don't know anything about your data, volume, access patterns, etc. I can tell you that HBase does scale well - we have tables with many billions of rows stored in it for instance. 2) How much strain would be to keep both Solr Shard and Hbase node on the same machine HBase loves memory. So does Solr. They both dislike disk IO (who doesn't!). Solr can use a lot of CPU for indexing/searching, depending on the volume. HBase RegionServers can use a lot of CPU if you run MapReuce on data in HBase. 3) if there a calculation on what kind of machine configuration would I need to store 500-1000 million records. Most of these with be social data (Twitter/facebook/blogs etc) and how many shards. No recipe here, unfortunately. You'd have to experiment and test, do load and performance testing, etc. If you need help with Solr + HBase, we happen to have a lot of experience with both and have even used them together for some of our clients. Otis -- Performance Monitoring - http://sematext.com/spm/index.html Search Analytics - http://sematext.com/search-analytics/index.html
RE: Architecture Question
Hi Otis, Thanks for your reply, just wanted to check what NoSql structure would be best suited to store data and use the least amount of memory, since for most of my work Solr would be sufficient and I want to store data just in case we want to reindex and as a backup. Regards, Ayush Date: Fri, 16 Nov 2012 15:47:40 -0500 Subject: Re: Architecture Question From: otis.gospodne...@gmail.com To: solr-user@lucene.apache.org Hello, I am not sure if this is the right forum for this question, but it would be great if I could be pointed in the right direction. We have been using a combination of MySql and Solr for all our company full text and query needs. But as our customers have grow so has the amount of data and MySql is just not proving to be a right option for storing/querying. I have been looking at Solr Cloud and it looks really impressive, but and not sure if we should give away our storage system. So, I have been exploring DataStax but a commercial option is out of question. So we were thinking of using hbase to store the data and at the same time index the data into Solr cloud, but for many reasons this design doesn't seem convincing (Also seen basic of Lilly). 1) Would it be recommended to just user Solr cloud with multiple replication or hbase-solr seems like good option If you trust SolrCloud with replication and keep all your fields stored then you could live without an external DB. At this point I personally would still want an external DB. Whether HBase is the right DB for the job I can't tell because I don't know anything about your data, volume, access patterns, etc. I can tell you that HBase does scale well - we have tables with many billions of rows stored in it for instance. 2) How much strain would be to keep both Solr Shard and Hbase node on the same machine HBase loves memory. So does Solr. They both dislike disk IO (who doesn't!). Solr can use a lot of CPU for indexing/searching, depending on the volume. HBase RegionServers can use a lot of CPU if you run MapReuce on data in HBase. 3) if there a calculation on what kind of machine configuration would I need to store 500-1000 million records. Most of these with be social data (Twitter/facebook/blogs etc) and how many shards. No recipe here, unfortunately. You'd have to experiment and test, do load and performance testing, etc. If you need help with Solr + HBase, we happen to have a lot of experience with both and have even used them together for some of our clients. Otis -- Performance Monitoring - http://sematext.com/spm/index.html Search Analytics - http://sematext.com/search-analytics/index.html
RE: Architecture question about solr sharding
I'd separate the splitting of the binary documents from the sharding in Solr - they're different things and the split may be required at different levels, due to different numbers of documents. Splitting the dependency means that you can store the path in the document and not need to infer anything, and you can re-organise the Solr shards without having to worry about moving the binary documents around. Also, if you think you're going to need to change Jan to Jan2011, then maybe you should just start with Jan2011. Alternatively, considering that you think change is likely in the future, why not name the directories in such a way that you don't need to make the change earlier ones as the requirement to change the structures arises? Does that make sense? Rob On Tue, Mar 22, 2011 at 3:20 PM, JohnRodey timothydd...@yahoo.com wrote: I have an issue and I'm wondering if there is an easy way around it with just SOLR. I have multiple SOLR servers and a field in my schema is a relative path to a binary file. Each SOLR server is responsible for a different subset of data that belongs to a different base path. For Example... My directory structure may look like this: /someDir/Jan/binaryfiles/... /someDir/Feb/binaryfiles/... /someDir/Mar/binaryfiles/... /someDir/Apr/binaryfiles/... Server1 is responsible for Jan, Server2 for Feb, etc... And a response document may have a field like this my entry binaryfiles/12345.bin How can I tell from my main search server which server returned a result? I cannot put the full path in the index because my path structure might change in the future. Using this example it may go to '/someDir/Jan2011/'. I basically need to find a way to say 'Ah! server01 returned this result, so it must be in /someDir/Jan' Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Architecture-question-about-solr-sh arding-tp2716417p2716417.html Sent from the Solr - User mailing list archive at Nabble.com. This email transmission is confidential and intended solely for the addressee. If you are not the intended addressee, you must not disclose, copy or distribute the contents of this transmission. If you have received this transmission in error, please notify the sender immediately. http://www.sthree.com
Re: Architecture question about solr sharding
I'd just put the data in the document. That way, you're not inferring anything, you *know* which shard (or even the logical shard) the data came from. Does that make sense in your problem sace? Erick On Tue, Mar 22, 2011 at 3:20 PM, JohnRodey timothydd...@yahoo.com wrote: I have an issue and I'm wondering if there is an easy way around it with just SOLR. I have multiple SOLR servers and a field in my schema is a relative path to a binary file. Each SOLR server is responsible for a different subset of data that belongs to a different base path. For Example... My directory structure may look like this: /someDir/Jan/binaryfiles/... /someDir/Feb/binaryfiles/... /someDir/Mar/binaryfiles/... /someDir/Apr/binaryfiles/... Server1 is responsible for Jan, Server2 for Feb, etc... And a response document may have a field like this my entry binaryfiles/12345.bin How can I tell from my main search server which server returned a result? I cannot put the full path in the index because my path structure might change in the future. Using this example it may go to '/someDir/Jan2011/'. I basically need to find a way to say 'Ah! server01 returned this result, so it must be in /someDir/Jan' Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Architecture-question-about-solr-sharding-tp2716417p2716417.html Sent from the Solr - User mailing list archive at Nabble.com.