RE: Architecture Question

2012-11-19 Thread Buttler, David
If you just want to store the data, you can dump it into HDFS sequence files.  
While HBase is really nice if you want to process and serve data real-time, it 
adds overhead to use it as pure storage.
Dave

-Original Message-
From: Cool Techi [mailto:cooltec...@outlook.com] 
Sent: Friday, November 16, 2012 8:26 PM
To: solr-user@lucene.apache.org
Subject: RE: Architecture Question

Hi Otis,

Thanks for your reply, just wanted to check what NoSql structure would be best 
suited to store data and use the least amount of memory, since for most of my 
work Solr would be sufficient and I want to store data just in case we want to 
reindex and as a backup.

Regards,
Ayush

 Date: Fri, 16 Nov 2012 15:47:40 -0500
 Subject: Re: Architecture Question
 From: otis.gospodne...@gmail.com
 To: solr-user@lucene.apache.org
 
 Hello,
 
 
 
  I am not sure if this is the right forum for this question, but it would
  be great if I could be pointed in the right direction. We have been using a
  combination of MySql and Solr for all our company full text and query
  needs.  But as our customers have grow so has the amount of data and MySql
  is just not proving to be a right option for storing/querying.
 
  I have been looking at Solr Cloud and it looks really impressive, but and
  not sure if we should give away our storage system. So, I have been
  exploring DataStax but a commercial option is out of question. So we were
  thinking of using hbase to store the data and at the same time index the
  data into Solr cloud, but for many reasons this design doesn't seem
  convincing (Also seen basic of Lilly).
 
  1) Would it be recommended to just user Solr cloud with multiple
  replication or hbase-solr seems like good option
 
 
 If you trust SolrCloud with replication and keep all your fields stored
 then you could live without an external DB.  At this point I personally
 would still want an external DB.  Whether HBase is the right DB for the job
 I can't tell because I don't know anything about your data, volume, access
 patterns, etc.  I can tell you that HBase does scale well - we have tables
 with many billions of rows stored in it for instance.
 
 
  2) How much strain would be to keep both Solr Shard and Hbase node on the
  same machine
 
 
 HBase loves memory.  So does Solr.  They both dislike disk IO (who
 doesn't!).  Solr can use a lot of CPU for indexing/searching, depending on
 the volume.  HBase RegionServers can use a lot of CPU if you run MapReuce
 on data in HBase.
 
 
  3) if there a calculation on what kind of machine configuration would I
  need to store 500-1000 million records. Most of these with be social data
  (Twitter/facebook/blogs etc) and how many shards.
 
 
 No recipe here, unfortunately.  You'd have to experiment and test, do load
 and performance testing, etc.  If you need help with Solr + HBase, we
 happen to have a lot of experience with both and have even used them
 together for some of our clients.
 
 Otis
 --
 Performance Monitoring - http://sematext.com/spm/index.html
 Search Analytics - http://sematext.com/search-analytics/index.html
  


Re: Architecture Question

2012-11-16 Thread Otis Gospodnetic
Hello,



 I am not sure if this is the right forum for this question, but it would
 be great if I could be pointed in the right direction. We have been using a
 combination of MySql and Solr for all our company full text and query
 needs.  But as our customers have grow so has the amount of data and MySql
 is just not proving to be a right option for storing/querying.

 I have been looking at Solr Cloud and it looks really impressive, but and
 not sure if we should give away our storage system. So, I have been
 exploring DataStax but a commercial option is out of question. So we were
 thinking of using hbase to store the data and at the same time index the
 data into Solr cloud, but for many reasons this design doesn't seem
 convincing (Also seen basic of Lilly).

 1) Would it be recommended to just user Solr cloud with multiple
 replication or hbase-solr seems like good option


If you trust SolrCloud with replication and keep all your fields stored
then you could live without an external DB.  At this point I personally
would still want an external DB.  Whether HBase is the right DB for the job
I can't tell because I don't know anything about your data, volume, access
patterns, etc.  I can tell you that HBase does scale well - we have tables
with many billions of rows stored in it for instance.


 2) How much strain would be to keep both Solr Shard and Hbase node on the
 same machine


HBase loves memory.  So does Solr.  They both dislike disk IO (who
doesn't!).  Solr can use a lot of CPU for indexing/searching, depending on
the volume.  HBase RegionServers can use a lot of CPU if you run MapReuce
on data in HBase.


 3) if there a calculation on what kind of machine configuration would I
 need to store 500-1000 million records. Most of these with be social data
 (Twitter/facebook/blogs etc) and how many shards.


No recipe here, unfortunately.  You'd have to experiment and test, do load
and performance testing, etc.  If you need help with Solr + HBase, we
happen to have a lot of experience with both and have even used them
together for some of our clients.

Otis
--
Performance Monitoring - http://sematext.com/spm/index.html
Search Analytics - http://sematext.com/search-analytics/index.html


RE: Architecture Question

2012-11-16 Thread Cool Techi
Hi Otis,

Thanks for your reply, just wanted to check what NoSql structure would be best 
suited to store data and use the least amount of memory, since for most of my 
work Solr would be sufficient and I want to store data just in case we want to 
reindex and as a backup.

Regards,
Ayush

 Date: Fri, 16 Nov 2012 15:47:40 -0500
 Subject: Re: Architecture Question
 From: otis.gospodne...@gmail.com
 To: solr-user@lucene.apache.org
 
 Hello,
 
 
 
  I am not sure if this is the right forum for this question, but it would
  be great if I could be pointed in the right direction. We have been using a
  combination of MySql and Solr for all our company full text and query
  needs.  But as our customers have grow so has the amount of data and MySql
  is just not proving to be a right option for storing/querying.
 
  I have been looking at Solr Cloud and it looks really impressive, but and
  not sure if we should give away our storage system. So, I have been
  exploring DataStax but a commercial option is out of question. So we were
  thinking of using hbase to store the data and at the same time index the
  data into Solr cloud, but for many reasons this design doesn't seem
  convincing (Also seen basic of Lilly).
 
  1) Would it be recommended to just user Solr cloud with multiple
  replication or hbase-solr seems like good option
 
 
 If you trust SolrCloud with replication and keep all your fields stored
 then you could live without an external DB.  At this point I personally
 would still want an external DB.  Whether HBase is the right DB for the job
 I can't tell because I don't know anything about your data, volume, access
 patterns, etc.  I can tell you that HBase does scale well - we have tables
 with many billions of rows stored in it for instance.
 
 
  2) How much strain would be to keep both Solr Shard and Hbase node on the
  same machine
 
 
 HBase loves memory.  So does Solr.  They both dislike disk IO (who
 doesn't!).  Solr can use a lot of CPU for indexing/searching, depending on
 the volume.  HBase RegionServers can use a lot of CPU if you run MapReuce
 on data in HBase.
 
 
  3) if there a calculation on what kind of machine configuration would I
  need to store 500-1000 million records. Most of these with be social data
  (Twitter/facebook/blogs etc) and how many shards.
 
 
 No recipe here, unfortunately.  You'd have to experiment and test, do load
 and performance testing, etc.  If you need help with Solr + HBase, we
 happen to have a lot of experience with both and have even used them
 together for some of our clients.
 
 Otis
 --
 Performance Monitoring - http://sematext.com/spm/index.html
 Search Analytics - http://sematext.com/search-analytics/index.html
  

RE: Architecture question about solr sharding

2011-03-23 Thread Baillie, Robert
I'd separate the splitting of the binary documents from the sharding in Solr - 
they're different things and the split may be required at different levels, due 
to different numbers of documents.

Splitting the dependency means that you can store the path in the document and 
not need to infer anything, and you can re-organise the Solr shards without 
having to worry about moving the binary documents around.

Also, if you think you're going to need to change Jan to Jan2011, then maybe 
you should just start with Jan2011.  Alternatively, considering that you think 
change is likely in the future, why not name the directories in such a way that 
you don't need to make the change earlier ones as the requirement to change the 
structures arises?

Does that make sense?

Rob

On Tue, Mar 22, 2011 at 3:20 PM, JohnRodey timothydd...@yahoo.com wrote:
 I have an issue and I'm wondering if there is an easy way around it 
 with just SOLR.

 I have multiple SOLR servers and a field in my schema is a relative 
 path to a binary file.  Each SOLR server is responsible for a 
 different subset of data that belongs to a different base path.

 For Example...

 My directory structure may look like this:
 /someDir/Jan/binaryfiles/...
 /someDir/Feb/binaryfiles/...
 /someDir/Mar/binaryfiles/...
 /someDir/Apr/binaryfiles/...

 Server1 is responsible for Jan, Server2 for Feb, etc...

 And a response document may have a field like this my entry 
 binaryfiles/12345.bin

 How can I tell from my main search server which server returned a result?
 I cannot put the full path in the index because my path structure 
 might change in the future.  Using this example it may go to 
 '/someDir/Jan2011/'.

 I basically need to find a way to say 'Ah! server01 returned this 
 result, so it must be in /someDir/Jan'

 Thanks!

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Architecture-question-about-solr-sh
 arding-tp2716417p2716417.html Sent from the Solr - User mailing list 
 archive at Nabble.com.




This email transmission is confidential and intended solely for the 
addressee. If you are not the intended addressee, you must not 
disclose, copy or distribute the contents of this transmission. If you 
have received this transmission in error, please notify the sender 
immediately.

http://www.sthree.com



Re: Architecture question about solr sharding

2011-03-22 Thread Erick Erickson
I'd just put the data in the document. That way, you're not
inferring anything, you *know* which shard (or even the
logical shard) the data came from.

Does that make sense in your problem sace?

Erick

On Tue, Mar 22, 2011 at 3:20 PM, JohnRodey timothydd...@yahoo.com wrote:
 I have an issue and I'm wondering if there is an easy way around it with just
 SOLR.

 I have multiple SOLR servers and a field in my schema is a relative path to
 a binary file.  Each SOLR server is responsible for a different subset of
 data that belongs to a different base path.

 For Example...

 My directory structure may look like this:
 /someDir/Jan/binaryfiles/...
 /someDir/Feb/binaryfiles/...
 /someDir/Mar/binaryfiles/...
 /someDir/Apr/binaryfiles/...

 Server1 is responsible for Jan, Server2 for Feb, etc...

 And a response document may have a field like this
 my entry
 binaryfiles/12345.bin

 How can I tell from my main search server which server returned a result?
 I cannot put the full path in the index because my path structure might
 change in the future.  Using this example it may go to '/someDir/Jan2011/'.

 I basically need to find a way to say 'Ah! server01 returned this result, so
 it must be in /someDir/Jan'

 Thanks!

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Architecture-question-about-solr-sharding-tp2716417p2716417.html
 Sent from the Solr - User mailing list archive at Nabble.com.