Re: indexing bigdata

2012-03-09 Thread Robert Stewart
It very much depends on your data and also what query features you will use.  
How many fields, the size of each field, how many unique values per field, how 
many fields are stored vs. only indexed, etc.  I have a system with 3+ billion 
does, and each instance (each index core) has 120million docs and it flies.  
But the documents are tiny only 3 fields each, and the search is very simple 
single keyword match.  On another system we only have 7 million docs per 
instance and it is slower because documents are much much larger with many more 
fields, and we do a lot of faceting and other advanced search features.

Also other factors such as what type of features you will use for search 
(faceting, field collapsing, wildcard queries, etc.) can all increase search 
time vs. just simple keyword search.

Unfortunately it is one of those things you need to try it out to really get an 
answer IMO.


On Mar 8, 2012, at 11:39 PM, Sharath Jagannath wrote:

 Ok, My bad. I should have put it in a better way.
 Is it good idea to have all the 30M docs on a single instance, or should I
 consider distributed set-up.
 I have synthesized the data and the have configured schema and have made
 suitable changes to the config. Have tested out with a smaller data-set on
 my laptop and have a good work flow set-up.
 
 I do not have a big machine and test it out.
 Wanted to make sure I have insight in either option I have before I decide
 to spin-up an amazon instance.
 
 Thanks,
 Sharath
 
 On Thu, Mar 8, 2012 at 6:18 PM, Erick Erickson erickerick...@gmail.comwrote:
 
 Your question is really unanswerable, there are about a zillion
 factors that could influence the answer. I can index 5-7K docs/second
 so it's efficient. Others can index only a fraction of that. It all
 depends...
 
 Try it and see is about the only way to answer.
 
 Best
 Erick
 
 On Thu, Mar 8, 2012 at 1:35 PM, Sharath Jagannath
 shotsonclo...@gmail.com wrote:
 Is indexing around 30 Million documents in a single solr instance
 efficient?
 Has somebody experimented it? Planning to use it for an autosuggest
 feature
 I am implementing, so expecting the response in few milliseconds.
 Should I be looking at sharding?
 
 Thanks,
 Sharath
 



Re: indexing bigdata

2012-03-08 Thread Erick Erickson
Your question is really unanswerable, there are about a zillion
factors that could influence the answer. I can index 5-7K docs/second
so it's efficient. Others can index only a fraction of that. It all depends...

Try it and see is about the only way to answer.

Best
Erick

On Thu, Mar 8, 2012 at 1:35 PM, Sharath Jagannath
shotsonclo...@gmail.com wrote:
 Is indexing around 30 Million documents in a single solr instance efficient?
 Has somebody experimented it? Planning to use it for an autosuggest feature
 I am implementing, so expecting the response in few milliseconds.
 Should I be looking at sharding?

 Thanks,
 Sharath


Re: indexing bigdata

2012-03-08 Thread Sharath Jagannath
Ok, My bad. I should have put it in a better way.
Is it good idea to have all the 30M docs on a single instance, or should I
consider distributed set-up.
I have synthesized the data and the have configured schema and have made
suitable changes to the config. Have tested out with a smaller data-set on
my laptop and have a good work flow set-up.

I do not have a big machine and test it out.
Wanted to make sure I have insight in either option I have before I decide
to spin-up an amazon instance.

Thanks,
Sharath

On Thu, Mar 8, 2012 at 6:18 PM, Erick Erickson erickerick...@gmail.comwrote:

 Your question is really unanswerable, there are about a zillion
 factors that could influence the answer. I can index 5-7K docs/second
 so it's efficient. Others can index only a fraction of that. It all
 depends...

 Try it and see is about the only way to answer.

 Best
 Erick

 On Thu, Mar 8, 2012 at 1:35 PM, Sharath Jagannath
 shotsonclo...@gmail.com wrote:
  Is indexing around 30 Million documents in a single solr instance
 efficient?
  Has somebody experimented it? Planning to use it for an autosuggest
 feature
  I am implementing, so expecting the response in few milliseconds.
  Should I be looking at sharding?
 
  Thanks,
  Sharath