Re: Index Sizes
On 1/7/2014 7:48 AM, Steven Bower wrote: > I was looking at the code for getIndexSize() on the ReplicationHandler to > get at the size of the index on disk. From what I can tell, because this > does directory.listAll() to get all the files in the directory, the size on > disk includes not only what is searchable at the moment but potentially > also files that are being created by background merges/etc.. I am wondering > if there is an API that would give me the size of the "currently > searchable" index files (doubt this exists, but maybe).. > > If not what is the most appropriate way to get a list of the segments/files > that are currently in use by the active searcher such that I could then ask > the directory implementation for the size of all those files? > > For a more complete picture of what I'm trying to accomplish, I am looking > at building a quota/monitoring component that will trigger when index size > on disk gets above a certain size. I don't want to trigger if index is > doing a merge and ephemerally uses disk for that process. If anyone has any > suggestions/recommendations here too I'd be interested.. Dredging up a VERY old thread here. As I was replying to your most recent query, I was looking through my email archive for your previous messages and this one caught my eye, especially because it never got a reply. It must have escaped my notice last year. This is a very good idea. I imagine that the active searcher object directly or indirectly knows exactly which files are in use for that searcher, so I think it should be relatively easy for it to retrieve a list, and the index size code should be able to return both the active index size as well as the total directory size. I've been putting a little bit of work in to get the index size code moved out of the replication handler so that it is available even if replication is completely disabled, but my free time has been limited. I don't recall the issue number(s) for that work. Thanks, Shawn
Index Sizes
I was looking at the code for getIndexSize() on the ReplicationHandler to get at the size of the index on disk. From what I can tell, because this does directory.listAll() to get all the files in the directory, the size on disk includes not only what is searchable at the moment but potentially also files that are being created by background merges/etc.. I am wondering if there is an API that would give me the size of the "currently searchable" index files (doubt this exists, but maybe).. If not what is the most appropriate way to get a list of the segments/files that are currently in use by the active searcher such that I could then ask the directory implementation for the size of all those files? For a more complete picture of what I'm trying to accomplish, I am looking at building a quota/monitoring component that will trigger when index size on disk gets above a certain size. I don't want to trigger if index is doing a merge and ephemerally uses disk for that process. If anyone has any suggestions/recommendations here too I'd be interested.. Thanks, steve
Re: Prediction About Index Sizes of Solr
Interesting bit, thanks* *Rafał! On Mon, Apr 8, 2013 at 12:54 PM, Rafał Kuć wrote: > Hello! > > Let me answer the first part of your question. Please have a look at > > https://svn.apache.org/repos/asf/lucene/dev/trunk/dev-tools/size-estimator-lucene-solr.xls > It should help you make an estimation about your index size. > > -- > Regards, > Rafał Kuć > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch > > > This may not be a well detailed question but I will try to make it clear. > > > I am crawling web pages and will index them at SolrCloud 4.2. What I want > > to predict is the index size. > > > I will have approximately 2 billion web pages and I consider each of them > > will be 100 Kb. > > I know that it depends on storing documents, stop words. etc. etc. If you > > want to ask about detail of my question I may give you more explanation. > > However there should be some analysis to help me because I should predict > > something about what will be the index size for me. > > > On the other hand my other important question is how SolrCloud makes > > replicas for indexes, can I change it how many replicas will be. Because > I > > should multiply the total amount of index size with replica size. > > > Here I found an article related to my analysis: > > http://juanggrande.wordpress.com/2010/12/20/solr-index-size-analysis/ > > > I know this question may not be details but if you give ideas about it > you > > are welcome. > >
Re: Prediction About Index Sizes of Solr
Hello! Let me answer the first part of your question. Please have a look at https://svn.apache.org/repos/asf/lucene/dev/trunk/dev-tools/size-estimator-lucene-solr.xls It should help you make an estimation about your index size. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch > This may not be a well detailed question but I will try to make it clear. > I am crawling web pages and will index them at SolrCloud 4.2. What I want > to predict is the index size. > I will have approximately 2 billion web pages and I consider each of them > will be 100 Kb. > I know that it depends on storing documents, stop words. etc. etc. If you > want to ask about detail of my question I may give you more explanation. > However there should be some analysis to help me because I should predict > something about what will be the index size for me. > On the other hand my other important question is how SolrCloud makes > replicas for indexes, can I change it how many replicas will be. Because I > should multiply the total amount of index size with replica size. > Here I found an article related to my analysis: > http://juanggrande.wordpress.com/2010/12/20/solr-index-size-analysis/ > I know this question may not be details but if you give ideas about it you > are welcome.
Prediction About Index Sizes of Solr
This may not be a well detailed question but I will try to make it clear. I am crawling web pages and will index them at SolrCloud 4.2. What I want to predict is the index size. I will have approximately 2 billion web pages and I consider each of them will be 100 Kb. I know that it depends on storing documents, stop words. etc. etc. If you want to ask about detail of my question I may give you more explanation. However there should be some analysis to help me because I should predict something about what will be the index size for me. On the other hand my other important question is how SolrCloud makes replicas for indexes, can I change it how many replicas will be. Because I should multiply the total amount of index size with replica size. Here I found an article related to my analysis: http://juanggrande.wordpress.com/2010/12/20/solr-index-size-analysis/ I know this question may not be details but if you give ideas about it you are welcome.
RE: Question about index sizes.
That's a great question. And the answer is, of course, it depends. Mostly on the size of the documents you are indexing. 50 million rows from a database table with a handful of columns is very different from 50 million web pages, pdf documents, books, etc. We currently have about 50 million documents split across 2 servers with reasonable performance - sub-second response time in most cases. The total size of the 2 indices is about 300G. I'd say most of the size is from stored fields, though we index just about everything. This is on 64-bit ubuntu boxes with 32G of memory. We haven't pushed this into production yet, but initial load-testing results look promising. Hope this helps! > -Original Message- > From: Jim Adams [mailto:jasolru...@gmail.com] > Sent: Tuesday, June 23, 2009 1:24 PM > To: solr-user@lucene.apache.org > Subject: Question about index sizes. > > Can anyone give me a rule of thumb for knowing when you need to go to > multicore or shards? How many records can be in an index before it > breaks > down? Does it break down? Is it 10 million? 20 million? 50 million? > > Thanks, Jim
Question about index sizes.
Can anyone give me a rule of thumb for knowing when you need to go to multicore or shards? How many records can be in an index before it breaks down? Does it break down? Is it 10 million? 20 million? 50 million? Thanks, Jim