Re: Solr Performance/Architecture

Shawn Heisey Mon, 21 Nov 2011 06:17:10 -0800

On 11/21/2011 12:41 AM, Husain, Yavar wrote:

Number of rows in SQL Table (Indexed till now using Solr): 1 million
Total Size of Data in the table: 4GB
Total Index Size: 3.5 GB


Total Number of Rows that I have to index: 20 Million (approximately 100 GB 
Data) and growing

What is the best practices with respect to distributing the index? What I mean 
to say here is when should I distribute and what is the magic number that I can 
have for index size per instance?

For 1 million itself Solr instance running on a VM is taking roughly 2.5 hrs to 
index for me. So for 20 million roughly it would take 60 -70 hrs. That would be 
too much.

What would be the best distributed architecture for my case? It will be great 
if people may share their best practices and experience.

I have a MySQL database with 66 million rows at the moment, alwaysgrowing. My Solr index is split into six large shards and a small shardwith the newest data. The small shard (incremental) is calculated bylooking at counts of data in hourly increments between 7 and 3.5 daysold, and either choosing a boundary that results in less than 500,000documents or the 3.5 day boundary. This index is usually about 1GB in size.

The rest of the documents are split between the other six shards usingcrc32(did) % 6. The did field is a mysql bigint autoincrement field.These large shards are very close to 11 million records and 20GB each.By indexing all six at once, I can complete a full index rebuild inabout 3.5 hours.

Each full index chain lives on two 64GB Dell servers with dual quad-coreprocessors. Each server contains a Solr instance with 8GB of heap,running three large shards. One server contains the incremental index,the other server runs the load balancer. Both servers run an index-freeSolr core that we call the broker. Its search handlers have the shardsparameter in solrconfig.xml, pointed at the appropriate cores for thatindex chain.

To keep index size down and search speed up, it's important that yourindex only contain the fields needed for two purposes: Searching(indexed fields) and displaying a results grid (stored fields). Anyother information should be excluded from your schema.xml and/or DIHconfig. Full item details should be populated from the database orother information store (possibly a filesystem), using the uniqueidentifier from the search results.

If you are aggregating data from more than one table, see if you canhave your database get the information into one SELECT statement withJOINs, rather than having more than one entity in your DIH config.Alternatively, if your secondary tables are small, try using theCachedSQLEntityProcessor on them so they are loaded entirely into RAMfor the import. Your database software is usually much better atcombining tables than Solr, so take advantage of it.

If you have multivalued search fields from secondary entities in DIH,you can often get your database software to CONCAT them together into asingle field, then use an appropriate tokenizer to split them intoseparate terms. I have one such field that is semicolon separated by adatabase JOIN that's specified in a view, then I use a pattern tokenizerthat splits it at index time.


I hope this is helpful.

Thanks,
Shawn

Re: Solr Performance/Architecture

Reply via email to