On 11/21/2011 12:41 AM, Husain, Yavar wrote:
Number of rows in SQL Table (Indexed till now using Solr): 1 million
Total Size of Data in the table: 4GB
Total Index Size: 3.5 GB

Total Number of Rows that I have to index: 20 Million (approximately 100 GB 
Data) and growing

What is the best practices with respect to distributing the index? What I mean 
to say here is when should I distribute and what is the magic number that I can 
have for index size per instance?

For 1 million itself Solr instance running on a VM is taking roughly 2.5 hrs to 
index for me. So for 20 million roughly it would take 60 -70 hrs. That would be 
too much.

What would be the best distributed architecture for my case? It will be great 
if people may share their best practices and experience.

I have a MySQL database with 66 million rows at the moment, always growing. My Solr index is split into six large shards and a small shard with the newest data. The small shard (incremental) is calculated by looking at counts of data in hourly increments between 7 and 3.5 days old, and either choosing a boundary that results in less than 500,000 documents or the 3.5 day boundary. This index is usually about 1GB in size.

The rest of the documents are split between the other six shards using crc32(did) % 6. The did field is a mysql bigint autoincrement field. These large shards are very close to 11 million records and 20GB each. By indexing all six at once, I can complete a full index rebuild in about 3.5 hours.

Each full index chain lives on two 64GB Dell servers with dual quad-core processors. Each server contains a Solr instance with 8GB of heap, running three large shards. One server contains the incremental index, the other server runs the load balancer. Both servers run an index-free Solr core that we call the broker. Its search handlers have the shards parameter in solrconfig.xml, pointed at the appropriate cores for that index chain.

To keep index size down and search speed up, it's important that your index only contain the fields needed for two purposes: Searching (indexed fields) and displaying a results grid (stored fields). Any other information should be excluded from your schema.xml and/or DIH config. Full item details should be populated from the database or other information store (possibly a filesystem), using the unique identifier from the search results.

If you are aggregating data from more than one table, see if you can have your database get the information into one SELECT statement with JOINs, rather than having more than one entity in your DIH config. Alternatively, if your secondary tables are small, try using the CachedSQLEntityProcessor on them so they are loaded entirely into RAM for the import. Your database software is usually much better at combining tables than Solr, so take advantage of it.

If you have multivalued search fields from secondary entities in DIH, you can often get your database software to CONCAT them together into a single field, then use an appropriate tokenizer to split them into separate terms. I have one such field that is semicolon separated by a database JOIN that's specified in a view, then I use a pattern tokenizer that splits it at index time.

I hope this is helpful.

Thanks,
Shawn

Reply via email to