On 11/21/2011 12:41 AM, Husain, Yavar wrote:
Number of rows in SQL Table (Indexed till now using Solr): 1 million
Total Size of Data in the table: 4GB
Total Index Size: 3.5 GB
Total Number of Rows that I have to index: 20 Million (approximately 100 GB
Data) and growing
What is the best practices with respect to distributing the index? What I mean
to say here is when should I distribute and what is the magic number that I can
have for index size per instance?
For 1 million itself Solr instance running on a VM is taking roughly 2.5 hrs to
index for me. So for 20 million roughly it would take 60 -70 hrs. That would be
too much.
What would be the best distributed architecture for my case? It will be great
if people may share their best practices and experience.
I have a MySQL database with 66 million rows at the moment, always
growing. My Solr index is split into six large shards and a small shard
with the newest data. The small shard (incremental) is calculated by
looking at counts of data in hourly increments between 7 and 3.5 days
old, and either choosing a boundary that results in less than 500,000
documents or the 3.5 day boundary. This index is usually about 1GB in size.
The rest of the documents are split between the other six shards using
crc32(did) % 6. The did field is a mysql bigint autoincrement field.
These large shards are very close to 11 million records and 20GB each.
By indexing all six at once, I can complete a full index rebuild in
about 3.5 hours.
Each full index chain lives on two 64GB Dell servers with dual quad-core
processors. Each server contains a Solr instance with 8GB of heap,
running three large shards. One server contains the incremental index,
the other server runs the load balancer. Both servers run an index-free
Solr core that we call the broker. Its search handlers have the shards
parameter in solrconfig.xml, pointed at the appropriate cores for that
index chain.
To keep index size down and search speed up, it's important that your
index only contain the fields needed for two purposes: Searching
(indexed fields) and displaying a results grid (stored fields). Any
other information should be excluded from your schema.xml and/or DIH
config. Full item details should be populated from the database or
other information store (possibly a filesystem), using the unique
identifier from the search results.
If you are aggregating data from more than one table, see if you can
have your database get the information into one SELECT statement with
JOINs, rather than having more than one entity in your DIH config.
Alternatively, if your secondary tables are small, try using the
CachedSQLEntityProcessor on them so they are loaded entirely into RAM
for the import. Your database software is usually much better at
combining tables than Solr, so take advantage of it.
If you have multivalued search fields from secondary entities in DIH,
you can often get your database software to CONCAT them together into a
single field, then use an appropriate tokenizer to split them into
separate terms. I have one such field that is semicolon separated by a
database JOIN that's specified in a view, then I use a pattern tokenizer
that splits it at index time.
I hope this is helpful.
Thanks,
Shawn