need suggestions for storing TBs of strutucred data in SolrCloud

2014-03-05 Thread Chia-Chun Shih
Hi,

I am planning a system for searching TB's of structured data in SolrCloud.
I need suggestions for handling such huge amount of data in SolrCloud.
(e.g., number of shards per collection, number of nodes, etc.)

Here are some specs of the system:

   1. Raw data is 35,000 CSV files per day. Each file is about 5 MB.
   2. One collection serves one day. 200-day history data is required.
   3. Take less than 10 hours to build one-day index.
   4. Allow to execute an ordinary query (may span 1~7 days) in 10 minutes
   5. concurrent user < 10

I have built an experimental SolrCloud based on 3 VMs, each equipped with 8
cores, 64GB RAM.  Each collection has 3 shards and no replication. Here are
my findings:

   1. Each collection's actual index size is between 30GB to 90GB,
   depending on the number of stored field.
   2. It takes 6 to 12 hours to load raw data. I use multiple (15~30)
   threads to launch http requests. (http://wiki.apache.org/solr/UpdateCSV)


Thanks,
Chia-Chun


SolrCloud can't correctly create collection after zookeeper ensemble recovery

2014-02-20 Thread Chia-Chun Shih
Hi all,

This is my test procedure:

1. start a Zookeeper ensemble and a SolrCloud node
2. stop Zookeeper ensemble
3. start Zookeeper ensemble
4. fail to create a collection (with 1 shard and 1 replica) because of
timeout
5. restart the SolrCloud node
6. fail to create a collection with the same name in step 4 because the
collection already exists. But the collection doesn't assign to any
SolrCloud node.

I am using Solr 4.6.1 and Zookeeper 3.4.5

Thanks,
Chia-Chun