As an aside - I just spoke with somone the other day who is using Hadoop for re-index in order to save a lot of time. I don't know the details, but I assume they're using Hadoop to call Lucene code and index documents using the map-reduce approach...
This was made in their own shop - I don't think the code is available as open source, but it works for them as a way to really cut down re-indexing time for extremely large data sets. On Sat, Sep 24, 2016 at 8:15 AM, Yago Riveiro <yago.rive...@gmail.com> wrote: > "LucidWorks achieved 150k docs/second" > > > > This is only valid is you don't have replication, I don't know your use > case, > but a realistic use case normally use some type of redundancy to not lost > data > in a hardware failure, at least 2 replicas, more implicates a reduction of > throughput. Also don't forget that in an realistic use case you should > handle > reads too. > > Our cluster is small for the data we hold (12 machines with SSD and 32G of > RAM), but we don't need sub-second queries, we need facet with high > cardinality (in worst case scenarios we aggregate 5M unique string values) > > As Shawn probably told you, sizing your cluster is a try and error path. > Our > cluster is optimize to handle a low rate of reads, facet queries and a high > rate of inserts. > > In a peak of inserts we can handle around 25K docs per second without any > issue with 2 replicas and without compromise reads or put a node in stress. > Nodes in stress can eject him selfs from the Zookepeer cluster due a GC or > a > lack of CPU to communicate. > > If you want accuracy data you need to do test. > > Keep in mind the most important thing about solr in my opinion, in a > terabyte > scale any field type schema change or LuceneCodec change will force you to > do > a full reindex. Each time I need to update Solr to a major release it's a > pain > in the ass to convert the segments if are not compatible with newer > version. > This can take months, will not ensure your data will be equal that a clean > index (voodoo magic thing can happen, thrust me), and it will drain a huge > amount of hardware resources to do it without downtime. > > > \-- > > > > /Yago Riveiro > > ![](https://link.nylas.com/open/m7fkqw0yim04itb62itnp7r9/local-277ee09e- > 1aee?r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn) > > > On Sep 24 2016, at 7:48 am, S G <sg.online.em...@gmail.com> wrote: > > > Hey Yago, > > > > > > 12 T is very impressive. > > > > > > Can you also share some numbers about the shards, replicas, machine > count/specs and docs/second for your case? > I think you would not be having a single index of 12 TB too. So some > details on that would be really helpful too. > > > > > > https://lucidworks.com/blog/2014/06/03/introducing-the- > solr-scale-toolkit/ > is a good post how LucidWorks achieved 150k docs/second. > If you have any such similar blog, that would be quite useful and popular > too. > > > > > > \--SG > > > > > > On Fri, Sep 23, 2016 at 5:00 PM, Yago Riveiro <yago.rive...@gmail.com> > wrote: > > > > > > > In my company we have a SolrCloud cluster with 12T. > > > > My advices: > > > > Be nice with CPU you will needed in some point (very important if you > have > > not control over the kind of queries to the cluster, clients are greedy, > > the want all results at the same time) > > > > SSD and memory (as many as you can afford if you will do facets) > > > > Full recoveries are a pain, network it's important and should be as fast > > as possible, never less than 1Gbit. > > > > Divide and conquer, but too much can drive you to an expensive overhead, > > data travels over the network. Find the sweet point (only testing you use > > case you will know) > > > > \-- > > > > /Yago Riveiro > > > > On 23 Sep 2016, 23:44 +0100, Pushkar Raste <pushkar.ra...@gmail.com>, > > wrote: > > > Solr is RAM hungry. Make sure that you have enough RAM to have most if > > the > > > index of a core in the RAM itself. > > > > > > You should also consider using really good SSDs. > > > > > > That would be a good start. Like others said, test and verify your > setup. > > > > > > \--Pushkar Raste > > > > > > On Sep 23, 2016 4:58 PM, "Jeffery Yuan" <yuanyun...@gmail.com> wrote: > > > > > > Thanks so much for your prompt reply. > > > > > > We are definitely going to use SolrCloud. > > > > > > I am just wondering whether SolrCloud can scale even at TB data level > and > > > what kind of hardware configuration it should be. > > > > > > Thanks. > > > > > > > > > > > > \-- > > > View this message in context: [http://lucene.472066.n3.](htt > p://lucene.472 > 066.n3.&r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn) > > > nabble.com/Whether-solr-can-support-2-TB-data-tp4297790p4297800.html > > > Sent from the Solr - User mailing list archive at Nabble.com. > > > >