Re: Whether SolrCloud can support 2 TB data?

John Bickerstaff Sat, 24 Sep 2016 08:06:05 -0700

As an aside - I just spoke with somone the other day who is using Hadoop
for re-index in order to save a lot of time.  I don't know the details, but
I assume they're using Hadoop to call Lucene code and index documents using
the map-reduce approach...


This was made in their own shop - I don't think the code is available as
open source, but it works for them as a way to really cut down re-indexing
time for extremely large data sets.

On Sat, Sep 24, 2016 at 8:15 AM, Yago Riveiro <yago.rive...@gmail.com>
wrote:

>  "LucidWorks achieved 150k docs/second"
>
>
>
> This is only valid is you don't have replication, I don't know your use
> case,
> but a realistic use case normally use some type of redundancy to not lost
> data
> in a hardware failure, at least 2 replicas, more implicates a reduction of
> throughput. Also don't forget that in an realistic use case you should
> handle
> reads too.
>
> Our cluster is small for the data we hold (12 machines with SSD and 32G of
> RAM), but we don't need sub-second queries, we need facet with high
> cardinality (in worst case scenarios we aggregate 5M unique string values)
>
> As Shawn probably told you, sizing your cluster is a try and error path.
> Our
> cluster is optimize to handle a low rate of reads, facet queries and a high
> rate of inserts.
>
> In a peak of inserts we can handle around 25K docs per second without any
> issue with 2 replicas and without compromise reads or put a node in stress.
> Nodes in stress can eject him selfs from the Zookepeer cluster due a GC or
> a
> lack of CPU to communicate.
>
> If you want accuracy data you need to do test.
>
> Keep in mind the most important thing about solr in my opinion, in a
> terabyte
> scale any field type schema change or LuceneCodec change will force you to
> do
> a full reindex. Each time I need to update Solr to a major release it's a
> pain
> in the ass to convert the segments if are not compatible with newer
> version.
> This can take months, will not ensure your data will be equal that a clean
> index (voodoo magic thing can happen, thrust me), and it will drain a huge
> amount of hardware resources to do it without downtime.
>
>
> \--
>
>
>
> /Yago Riveiro
>
> ![](https://link.nylas.com/open/m7fkqw0yim04itb62itnp7r9/local-277ee09e-
> 1aee?r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)
>
>
> On Sep 24 2016, at 7:48 am, S G <sg.online.em...@gmail.com> wrote:
>
> > Hey Yago,
>
> >
>
> > 12 T is very impressive.
>
> >
>
> > Can you also share some numbers about the shards, replicas, machine
> count/specs and docs/second for your case?
> I think you would not be having a single index of 12 TB too. So some
> details on that would be really helpful too.
>
> >
>
> > https://lucidworks.com/blog/2014/06/03/introducing-the-
> solr-scale-toolkit/
> is a good post how LucidWorks achieved 150k docs/second.
> If you have any such similar blog, that would be quite useful and popular
> too.
>
> >
>
> > \--SG
>
> >
>
> > On Fri, Sep 23, 2016 at 5:00 PM, Yago Riveiro <yago.rive...@gmail.com>
> wrote:
>
> >
>
> > > In my company we have a SolrCloud cluster with 12T.
> >
> > My advices:
> >
> > Be nice with CPU you will needed in some point (very important if you
> have
> > not control over the kind of queries to the cluster, clients are greedy,
> > the want all results at the same time)
> >
> > SSD and memory (as many as you can afford if you will do facets)
> >
> > Full recoveries are a pain, network it's important and should be as fast
> > as possible, never less than 1Gbit.
> >
> > Divide and conquer, but too much can drive you to an expensive overhead,
> > data travels over the network. Find the sweet point (only testing you use
> > case you will know)
> >
> > \--
> >
> > /Yago Riveiro
> >
> > On 23 Sep 2016, 23:44 +0100, Pushkar Raste <pushkar.ra...@gmail.com>,
> > wrote:
> > > Solr is RAM hungry. Make sure that you have enough RAM to have most if
> > the
> > > index of a core in the RAM itself.
> > >
> > > You should also consider using really good SSDs.
> > >
> > > That would be a good start. Like others said, test and verify your
> setup.
> > >
> > > \--Pushkar Raste
> > >
> > > On Sep 23, 2016 4:58 PM, "Jeffery Yuan" <yuanyun...@gmail.com> wrote:
> > >
> > > Thanks so much for your prompt reply.
> > >
> > > We are definitely going to use SolrCloud.
> > >
> > > I am just wondering whether SolrCloud can scale even at TB data level
> and
> > > what kind of hardware configuration it should be.
> > >
> > > Thanks.
> > >
> > >
> > >
> > > \--
> > > View this message in context: [http://lucene.472066.n3.](htt
> p://lucene.472
> 066.n3.&r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)
> > > nabble.com/Whether-solr-can-support-2-TB-data-tp4297790p4297800.html
> > > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
>

Re: Whether SolrCloud can support 2 TB data?

Reply via email to