Re: Solr server requirements for 100+ million documents

svante karlsson Sat, 25 Jan 2014 05:10:50 -0800

You are of course right but we do our own normalization (among other things
"to_lower") before we insert and before search queries get entered.


We do not use wildcards in searches either so in our problem domain it
works quite well.

/svante




2014/1/25 Erick Erickson <erickerick...@gmail.com>

> Hmmm, I'm always suspicious when I see a schema.xml with a lot of "string"
> types. This is tangential to your question, but I thought I'd butt in
> anyway.
>
> String types are totally unanalyzed. So if the input for a field is "I
> like Strings",
> the only match will be "I like Strings". "I like strings" won't match
> due to the
> lower-case 's' in strings. "like" won't match since it isn't the complete
> input.
>
> You may already know this, but thought I'd point it out. For tokenized
> searches, text_general is a good place to start. Pardon me if this is
> repeating
> what you already know....
>
> Lots of string types sometimes lead people with DB backgrounds to
> search for *like* which will be slow FWIW.
>
> Best,
> Erick
>
> On Sat, Jan 25, 2014 at 5:51 AM, svante karlsson <s...@csi.se> wrote:
> > That got away a little early...
> >
> > The inserter is a small C++ program that uses pglib to speek to postgres
> > and the a http-client library that uses libcurl under the hood. The
> > inserter draws very little CPU and we normally use 2 writer threads that
> > each posts 1000 records at a time. Its very inefficient to post one at a
> > time but I've not done any specific testing to know if 1000 is better
> that
> > 500....
> >
> > What we're doing now is trying to figure out how to get the query
> > performance up since is not where we need it to be so we're not done
> > either...
> >
> >
> > 2014/1/25 svante karlsson <s...@csi.se>
> >
> >> We are using a postgres server on a different host (same hardware as the
> >> test solr server). The reason we take the data from the postgres server
> is
> >> that is easy to automate testing since we use the same server to produce
> >> queries. In production we preload the solr from a csv file from a hive
> >> (hadoop) job and then only write updates ( < 500 / sec ). In our
> usecase we
> >> use solr as NoSQL dabase since we really want to do SHOULD queries
> against
> >> all the fields. The fields are typically very small text fields (<30
> chars)
> >> but occasionally bigger but I don't think I have more than 128 chars on
> >> anything in the whole dataset.
> >>
> >> <?xml version="1.0" encoding="UTF-8" ?>
> >> <schema name="example" version="1.1">
> >>   <types>
> >>   <fieldType name="uuid" class="solr.UUIDField" indexed="true" />
> >>   <fieldType name="string" class="solr.StrField" sortMissingLast="true"
> >> omitNorms="true"/>
> >>    <fieldType name="boolean" class="solr.BoolField"
> >> sortMissingLast="true"/>
> >>    <fieldType name="tdate" class="solr.TrieDateField" precisionStep="6"
> >> positionIncrementGap="0"/>
> >>    <fieldType name="int" class="solr.TrieIntField" precisionStep="0"
> >> positionIncrementGap="0"/>
> >>    <fieldType name="long" class="solr.TrieLongField" precisionStep="0"
> >> positionIncrementGap="0"/>
> >>    </types>
> >> <fields>
> >> <field name="_version_" type="long" indexed="true" stored="true"
> >> multiValued="false"/>
> >> <field name="id" type="string" indexed="true" stored="true"
> >> required="true" multiValued="false" />
> >> <field name="name" type="int" indexed="true" stored="true"/>
> >> <field name="fieldA" type="string" indexed="true" stored="true"/>
> >> <field name="fieldB" type="string" indexed="true" stored="true"/>
> >> <field name="fieldC" type="int" indexed="true" stored="true"/>
> >> <field name="fieldD" type="int" indexed="true" stored="true"/>
> >> <field name="fieldE" type="int" indexed="true" stored="true"/>
> >> <field name="fieldF" type="string" indexed="true" stored="true"
> >> multiValued="true"/>
> >> <field name="fieldG" type="string" indexed="true" stored="true"
> >> multiValued="true"/>
> >> <field name="fieldH" type="string" indexed="true" stored="true"
> >> multiValued="true"/>
> >> <field name="fieldI" type="string" indexed="true" stored="true"
> >> multiValued="true"/>
> >> <field name="fieldJ" type="string" indexed="true" stored="true"
> >> multiValued="true"/>
> >> <field name="fieldK" type="string" indexed="true" stored="true"
> >> multiValued="true"/>
> >> <field name="fieldL" type="string" indexed="true" stored="true"/>
> >> <field name="fieldM" type="string" indexed="true" stored="true"
> >> multiValued="true"/>
> >> <field name="fieldN" type="string" indexed="true" stored="true"/>
> >>
> >> <field name="fieldO" type="string" indexed="false" stored="true"
> >> required="false" />
> >> <field name="ts"  type="long" indexed="true" stored="true"/>
> >> </fields>
> >> <uniqueKey>id</uniqueKey>
> >> <solrQueryParser defaultOperator="OR"/>
> >> </schema>
> >>
> >>
> >>
> >>
> >>
> >> 2014/1/25 Kranti Parisa <kranti.par...@gmail.com>
> >>
> >>> can you post the complete solrconfig.xml file and schema.xml files to
> >>> review all of your settings that would impact your indexing
> performance.
> >>>
> >>> Thanks,
> >>> Kranti K. Parisa
> >>> http://www.linkedin.com/in/krantiparisa
> >>>
> >>>
> >>>
> >>> On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar <
> >>> susheel.ku...@thedigitalgroup.net> wrote:
> >>>
> >>> > Thanks, Svante. Your indexing speed using db seems to really fast.
> Can
> >>> you
> >>> > please provide some more detail on how you are indexing db records.
> Is
> >>> it
> >>> > thru DataImportHandler? And what database? Is that local db?  We are
> >>> > indexing around 70 fields (60 multivalued) but data is not populated
> >>> always
> >>> > in all fields. The average size of document is in 5-10 kbs.
> >>> >
> >>> > -----Original Message-----
> >>> > From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On
> Behalf Of
> >>> > svante karlsson
> >>> > Sent: Friday, January 24, 2014 5:05 PM
> >>> > To: solr-user@lucene.apache.org
> >>> > Subject: Re: Solr server requirements for 100+ million documents
> >>> >
> >>> > I just indexed 100 million db docs (records) with 22 fields (4
> >>> > multivalued) in 9524 sec using libcurl.
> >>> > 11 million took 763 seconds so the speed drops somewhat with
> increasing
> >>> > dbsize.
> >>> >
> >>> > We write 1000 docs (just an arbitrary number) in each request from
> two
> >>> > threads. If you will be using solrcloud you will want more writer
> >>> threads.
> >>> >
> >>> > The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with
> one
> >>> SSD
> >>> > and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual
> >>> machine.
> >>> >
> >>> > /svante
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > 2014/1/24 Susheel Kumar <susheel.ku...@thedigitalgroup.net>
> >>> >
> >>> > > Thanks, Erick for the info.
> >>> > >
> >>> > > For indexing I agree the more time is consumed in data acquisition
> >>> > > which in our case from Database.  For indexing currently we are
> using
> >>> > > the manual process i.e. Solr dashboard Data Import but now looking
> to
> >>> > > automate.  How do you suggest to automate the index part. Do you
> >>> > > recommend to use SolrJ or should we try to automate using Curl?
> >>> > >
> >>> > >
> >>> > > -----Original Message-----
> >>> > > From: Erick Erickson [mailto:erickerick...@gmail.com]
> >>> > > Sent: Friday, January 24, 2014 2:59 PM
> >>> > > To: solr-user@lucene.apache.org
> >>> > > Subject: Re: Solr server requirements for 100+ million documents
> >>> > >
> >>> > > Can't be done with the information you provided, and can only be
> >>> > > guessed at even with more comprehensive information.
> >>> > >
> >>> > > Here's why:
> >>> > >
> >>> > >
> >>> > >
> >>> http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we
> >>> > > -dont-have-a-definitive-answer/
> >>> > >
> >>> > > Also, at a guess, your indexing speed is so slow due to data
> >>> > > acquisition; I rather doubt you're being limited by raw Solr
> indexing.
> >>> > > If you're using SolrJ, try commenting out the
> >>> > > server.add() bit and running again. My guess is that your indexing
> >>> > > speed will be almost unchanged, in which case it's the data
> >>> > > acquisition process is where you should concentrate efforts. As a
> >>> > > comparison, I can index 11M Wikipedia docs on my laptop in 45
> minutes
> >>> > > without any attempts at parallelization.
> >>> > >
> >>> > >
> >>> > > Best,
> >>> > > Erick
> >>> > >
> >>> > > On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar <
> >>> > > susheel.ku...@thedigitalgroup.net> wrote:
> >>> > > > Hi,
> >>> > > >
> >>> > > > Currently we are indexing 10 million document from database (10
> db
> >>> > > > data
> >>> > > entities) & index size is around 8 GB on windows virtual box.
> Indexing
> >>> > > in one shot taking 12+ hours while indexing parallel in separate
> cores
> >>> > > & merging them together taking 4+ hours.
> >>> > > >
> >>> > > > We are looking to scale to 100+ million documents and looking for
> >>> > > recommendation on servers requirements on below parameters for a
> >>> > > Production environment. There can be 200+ users performing search
> same
> >>> > time.
> >>> > > >
> >>> > > > No of physical servers (considering solr cloud) Memory
> requirement
> >>> > > > Processor requirement (# cores) Linux as OS oppose to windows
> >>> > > >
> >>> > > > Thanks in advance.
> >>> > > > Susheel
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>
> >>
>

Re: Solr server requirements for 100+ million documents

Reply via email to