You are of course right but we do our own normalization (among other things "to_lower") before we insert and before search queries get entered.
We do not use wildcards in searches either so in our problem domain it works quite well. /svante 2014/1/25 Erick Erickson <erickerick...@gmail.com> > Hmmm, I'm always suspicious when I see a schema.xml with a lot of "string" > types. This is tangential to your question, but I thought I'd butt in > anyway. > > String types are totally unanalyzed. So if the input for a field is "I > like Strings", > the only match will be "I like Strings". "I like strings" won't match > due to the > lower-case 's' in strings. "like" won't match since it isn't the complete > input. > > You may already know this, but thought I'd point it out. For tokenized > searches, text_general is a good place to start. Pardon me if this is > repeating > what you already know.... > > Lots of string types sometimes lead people with DB backgrounds to > search for *like* which will be slow FWIW. > > Best, > Erick > > On Sat, Jan 25, 2014 at 5:51 AM, svante karlsson <s...@csi.se> wrote: > > That got away a little early... > > > > The inserter is a small C++ program that uses pglib to speek to postgres > > and the a http-client library that uses libcurl under the hood. The > > inserter draws very little CPU and we normally use 2 writer threads that > > each posts 1000 records at a time. Its very inefficient to post one at a > > time but I've not done any specific testing to know if 1000 is better > that > > 500.... > > > > What we're doing now is trying to figure out how to get the query > > performance up since is not where we need it to be so we're not done > > either... > > > > > > 2014/1/25 svante karlsson <s...@csi.se> > > > >> We are using a postgres server on a different host (same hardware as the > >> test solr server). The reason we take the data from the postgres server > is > >> that is easy to automate testing since we use the same server to produce > >> queries. In production we preload the solr from a csv file from a hive > >> (hadoop) job and then only write updates ( < 500 / sec ). In our > usecase we > >> use solr as NoSQL dabase since we really want to do SHOULD queries > against > >> all the fields. The fields are typically very small text fields (<30 > chars) > >> but occasionally bigger but I don't think I have more than 128 chars on > >> anything in the whole dataset. > >> > >> <?xml version="1.0" encoding="UTF-8" ?> > >> <schema name="example" version="1.1"> > >> <types> > >> <fieldType name="uuid" class="solr.UUIDField" indexed="true" /> > >> <fieldType name="string" class="solr.StrField" sortMissingLast="true" > >> omitNorms="true"/> > >> <fieldType name="boolean" class="solr.BoolField" > >> sortMissingLast="true"/> > >> <fieldType name="tdate" class="solr.TrieDateField" precisionStep="6" > >> positionIncrementGap="0"/> > >> <fieldType name="int" class="solr.TrieIntField" precisionStep="0" > >> positionIncrementGap="0"/> > >> <fieldType name="long" class="solr.TrieLongField" precisionStep="0" > >> positionIncrementGap="0"/> > >> </types> > >> <fields> > >> <field name="_version_" type="long" indexed="true" stored="true" > >> multiValued="false"/> > >> <field name="id" type="string" indexed="true" stored="true" > >> required="true" multiValued="false" /> > >> <field name="name" type="int" indexed="true" stored="true"/> > >> <field name="fieldA" type="string" indexed="true" stored="true"/> > >> <field name="fieldB" type="string" indexed="true" stored="true"/> > >> <field name="fieldC" type="int" indexed="true" stored="true"/> > >> <field name="fieldD" type="int" indexed="true" stored="true"/> > >> <field name="fieldE" type="int" indexed="true" stored="true"/> > >> <field name="fieldF" type="string" indexed="true" stored="true" > >> multiValued="true"/> > >> <field name="fieldG" type="string" indexed="true" stored="true" > >> multiValued="true"/> > >> <field name="fieldH" type="string" indexed="true" stored="true" > >> multiValued="true"/> > >> <field name="fieldI" type="string" indexed="true" stored="true" > >> multiValued="true"/> > >> <field name="fieldJ" type="string" indexed="true" stored="true" > >> multiValued="true"/> > >> <field name="fieldK" type="string" indexed="true" stored="true" > >> multiValued="true"/> > >> <field name="fieldL" type="string" indexed="true" stored="true"/> > >> <field name="fieldM" type="string" indexed="true" stored="true" > >> multiValued="true"/> > >> <field name="fieldN" type="string" indexed="true" stored="true"/> > >> > >> <field name="fieldO" type="string" indexed="false" stored="true" > >> required="false" /> > >> <field name="ts" type="long" indexed="true" stored="true"/> > >> </fields> > >> <uniqueKey>id</uniqueKey> > >> <solrQueryParser defaultOperator="OR"/> > >> </schema> > >> > >> > >> > >> > >> > >> 2014/1/25 Kranti Parisa <kranti.par...@gmail.com> > >> > >>> can you post the complete solrconfig.xml file and schema.xml files to > >>> review all of your settings that would impact your indexing > performance. > >>> > >>> Thanks, > >>> Kranti K. Parisa > >>> http://www.linkedin.com/in/krantiparisa > >>> > >>> > >>> > >>> On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar < > >>> susheel.ku...@thedigitalgroup.net> wrote: > >>> > >>> > Thanks, Svante. Your indexing speed using db seems to really fast. > Can > >>> you > >>> > please provide some more detail on how you are indexing db records. > Is > >>> it > >>> > thru DataImportHandler? And what database? Is that local db? We are > >>> > indexing around 70 fields (60 multivalued) but data is not populated > >>> always > >>> > in all fields. The average size of document is in 5-10 kbs. > >>> > > >>> > -----Original Message----- > >>> > From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On > Behalf Of > >>> > svante karlsson > >>> > Sent: Friday, January 24, 2014 5:05 PM > >>> > To: solr-user@lucene.apache.org > >>> > Subject: Re: Solr server requirements for 100+ million documents > >>> > > >>> > I just indexed 100 million db docs (records) with 22 fields (4 > >>> > multivalued) in 9524 sec using libcurl. > >>> > 11 million took 763 seconds so the speed drops somewhat with > increasing > >>> > dbsize. > >>> > > >>> > We write 1000 docs (just an arbitrary number) in each request from > two > >>> > threads. If you will be using solrcloud you will want more writer > >>> threads. > >>> > > >>> > The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with > one > >>> SSD > >>> > and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual > >>> machine. > >>> > > >>> > /svante > >>> > > >>> > > >>> > > >>> > > >>> > 2014/1/24 Susheel Kumar <susheel.ku...@thedigitalgroup.net> > >>> > > >>> > > Thanks, Erick for the info. > >>> > > > >>> > > For indexing I agree the more time is consumed in data acquisition > >>> > > which in our case from Database. For indexing currently we are > using > >>> > > the manual process i.e. Solr dashboard Data Import but now looking > to > >>> > > automate. How do you suggest to automate the index part. Do you > >>> > > recommend to use SolrJ or should we try to automate using Curl? > >>> > > > >>> > > > >>> > > -----Original Message----- > >>> > > From: Erick Erickson [mailto:erickerick...@gmail.com] > >>> > > Sent: Friday, January 24, 2014 2:59 PM > >>> > > To: solr-user@lucene.apache.org > >>> > > Subject: Re: Solr server requirements for 100+ million documents > >>> > > > >>> > > Can't be done with the information you provided, and can only be > >>> > > guessed at even with more comprehensive information. > >>> > > > >>> > > Here's why: > >>> > > > >>> > > > >>> > > > >>> http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we > >>> > > -dont-have-a-definitive-answer/ > >>> > > > >>> > > Also, at a guess, your indexing speed is so slow due to data > >>> > > acquisition; I rather doubt you're being limited by raw Solr > indexing. > >>> > > If you're using SolrJ, try commenting out the > >>> > > server.add() bit and running again. My guess is that your indexing > >>> > > speed will be almost unchanged, in which case it's the data > >>> > > acquisition process is where you should concentrate efforts. As a > >>> > > comparison, I can index 11M Wikipedia docs on my laptop in 45 > minutes > >>> > > without any attempts at parallelization. > >>> > > > >>> > > > >>> > > Best, > >>> > > Erick > >>> > > > >>> > > On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar < > >>> > > susheel.ku...@thedigitalgroup.net> wrote: > >>> > > > Hi, > >>> > > > > >>> > > > Currently we are indexing 10 million document from database (10 > db > >>> > > > data > >>> > > entities) & index size is around 8 GB on windows virtual box. > Indexing > >>> > > in one shot taking 12+ hours while indexing parallel in separate > cores > >>> > > & merging them together taking 4+ hours. > >>> > > > > >>> > > > We are looking to scale to 100+ million documents and looking for > >>> > > recommendation on servers requirements on below parameters for a > >>> > > Production environment. There can be 200+ users performing search > same > >>> > time. > >>> > > > > >>> > > > No of physical servers (considering solr cloud) Memory > requirement > >>> > > > Processor requirement (# cores) Linux as OS oppose to windows > >>> > > > > >>> > > > Thanks in advance. > >>> > > > Susheel > >>> > > > > >>> > > > >>> > > >>> > >> > >> >