Re: Solr searching performance issues, using large documents

Peter Spam Thu, 29 Jul 2010 11:08:10 -0700

Any ideas?  I've got 5000 documents with an average size of 850k each, and it 
sometimes takes 2 minutes for a query to come back when highlighting is turned 
on!  Help!



-Pete

On Jul 21, 2010, at 2:41 PM, Peter Spam wrote:

> From the mailing list archive, Koji wrote:
> 
>> 1. Provide another field for highlighting and use copyField to copy 
>> plainText to the highlighting field.
> 
> and Lance wrote: 
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html
> 
>> If you want to highlight field X, doing the 
>> termOffsets/termPositions/termVectors will make highlighting that field 
>> faster. You should make a separate field and apply these options to that 
>> field.
>> 
>> Now: doing a copyfield adds a "value" to a multiValued field. For a text 
>> field, you get a multi-valued text field. You should only copy one value to 
>> the highlighted field, so just copyField the document to your special field. 
>> To enforce this, I would add multiValued="false" to that field, just to 
>> avoid mistakes.
>> 
>> So, all_text should be indexed without the term* attributes, and should not 
>> be stored. Then your document stored in a separate field that you use for 
>> highlighting and has the term* attributes.
> 
> I've been experimenting with this, and here's what I've tried:
> 
>   <field name="body" type="text_pl" indexed="true" stored="false" 
> multiValued="true" termVectors="true" termPositions="true" termOff
> sets="true" />
>   <field name="body_all" type="text_pl" indexed="false" stored="true" 
> multiValued="true" />
>   <copyField source="body" dest="body_all"/>
> 
> ... but it's still very slow (10+ seconds).  Why is it better to have two 
> fields (one indexed but not stored, and the other not indexed but stored) 
> rather than just one field that's both indexed and stored?
> 
> 
> From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors
> 
>> If you aren't always using all the stored fields, then enabling lazy field 
>> loading can be a huge boon, especially if compressed fields are used.
> 
> What does this mean?  How do you load a field lazily?
> 
> Thanks for your time, guys - this has started to become frustrating, since it 
> works so well, but is very slow!
> 
> 
> -Pete
> 
> On Jul 20, 2010, at 5:36 PM, Peter Spam wrote:
> 
>> Data set: About 4,000 log files (will eventually grow to millions).  Average 
>> log file is 850k.  Largest log file (so far) is about 70MB.
>> 
>> Problem: When I search for common terms, the query time goes from under 2-3 
>> seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable 
>> highlighting, performance improves a lot, but is still slow for some queries 
>> (7 seconds).  Thanks in advance for any ideas!
>> 
>> 
>> -Peter
>> 
>> 
>> -------------------------------------------------------------------------------------------------------------------------------------
>> 
>> 4GB RAM server
>> % java -Xms2048M -Xmx3072M -jar start.jar
>> 
>> -------------------------------------------------------------------------------------------------------------------------------------
>> 
>> schema.xml changes:
>> 
>>   <fieldType name="text_pl" class="solr.TextField">
>>     <analyzer>
>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>      <filter class="solr.LowerCaseFilterFactory"/> 
>>      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" 
>> generateNumberParts="0" catenateWords="0" catenateNumbers="0" 
>> catenateAll="0" splitOnCaseChange="0"/>
>>     </analyzer>
>>   </fieldType>
>> 
>> ...
>> 
>>  <field name="body" type="text_pl" indexed="true" stored="true" 
>> multiValued="false" termVectors="true" termPositions="true" 
>> termOffsets="true" />
>>   <field name="timestamp" type="date" indexed="true" stored="true" 
>> default="NOW" multiValued="false"/>
>>  <field name="version" type="string" indexed="true" stored="true" 
>> multiValued="false"/>
>>  <field name="device" type="string" indexed="true" stored="true" 
>> multiValued="false"/>
>>  <field name="filename" type="string" indexed="true" stored="true" 
>> multiValued="false"/>
>>  <field name="filesize" type="long" indexed="true" stored="true" 
>> multiValued="false"/>
>>  <field name="pversion" type="int" indexed="true" stored="true" 
>> multiValued="false"/>
>>  <field name="first2md5" type="string" indexed="false" stored="true" 
>> multiValued="false"/>
>>  <field name="ckey" type="string" indexed="true" stored="true" 
>> multiValued="false"/>
>> 
>> ...
>> 
>> <dynamicField name="*" type="ignored" multiValued="true" />
>> <defaultSearchField>body</defaultSearchField>
>> <solrQueryParser defaultOperator="AND"/>
>> 
>> -------------------------------------------------------------------------------------------------------------------------------------
>> 
>> solrconfig.xml changes:
>> 
>>   <maxFieldLength>2147483647</maxFieldLength>
>>   <ramBufferSizeMB>128</ramBufferSizeMB>
>> 
>> -------------------------------------------------------------------------------------------------------------------------------------
>> 
>> The query:
>> 
>> rowStr = "&rows=10"
>> facet = 
>> "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>> regexv = "(?m)^.*\n.*\n.*$"
>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + 
>> "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, 
>> '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>> 
>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? '' : 
>> ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl 
>> + hl_regex
>> 
>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' + 
>> p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>> 
>

Re: Solr searching performance issues, using large documents

Reply via email to