Re: Solr searching performance issues, using large documents

dc tech Thu, 29 Jul 2010 11:54:37 -0700

Are you storing the entire log file text in SOLR? That's almost 3gb of
text that you are storing in the SOLR. Try to
1) Is this first time performance or on repaat queries with the same fields?
2) Optimze the index and test performance again
3) index without storing the text and see what the performance looks like.



On 7/29/10, Peter Spam <ps...@mac.com> wrote:
> Any ideas?  I've got 5000 documents with an average size of 850k each, and
> it sometimes takes 2 minutes for a query to come back when highlighting is
> turned on!  Help!
>
>
> -Pete
>
> On Jul 21, 2010, at 2:41 PM, Peter Spam wrote:
>
>> From the mailing list archive, Koji wrote:
>>
>>> 1. Provide another field for highlighting and use copyField to copy
>>> plainText to the highlighting field.
>>
>> and Lance wrote:
>> http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html
>>
>>> If you want to highlight field X, doing the
>>> termOffsets/termPositions/termVectors will make highlighting that field
>>> faster. You should make a separate field and apply these options to that
>>> field.
>>>
>>> Now: doing a copyfield adds a "value" to a multiValued field. For a text
>>> field, you get a multi-valued text field. You should only copy one value
>>> to the highlighted field, so just copyField the document to your special
>>> field. To enforce this, I would add multiValued="false" to that field,
>>> just to avoid mistakes.
>>>
>>> So, all_text should be indexed without the term* attributes, and should
>>> not be stored. Then your document stored in a separate field that you use
>>> for highlighting and has the term* attributes.
>>
>> I've been experimenting with this, and here's what I've tried:
>>
>>   <field name="body" type="text_pl" indexed="true" stored="false"
>> multiValued="true" termVectors="true" termPositions="true" termOff
>> sets="true" />
>>   <field name="body_all" type="text_pl" indexed="false" stored="true"
>> multiValued="true" />
>>   <copyField source="body" dest="body_all"/>
>>
>> ... but it's still very slow (10+ seconds).  Why is it better to have two
>> fields (one indexed but not stored, and the other not indexed but stored)
>> rather than just one field that's both indexed and stored?
>>
>>
>> From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors
>>
>>> If you aren't always using all the stored fields, then enabling lazy
>>> field loading can be a huge boon, especially if compressed fields are
>>> used.
>>
>> What does this mean?  How do you load a field lazily?
>>
>> Thanks for your time, guys - this has started to become frustrating, since
>> it works so well, but is very slow!
>>
>>
>> -Pete
>>
>> On Jul 20, 2010, at 5:36 PM, Peter Spam wrote:
>>
>>> Data set: About 4,000 log files (will eventually grow to millions).
>>> Average log file is 850k.  Largest log file (so far) is about 70MB.
>>>
>>> Problem: When I search for common terms, the query time goes from under
>>> 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I
>>> disable highlighting, performance improves a lot, but is still slow for
>>> some queries (7 seconds).  Thanks in advance for any ideas!
>>>
>>>
>>> -Peter
>>>
>>>
>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> 4GB RAM server
>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>>
>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> schema.xml changes:
>>>
>>>   <fieldType name="text_pl" class="solr.TextField">
>>>     <analyzer>
>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>     <filter class="solr.LowerCaseFilterFactory"/>
>>>     <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0"
>>> generateNumberParts="0" catenateWords="0" catenateNumbers="0"
>>> catenateAll="0" splitOnCaseChange="0"/>
>>>     </analyzer>
>>>   </fieldType>
>>>
>>> ...
>>>
>>>  <field name="body" type="text_pl" indexed="true" stored="true"
>>> multiValued="false" termVectors="true" termPositions="true"
>>> termOffsets="true" />
>>>   <field name="timestamp" type="date" indexed="true" stored="true"
>>> default="NOW" multiValued="false"/>
>>>  <field name="version" type="string" indexed="true" stored="true"
>>> multiValued="false"/>
>>>  <field name="device" type="string" indexed="true" stored="true"
>>> multiValued="false"/>
>>>  <field name="filename" type="string" indexed="true" stored="true"
>>> multiValued="false"/>
>>>  <field name="filesize" type="long" indexed="true" stored="true"
>>> multiValued="false"/>
>>>  <field name="pversion" type="int" indexed="true" stored="true"
>>> multiValued="false"/>
>>>  <field name="first2md5" type="string" indexed="false" stored="true"
>>> multiValued="false"/>
>>>  <field name="ckey" type="string" indexed="true" stored="true"
>>> multiValued="false"/>
>>>
>>> ...
>>>
>>> <dynamicField name="*" type="ignored" multiValued="true" />
>>> <defaultSearchField>body</defaultSearchField>
>>> <solrQueryParser defaultOperator="AND"/>
>>>
>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> solrconfig.xml changes:
>>>
>>>   <maxFieldLength>2147483647</maxFieldLength>
>>>   <ramBufferSizeMB>128</ramBufferSizeMB>
>>>
>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> The query:
>>>
>>> rowStr = "&rows=10"
>>> facet =
>>> "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>> regexv = "(?m)^.*\n.*\n.*$"
>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) +
>>> "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/,
>>> '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>>
>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? ''
>>> : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors
>>> + hl + hl_regex
>>>
>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' +
>>> p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>>
>>
>
>

-- 
Sent from my mobile device

Re: Solr searching performance issues, using large documents

Reply via email to