Re: Can Solr handle large text files?

Peter Spam Tue, 01 Nov 2011 17:19:32 -0700

Oh by the way - what analyzer are you using for your log files?  Here's what 
I'm trying:


    <fieldType name="text_pl" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" 
generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" 
splitOnCaseChange="0"/>
      </analyzer>
    </fieldType>


Thanks!
Pete

On Oct 31, 2011, at 9:28 PM, anand.ni...@rbs.com wrote:

> Hi,
> 
> Basically I need to index very large log files. I have modified the 
> ExtractingDocumentLoader to create a new document for every 50 lines (it is 
> made configurable by keeping it as a system property)  of the log file being 
> indexed. 'Filename' field for document created from 1 log file is kept the 
> same and unique id is generated by appending the line no. with the file name, 
> e.g 'log.txt (line no. 100 -150)'. Each doc is given the custom score stored 
> in field called 'custom_score' which is directly proportional to its distance 
> from the beginning of the file.
> 
> I have also found 'hitGrouped.vm' from the net. Since I am reading only 50 
> lines for each document so the default max chunk size works for me but it can 
> be easily adjusted depending upon the no of lines you are reading per doc.
> 
> Now I have done the grouping based on the 'filename' field and show the 
> results from docs having highest score as a result I am able to show the last 
> matching results from log file. Query parameters that I am using for search 
> are:
> 
> http://localhost:8080/solr/select?defType=dismax&qf=Content&q=Solr&fl=id,score&defType=dismax&bf=sub(1000,caprice_score)&group=true&group.field=FileName
> 
> Results are amazing, I am able to index and search from very larger log files 
> (few 100 MBs) with very low memory requirements. Highlighting is also working 
> fine.
> 
> Thanks & Regards,
> Anand
> 
> 
> 
> 
> 
> Anand Nigam
> RBS Global Banking & Markets
> Office: +91 124 492 5506   
> 
> -----Original Message-----
> From: Peter Spam [mailto:ps...@mac.com] 
> Sent: 21 October 2011 23:04
> To: solr-user@lucene.apache.org
> Subject: Re: Can Solr handle large text files?
> 
> Thanks for your note, Anand.  What was the maximum chunk size for you?  Could 
> you post the relevant portions of your configuration file?
> 
> 
> Thanks!
> Pete
> 
> On Oct 21, 2011, at 4:20 AM, anand.ni...@rbs.com wrote:
> 
>> Hi,
>> 
>> I was also facing the issue of highlighting the large text files. I applied 
>> the solution proposed here and it worked. But I am getting following error :
>> 
>> 
>> Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where 
>> can I get this file from. Its reference is present in browse.vm
>> 
>> <div class="results">
>> #if($response.response.get('grouped'))
>>   #foreach($grouping in $response.response.get('grouped'))
>>     #parse("hitGrouped.vm")
>>   #end
>> #else
>>   #foreach($doc in $response.results)
>>     #parse("hit.vm")
>>   #end
>> #end
>> </div>
>> 
>> 
>> HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or 
>> 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
>> cwd=C:\glassfish3\glassfish\domains\domain1\config 
>> java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in 
>> classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
>> cwd=C:\glassfish3\glassfish\domains\domain1\config at 
>> org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoade
>> r.java:268) at 
>> org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream(
>> SolrVelocityResourceLoader.java:42) at 
>> org.apache.velocity.Template.process(Template.java:98) at 
>> org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource(
>> ResourceManagerImpl.java:446) at
>> 
>> Thanks & Regards,
>> Anand
>> Anand Nigam
>> RBS Global Banking & Markets
>> Office: +91 124 492 5506   
>> 
>> 
>> -----Original Message-----
>> From: karsten-s...@gmx.de [mailto:karsten-s...@gmx.de]
>> Sent: 21 October 2011 14:58
>> To: solr-user@lucene.apache.org
>> Subject: Re: Can Solr handle large text files?
>> 
>> Hi Peter,
>> 
>> highlighting in large text files can not be fast without dividing the 
>> original text in small piece.
>> So take a look in
>> http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
>> and in
>> http://www.lucidimagination.com/blog/2010/09/16/2446/
>> 
>> Which means that you should divide your files and use Result Grouping / 
>> Field Collapsing to list only one hit per original document.
>> 
>> (xtf also would solve your problem "out of the box" but xtf does not use 
>> solr).
>> 
>> Best regards
>> Karsten
>> 
>> -------- Original-Nachricht --------
>>> Datum: Thu, 20 Oct 2011 17:59:04 -0700
>>> Von: Peter Spam <ps...@mac.com>
>>> An: solr-user@lucene.apache.org
>>> Betreff: Can Solr handle large text files?
>> 
>>> I have about 20k text files, some very small, but some up to 300MB, 
>>> and would like to do text searching with highlighting.
>>> 
>>> Imagine the text is the contents of your syslog.
>>> 
>>> I would like to type in some terms, such as "error" and "mail", and 
>>> have Solr return the syslog lines with those terms PLUS two lines of 
>>> context.
>>> Pretty much just like Google's highlighting.
>>> 
>>> 1) Can Solr handle this?  I had extremely long query times when I 
>>> tried this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I 
>>> tried breaking the files into 1MB pieces, but searching would be 
>>> wonky => return the wrong number of documents (ie. if one file had a 
>>> term 5 times, and that was the only file that had the term, I want 1 
>>> result, not 5 results).
>>> 
>>> 2) What sort of tokenizer would be best?  Here's what I'm using:
>>> 
>>>  <field name="body" type="text_pl" indexed="true" stored="true"
>>> multiValued="false" termVectors="true" termPositions="true" 
>>> termOffsets="true" />
>>> 
>>>   <fieldType name="text_pl" class="solr.TextField">
>>>     <analyzer>
>>>       <tokenizer class="solr.StandardTokenizerFactory"/>
>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>       <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="0" generateNumberParts="0" catenateWords="0" 
>>> catenateNumbers="0"
>>> catenateAll="0" splitOnCaseChange="0"/>
>>>     </analyzer>
>>>   </fieldType>
>>> 
>>> 
>>> Thanks!
>>> Pete
>> 
>> **********************************************************************
>> ************* The Royal Bank of Scotland plc. Registered in Scotland 
>> No 90312.
>> Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB. 
>> Authorised and regulated by the Financial Services Authority. The 
>> Royal Bank of Scotland N.V. is authorised and regulated by the De 
>> Nederlandsche Bank and has its seat at Amsterdam, the Netherlands, and 
>> is registered in the Commercial Register under number 33002587. 
>> Registered Office: Gustav Mahlerlaan 350, Amsterdam, The Netherlands. 
>> The Royal Bank of Scotland N.V. and The Royal Bank of Scotland plc are 
>> authorised to act as agent for each other in certain jurisdictions.
>> 
>> This e-mail message is confidential and for use by the addressee only. 
>> If the message is received by anyone other than the addressee, please 
>> return the message to the sender by replying to it and then delete the 
>> message from your computer. Internet e-mails are not necessarily 
>> secure. The Royal Bank of Scotland plc and The Royal Bank of Scotland 
>> N.V. including its affiliates ("RBS group") does not accept 
>> responsibility for changes made to this message after it was sent. For 
>> the protection of RBS group and its clients and customers, and in 
>> compliance with regulatory requirements, the contents of both incoming 
>> and outgoing e-mail communications, which could include proprietary 
>> information and Non-Public Personal Information, may be read by 
>> authorised persons within RBS group other than the intended recipient(s).
>> 
>> Whilst all reasonable care has been taken to avoid the transmission of 
>> viruses, it is the responsibility of the recipient to ensure that the 
>> onward transmission, opening or use of this message and any 
>> attachments will not adversely affect its systems or data. No 
>> responsibility is accepted by the RBS group in this regard and the 
>> recipient should carry out such virus and other checks as it considers 
>> appropriate.
>> 
>> Visit our website at www.rbs.com
>> 
>> **********************************************************************
>> *************
>> 
>

Re: Can Solr handle large text files?

Reply via email to