RE: Can Solr handle large text files?

Anand.Nigam Mon, 31 Oct 2011 21:28:48 -0700

Hi,

Basically I need to index very large log files. I have modified the 
ExtractingDocumentLoader to create a new document for every 50 lines (it is 
made configurable by keeping it as a system property)  of the log file being 
indexed. 'Filename' field for document created from 1 log file is kept the same 
and unique id is generated by appending the line no. with the file name, e.g 
'log.txt (line no. 100 -150)'. Each doc is given the custom score stored in 
field called 'custom_score' which is directly proportional to its distance from 
the beginning of the file.


I have also found 'hitGrouped.vm' from the net. Since I am reading only 50 
lines for each document so the default max chunk size works for me but it can 
be easily adjusted depending upon the no of lines you are reading per doc.

Now I have done the grouping based on the 'filename' field and show the results 
from docs having highest score as a result I am able to show the last matching 
results from log file. Query parameters that I am using for search are:

http://localhost:8080/solr/select?defType=dismax&qf=Content&q=Solr&fl=id,score&defType=dismax&bf=sub(1000,caprice_score)&group=true&group.field=FileName

Results are amazing, I am able to index and search from very larger log files 
(few 100 MBs) with very low memory requirements. Highlighting is also working 
fine.

Thanks & Regards,
Anand





Anand Nigam
RBS Global Banking & Markets
Office: +91 124 492 5506   

-----Original Message-----
From: Peter Spam [mailto:ps...@mac.com] 
Sent: 21 October 2011 23:04
To: solr-user@lucene.apache.org
Subject: Re: Can Solr handle large text files?

Thanks for your note, Anand.  What was the maximum chunk size for you?  Could 
you post the relevant portions of your configuration file?


Thanks!
Pete

On Oct 21, 2011, at 4:20 AM, anand.ni...@rbs.com wrote:

> Hi,
> 
> I was also facing the issue of highlighting the large text files. I applied 
> the solution proposed here and it worked. But I am getting following error :
> 
> 
> Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where 
> can I get this file from. Its reference is present in browse.vm
> 
> <div class="results">
>  #if($response.response.get('grouped'))
>    #foreach($grouping in $response.response.get('grouped'))
>      #parse("hitGrouped.vm")
>    #end
>  #else
>    #foreach($doc in $response.results)
>      #parse("hit.vm")
>    #end
>  #end
> </div>
> 
> 
> HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or 
> 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
> cwd=C:\glassfish3\glassfish\domains\domain1\config 
> java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in 
> classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
> cwd=C:\glassfish3\glassfish\domains\domain1\config at 
> org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoade
> r.java:268) at 
> org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream(
> SolrVelocityResourceLoader.java:42) at 
> org.apache.velocity.Template.process(Template.java:98) at 
> org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource(
> ResourceManagerImpl.java:446) at
> 
> Thanks & Regards,
> Anand
> Anand Nigam
> RBS Global Banking & Markets
> Office: +91 124 492 5506   
> 
> 
> -----Original Message-----
> From: karsten-s...@gmx.de [mailto:karsten-s...@gmx.de]
> Sent: 21 October 2011 14:58
> To: solr-user@lucene.apache.org
> Subject: Re: Can Solr handle large text files?
> 
> Hi Peter,
> 
> highlighting in large text files can not be fast without dividing the 
> original text in small piece.
> So take a look in
> http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
> and in
> http://www.lucidimagination.com/blog/2010/09/16/2446/
> 
> Which means that you should divide your files and use Result Grouping / Field 
> Collapsing to list only one hit per original document.
> 
> (xtf also would solve your problem "out of the box" but xtf does not use 
> solr).
> 
> Best regards
>  Karsten
> 
> -------- Original-Nachricht --------
>> Datum: Thu, 20 Oct 2011 17:59:04 -0700
>> Von: Peter Spam <ps...@mac.com>
>> An: solr-user@lucene.apache.org
>> Betreff: Can Solr handle large text files?
> 
>> I have about 20k text files, some very small, but some up to 300MB, 
>> and would like to do text searching with highlighting.
>> 
>> Imagine the text is the contents of your syslog.
>> 
>> I would like to type in some terms, such as "error" and "mail", and 
>> have Solr return the syslog lines with those terms PLUS two lines of context.
>> Pretty much just like Google's highlighting.
>> 
>> 1) Can Solr handle this?  I had extremely long query times when I 
>> tried this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I 
>> tried breaking the files into 1MB pieces, but searching would be 
>> wonky => return the wrong number of documents (ie. if one file had a 
>> term 5 times, and that was the only file that had the term, I want 1 result, 
>> not 5 results).
>> 
>> 2) What sort of tokenizer would be best?  Here's what I'm using:
>> 
>>   <field name="body" type="text_pl" indexed="true" stored="true"
>> multiValued="false" termVectors="true" termPositions="true" 
>> termOffsets="true" />
>> 
>>    <fieldType name="text_pl" class="solr.TextField">
>>      <analyzer>
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="0" generateNumberParts="0" catenateWords="0" 
>> catenateNumbers="0"
>> catenateAll="0" splitOnCaseChange="0"/>
>>      </analyzer>
>>    </fieldType>
>> 
>> 
>> Thanks!
>> Pete
> 
> **********************************************************************
> ************* The Royal Bank of Scotland plc. Registered in Scotland 
> No 90312.
> Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB. 
> Authorised and regulated by the Financial Services Authority. The 
> Royal Bank of Scotland N.V. is authorised and regulated by the De 
> Nederlandsche Bank and has its seat at Amsterdam, the Netherlands, and 
> is registered in the Commercial Register under number 33002587. 
> Registered Office: Gustav Mahlerlaan 350, Amsterdam, The Netherlands. 
> The Royal Bank of Scotland N.V. and The Royal Bank of Scotland plc are 
> authorised to act as agent for each other in certain jurisdictions.
> 
> This e-mail message is confidential and for use by the addressee only. 
> If the message is received by anyone other than the addressee, please 
> return the message to the sender by replying to it and then delete the 
> message from your computer. Internet e-mails are not necessarily 
> secure. The Royal Bank of Scotland plc and The Royal Bank of Scotland 
> N.V. including its affiliates ("RBS group") does not accept 
> responsibility for changes made to this message after it was sent. For 
> the protection of RBS group and its clients and customers, and in 
> compliance with regulatory requirements, the contents of both incoming 
> and outgoing e-mail communications, which could include proprietary 
> information and Non-Public Personal Information, may be read by 
> authorised persons within RBS group other than the intended recipient(s).
> 
> Whilst all reasonable care has been taken to avoid the transmission of 
> viruses, it is the responsibility of the recipient to ensure that the 
> onward transmission, opening or use of this message and any 
> attachments will not adversely affect its systems or data. No 
> responsibility is accepted by the RBS group in this regard and the 
> recipient should carry out such virus and other checks as it considers 
> appropriate.
> 
> Visit our website at www.rbs.com
> 
> **********************************************************************
> *************
>

RE: Can Solr handle large text files?

Reply via email to