Re: Can Solr handle large text files?

Peter Spam Fri, 04 Nov 2011 00:24:02 -0700

Solr 4.0 (11/1 snapshot)
Data: 80k files, average size 2.5MB, largest is 750MB; 
Solr: Each document is max 256k; total docs = 800k
Machine: Early 2009 Mac Pro, 6GB RAM, 1GBmin/2GBmax given to Solr Java; Admin 
shows 30% mem usage


I originally tried injecting the entire file into a single Solr document, and 
this had disastrous results when trying to highlight.  I've now tried splitting 
each file into 256k segments per Solr document, and the results are better, but 
still not what I was hoping for.  Queries are around 2-8 seconds, with some 
reaching into 30+ second territory.

Ideally, I'd like to feed Solr the metadata and the entire file at once, and 
have the back-end split the file into thousands of pieces.  Is this possible?


Thanks!
Pete

On Nov 1, 2011, at 5:15 PM, Peter Spam wrote:

> Wow, 50 lines is tiny!  Is that how small you need to go, to get good 
> highlighting performance?
> 
> I'm looking at documents that can be up to 800MB in size, so I've decided to 
> split them down into 256k chunks.  I'm still indexing right now - I'm curious 
> to see how performance is when the injection is finished.
> 
> Has anyone done analysis on where the knee in the curve is, wrt document size 
> vs. # of documents?
> 
> 
> Thanks!
> Pete
> 
> On Oct 31, 2011, at 9:28 PM, anand.ni...@rbs.com wrote:
> 
>> Hi,
>> 
>> Basically I need to index very large log files. I have modified the 
>> ExtractingDocumentLoader to create a new document for every 50 lines (it is 
>> made configurable by keeping it as a system property)  of the log file being 
>> indexed. 'Filename' field for document created from 1 log file is kept the 
>> same and unique id is generated by appending the line no. with the file 
>> name, e.g 'log.txt (line no. 100 -150)'. Each doc is given the custom score 
>> stored in field called 'custom_score' which is directly proportional to its 
>> distance from the beginning of the file.
>> 
>> I have also found 'hitGrouped.vm' from the net. Since I am reading only 50 
>> lines for each document so the default max chunk size works for me but it 
>> can be easily adjusted depending upon the no of lines you are reading per 
>> doc.
>> 
>> Now I have done the grouping based on the 'filename' field and show the 
>> results from docs having highest score as a result I am able to show the 
>> last matching results from log file. Query parameters that I am using for 
>> search are:
>> 
>> http://localhost:8080/solr/select?defType=dismax&qf=Content&q=Solr&fl=id,score&defType=dismax&bf=sub(1000,caprice_score)&group=true&group.field=FileName
>> 
>> Results are amazing, I am able to index and search from very larger log 
>> files (few 100 MBs) with very low memory requirements. Highlighting is also 
>> working fine.
>> 
>> Thanks & Regards,
>> Anand
>> 
>> 
>> 
>> 
>> 
>> Anand Nigam
>> RBS Global Banking & Markets
>> Office: +91 124 492 5506   
>> 
>> -----Original Message-----
>> From: Peter Spam [mailto:ps...@mac.com] 
>> Sent: 21 October 2011 23:04
>> To: solr-user@lucene.apache.org
>> Subject: Re: Can Solr handle large text files?
>> 
>> Thanks for your note, Anand.  What was the maximum chunk size for you?  
>> Could you post the relevant portions of your configuration file?
>> 
>> 
>> Thanks!
>> Pete
>> 
>> On Oct 21, 2011, at 4:20 AM, anand.ni...@rbs.com wrote:
>> 
>>> Hi,
>>> 
>>> I was also facing the issue of highlighting the large text files. I applied 
>>> the solution proposed here and it worked. But I am getting following error :
>>> 
>>> 
>>> Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where 
>>> can I get this file from. Its reference is present in browse.vm
>>> 
>>> <div class="results">
>>> #if($response.response.get('grouped'))
>>>  #foreach($grouping in $response.response.get('grouped'))
>>>    #parse("hitGrouped.vm")
>>>  #end
>>> #else
>>>  #foreach($doc in $response.results)
>>>    #parse("hit.vm")
>>>  #end
>>> #end
>>> </div>
>>> 
>>> 
>>> HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or 
>>> 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
>>> cwd=C:\glassfish3\glassfish\domains\domain1\config 
>>> java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in 
>>> classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
>>> cwd=C:\glassfish3\glassfish\domains\domain1\config at 
>>> org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoade
>>> r.java:268) at 
>>> org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream(
>>> SolrVelocityResourceLoader.java:42) at 
>>> org.apache.velocity.Template.process(Template.java:98) at 
>>> org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource(
>>> ResourceManagerImpl.java:446) at
>>> 
>>> Thanks & Regards,
>>> Anand
>>> Anand Nigam
>>> RBS Global Banking & Markets
>>> Office: +91 124 492 5506   
>>> 
>>> 
>>> -----Original Message-----
>>> From: karsten-s...@gmx.de [mailto:karsten-s...@gmx.de]
>>> Sent: 21 October 2011 14:58
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Can Solr handle large text files?
>>> 
>>> Hi Peter,
>>> 
>>> highlighting in large text files can not be fast without dividing the 
>>> original text in small piece.
>>> So take a look in
>>> http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
>>> and in
>>> http://www.lucidimagination.com/blog/2010/09/16/2446/
>>> 
>>> Which means that you should divide your files and use Result Grouping / 
>>> Field Collapsing to list only one hit per original document.
>>> 
>>> (xtf also would solve your problem "out of the box" but xtf does not use 
>>> solr).
>>> 
>>> Best regards
>>> Karsten
>>> 
>>> -------- Original-Nachricht --------
>>>> Datum: Thu, 20 Oct 2011 17:59:04 -0700
>>>> Von: Peter Spam <ps...@mac.com>
>>>> An: solr-user@lucene.apache.org
>>>> Betreff: Can Solr handle large text files?
>>> 
>>>> I have about 20k text files, some very small, but some up to 300MB, 
>>>> and would like to do text searching with highlighting.
>>>> 
>>>> Imagine the text is the contents of your syslog.
>>>> 
>>>> I would like to type in some terms, such as "error" and "mail", and 
>>>> have Solr return the syslog lines with those terms PLUS two lines of 
>>>> context.
>>>> Pretty much just like Google's highlighting.
>>>> 
>>>> 1) Can Solr handle this?  I had extremely long query times when I 
>>>> tried this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I 
>>>> tried breaking the files into 1MB pieces, but searching would be 
>>>> wonky => return the wrong number of documents (ie. if one file had a 
>>>> term 5 times, and that was the only file that had the term, I want 1 
>>>> result, not 5 results).
>>>> 
>>>> 2) What sort of tokenizer would be best?  Here's what I'm using:
>>>> 
>>>> <field name="body" type="text_pl" indexed="true" stored="true"
>>>> multiValued="false" termVectors="true" termPositions="true" 
>>>> termOffsets="true" />
>>>> 
>>>>  <fieldType name="text_pl" class="solr.TextField">
>>>>    <analyzer>
>>>>      <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>>      <filter class="solr.WordDelimiterFilterFactory"
>>>> generateWordParts="0" generateNumberParts="0" catenateWords="0" 
>>>> catenateNumbers="0"
>>>> catenateAll="0" splitOnCaseChange="0"/>
>>>>    </analyzer>
>>>>  </fieldType>
>>>> 
>>>> 
>>>> Thanks!
>>>> Pete
>>> 
>>> **********************************************************************
>>> ************* The Royal Bank of Scotland plc. Registered in Scotland 
>>> No 90312.
>>> Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB. 
>>> Authorised and regulated by the Financial Services Authority. The 
>>> Royal Bank of Scotland N.V. is authorised and regulated by the De 
>>> Nederlandsche Bank and has its seat at Amsterdam, the Netherlands, and 
>>> is registered in the Commercial Register under number 33002587. 
>>> Registered Office: Gustav Mahlerlaan 350, Amsterdam, The Netherlands. 
>>> The Royal Bank of Scotland N.V. and The Royal Bank of Scotland plc are 
>>> authorised to act as agent for each other in certain jurisdictions.
>>> 
>>> This e-mail message is confidential and for use by the addressee only. 
>>> If the message is received by anyone other than the addressee, please 
>>> return the message to the sender by replying to it and then delete the 
>>> message from your computer. Internet e-mails are not necessarily 
>>> secure. The Royal Bank of Scotland plc and The Royal Bank of Scotland 
>>> N.V. including its affiliates ("RBS group") does not accept 
>>> responsibility for changes made to this message after it was sent. For 
>>> the protection of RBS group and its clients and customers, and in 
>>> compliance with regulatory requirements, the contents of both incoming 
>>> and outgoing e-mail communications, which could include proprietary 
>>> information and Non-Public Personal Information, may be read by 
>>> authorised persons within RBS group other than the intended recipient(s).
>>> 
>>> Whilst all reasonable care has been taken to avoid the transmission of 
>>> viruses, it is the responsibility of the recipient to ensure that the 
>>> onward transmission, opening or use of this message and any 
>>> attachments will not adversely affect its systems or data. No 
>>> responsibility is accepted by the RBS group in this regard and the 
>>> recipient should carry out such virus and other checks as it considers 
>>> appropriate.
>>> 
>>> Visit our website at www.rbs.com
>>> 
>>> **********************************************************************
>>> *************
>>> 
>> 
>

Re: Can Solr handle large text files?

Reply via email to