Re: Can Solr handle large text files?

Peter Spam Fri, 27 Jul 2012 23:51:17 -0700

Has the performance of highlighting large text documents been improved in Solr 
4?



Thanks!
Pete

On Nov 5, 2011, at 9:03 AM, Erick Erickson <erickerick...@gmail.com> wrote:

> Sure, if you write a custom update handler. But I'm not at all sure
> this is "ideal".
> You're requiring all that data to be transmitted across the wire and processed
> by Solr. Assuming you have more than one input source, the Solr server in
> the background will be handling up to N documents simultaneously. Plus
> the effort to index. I think I'd recommend splitting them up on the client 
> side.
> 
> Best
> Erick
> 
> On Fri, Nov 4, 2011 at 3:23 AM, Peter Spam <ps...@mac.com> wrote:
>> Solr 4.0 (11/1 snapshot)
>> Data: 80k files, average size 2.5MB, largest is 750MB;
>> Solr: Each document is max 256k; total docs = 800k
>> Machine: Early 2009 Mac Pro, 6GB RAM, 1GBmin/2GBmax given to Solr Java; 
>> Admin shows 30% mem usage
>> 
>> I originally tried injecting the entire file into a single Solr document, 
>> and this had disastrous results when trying to highlight.  I've now tried 
>> splitting each file into 256k segments per Solr document, and the results 
>> are better, but still not what I was hoping for.  Queries are around 2-8 
>> seconds, with some reaching into 30+ second territory.
>> 
>> Ideally, I'd like to feed Solr the metadata and the entire file at once, and 
>> have the back-end split the file into thousands of pieces.  Is this possible?
>> 
>> 
>> Thanks!
>> Pete
>> 
>> On Nov 1, 2011, at 5:15 PM, Peter Spam wrote:
>> 
>>> Wow, 50 lines is tiny!  Is that how small you need to go, to get good 
>>> highlighting performance?
>>> 
>>> I'm looking at documents that can be up to 800MB in size, so I've decided 
>>> to split them down into 256k chunks.  I'm still indexing right now - I'm 
>>> curious to see how performance is when the injection is finished.
>>> 
>>> Has anyone done analysis on where the knee in the curve is, wrt document 
>>> size vs. # of documents?
>>> 
>>> 
>>> Thanks!
>>> Pete
>>> 
>>> On Oct 31, 2011, at 9:28 PM, anand.ni...@rbs.com wrote:
>>> 
>>>> Hi,
>>>> 
>>>> Basically I need to index very large log files. I have modified the 
>>>> ExtractingDocumentLoader to create a new document for every 50 lines (it 
>>>> is made configurable by keeping it as a system property)  of the log file 
>>>> being indexed. 'Filename' field for document created from 1 log file is 
>>>> kept the same and unique id is generated by appending the line no. with 
>>>> the file name, e.g 'log.txt (line no. 100 -150)'. Each doc is given the 
>>>> custom score stored in field called 'custom_score' which is directly 
>>>> proportional to its distance from the beginning of the file.
>>>> 
>>>> I have also found 'hitGrouped.vm' from the net. Since I am reading only 50 
>>>> lines for each document so the default max chunk size works for me but it 
>>>> can be easily adjusted depending upon the no of lines you are reading per 
>>>> doc.
>>>> 
>>>> Now I have done the grouping based on the 'filename' field and show the 
>>>> results from docs having highest score as a result I am able to show the 
>>>> last matching results from log file. Query parameters that I am using for 
>>>> search are:
>>>> 
>>>> http://localhost:8080/solr/select?defType=dismax&qf=Content&q=Solr&fl=id,score&defType=dismax&bf=sub(1000,caprice_score)&group=true&group.field=FileName
>>>> 
>>>> Results are amazing, I am able to index and search from very larger log 
>>>> files (few 100 MBs) with very low memory requirements. Highlighting is 
>>>> also working fine.
>>>> 
>>>> Thanks & Regards,
>>>> Anand
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Anand Nigam
>>>> RBS Global Banking & Markets
>>>> Office: +91 124 492 5506
>>>> 
>>>> -----Original Message-----
>>>> From: Peter Spam [mailto:ps...@mac.com]
>>>> Sent: 21 October 2011 23:04
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: Can Solr handle large text files?
>>>> 
>>>> Thanks for your note, Anand.  What was the maximum chunk size for you?  
>>>> Could you post the relevant portions of your configuration file?
>>>> 
>>>> 
>>>> Thanks!
>>>> Pete
>>>> 
>>>> On Oct 21, 2011, at 4:20 AM, anand.ni...@rbs.com wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I was also facing the issue of highlighting the large text files. I 
>>>>> applied the solution proposed here and it worked. But I am getting 
>>>>> following error :
>>>>> 
>>>>> 
>>>>> Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where
>>>>> can I get this file from. Its reference is present in browse.vm
>>>>> 
>>>>> <div class="results">
>>>>> #if($response.response.get('grouped'))
>>>>>  #foreach($grouping in $response.response.get('grouped'))
>>>>>    #parse("hitGrouped.vm")
>>>>>  #end
>>>>> #else
>>>>>  #foreach($doc in $response.results)
>>>>>    #parse("hit.vm")
>>>>>  #end
>>>>> #end
>>>>> </div>
>>>>> 
>>>>> 
>>>>> HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or
>>>>> 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/',
>>>>> cwd=C:\glassfish3\glassfish\domains\domain1\config
>>>>> java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in
>>>>> classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/',
>>>>> cwd=C:\glassfish3\glassfish\domains\domain1\config at
>>>>> org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoade
>>>>> r.java:268) at
>>>>> org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream(
>>>>> SolrVelocityResourceLoader.java:42) at
>>>>> org.apache.velocity.Template.process(Template.java:98) at
>>>>> org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource(
>>>>> ResourceManagerImpl.java:446) at
>>>>> 
>>>>> Thanks & Regards,
>>>>> Anand
>>>>> Anand Nigam
>>>>> RBS Global Banking & Markets
>>>>> Office: +91 124 492 5506
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: karsten-s...@gmx.de [mailto:karsten-s...@gmx.de]
>>>>> Sent: 21 October 2011 14:58
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Re: Can Solr handle large text files?
>>>>> 
>>>>> Hi Peter,
>>>>> 
>>>>> highlighting in large text files can not be fast without dividing the 
>>>>> original text in small piece.
>>>>> So take a look in
>>>>> http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
>>>>> and in
>>>>> http://www.lucidimagination.com/blog/2010/09/16/2446/
>>>>> 
>>>>> Which means that you should divide your files and use Result Grouping / 
>>>>> Field Collapsing to list only one hit per original document.
>>>>> 
>>>>> (xtf also would solve your problem "out of the box" but xtf does not use 
>>>>> solr).
>>>>> 
>>>>> Best regards
>>>>> Karsten
>>>>> 
>>>>> -------- Original-Nachricht --------
>>>>>> Datum: Thu, 20 Oct 2011 17:59:04 -0700
>>>>>> Von: Peter Spam <ps...@mac.com>
>>>>>> An: solr-user@lucene.apache.org
>>>>>> Betreff: Can Solr handle large text files?
>>>>> 
>>>>>> I have about 20k text files, some very small, but some up to 300MB,
>>>>>> and would like to do text searching with highlighting.
>>>>>> 
>>>>>> Imagine the text is the contents of your syslog.
>>>>>> 
>>>>>> I would like to type in some terms, such as "error" and "mail", and
>>>>>> have Solr return the syslog lines with those terms PLUS two lines of 
>>>>>> context.
>>>>>> Pretty much just like Google's highlighting.
>>>>>> 
>>>>>> 1) Can Solr handle this?  I had extremely long query times when I
>>>>>> tried this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I
>>>>>> tried breaking the files into 1MB pieces, but searching would be
>>>>>> wonky => return the wrong number of documents (ie. if one file had a
>>>>>> term 5 times, and that was the only file that had the term, I want 1 
>>>>>> result, not 5 results).
>>>>>> 
>>>>>> 2) What sort of tokenizer would be best?  Here's what I'm using:
>>>>>> 
>>>>>> <field name="body" type="text_pl" indexed="true" stored="true"
>>>>>> multiValued="false" termVectors="true" termPositions="true"
>>>>>> termOffsets="true" />
>>>>>> 
>>>>>>  <fieldType name="text_pl" class="solr.TextField">
>>>>>>    <analyzer>
>>>>>>      <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>      <filter class="solr.WordDelimiterFilterFactory"
>>>>>> generateWordParts="0" generateNumberParts="0" catenateWords="0" 
>>>>>> catenateNumbers="0"
>>>>>> catenateAll="0" splitOnCaseChange="0"/>
>>>>>>    </analyzer>
>>>>>>  </fieldType>
>>>>>> 
>>>>>> 
>>>>>> Thanks!
>>>>>> Pete
>>>>> 
>>>>> **********************************************************************
>>>>> ************* The Royal Bank of Scotland plc. Registered in Scotland
>>>>> No 90312.
>>>>> Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB.
>>>>> Authorised and regulated by the Financial Services Authority. The
>>>>> Royal Bank of Scotland N.V. is authorised and regulated by the De
>>>>> Nederlandsche Bank and has its seat at Amsterdam, the Netherlands, and
>>>>> is registered in the Commercial Register under number 33002587.
>>>>> Registered Office: Gustav Mahlerlaan 350, Amsterdam, The Netherlands.
>>>>> The Royal Bank of Scotland N.V. and The Royal Bank of Scotland plc are
>>>>> authorised to act as agent for each other in certain jurisdictions.
>>>>> 
>>>>> This e-mail message is confidential and for use by the addressee only.
>>>>> If the message is received by anyone other than the addressee, please
>>>>> return the message to the sender by replying to it and then delete the
>>>>> message from your computer. Internet e-mails are not necessarily
>>>>> secure. The Royal Bank of Scotland plc and The Royal Bank of Scotland
>>>>> N.V. including its affiliates ("RBS group") does not accept
>>>>> responsibility for changes made to this message after it was sent. For
>>>>> the protection of RBS group and its clients and customers, and in
>>>>> compliance with regulatory requirements, the contents of both incoming
>>>>> and outgoing e-mail communications, which could include proprietary
>>>>> information and Non-Public Personal Information, may be read by
>>>>> authorised persons within RBS group other than the intended recipient(s).
>>>>> 
>>>>> Whilst all reasonable care has been taken to avoid the transmission of
>>>>> viruses, it is the responsibility of the recipient to ensure that the
>>>>> onward transmission, opening or use of this message and any
>>>>> attachments will not adversely affect its systems or data. No
>>>>> responsibility is accepted by the RBS group in this regard and the
>>>>> recipient should carry out such virus and other checks as it considers 
>>>>> appropriate.
>>>>> 
>>>>> Visit our website at www.rbs.com
>>>>> 
>>>>> **********************************************************************
>>>>> *************
>>>>> 
>>>> 
>>> 
>> 
>>

Re: Can Solr handle large text files?

Reply via email to