Has the performance of highlighting large text documents been improved in Solr 4?
Thanks! Pete On Nov 5, 2011, at 9:03 AM, Erick Erickson <erickerick...@gmail.com> wrote: > Sure, if you write a custom update handler. But I'm not at all sure > this is "ideal". > You're requiring all that data to be transmitted across the wire and processed > by Solr. Assuming you have more than one input source, the Solr server in > the background will be handling up to N documents simultaneously. Plus > the effort to index. I think I'd recommend splitting them up on the client > side. > > Best > Erick > > On Fri, Nov 4, 2011 at 3:23 AM, Peter Spam <ps...@mac.com> wrote: >> Solr 4.0 (11/1 snapshot) >> Data: 80k files, average size 2.5MB, largest is 750MB; >> Solr: Each document is max 256k; total docs = 800k >> Machine: Early 2009 Mac Pro, 6GB RAM, 1GBmin/2GBmax given to Solr Java; >> Admin shows 30% mem usage >> >> I originally tried injecting the entire file into a single Solr document, >> and this had disastrous results when trying to highlight. I've now tried >> splitting each file into 256k segments per Solr document, and the results >> are better, but still not what I was hoping for. Queries are around 2-8 >> seconds, with some reaching into 30+ second territory. >> >> Ideally, I'd like to feed Solr the metadata and the entire file at once, and >> have the back-end split the file into thousands of pieces. Is this possible? >> >> >> Thanks! >> Pete >> >> On Nov 1, 2011, at 5:15 PM, Peter Spam wrote: >> >>> Wow, 50 lines is tiny! Is that how small you need to go, to get good >>> highlighting performance? >>> >>> I'm looking at documents that can be up to 800MB in size, so I've decided >>> to split them down into 256k chunks. I'm still indexing right now - I'm >>> curious to see how performance is when the injection is finished. >>> >>> Has anyone done analysis on where the knee in the curve is, wrt document >>> size vs. # of documents? >>> >>> >>> Thanks! >>> Pete >>> >>> On Oct 31, 2011, at 9:28 PM, anand.ni...@rbs.com wrote: >>> >>>> Hi, >>>> >>>> Basically I need to index very large log files. I have modified the >>>> ExtractingDocumentLoader to create a new document for every 50 lines (it >>>> is made configurable by keeping it as a system property) of the log file >>>> being indexed. 'Filename' field for document created from 1 log file is >>>> kept the same and unique id is generated by appending the line no. with >>>> the file name, e.g 'log.txt (line no. 100 -150)'. Each doc is given the >>>> custom score stored in field called 'custom_score' which is directly >>>> proportional to its distance from the beginning of the file. >>>> >>>> I have also found 'hitGrouped.vm' from the net. Since I am reading only 50 >>>> lines for each document so the default max chunk size works for me but it >>>> can be easily adjusted depending upon the no of lines you are reading per >>>> doc. >>>> >>>> Now I have done the grouping based on the 'filename' field and show the >>>> results from docs having highest score as a result I am able to show the >>>> last matching results from log file. Query parameters that I am using for >>>> search are: >>>> >>>> http://localhost:8080/solr/select?defType=dismax&qf=Content&q=Solr&fl=id,score&defType=dismax&bf=sub(1000,caprice_score)&group=true&group.field=FileName >>>> >>>> Results are amazing, I am able to index and search from very larger log >>>> files (few 100 MBs) with very low memory requirements. Highlighting is >>>> also working fine. >>>> >>>> Thanks & Regards, >>>> Anand >>>> >>>> >>>> >>>> >>>> >>>> Anand Nigam >>>> RBS Global Banking & Markets >>>> Office: +91 124 492 5506 >>>> >>>> -----Original Message----- >>>> From: Peter Spam [mailto:ps...@mac.com] >>>> Sent: 21 October 2011 23:04 >>>> To: solr-user@lucene.apache.org >>>> Subject: Re: Can Solr handle large text files? >>>> >>>> Thanks for your note, Anand. What was the maximum chunk size for you? >>>> Could you post the relevant portions of your configuration file? >>>> >>>> >>>> Thanks! >>>> Pete >>>> >>>> On Oct 21, 2011, at 4:20 AM, anand.ni...@rbs.com wrote: >>>> >>>>> Hi, >>>>> >>>>> I was also facing the issue of highlighting the large text files. I >>>>> applied the solution proposed here and it worked. But I am getting >>>>> following error : >>>>> >>>>> >>>>> Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where >>>>> can I get this file from. Its reference is present in browse.vm >>>>> >>>>> <div class="results"> >>>>> #if($response.response.get('grouped')) >>>>> #foreach($grouping in $response.response.get('grouped')) >>>>> #parse("hitGrouped.vm") >>>>> #end >>>>> #else >>>>> #foreach($doc in $response.results) >>>>> #parse("hit.vm") >>>>> #end >>>>> #end >>>>> </div> >>>>> >>>>> >>>>> HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or >>>>> 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', >>>>> cwd=C:\glassfish3\glassfish\domains\domain1\config >>>>> java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in >>>>> classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', >>>>> cwd=C:\glassfish3\glassfish\domains\domain1\config at >>>>> org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoade >>>>> r.java:268) at >>>>> org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream( >>>>> SolrVelocityResourceLoader.java:42) at >>>>> org.apache.velocity.Template.process(Template.java:98) at >>>>> org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource( >>>>> ResourceManagerImpl.java:446) at >>>>> >>>>> Thanks & Regards, >>>>> Anand >>>>> Anand Nigam >>>>> RBS Global Banking & Markets >>>>> Office: +91 124 492 5506 >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: karsten-s...@gmx.de [mailto:karsten-s...@gmx.de] >>>>> Sent: 21 October 2011 14:58 >>>>> To: solr-user@lucene.apache.org >>>>> Subject: Re: Can Solr handle large text files? >>>>> >>>>> Hi Peter, >>>>> >>>>> highlighting in large text files can not be fast without dividing the >>>>> original text in small piece. >>>>> So take a look in >>>>> http://xtf.cdlib.org/documentation/under-the-hood/#Chunking >>>>> and in >>>>> http://www.lucidimagination.com/blog/2010/09/16/2446/ >>>>> >>>>> Which means that you should divide your files and use Result Grouping / >>>>> Field Collapsing to list only one hit per original document. >>>>> >>>>> (xtf also would solve your problem "out of the box" but xtf does not use >>>>> solr). >>>>> >>>>> Best regards >>>>> Karsten >>>>> >>>>> -------- Original-Nachricht -------- >>>>>> Datum: Thu, 20 Oct 2011 17:59:04 -0700 >>>>>> Von: Peter Spam <ps...@mac.com> >>>>>> An: solr-user@lucene.apache.org >>>>>> Betreff: Can Solr handle large text files? >>>>> >>>>>> I have about 20k text files, some very small, but some up to 300MB, >>>>>> and would like to do text searching with highlighting. >>>>>> >>>>>> Imagine the text is the contents of your syslog. >>>>>> >>>>>> I would like to type in some terms, such as "error" and "mail", and >>>>>> have Solr return the syslog lines with those terms PLUS two lines of >>>>>> context. >>>>>> Pretty much just like Google's highlighting. >>>>>> >>>>>> 1) Can Solr handle this? I had extremely long query times when I >>>>>> tried this with Solr 1.4.1 (yes I was using TermVectors, etc.). I >>>>>> tried breaking the files into 1MB pieces, but searching would be >>>>>> wonky => return the wrong number of documents (ie. if one file had a >>>>>> term 5 times, and that was the only file that had the term, I want 1 >>>>>> result, not 5 results). >>>>>> >>>>>> 2) What sort of tokenizer would be best? Here's what I'm using: >>>>>> >>>>>> <field name="body" type="text_pl" indexed="true" stored="true" >>>>>> multiValued="false" termVectors="true" termPositions="true" >>>>>> termOffsets="true" /> >>>>>> >>>>>> <fieldType name="text_pl" class="solr.TextField"> >>>>>> <analyzer> >>>>>> <tokenizer class="solr.StandardTokenizerFactory"/> >>>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>>> <filter class="solr.WordDelimiterFilterFactory" >>>>>> generateWordParts="0" generateNumberParts="0" catenateWords="0" >>>>>> catenateNumbers="0" >>>>>> catenateAll="0" splitOnCaseChange="0"/> >>>>>> </analyzer> >>>>>> </fieldType> >>>>>> >>>>>> >>>>>> Thanks! >>>>>> Pete >>>>> >>>>> ********************************************************************** >>>>> ************* The Royal Bank of Scotland plc. Registered in Scotland >>>>> No 90312. >>>>> Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB. >>>>> Authorised and regulated by the Financial Services Authority. The >>>>> Royal Bank of Scotland N.V. is authorised and regulated by the De >>>>> Nederlandsche Bank and has its seat at Amsterdam, the Netherlands, and >>>>> is registered in the Commercial Register under number 33002587. >>>>> Registered Office: Gustav Mahlerlaan 350, Amsterdam, The Netherlands. >>>>> The Royal Bank of Scotland N.V. and The Royal Bank of Scotland plc are >>>>> authorised to act as agent for each other in certain jurisdictions. >>>>> >>>>> This e-mail message is confidential and for use by the addressee only. >>>>> If the message is received by anyone other than the addressee, please >>>>> return the message to the sender by replying to it and then delete the >>>>> message from your computer. Internet e-mails are not necessarily >>>>> secure. The Royal Bank of Scotland plc and The Royal Bank of Scotland >>>>> N.V. including its affiliates ("RBS group") does not accept >>>>> responsibility for changes made to this message after it was sent. For >>>>> the protection of RBS group and its clients and customers, and in >>>>> compliance with regulatory requirements, the contents of both incoming >>>>> and outgoing e-mail communications, which could include proprietary >>>>> information and Non-Public Personal Information, may be read by >>>>> authorised persons within RBS group other than the intended recipient(s). >>>>> >>>>> Whilst all reasonable care has been taken to avoid the transmission of >>>>> viruses, it is the responsibility of the recipient to ensure that the >>>>> onward transmission, opening or use of this message and any >>>>> attachments will not adversely affect its systems or data. No >>>>> responsibility is accepted by the RBS group in this regard and the >>>>> recipient should carry out such virus and other checks as it considers >>>>> appropriate. >>>>> >>>>> Visit our website at www.rbs.com >>>>> >>>>> ********************************************************************** >>>>> ************* >>>>> >>>> >>> >> >>