Oh by the way - what analyzer are you using for your log files? Here's what I'm trying:
<fieldType name="text_pl" class="solr.TextField"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/> </analyzer> </fieldType> Thanks! Pete On Oct 31, 2011, at 9:28 PM, anand.ni...@rbs.com wrote: > Hi, > > Basically I need to index very large log files. I have modified the > ExtractingDocumentLoader to create a new document for every 50 lines (it is > made configurable by keeping it as a system property) of the log file being > indexed. 'Filename' field for document created from 1 log file is kept the > same and unique id is generated by appending the line no. with the file name, > e.g 'log.txt (line no. 100 -150)'. Each doc is given the custom score stored > in field called 'custom_score' which is directly proportional to its distance > from the beginning of the file. > > I have also found 'hitGrouped.vm' from the net. Since I am reading only 50 > lines for each document so the default max chunk size works for me but it can > be easily adjusted depending upon the no of lines you are reading per doc. > > Now I have done the grouping based on the 'filename' field and show the > results from docs having highest score as a result I am able to show the last > matching results from log file. Query parameters that I am using for search > are: > > http://localhost:8080/solr/select?defType=dismax&qf=Content&q=Solr&fl=id,score&defType=dismax&bf=sub(1000,caprice_score)&group=true&group.field=FileName > > Results are amazing, I am able to index and search from very larger log files > (few 100 MBs) with very low memory requirements. Highlighting is also working > fine. > > Thanks & Regards, > Anand > > > > > > Anand Nigam > RBS Global Banking & Markets > Office: +91 124 492 5506 > > -----Original Message----- > From: Peter Spam [mailto:ps...@mac.com] > Sent: 21 October 2011 23:04 > To: solr-user@lucene.apache.org > Subject: Re: Can Solr handle large text files? > > Thanks for your note, Anand. What was the maximum chunk size for you? Could > you post the relevant portions of your configuration file? > > > Thanks! > Pete > > On Oct 21, 2011, at 4:20 AM, anand.ni...@rbs.com wrote: > >> Hi, >> >> I was also facing the issue of highlighting the large text files. I applied >> the solution proposed here and it worked. But I am getting following error : >> >> >> Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where >> can I get this file from. Its reference is present in browse.vm >> >> <div class="results"> >> #if($response.response.get('grouped')) >> #foreach($grouping in $response.response.get('grouped')) >> #parse("hitGrouped.vm") >> #end >> #else >> #foreach($doc in $response.results) >> #parse("hit.vm") >> #end >> #end >> </div> >> >> >> HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or >> 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', >> cwd=C:\glassfish3\glassfish\domains\domain1\config >> java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in >> classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', >> cwd=C:\glassfish3\glassfish\domains\domain1\config at >> org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoade >> r.java:268) at >> org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream( >> SolrVelocityResourceLoader.java:42) at >> org.apache.velocity.Template.process(Template.java:98) at >> org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource( >> ResourceManagerImpl.java:446) at >> >> Thanks & Regards, >> Anand >> Anand Nigam >> RBS Global Banking & Markets >> Office: +91 124 492 5506 >> >> >> -----Original Message----- >> From: karsten-s...@gmx.de [mailto:karsten-s...@gmx.de] >> Sent: 21 October 2011 14:58 >> To: solr-user@lucene.apache.org >> Subject: Re: Can Solr handle large text files? >> >> Hi Peter, >> >> highlighting in large text files can not be fast without dividing the >> original text in small piece. >> So take a look in >> http://xtf.cdlib.org/documentation/under-the-hood/#Chunking >> and in >> http://www.lucidimagination.com/blog/2010/09/16/2446/ >> >> Which means that you should divide your files and use Result Grouping / >> Field Collapsing to list only one hit per original document. >> >> (xtf also would solve your problem "out of the box" but xtf does not use >> solr). >> >> Best regards >> Karsten >> >> -------- Original-Nachricht -------- >>> Datum: Thu, 20 Oct 2011 17:59:04 -0700 >>> Von: Peter Spam <ps...@mac.com> >>> An: solr-user@lucene.apache.org >>> Betreff: Can Solr handle large text files? >> >>> I have about 20k text files, some very small, but some up to 300MB, >>> and would like to do text searching with highlighting. >>> >>> Imagine the text is the contents of your syslog. >>> >>> I would like to type in some terms, such as "error" and "mail", and >>> have Solr return the syslog lines with those terms PLUS two lines of >>> context. >>> Pretty much just like Google's highlighting. >>> >>> 1) Can Solr handle this? I had extremely long query times when I >>> tried this with Solr 1.4.1 (yes I was using TermVectors, etc.). I >>> tried breaking the files into 1MB pieces, but searching would be >>> wonky => return the wrong number of documents (ie. if one file had a >>> term 5 times, and that was the only file that had the term, I want 1 >>> result, not 5 results). >>> >>> 2) What sort of tokenizer would be best? Here's what I'm using: >>> >>> <field name="body" type="text_pl" indexed="true" stored="true" >>> multiValued="false" termVectors="true" termPositions="true" >>> termOffsets="true" /> >>> >>> <fieldType name="text_pl" class="solr.TextField"> >>> <analyzer> >>> <tokenizer class="solr.StandardTokenizerFactory"/> >>> <filter class="solr.LowerCaseFilterFactory"/> >>> <filter class="solr.WordDelimiterFilterFactory" >>> generateWordParts="0" generateNumberParts="0" catenateWords="0" >>> catenateNumbers="0" >>> catenateAll="0" splitOnCaseChange="0"/> >>> </analyzer> >>> </fieldType> >>> >>> >>> Thanks! >>> Pete >> >> ********************************************************************** >> ************* The Royal Bank of Scotland plc. Registered in Scotland >> No 90312. >> Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB. >> Authorised and regulated by the Financial Services Authority. The >> Royal Bank of Scotland N.V. is authorised and regulated by the De >> Nederlandsche Bank and has its seat at Amsterdam, the Netherlands, and >> is registered in the Commercial Register under number 33002587. >> Registered Office: Gustav Mahlerlaan 350, Amsterdam, The Netherlands. >> The Royal Bank of Scotland N.V. and The Royal Bank of Scotland plc are >> authorised to act as agent for each other in certain jurisdictions. >> >> This e-mail message is confidential and for use by the addressee only. >> If the message is received by anyone other than the addressee, please >> return the message to the sender by replying to it and then delete the >> message from your computer. Internet e-mails are not necessarily >> secure. The Royal Bank of Scotland plc and The Royal Bank of Scotland >> N.V. including its affiliates ("RBS group") does not accept >> responsibility for changes made to this message after it was sent. For >> the protection of RBS group and its clients and customers, and in >> compliance with regulatory requirements, the contents of both incoming >> and outgoing e-mail communications, which could include proprietary >> information and Non-Public Personal Information, may be read by >> authorised persons within RBS group other than the intended recipient(s). >> >> Whilst all reasonable care has been taken to avoid the transmission of >> viruses, it is the responsibility of the recipient to ensure that the >> onward transmission, opening or use of this message and any >> attachments will not adversely affect its systems or data. No >> responsibility is accepted by the RBS group in this regard and the >> recipient should carry out such virus and other checks as it considers >> appropriate. >> >> Visit our website at www.rbs.com >> >> ********************************************************************** >> ************* >> >