Re: Can Solr handle large text files?
Has the performance of highlighting large text documents been improved in Solr 4? Thanks! Pete On Nov 5, 2011, at 9:03 AM, Erick Erickson erickerick...@gmail.com wrote: Sure, if you write a custom update handler. But I'm not at all sure this is ideal. You're requiring all that data to be transmitted across the wire and processed by Solr. Assuming you have more than one input source, the Solr server in the background will be handling up to N documents simultaneously. Plus the effort to index. I think I'd recommend splitting them up on the client side. Best Erick On Fri, Nov 4, 2011 at 3:23 AM, Peter Spam ps...@mac.com wrote: Solr 4.0 (11/1 snapshot) Data: 80k files, average size 2.5MB, largest is 750MB; Solr: Each document is max 256k; total docs = 800k Machine: Early 2009 Mac Pro, 6GB RAM, 1GBmin/2GBmax given to Solr Java; Admin shows 30% mem usage I originally tried injecting the entire file into a single Solr document, and this had disastrous results when trying to highlight. I've now tried splitting each file into 256k segments per Solr document, and the results are better, but still not what I was hoping for. Queries are around 2-8 seconds, with some reaching into 30+ second territory. Ideally, I'd like to feed Solr the metadata and the entire file at once, and have the back-end split the file into thousands of pieces. Is this possible? Thanks! Pete On Nov 1, 2011, at 5:15 PM, Peter Spam wrote: Wow, 50 lines is tiny! Is that how small you need to go, to get good highlighting performance? I'm looking at documents that can be up to 800MB in size, so I've decided to split them down into 256k chunks. I'm still indexing right now - I'm curious to see how performance is when the injection is finished. Has anyone done analysis on where the knee in the curve is, wrt document size vs. # of documents? Thanks! Pete On Oct 31, 2011, at 9:28 PM, anand.ni...@rbs.com wrote: Hi, Basically I need to index very large log files. I have modified the ExtractingDocumentLoader to create a new document for every 50 lines (it is made configurable by keeping it as a system property) of the log file being indexed. 'Filename' field for document created from 1 log file is kept the same and unique id is generated by appending the line no. with the file name, e.g 'log.txt (line no. 100 -150)'. Each doc is given the custom score stored in field called 'custom_score' which is directly proportional to its distance from the beginning of the file. I have also found 'hitGrouped.vm' from the net. Since I am reading only 50 lines for each document so the default max chunk size works for me but it can be easily adjusted depending upon the no of lines you are reading per doc. Now I have done the grouping based on the 'filename' field and show the results from docs having highest score as a result I am able to show the last matching results from log file. Query parameters that I am using for search are: http://localhost:8080/solr/select?defType=dismaxqf=Contentq=Solrfl=id,scoredefType=dismaxbf=sub(1000,caprice_score)group=truegroup.field=FileName Results are amazing, I am able to index and search from very larger log files (few 100 MBs) with very low memory requirements. Highlighting is also working fine. Thanks Regards, Anand Anand Nigam RBS Global Banking Markets Office: +91 124 492 5506 -Original Message- From: Peter Spam [mailto:ps...@mac.com] Sent: 21 October 2011 23:04 To: solr-user@lucene.apache.org Subject: Re: Can Solr handle large text files? Thanks for your note, Anand. What was the maximum chunk size for you? Could you post the relevant portions of your configuration file? Thanks! Pete On Oct 21, 2011, at 4:20 AM, anand.ni...@rbs.com wrote: Hi, I was also facing the issue of highlighting the large text files. I applied the solution proposed here and it worked. But I am getting following error : Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where can I get this file from. Its reference is present in browse.vm div class=results #if($response.response.get('grouped')) #foreach($grouping in $response.response.get('grouped')) #parse(hitGrouped.vm) #end #else #foreach($doc in $response.results) #parse(hit.vm) #end #end /div HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', cwd=C:\glassfish3\glassfish\domains\domain1\config java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', cwd=C:\glassfish3\glassfish\domains\domain1\config at org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoade r.java:268) at org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream( SolrVelocityResourceLoader.java:42
Proper analyzer / tokenizer for syslog data?
Example data: 01/23/2011 05:12:34 [Test] a=1; hello_there=50; data=[1,5,30%]; I would love to be able to just grep the data - ie. if I search for ello, it finds and returns ello, and if I search for hello_there=5, it would match too. Here's what I'm using now: fieldType name=text_sy class=solr.TextField analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType The problem with this is that if I search for a substring, I don't get anything back. For example, searching for ello or *ello* doesn't return. Any ideas? http://localhost:8983/solr/select?q=*ello*start=0rows=50hl.maxAnalyzedChars=2147483647hl.useFastVectorHighlighter=truehl=truehl.fl=bodyhl.snippets=1hl.fragsize=400 Thanks! Pete
Re: Can Solr handle large text files?
Solr 4.0 (11/1 snapshot) Data: 80k files, average size 2.5MB, largest is 750MB; Solr: Each document is max 256k; total docs = 800k Machine: Early 2009 Mac Pro, 6GB RAM, 1GBmin/2GBmax given to Solr Java; Admin shows 30% mem usage I originally tried injecting the entire file into a single Solr document, and this had disastrous results when trying to highlight. I've now tried splitting each file into 256k segments per Solr document, and the results are better, but still not what I was hoping for. Queries are around 2-8 seconds, with some reaching into 30+ second territory. Ideally, I'd like to feed Solr the metadata and the entire file at once, and have the back-end split the file into thousands of pieces. Is this possible? Thanks! Pete On Nov 1, 2011, at 5:15 PM, Peter Spam wrote: Wow, 50 lines is tiny! Is that how small you need to go, to get good highlighting performance? I'm looking at documents that can be up to 800MB in size, so I've decided to split them down into 256k chunks. I'm still indexing right now - I'm curious to see how performance is when the injection is finished. Has anyone done analysis on where the knee in the curve is, wrt document size vs. # of documents? Thanks! Pete On Oct 31, 2011, at 9:28 PM, anand.ni...@rbs.com wrote: Hi, Basically I need to index very large log files. I have modified the ExtractingDocumentLoader to create a new document for every 50 lines (it is made configurable by keeping it as a system property) of the log file being indexed. 'Filename' field for document created from 1 log file is kept the same and unique id is generated by appending the line no. with the file name, e.g 'log.txt (line no. 100 -150)'. Each doc is given the custom score stored in field called 'custom_score' which is directly proportional to its distance from the beginning of the file. I have also found 'hitGrouped.vm' from the net. Since I am reading only 50 lines for each document so the default max chunk size works for me but it can be easily adjusted depending upon the no of lines you are reading per doc. Now I have done the grouping based on the 'filename' field and show the results from docs having highest score as a result I am able to show the last matching results from log file. Query parameters that I am using for search are: http://localhost:8080/solr/select?defType=dismaxqf=Contentq=Solrfl=id,scoredefType=dismaxbf=sub(1000,caprice_score)group=truegroup.field=FileName Results are amazing, I am able to index and search from very larger log files (few 100 MBs) with very low memory requirements. Highlighting is also working fine. Thanks Regards, Anand Anand Nigam RBS Global Banking Markets Office: +91 124 492 5506 -Original Message- From: Peter Spam [mailto:ps...@mac.com] Sent: 21 October 2011 23:04 To: solr-user@lucene.apache.org Subject: Re: Can Solr handle large text files? Thanks for your note, Anand. What was the maximum chunk size for you? Could you post the relevant portions of your configuration file? Thanks! Pete On Oct 21, 2011, at 4:20 AM, anand.ni...@rbs.com wrote: Hi, I was also facing the issue of highlighting the large text files. I applied the solution proposed here and it worked. But I am getting following error : Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where can I get this file from. Its reference is present in browse.vm div class=results #if($response.response.get('grouped')) #foreach($grouping in $response.response.get('grouped')) #parse(hitGrouped.vm) #end #else #foreach($doc in $response.results) #parse(hit.vm) #end #end /div HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', cwd=C:\glassfish3\glassfish\domains\domain1\config java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', cwd=C:\glassfish3\glassfish\domains\domain1\config at org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoade r.java:268) at org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream( SolrVelocityResourceLoader.java:42) at org.apache.velocity.Template.process(Template.java:98) at org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource( ResourceManagerImpl.java:446) at Thanks Regards, Anand Anand Nigam RBS Global Banking Markets Office: +91 124 492 5506 -Original Message- From: karsten-s...@gmx.de [mailto:karsten-s...@gmx.de] Sent: 21 October 2011 14:58 To: solr-user@lucene.apache.org Subject: Re: Can Solr handle large text files? Hi Peter, highlighting in large text files can not be fast without dividing the original text in small piece. So take a look in http://xtf.cdlib.org/documentation/under-the-hood/#Chunking and in http
Re: Proper analyzer / tokenizer for syslog data?
Wow, I tried with minGramSize=1 and maxgramSize=1000 (I want someone to be able to search on any substring, just like grep), and the index is multiple orders of magnitude larger than my data! There's got to be a better way to support full grep-like searching? Thanks! Pete On Nov 4, 2011, at 1:20 AM, Ahmet Arslan wrote: Example data: 01/23/2011 05:12:34 [Test] a=1; hello_there=50; data=[1,5,30%]; I would love to be able to just grep the data - ie. if I search for ello, it finds and returns ello, and if I search for hello_there=5, it would match too. Here's what I'm using now: fieldType name=text_sy class=solr.TextField analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType The problem with this is that if I search for a substring, I don't get anything back. For example, searching for ello or *ello* doesn't return. Any ideas? http://localhost:8983/solr/select?q=*ello*start=0rows=50hl.maxAnalyzedChars=2147483647hl.useFastVectorHighlighter=truehl=truehl.fl=bodyhl.snippets=1hl.fragsize=400 For sub-string match NGramFilterFactory is required at index time. filter class=solr.NGramFilterFactory minGramSize=1 maxGramSize=15/ Plus you may want to use WhiteSpaceTokenizer instead of StandardTokenizerFactory. Analysis admin page displays behavior of each tokenizer.
Re: Can Solr handle large text files?
Wow, 50 lines is tiny! Is that how small you need to go, to get good highlighting performance? I'm looking at documents that can be up to 800MB in size, so I've decided to split them down into 256k chunks. I'm still indexing right now - I'm curious to see how performance is when the injection is finished. Has anyone done analysis on where the knee in the curve is, wrt document size vs. # of documents? Thanks! Pete On Oct 31, 2011, at 9:28 PM, anand.ni...@rbs.com wrote: Hi, Basically I need to index very large log files. I have modified the ExtractingDocumentLoader to create a new document for every 50 lines (it is made configurable by keeping it as a system property) of the log file being indexed. 'Filename' field for document created from 1 log file is kept the same and unique id is generated by appending the line no. with the file name, e.g 'log.txt (line no. 100 -150)'. Each doc is given the custom score stored in field called 'custom_score' which is directly proportional to its distance from the beginning of the file. I have also found 'hitGrouped.vm' from the net. Since I am reading only 50 lines for each document so the default max chunk size works for me but it can be easily adjusted depending upon the no of lines you are reading per doc. Now I have done the grouping based on the 'filename' field and show the results from docs having highest score as a result I am able to show the last matching results from log file. Query parameters that I am using for search are: http://localhost:8080/solr/select?defType=dismaxqf=Contentq=Solrfl=id,scoredefType=dismaxbf=sub(1000,caprice_score)group=truegroup.field=FileName Results are amazing, I am able to index and search from very larger log files (few 100 MBs) with very low memory requirements. Highlighting is also working fine. Thanks Regards, Anand Anand Nigam RBS Global Banking Markets Office: +91 124 492 5506 -Original Message- From: Peter Spam [mailto:ps...@mac.com] Sent: 21 October 2011 23:04 To: solr-user@lucene.apache.org Subject: Re: Can Solr handle large text files? Thanks for your note, Anand. What was the maximum chunk size for you? Could you post the relevant portions of your configuration file? Thanks! Pete On Oct 21, 2011, at 4:20 AM, anand.ni...@rbs.com wrote: Hi, I was also facing the issue of highlighting the large text files. I applied the solution proposed here and it worked. But I am getting following error : Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where can I get this file from. Its reference is present in browse.vm div class=results #if($response.response.get('grouped')) #foreach($grouping in $response.response.get('grouped')) #parse(hitGrouped.vm) #end #else #foreach($doc in $response.results) #parse(hit.vm) #end #end /div HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', cwd=C:\glassfish3\glassfish\domains\domain1\config java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', cwd=C:\glassfish3\glassfish\domains\domain1\config at org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoade r.java:268) at org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream( SolrVelocityResourceLoader.java:42) at org.apache.velocity.Template.process(Template.java:98) at org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource( ResourceManagerImpl.java:446) at Thanks Regards, Anand Anand Nigam RBS Global Banking Markets Office: +91 124 492 5506 -Original Message- From: karsten-s...@gmx.de [mailto:karsten-s...@gmx.de] Sent: 21 October 2011 14:58 To: solr-user@lucene.apache.org Subject: Re: Can Solr handle large text files? Hi Peter, highlighting in large text files can not be fast without dividing the original text in small piece. So take a look in http://xtf.cdlib.org/documentation/under-the-hood/#Chunking and in http://www.lucidimagination.com/blog/2010/09/16/2446/ Which means that you should divide your files and use Result Grouping / Field Collapsing to list only one hit per original document. (xtf also would solve your problem out of the box but xtf does not use solr). Best regards Karsten Original-Nachricht Datum: Thu, 20 Oct 2011 17:59:04 -0700 Von: Peter Spam ps...@mac.com An: solr-user@lucene.apache.org Betreff: Can Solr handle large text files? I have about 20k text files, some very small, but some up to 300MB, and would like to do text searching with highlighting. Imagine the text is the contents of your syslog. I would like to type in some terms, such as error and mail, and have Solr return the syslog lines with those terms PLUS two lines of context. Pretty much
Re: Can Solr handle large text files?
Oh by the way - what analyzer are you using for your log files? Here's what I'm trying: fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType Thanks! Pete On Oct 31, 2011, at 9:28 PM, anand.ni...@rbs.com wrote: Hi, Basically I need to index very large log files. I have modified the ExtractingDocumentLoader to create a new document for every 50 lines (it is made configurable by keeping it as a system property) of the log file being indexed. 'Filename' field for document created from 1 log file is kept the same and unique id is generated by appending the line no. with the file name, e.g 'log.txt (line no. 100 -150)'. Each doc is given the custom score stored in field called 'custom_score' which is directly proportional to its distance from the beginning of the file. I have also found 'hitGrouped.vm' from the net. Since I am reading only 50 lines for each document so the default max chunk size works for me but it can be easily adjusted depending upon the no of lines you are reading per doc. Now I have done the grouping based on the 'filename' field and show the results from docs having highest score as a result I am able to show the last matching results from log file. Query parameters that I am using for search are: http://localhost:8080/solr/select?defType=dismaxqf=Contentq=Solrfl=id,scoredefType=dismaxbf=sub(1000,caprice_score)group=truegroup.field=FileName Results are amazing, I am able to index and search from very larger log files (few 100 MBs) with very low memory requirements. Highlighting is also working fine. Thanks Regards, Anand Anand Nigam RBS Global Banking Markets Office: +91 124 492 5506 -Original Message- From: Peter Spam [mailto:ps...@mac.com] Sent: 21 October 2011 23:04 To: solr-user@lucene.apache.org Subject: Re: Can Solr handle large text files? Thanks for your note, Anand. What was the maximum chunk size for you? Could you post the relevant portions of your configuration file? Thanks! Pete On Oct 21, 2011, at 4:20 AM, anand.ni...@rbs.com wrote: Hi, I was also facing the issue of highlighting the large text files. I applied the solution proposed here and it worked. But I am getting following error : Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where can I get this file from. Its reference is present in browse.vm div class=results #if($response.response.get('grouped')) #foreach($grouping in $response.response.get('grouped')) #parse(hitGrouped.vm) #end #else #foreach($doc in $response.results) #parse(hit.vm) #end #end /div HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', cwd=C:\glassfish3\glassfish\domains\domain1\config java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', cwd=C:\glassfish3\glassfish\domains\domain1\config at org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoade r.java:268) at org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream( SolrVelocityResourceLoader.java:42) at org.apache.velocity.Template.process(Template.java:98) at org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource( ResourceManagerImpl.java:446) at Thanks Regards, Anand Anand Nigam RBS Global Banking Markets Office: +91 124 492 5506 -Original Message- From: karsten-s...@gmx.de [mailto:karsten-s...@gmx.de] Sent: 21 October 2011 14:58 To: solr-user@lucene.apache.org Subject: Re: Can Solr handle large text files? Hi Peter, highlighting in large text files can not be fast without dividing the original text in small piece. So take a look in http://xtf.cdlib.org/documentation/under-the-hood/#Chunking and in http://www.lucidimagination.com/blog/2010/09/16/2446/ Which means that you should divide your files and use Result Grouping / Field Collapsing to list only one hit per original document. (xtf also would solve your problem out of the box but xtf does not use solr). Best regards Karsten Original-Nachricht Datum: Thu, 20 Oct 2011 17:59:04 -0700 Von: Peter Spam ps...@mac.com An: solr-user@lucene.apache.org Betreff: Can Solr handle large text files? I have about 20k text files, some very small, but some up to 300MB, and would like to do text searching with highlighting. Imagine the text is the contents of your syslog. I would like to type in some terms, such as error and mail, and have Solr return the syslog lines with those
Re: Can Solr handle large text files?
Thanks for the reminder - I had that set to 214xxx... (the max), but perf was terrible when I injected large files. So what's the max recommended field size in kb? I can try chopping up the syslogs into arbitrarily small pieces, but would love to know where to start. Thanks! Sent from my iPhone On Oct 23, 2011, at 2:01 PM, Erick Erickson erickerick...@gmail.com wrote: Also be aware that by default Solr is configured to only index the first 10,000 lines of text. See maxFieldLength in solrconfig.xml Best Erick On Fri, Oct 21, 2011 at 7:34 PM, Peter Spam ps...@mac.com wrote: Thanks for your note, Anand. What was the maximum chunk size for you? Could you post the relevant portions of your configuration file? Thanks! Pete On Oct 21, 2011, at 4:20 AM, anand.ni...@rbs.com wrote: Hi, I was also facing the issue of highlighting the large text files. I applied the solution proposed here and it worked. But I am getting following error : Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where can I get this file from. Its reference is present in browse.vm div class=results #if($response.response.get('grouped')) #foreach($grouping in $response.response.get('grouped')) #parse(hitGrouped.vm) #end #else #foreach($doc in $response.results) #parse(hit.vm) #end #end /div HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', cwd=C:\glassfish3\glassfish\domains\domain1\config java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', cwd=C:\glassfish3\glassfish\domains\domain1\config at org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:268) at org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream(SolrVelocityResourceLoader.java:42) at org.apache.velocity.Template.process(Template.java:98) at org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource(ResourceManagerImpl.java:446) at Thanks Regards, Anand Anand Nigam RBS Global Banking Markets Office: +91 124 492 5506 -Original Message- From: karsten-s...@gmx.de [mailto:karsten-s...@gmx.de] Sent: 21 October 2011 14:58 To: solr-user@lucene.apache.org Subject: Re: Can Solr handle large text files? Hi Peter, highlighting in large text files can not be fast without dividing the original text in small piece. So take a look in http://xtf.cdlib.org/documentation/under-the-hood/#Chunking and in http://www.lucidimagination.com/blog/2010/09/16/2446/ Which means that you should divide your files and use Result Grouping / Field Collapsing to list only one hit per original document. (xtf also would solve your problem out of the box but xtf does not use solr). Best regards Karsten Original-Nachricht Datum: Thu, 20 Oct 2011 17:59:04 -0700 Von: Peter Spam ps...@mac.com An: solr-user@lucene.apache.org Betreff: Can Solr handle large text files? I have about 20k text files, some very small, but some up to 300MB, and would like to do text searching with highlighting. Imagine the text is the contents of your syslog. I would like to type in some terms, such as error and mail, and have Solr return the syslog lines with those terms PLUS two lines of context. Pretty much just like Google's highlighting. 1) Can Solr handle this? I had extremely long query times when I tried this with Solr 1.4.1 (yes I was using TermVectors, etc.). I tried breaking the files into 1MB pieces, but searching would be wonky = return the wrong number of documents (ie. if one file had a term 5 times, and that was the only file that had the term, I want 1 result, not 5 results). 2) What sort of tokenizer would be best? Here's what I'm using: field name=body type=text_pl indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType Thanks! Pete *** The Royal Bank of Scotland plc. Registered in Scotland No 90312. Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB. Authorised and regulated by the Financial Services Authority. The Royal Bank of Scotland N.V. is authorised and regulated by the De Nederlandsche Bank and has its seat at Amsterdam, the Netherlands, and is registered in the Commercial Register under number 33002587. Registered Office: Gustav Mahlerlaan 350, Amsterdam, The Netherlands
Re: Sorting fields with letters?
Tried using the ord() function, but it was the same as the standard sort. Do I just need to bite the bullet and reindex everything? Thanks! Pete On Oct 21, 2011, at 5:26 PM, Tomás Fernández Löbbe wrote: I don't know if you'll find exactly what you need, but you can sort by any field or FunctionQuery. See http://wiki.apache.org/solr/FunctionQuery On Fri, Oct 21, 2011 at 7:03 PM, Peter Spam ps...@mac.com wrote: Is there a way to use a custom sorter, to avoid re-indexing? Thanks! Pete On Oct 21, 2011, at 2:13 PM, Tomás Fernández Löbbe wrote: Well, yes. You probably have a string field for that content, right? so the content is being compared as strings, not as numbers, that why something like 1000 is lower than 2. Leading zeros would be an option. Another option is to separate the field into numeric fields and sort by those (this last option is only recommended if your data always look similar). Something like 11C15 to field1: 11, field2:C field3: 15. Then use sort=field1,field2,field3. Anyway, both this options require reindexing. Regards, Tomás On Fri, Oct 21, 2011 at 4:57 PM, Peter Spam ps...@mac.com wrote: Hi everyone, I have a field that has a letter in it (for example, 1A1, 2A1, 11C15, etc.). Sorting it seems to work most of the time, except for a few things, like 10A1 is lower than 8A100, and 10A100 is lower than 10A99. Any ideas? I bet if my data had leading zeros (ie 10A099), it would behave better? (But I can't really change my data now, as it would take a few days to re-inject - which is possible but a hassle). Thanks! Pete
Re: Can Solr handle large text files?
Thanks for the response, Karsten. 1) What's the recommended maximum chunk size? 2) Does my tokenizer look reasonable? Thanks! Pete On Oct 21, 2011, at 2:28 AM, karsten-s...@gmx.de wrote: Hi Peter, highlighting in large text files can not be fast without dividing the original text in small piece. So take a look in http://xtf.cdlib.org/documentation/under-the-hood/#Chunking and in http://www.lucidimagination.com/blog/2010/09/16/2446/ Which means that you should divide your files and use Result Grouping / Field Collapsing to list only one hit per original document. (xtf also would solve your problem out of the box but xtf does not use solr). Best regards Karsten Original-Nachricht Datum: Thu, 20 Oct 2011 17:59:04 -0700 Von: Peter Spam ps...@mac.com An: solr-user@lucene.apache.org Betreff: Can Solr handle large text files? I have about 20k text files, some very small, but some up to 300MB, and would like to do text searching with highlighting. Imagine the text is the contents of your syslog. I would like to type in some terms, such as error and mail, and have Solr return the syslog lines with those terms PLUS two lines of context. Pretty much just like Google's highlighting. 1) Can Solr handle this? I had extremely long query times when I tried this with Solr 1.4.1 (yes I was using TermVectors, etc.). I tried breaking the files into 1MB pieces, but searching would be wonky = return the wrong number of documents (ie. if one file had a term 5 times, and that was the only file that had the term, I want 1 result, not 5 results). 2) What sort of tokenizer would be best? Here's what I'm using: field name=body type=text_pl indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType Thanks! Pete
Re: Can Solr handle large text files?
Thanks for your note, Anand. What was the maximum chunk size for you? Could you post the relevant portions of your configuration file? Thanks! Pete On Oct 21, 2011, at 4:20 AM, anand.ni...@rbs.com wrote: Hi, I was also facing the issue of highlighting the large text files. I applied the solution proposed here and it worked. But I am getting following error : Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where can I get this file from. Its reference is present in browse.vm div class=results #if($response.response.get('grouped')) #foreach($grouping in $response.response.get('grouped')) #parse(hitGrouped.vm) #end #else #foreach($doc in $response.results) #parse(hit.vm) #end #end /div HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', cwd=C:\glassfish3\glassfish\domains\domain1\config java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', cwd=C:\glassfish3\glassfish\domains\domain1\config at org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:268) at org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream(SolrVelocityResourceLoader.java:42) at org.apache.velocity.Template.process(Template.java:98) at org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource(ResourceManagerImpl.java:446) at Thanks Regards, Anand Anand Nigam RBS Global Banking Markets Office: +91 124 492 5506 -Original Message- From: karsten-s...@gmx.de [mailto:karsten-s...@gmx.de] Sent: 21 October 2011 14:58 To: solr-user@lucene.apache.org Subject: Re: Can Solr handle large text files? Hi Peter, highlighting in large text files can not be fast without dividing the original text in small piece. So take a look in http://xtf.cdlib.org/documentation/under-the-hood/#Chunking and in http://www.lucidimagination.com/blog/2010/09/16/2446/ Which means that you should divide your files and use Result Grouping / Field Collapsing to list only one hit per original document. (xtf also would solve your problem out of the box but xtf does not use solr). Best regards Karsten Original-Nachricht Datum: Thu, 20 Oct 2011 17:59:04 -0700 Von: Peter Spam ps...@mac.com An: solr-user@lucene.apache.org Betreff: Can Solr handle large text files? I have about 20k text files, some very small, but some up to 300MB, and would like to do text searching with highlighting. Imagine the text is the contents of your syslog. I would like to type in some terms, such as error and mail, and have Solr return the syslog lines with those terms PLUS two lines of context. Pretty much just like Google's highlighting. 1) Can Solr handle this? I had extremely long query times when I tried this with Solr 1.4.1 (yes I was using TermVectors, etc.). I tried breaking the files into 1MB pieces, but searching would be wonky = return the wrong number of documents (ie. if one file had a term 5 times, and that was the only file that had the term, I want 1 result, not 5 results). 2) What sort of tokenizer would be best? Here's what I'm using: field name=body type=text_pl indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType Thanks! Pete *** The Royal Bank of Scotland plc. Registered in Scotland No 90312. Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB. Authorised and regulated by the Financial Services Authority. The Royal Bank of Scotland N.V. is authorised and regulated by the De Nederlandsche Bank and has its seat at Amsterdam, the Netherlands, and is registered in the Commercial Register under number 33002587. Registered Office: Gustav Mahlerlaan 350, Amsterdam, The Netherlands. The Royal Bank of Scotland N.V. and The Royal Bank of Scotland plc are authorised to act as agent for each other in certain jurisdictions. This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer. Internet e-mails are not necessarily secure. The Royal Bank of Scotland plc and The Royal Bank of Scotland N.V. including its affiliates (RBS group) does not accept responsibility for changes
Sorting fields with letters?
Hi everyone, I have a field that has a letter in it (for example, 1A1, 2A1, 11C15, etc.). Sorting it seems to work most of the time, except for a few things, like 10A1 is lower than 8A100, and 10A100 is lower than 10A99. Any ideas? I bet if my data had leading zeros (ie 10A099), it would behave better? (But I can't really change my data now, as it would take a few days to re-inject - which is possible but a hassle). Thanks! Pete
Re: Sorting fields with letters?
Is there a way to use a custom sorter, to avoid re-indexing? Thanks! Pete On Oct 21, 2011, at 2:13 PM, Tomás Fernández Löbbe wrote: Well, yes. You probably have a string field for that content, right? so the content is being compared as strings, not as numbers, that why something like 1000 is lower than 2. Leading zeros would be an option. Another option is to separate the field into numeric fields and sort by those (this last option is only recommended if your data always look similar). Something like 11C15 to field1: 11, field2:C field3: 15. Then use sort=field1,field2,field3. Anyway, both this options require reindexing. Regards, Tomás On Fri, Oct 21, 2011 at 4:57 PM, Peter Spam ps...@mac.com wrote: Hi everyone, I have a field that has a letter in it (for example, 1A1, 2A1, 11C15, etc.). Sorting it seems to work most of the time, except for a few things, like 10A1 is lower than 8A100, and 10A100 is lower than 10A99. Any ideas? I bet if my data had leading zeros (ie 10A099), it would behave better? (But I can't really change my data now, as it would take a few days to re-inject - which is possible but a hassle). Thanks! Pete
Can Solr handle large text files?
I have about 20k text files, some very small, but some up to 300MB, and would like to do text searching with highlighting. Imagine the text is the contents of your syslog. I would like to type in some terms, such as error and mail, and have Solr return the syslog lines with those terms PLUS two lines of context. Pretty much just like Google's highlighting. 1) Can Solr handle this? I had extremely long query times when I tried this with Solr 1.4.1 (yes I was using TermVectors, etc.). I tried breaking the files into 1MB pieces, but searching would be wonky = return the wrong number of documents (ie. if one file had a term 5 times, and that was the only file that had the term, I want 1 result, not 5 results). 2) What sort of tokenizer would be best? Here's what I'm using: field name=body type=text_pl indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType Thanks! Pete
Re: Dismax Request handler and Solrconfig.xml
I'm having the same problem - the standard query returns all my documents, but the dismax one returns 0. Any ideas? http://server:8983/solr/select?qt=standardindent=onq=* response − lst name=responseHeader int name=status0/int int name=QTime3592/int − lst name=params str name=indenton/str str name=qtstandard/str str name=q*/str /lst /lst − result name=response numFound=9108 start=0 − doc [...] http://server:8983/solr/select?qt=dismaxindent=onq=* response − lst name=responseHeader int name=status0/int int name=QTime10/int − lst name=params str name=indenton/str str name=qtdismax/str str name=q*/str /lst /lst result name=response numFound=0 start=0 maxScore=0.0/ /response On Sep 29, 2010, at 2:31 PM, Chris Hostetter wrote: : In Solrconfig.xml, default request handler is set to standard. I am : planning to change that to use dismax as the request handler but when I : set default=true for dismax - Solr does not return any results - I get : results only when I comment out str name=defTypedismax/str. you need to elaborate on what you mean by does not return any results ... doesn't return results for what exactly? what do your requests look like? (ie full URLs with all params) what do you expect to get back? what URLs are you using when you don't use defType=dismax? what do you get back then? not setting defType means you are getting the standard LuceneQParser instead o the DismaxQParser which means the qf param is being ignored and hte defaultSearchField is being used instead. are the terms you are searching for in your default search field but not in your title or pagedescription field? Please note these guidelines http://wiki.apache.org/solr/UsingMailingLists#Information_useful_for_searching_problems -Hoss -- http://lucenerevolution.org/ ... October 7-8, Boston http://bit.ly/stump-hoss ... Stump The Chump!
Re: How to Update Value of One Field of a Document in Index?
My schema: id, name, checksum, body, notes, date I'd like for a user to be able to add notes to the notes field, and not have to re-index the document (since the body field may contain 100MB of text). Some ideas: 1) How about creating another core which only contains id, checksum, and notes? Then, updating (delete followed by add) wouldn't be that painful? 2) What about using a multValued field? Could you just keep adding values as the user enters more notes? Pete On Sep 9, 2010, at 11:06 PM, Liam O'Boyle wrote: Hi Savannah, You can only reindex the entire document; if you only have the ID, then do a search to retrieve the rest of the data, then reindex. This assumes that all of the fields you need to index are stored (so that you can retrieve them) and not just indexed. Liam On Fri, Sep 10, 2010 at 3:29 PM, Savannah Beckett savannah_becket...@yahoo.com wrote: I use nutch to crawl and index to Solr. My code is working. Now, I want to update the value of one of the fields of a document in the solr index after the document was already indexed, and I have only the document id. How do I do that? Thanks.
Re: Tips for getting unique results?
Thanks for the note, Shaun, but the documentation indicates that the sorting is only in ascending order :-( facet.sort This param determines the ordering of the facet field constraints. • count - sort the constraints by count (highest count first) • index - to return the constraints sorted in their index order (lexicographic by indexed term). For terms in the ascii range, this will be alphabetically sorted. The default is count if facet.limit is greater than 0, index otherwise. Prior to Solr1.4, one needed to use true instead of count and false instead of index. This parameter can be specified on a per field basis. -Pete On Apr 8, 2011, at 2:49 AM, Shaun Campbell wrote: Pete Surely the default sort order for facets is by descending count order. See http://wiki.apache.org/solr/SimpleFacetParameters. If your results are really sorted in ascending order can't you sort them externally eg Java? Hope that helps. Shaun
Re: Tips for getting unique results?
The data are fine and not duplicated - however, I want to analyze the data, and summarize one field (kind of like faceting), to understand what the largest value is. For example: Document 1: label=1A1A1; body=adfasdfadsfasf Document 2: label=5A1B1; body=adfaasdfasdfsdfadsfasf Document 3: label=1A1A1; body=adasdfasdfasdffaasdfasdfsdfadsfasf Document 4: label=7A1A1; body=azxzxcvdfaasdfasdfsdfadsfasf Document 5: label=7A1A1; body=azxzxcvdfaasdfasdfsdasdafadsfasf Document 6: label=5A1B1; body=adfaasdfasdfsdfadsfasfzzz How do I get back just ONE of the largest label item? In other words, what query will return the 7A1A1 label just once? If I search for q=* and sort the results, it works, except I get back multiple hits for each label. If I do a facet, I can only sort by increasing order, when what I want is decreasing order. -Pete On Apr 6, 2011, at 10:22 PM, Otis Gospodnetic wrote: Hi, I think you are saying dupes are the main problem? If so, http://wiki.apache.org/solr/Deduplication ? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Peter Spam ps...@mac.com To: solr-user@lucene.apache.org Sent: Thu, April 7, 2011 1:13:44 AM Subject: Tips for getting unique results? Hi, I have documents with a field that has 1A2B3C alphanumeric characters. I can query for * and sort results based on this field, however I'd like to uniq these results (remove duplicates) so that I can get the 5 largest unique values. I can't use the StatsComponent because my values have letters in them too. Faceting (and ignoring the counts) gets me half of the way there, but I can only sort ascending. If I could also sort facet results descending, I'd be done. I'd rather not return all documents and just parse the last few results to work around this. Any ideas? -Pete
Re: Tips for getting unique results?
Would grouping solve this? I'd rather not move to a pre-release solr ... To clarify the problem: The data are fine and not duplicated - however, I want to analyze the data, and summarize one field (kind of like faceting), to understand what the largest value is. For example: Document 1: label=1A1A1; body=adfasdfadsfasf Document 2: label=5A1B1; body=adfaasdfasdfsdfadsfasf Document 3: label=1A1A1; body=adasdfasdfasdffaasdfasdfsdfadsfasf Document 4: label=7A1A1; body=azxzxcvdfaasdfasdfsdfadsfasf Document 5: label=7A1A1; body=azxzxcvdfaasdfasdfsdasdafadsfasf Document 6: label=5A1B1; body=adfaasdfasdfsdfadsfasfzzz How do I get back just ONE of the largest label item? In other words, what query will return the 7A1A1 label just once? If I search for q=* and sort the results, it works, except I get back multiple hits for each label. If I do a facet, I can only sort by increasing order, when what I want is decreasing order. -Peter On Apr 7, 2011, at 10:02 AM, Erick Erickson wrote: What version of Solr are you using? And, assuming the version that has it in, have you seen grouping? Which is another way of asking why you want to do this, perhaps it's an XY problem Best Erick On Thu, Apr 7, 2011 at 1:13 AM, Peter Spam ps...@mac.com wrote: Hi, I have documents with a field that has 1A2B3C alphanumeric characters. I can query for * and sort results based on this field, however I'd like to uniq these results (remove duplicates) so that I can get the 5 largest unique values. I can't use the StatsComponent because my values have letters in them too. Faceting (and ignoring the counts) gets me half of the way there, but I can only sort ascending. If I could also sort facet results descending, I'd be done. I'd rather not return all documents and just parse the last few results to work around this. Any ideas? -Pete
Tips for getting unique results?
Hi, I have documents with a field that has 1A2B3C alphanumeric characters. I can query for * and sort results based on this field, however I'd like to uniq these results (remove duplicates) so that I can get the 5 largest unique values. I can't use the StatsComponent because my values have letters in them too. Faceting (and ignoring the counts) gets me half of the way there, but I can only sort ascending. If I could also sort facet results descending, I'd be done. I'd rather not return all documents and just parse the last few results to work around this. Any ideas? -Pete
Re: Solr searching performance issues, using large documents (now 1MB documents)
This is a very small number of documents (7000), so I am surprised Solr is having such a hard time with it!! I do facet on 3 terms. Subsequent hello searches are faster, but still well over a second. This is a very fast Mac Pro, with 6GB of RAM. Thanks, Peter On Aug 25, 2010, at 9:52 AM, Yonik Seeley wrote: On Wed, Aug 25, 2010 at 11:29 AM, Peter Spam ps...@mac.com wrote: So, I went through all the effort to break my documents into max 1 MB chunks, and searching for hello still takes over 40 seconds (searching across 7433 documents): 8 results (41980 ms) What is going on??? (scroll down for my config). Are you still faceting on that query also? Breaking your docs into many chunks means inflating the doc count and will make faceting slower. Also, first-time faceting (as with sorting) is slow... did you try another query after hello (and without a commit happening inbetween) to see if it was faster? -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
Re: Solr searching performance issues, using large documents
Still stuck on this - any hints on how to write the JavaScript to split a document? Thanks! -Pete On Aug 5, 2010, at 8:10 PM, Lance Norskog wrote: You may have to write your own javascript to read in the giant field and split it up. On Thu, Aug 5, 2010 at 5:27 PM, Peter Spam ps...@mac.com wrote: I've read through the DataImportHandler page a few times, and still can't figure out how to separate a large document into smaller documents. Any hints? :-) Thanks! -Peter On Aug 2, 2010, at 9:01 PM, Lance Norskog wrote: Spanning won't work- you would have to make overlapping mini-documents if you want to support this. I don't know how big the chunks should be- you'll have to experiment. Lance On Mon, Aug 2, 2010 at 10:01 AM, Peter Spam ps...@mac.com wrote: What would happen if the search query phrase spanned separate document chunks? Also, what would the optimal size of chunks be? Thanks! -Peter On Aug 1, 2010, at 7:21 PM, Lance Norskog wrote: Not that I know of. The DataImportHandler has the ability to create multiple documents from one input stream. It is possible to create a DIH file that reads large log files and splits each one into N documents, with the file name as a common field. The DIH wiki page tells you in general how to make a DIH file. http://wiki.apache.org/solr/DataImportHandler From this, you should be able to make a DIH file that puts log files in as separate documents. As to splitting files up into mini-documents, you might have to write a bit of Javascript to achieve this. There is no data structure or software that implements structured documents. On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam ps...@mac.com wrote: Thanks for the pointer, Lance! Is there an example of this somewhere? -Peter On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote: Ah! You're not just highlighting, you're snippetizing. This makes it easier. Highlighting does not stream- it pulls the entire stored contents into one string and then pulls out the snippet. If you want this to be fast, you have to split up the text into small pieces and only snippetize from the most relevant text. So, separate documents with a common group id for the document it came from. You might have to do 2 queries to achieve what you want, but the second query for the same query will be blindingly fast. Often 1ms. Good luck! Lance On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam ps...@mac.com wrote: However, I do need to search the entire document, or else the highlighting will sometimes be blank :-( Thanks! - Peter ps. sorry for the many responses - I'm rushing around trying to get this working. On Jul 31, 2010, at 1:11 PM, Peter Spam wrote: Correction - it went from 17 seconds to 10 seconds - I was changing the hl.regex.maxAnalyzedChars the first time. Thanks! -Peter On Jul 31, 2010, at 1:06 PM, Peter Spam wrote: On Jul 30, 2010, at 1:16 PM, Peter Karich wrote: did you already try other values for hl.maxAnalyzedChars=2147483647 Yes, I tried dropping it down to 21, but it didn't have much of an impact (one search I just tried went from 17 seconds to 15.8 seconds, and this is an 8-core Mac Pro with 6GB RAM - 4GB for java). ? Also regular expression highlighting is more expensive, I think. What does the 'fuzzy' variable mean? If you use this to query via ~someTerm instead someTerm then you should try the trunk of solr which is a lot faster for fuzzy or other wildcard search. fuzzy could be set to * but isn't right now. Thanks for the tips, Peter - this has been very frustrating! - Peter Regards, Peter. Data set: About 4,000 log files (will eventually grow to millions). Average log file is 850k. Largest log file (so far) is about 70MB. Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds. TermVectors etc are enabled. When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds). Thanks in advance for any ideas! -Peter - 4GB RAM server % java -Xms2048M -Xmx3072M -jar start.jar - schema.xml changes: fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType ... field name=body type=text_pl indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / field name
Re: Solr searching performance issues, using large documents
I've read through the DataImportHandler page a few times, and still can't figure out how to separate a large document into smaller documents. Any hints? :-) Thanks! -Peter On Aug 2, 2010, at 9:01 PM, Lance Norskog wrote: Spanning won't work- you would have to make overlapping mini-documents if you want to support this. I don't know how big the chunks should be- you'll have to experiment. Lance On Mon, Aug 2, 2010 at 10:01 AM, Peter Spam ps...@mac.com wrote: What would happen if the search query phrase spanned separate document chunks? Also, what would the optimal size of chunks be? Thanks! -Peter On Aug 1, 2010, at 7:21 PM, Lance Norskog wrote: Not that I know of. The DataImportHandler has the ability to create multiple documents from one input stream. It is possible to create a DIH file that reads large log files and splits each one into N documents, with the file name as a common field. The DIH wiki page tells you in general how to make a DIH file. http://wiki.apache.org/solr/DataImportHandler From this, you should be able to make a DIH file that puts log files in as separate documents. As to splitting files up into mini-documents, you might have to write a bit of Javascript to achieve this. There is no data structure or software that implements structured documents. On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam ps...@mac.com wrote: Thanks for the pointer, Lance! Is there an example of this somewhere? -Peter On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote: Ah! You're not just highlighting, you're snippetizing. This makes it easier. Highlighting does not stream- it pulls the entire stored contents into one string and then pulls out the snippet. If you want this to be fast, you have to split up the text into small pieces and only snippetize from the most relevant text. So, separate documents with a common group id for the document it came from. You might have to do 2 queries to achieve what you want, but the second query for the same query will be blindingly fast. Often 1ms. Good luck! Lance On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam ps...@mac.com wrote: However, I do need to search the entire document, or else the highlighting will sometimes be blank :-( Thanks! - Peter ps. sorry for the many responses - I'm rushing around trying to get this working. On Jul 31, 2010, at 1:11 PM, Peter Spam wrote: Correction - it went from 17 seconds to 10 seconds - I was changing the hl.regex.maxAnalyzedChars the first time. Thanks! -Peter On Jul 31, 2010, at 1:06 PM, Peter Spam wrote: On Jul 30, 2010, at 1:16 PM, Peter Karich wrote: did you already try other values for hl.maxAnalyzedChars=2147483647 Yes, I tried dropping it down to 21, but it didn't have much of an impact (one search I just tried went from 17 seconds to 15.8 seconds, and this is an 8-core Mac Pro with 6GB RAM - 4GB for java). ? Also regular expression highlighting is more expensive, I think. What does the 'fuzzy' variable mean? If you use this to query via ~someTerm instead someTerm then you should try the trunk of solr which is a lot faster for fuzzy or other wildcard search. fuzzy could be set to * but isn't right now. Thanks for the tips, Peter - this has been very frustrating! - Peter Regards, Peter. Data set: About 4,000 log files (will eventually grow to millions). Average log file is 850k. Largest log file (so far) is about 70MB. Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds. TermVectors etc are enabled. When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds). Thanks in advance for any ideas! -Peter - 4GB RAM server % java -Xms2048M -Xmx3072M -jar start.jar - schema.xml changes: fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType ... field name=body type=text_pl indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ field name=version type=string indexed=true stored=true multiValued=false/ field name=device type=string indexed=true stored=true multiValued=false/ field name=filename type=string indexed=true stored=true multiValued=false/ field
Re: Solr searching performance issues, using large documents
What would happen if the search query phrase spanned separate document chunks? Also, what would the optimal size of chunks be? Thanks! -Peter On Aug 1, 2010, at 7:21 PM, Lance Norskog wrote: Not that I know of. The DataImportHandler has the ability to create multiple documents from one input stream. It is possible to create a DIH file that reads large log files and splits each one into N documents, with the file name as a common field. The DIH wiki page tells you in general how to make a DIH file. http://wiki.apache.org/solr/DataImportHandler From this, you should be able to make a DIH file that puts log files in as separate documents. As to splitting files up into mini-documents, you might have to write a bit of Javascript to achieve this. There is no data structure or software that implements structured documents. On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam ps...@mac.com wrote: Thanks for the pointer, Lance! Is there an example of this somewhere? -Peter On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote: Ah! You're not just highlighting, you're snippetizing. This makes it easier. Highlighting does not stream- it pulls the entire stored contents into one string and then pulls out the snippet. If you want this to be fast, you have to split up the text into small pieces and only snippetize from the most relevant text. So, separate documents with a common group id for the document it came from. You might have to do 2 queries to achieve what you want, but the second query for the same query will be blindingly fast. Often 1ms. Good luck! Lance On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam ps...@mac.com wrote: However, I do need to search the entire document, or else the highlighting will sometimes be blank :-( Thanks! - Peter ps. sorry for the many responses - I'm rushing around trying to get this working. On Jul 31, 2010, at 1:11 PM, Peter Spam wrote: Correction - it went from 17 seconds to 10 seconds - I was changing the hl.regex.maxAnalyzedChars the first time. Thanks! -Peter On Jul 31, 2010, at 1:06 PM, Peter Spam wrote: On Jul 30, 2010, at 1:16 PM, Peter Karich wrote: did you already try other values for hl.maxAnalyzedChars=2147483647 Yes, I tried dropping it down to 21, but it didn't have much of an impact (one search I just tried went from 17 seconds to 15.8 seconds, and this is an 8-core Mac Pro with 6GB RAM - 4GB for java). ? Also regular expression highlighting is more expensive, I think. What does the 'fuzzy' variable mean? If you use this to query via ~someTerm instead someTerm then you should try the trunk of solr which is a lot faster for fuzzy or other wildcard search. fuzzy could be set to * but isn't right now. Thanks for the tips, Peter - this has been very frustrating! - Peter Regards, Peter. Data set: About 4,000 log files (will eventually grow to millions). Average log file is 850k. Largest log file (so far) is about 70MB. Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds. TermVectors etc are enabled. When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds). Thanks in advance for any ideas! -Peter - 4GB RAM server % java -Xms2048M -Xmx3072M -jar start.jar - schema.xml changes: fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType ... field name=body type=text_pl indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ field name=version type=string indexed=true stored=true multiValued=false/ field name=device type=string indexed=true stored=true multiValued=false/ field name=filename type=string indexed=true stored=true multiValued=false/ field name=filesize type=long indexed=true stored=true multiValued=false/ field name=pversion type=int indexed=true stored=true multiValued=false/ field name=first2md5 type=string indexed=false stored=true multiValued=false/ field name=ckey type=string indexed=true stored=true multiValued=false/ ... dynamicField name=* type=ignored multiValued=true / defaultSearchFieldbody/defaultSearchField solrQueryParser defaultOperator
Re: Solr searching performance issues, using large documents
Thanks for the pointer, Lance! Is there an example of this somewhere? -Peter On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote: Ah! You're not just highlighting, you're snippetizing. This makes it easier. Highlighting does not stream- it pulls the entire stored contents into one string and then pulls out the snippet. If you want this to be fast, you have to split up the text into small pieces and only snippetize from the most relevant text. So, separate documents with a common group id for the document it came from. You might have to do 2 queries to achieve what you want, but the second query for the same query will be blindingly fast. Often 1ms. Good luck! Lance On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam ps...@mac.com wrote: However, I do need to search the entire document, or else the highlighting will sometimes be blank :-( Thanks! - Peter ps. sorry for the many responses - I'm rushing around trying to get this working. On Jul 31, 2010, at 1:11 PM, Peter Spam wrote: Correction - it went from 17 seconds to 10 seconds - I was changing the hl.regex.maxAnalyzedChars the first time. Thanks! -Peter On Jul 31, 2010, at 1:06 PM, Peter Spam wrote: On Jul 30, 2010, at 1:16 PM, Peter Karich wrote: did you already try other values for hl.maxAnalyzedChars=2147483647 Yes, I tried dropping it down to 21, but it didn't have much of an impact (one search I just tried went from 17 seconds to 15.8 seconds, and this is an 8-core Mac Pro with 6GB RAM - 4GB for java). ? Also regular expression highlighting is more expensive, I think. What does the 'fuzzy' variable mean? If you use this to query via ~someTerm instead someTerm then you should try the trunk of solr which is a lot faster for fuzzy or other wildcard search. fuzzy could be set to * but isn't right now. Thanks for the tips, Peter - this has been very frustrating! - Peter Regards, Peter. Data set: About 4,000 log files (will eventually grow to millions). Average log file is 850k. Largest log file (so far) is about 70MB. Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds. TermVectors etc are enabled. When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds). Thanks in advance for any ideas! -Peter - 4GB RAM server % java -Xms2048M -Xmx3072M -jar start.jar - schema.xml changes: fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType ... field name=body type=text_pl indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ field name=version type=string indexed=true stored=true multiValued=false/ field name=device type=string indexed=true stored=true multiValued=false/ field name=filename type=string indexed=true stored=true multiValued=false/ field name=filesize type=long indexed=true stored=true multiValued=false/ field name=pversion type=int indexed=true stored=true multiValued=false/ field name=first2md5 type=string indexed=false stored=true multiValued=false/ field name=ckey type=string indexed=true stored=true multiValued=false/ ... dynamicField name=* type=ignored multiValued=true / defaultSearchFieldbody/defaultSearchField solrQueryParser defaultOperator=AND/ - solrconfig.xml changes: maxFieldLength2147483647/maxFieldLength ramBufferSizeMB128/ramBufferSizeMB - The query: rowStr = rows=10 facet = facet=truefacet.limit=10facet.field=devicefacet.field=ckeyfacet.field=version fields = fl=id,score,filename,version,device,first2md5,filesize,ckey termvectors = tv=trueqt=tvrhtv.all=true hl = hl=truehl.fl=bodyhl.snippets=1hl.fragsize=400 regexv = (?m)^.*\n.*\n.*$ hl_regex = hl.regex.pattern= + CGI::escape(regexv) + hl.regex.slop=1hl.fragmenter=regexhl.regex.maxAnalyzedChars=2147483647hl.maxAnalyzedChars=2147483647 justq = 'q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!=])/,'\1') + fuzzy + minLogSizeStr
Re: Solr searching performance issues, using large documents
On Jul 30, 2010, at 7:04 PM, Lance Norskog wrote: Wait- how much text are you highlighting? You say these logfiles are X big- how big are the actual documents you are storing? I want it to be like google - I put the entire (sometimes 60MB) doc in a field, and then just highlight 2-4 lines of it. Thanks, Peter On Fri, Jul 30, 2010 at 1:16 PM, Peter Karich peat...@yahoo.de wrote: Hi Peter :-), did you already try other values for hl.maxAnalyzedChars=2147483647 ? Also regular expression highlighting is more expensive, I think. What does the 'fuzzy' variable mean? If you use this to query via ~someTerm instead someTerm then you should try the trunk of solr which is a lot faster for fuzzy or other wildcard search. Regards, Peter. Data set: About 4,000 log files (will eventually grow to millions). Average log file is 850k. Largest log file (so far) is about 70MB. Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds. TermVectors etc are enabled. When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds). Thanks in advance for any ideas! -Peter - 4GB RAM server % java -Xms2048M -Xmx3072M -jar start.jar - schema.xml changes: fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType ... field name=body type=text_pl indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ field name=version type=string indexed=true stored=true multiValued=false/ field name=device type=string indexed=true stored=true multiValued=false/ field name=filename type=string indexed=true stored=true multiValued=false/ field name=filesize type=long indexed=true stored=true multiValued=false/ field name=pversion type=int indexed=true stored=true multiValued=false/ field name=first2md5 type=string indexed=false stored=true multiValued=false/ field name=ckey type=string indexed=true stored=true multiValued=false/ ... dynamicField name=* type=ignored multiValued=true / defaultSearchFieldbody/defaultSearchField solrQueryParser defaultOperator=AND/ - solrconfig.xml changes: maxFieldLength2147483647/maxFieldLength ramBufferSizeMB128/ramBufferSizeMB - The query: rowStr = rows=10 facet = facet=truefacet.limit=10facet.field=devicefacet.field=ckeyfacet.field=version fields = fl=id,score,filename,version,device,first2md5,filesize,ckey termvectors = tv=trueqt=tvrhtv.all=true hl = hl=truehl.fl=bodyhl.snippets=1hl.fragsize=400 regexv = (?m)^.*\n.*\n.*$ hl_regex = hl.regex.pattern= + CGI::escape(regexv) + hl.regex.slop=1hl.fragmenter=regexhl.regex.maxAnalyzedChars=2147483647hl.maxAnalyzedChars=2147483647 justq = 'q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!=])/,'\1') + fuzzy + minLogSizeStr) thequery = '/solr/select?timeAllowed=5000wt=ruby' + (p['fq'].empty? ? '' : ('fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + hl_regex baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + 'rows=' + p['rows'].to_s + 'minLogSize=' + p['minLogSize'].to_s -- http://karussell.wordpress.com/ -- Lance Norskog goks...@gmail.com
Re: Solr searching performance issues, using large documents
On Jul 30, 2010, at 1:16 PM, Peter Karich wrote: did you already try other values for hl.maxAnalyzedChars=2147483647 Yes, I tried dropping it down to 21, but it didn't have much of an impact (one search I just tried went from 17 seconds to 15.8 seconds, and this is an 8-core Mac Pro with 6GB RAM - 4GB for java). ? Also regular expression highlighting is more expensive, I think. What does the 'fuzzy' variable mean? If you use this to query via ~someTerm instead someTerm then you should try the trunk of solr which is a lot faster for fuzzy or other wildcard search. fuzzy could be set to * but isn't right now. Thanks for the tips, Peter - this has been very frustrating! - Peter Regards, Peter. Data set: About 4,000 log files (will eventually grow to millions). Average log file is 850k. Largest log file (so far) is about 70MB. Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds. TermVectors etc are enabled. When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds). Thanks in advance for any ideas! -Peter - 4GB RAM server % java -Xms2048M -Xmx3072M -jar start.jar - schema.xml changes: fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType ... field name=body type=text_pl indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ field name=version type=string indexed=true stored=true multiValued=false/ field name=device type=string indexed=true stored=true multiValued=false/ field name=filename type=string indexed=true stored=true multiValued=false/ field name=filesize type=long indexed=true stored=true multiValued=false/ field name=pversion type=int indexed=true stored=true multiValued=false/ field name=first2md5 type=string indexed=false stored=true multiValued=false/ field name=ckey type=string indexed=true stored=true multiValued=false/ ... dynamicField name=* type=ignored multiValued=true / defaultSearchFieldbody/defaultSearchField solrQueryParser defaultOperator=AND/ - solrconfig.xml changes: maxFieldLength2147483647/maxFieldLength ramBufferSizeMB128/ramBufferSizeMB - The query: rowStr = rows=10 facet = facet=truefacet.limit=10facet.field=devicefacet.field=ckeyfacet.field=version fields = fl=id,score,filename,version,device,first2md5,filesize,ckey termvectors = tv=trueqt=tvrhtv.all=true hl = hl=truehl.fl=bodyhl.snippets=1hl.fragsize=400 regexv = (?m)^.*\n.*\n.*$ hl_regex = hl.regex.pattern= + CGI::escape(regexv) + hl.regex.slop=1hl.fragmenter=regexhl.regex.maxAnalyzedChars=2147483647hl.maxAnalyzedChars=2147483647 justq = 'q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!=])/,'\1') + fuzzy + minLogSizeStr) thequery = '/solr/select?timeAllowed=5000wt=ruby' + (p['fq'].empty? ? '' : ('fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + hl_regex baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + 'rows=' + p['rows'].to_s + 'minLogSize=' + p['minLogSize'].to_s -- http://karussell.wordpress.com/
Re: Solr searching performance issues, using large documents
Correction - it went from 17 seconds to 10 seconds - I was changing the hl.regex.maxAnalyzedChars the first time. Thanks! -Peter On Jul 31, 2010, at 1:06 PM, Peter Spam wrote: On Jul 30, 2010, at 1:16 PM, Peter Karich wrote: did you already try other values for hl.maxAnalyzedChars=2147483647 Yes, I tried dropping it down to 21, but it didn't have much of an impact (one search I just tried went from 17 seconds to 15.8 seconds, and this is an 8-core Mac Pro with 6GB RAM - 4GB for java). ? Also regular expression highlighting is more expensive, I think. What does the 'fuzzy' variable mean? If you use this to query via ~someTerm instead someTerm then you should try the trunk of solr which is a lot faster for fuzzy or other wildcard search. fuzzy could be set to * but isn't right now. Thanks for the tips, Peter - this has been very frustrating! - Peter Regards, Peter. Data set: About 4,000 log files (will eventually grow to millions). Average log file is 850k. Largest log file (so far) is about 70MB. Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds. TermVectors etc are enabled. When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds). Thanks in advance for any ideas! -Peter - 4GB RAM server % java -Xms2048M -Xmx3072M -jar start.jar - schema.xml changes: fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType ... field name=body type=text_pl indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ field name=version type=string indexed=true stored=true multiValued=false/ field name=device type=string indexed=true stored=true multiValued=false/ field name=filename type=string indexed=true stored=true multiValued=false/ field name=filesize type=long indexed=true stored=true multiValued=false/ field name=pversion type=int indexed=true stored=true multiValued=false/ field name=first2md5 type=string indexed=false stored=true multiValued=false/ field name=ckey type=string indexed=true stored=true multiValued=false/ ... dynamicField name=* type=ignored multiValued=true / defaultSearchFieldbody/defaultSearchField solrQueryParser defaultOperator=AND/ - solrconfig.xml changes: maxFieldLength2147483647/maxFieldLength ramBufferSizeMB128/ramBufferSizeMB - The query: rowStr = rows=10 facet = facet=truefacet.limit=10facet.field=devicefacet.field=ckeyfacet.field=version fields = fl=id,score,filename,version,device,first2md5,filesize,ckey termvectors = tv=trueqt=tvrhtv.all=true hl = hl=truehl.fl=bodyhl.snippets=1hl.fragsize=400 regexv = (?m)^.*\n.*\n.*$ hl_regex = hl.regex.pattern= + CGI::escape(regexv) + hl.regex.slop=1hl.fragmenter=regexhl.regex.maxAnalyzedChars=2147483647hl.maxAnalyzedChars=2147483647 justq = 'q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!=])/,'\1') + fuzzy + minLogSizeStr) thequery = '/solr/select?timeAllowed=5000wt=ruby' + (p['fq'].empty? ? '' : ('fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + hl_regex baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + 'rows=' + p['rows'].to_s + 'minLogSize=' + p['minLogSize'].to_s -- http://karussell.wordpress.com/
Re: Solr searching performance issues, using large documents
However, I do need to search the entire document, or else the highlighting will sometimes be blank :-( Thanks! - Peter ps. sorry for the many responses - I'm rushing around trying to get this working. On Jul 31, 2010, at 1:11 PM, Peter Spam wrote: Correction - it went from 17 seconds to 10 seconds - I was changing the hl.regex.maxAnalyzedChars the first time. Thanks! -Peter On Jul 31, 2010, at 1:06 PM, Peter Spam wrote: On Jul 30, 2010, at 1:16 PM, Peter Karich wrote: did you already try other values for hl.maxAnalyzedChars=2147483647 Yes, I tried dropping it down to 21, but it didn't have much of an impact (one search I just tried went from 17 seconds to 15.8 seconds, and this is an 8-core Mac Pro with 6GB RAM - 4GB for java). ? Also regular expression highlighting is more expensive, I think. What does the 'fuzzy' variable mean? If you use this to query via ~someTerm instead someTerm then you should try the trunk of solr which is a lot faster for fuzzy or other wildcard search. fuzzy could be set to * but isn't right now. Thanks for the tips, Peter - this has been very frustrating! - Peter Regards, Peter. Data set: About 4,000 log files (will eventually grow to millions). Average log file is 850k. Largest log file (so far) is about 70MB. Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds. TermVectors etc are enabled. When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds). Thanks in advance for any ideas! -Peter - 4GB RAM server % java -Xms2048M -Xmx3072M -jar start.jar - schema.xml changes: fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType ... field name=body type=text_pl indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ field name=version type=string indexed=true stored=true multiValued=false/ field name=device type=string indexed=true stored=true multiValued=false/ field name=filename type=string indexed=true stored=true multiValued=false/ field name=filesize type=long indexed=true stored=true multiValued=false/ field name=pversion type=int indexed=true stored=true multiValued=false/ field name=first2md5 type=string indexed=false stored=true multiValued=false/ field name=ckey type=string indexed=true stored=true multiValued=false/ ... dynamicField name=* type=ignored multiValued=true / defaultSearchFieldbody/defaultSearchField solrQueryParser defaultOperator=AND/ - solrconfig.xml changes: maxFieldLength2147483647/maxFieldLength ramBufferSizeMB128/ramBufferSizeMB - The query: rowStr = rows=10 facet = facet=truefacet.limit=10facet.field=devicefacet.field=ckeyfacet.field=version fields = fl=id,score,filename,version,device,first2md5,filesize,ckey termvectors = tv=trueqt=tvrhtv.all=true hl = hl=truehl.fl=bodyhl.snippets=1hl.fragsize=400 regexv = (?m)^.*\n.*\n.*$ hl_regex = hl.regex.pattern= + CGI::escape(regexv) + hl.regex.slop=1hl.fragmenter=regexhl.regex.maxAnalyzedChars=2147483647hl.maxAnalyzedChars=2147483647 justq = 'q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!=])/,'\1') + fuzzy + minLogSizeStr) thequery = '/solr/select?timeAllowed=5000wt=ruby' + (p['fq'].empty? ? '' : ('fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + hl_regex baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + 'rows=' + p['rows'].to_s + 'minLogSize=' + p['minLogSize'].to_s -- http://karussell.wordpress.com/
Re: Solr searching performance issues, using large documents
I do store term vector: field name=body type=text_pl indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / -Pete On Jul 30, 2010, at 7:30 AM, Li Li wrote: hightlight's time is mainly spent on getting the field which you want to highlight and tokenize this field(If you don't store term vector) . you can check what's wrong, 2010/7/30 Peter Spam ps...@mac.com: If I don't do highlighting, it's really fast. Optimize has no effect. -Peter On Jul 29, 2010, at 11:54 AM, dc tech wrote: Are you storing the entire log file text in SOLR? That's almost 3gb of text that you are storing in the SOLR. Try to 1) Is this first time performance or on repaat queries with the same fields? 2) Optimze the index and test performance again 3) index without storing the text and see what the performance looks like. On 7/29/10, Peter Spam ps...@mac.com wrote: Any ideas? I've got 5000 documents with an average size of 850k each, and it sometimes takes 2 minutes for a query to come back when highlighting is turned on! Help! -Pete On Jul 21, 2010, at 2:41 PM, Peter Spam wrote: From the mailing list archive, Koji wrote: 1. Provide another field for highlighting and use copyField to copy plainText to the highlighting field. and Lance wrote: http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html If you want to highlight field X, doing the termOffsets/termPositions/termVectors will make highlighting that field faster. You should make a separate field and apply these options to that field. Now: doing a copyfield adds a value to a multiValued field. For a text field, you get a multi-valued text field. You should only copy one value to the highlighted field, so just copyField the document to your special field. To enforce this, I would add multiValued=false to that field, just to avoid mistakes. So, all_text should be indexed without the term* attributes, and should not be stored. Then your document stored in a separate field that you use for highlighting and has the term* attributes. I've been experimenting with this, and here's what I've tried: field name=body type=text_pl indexed=true stored=false multiValued=true termVectors=true termPositions=true termOff sets=true / field name=body_all type=text_pl indexed=false stored=true multiValued=true / copyField source=body dest=body_all/ ... but it's still very slow (10+ seconds). Why is it better to have two fields (one indexed but not stored, and the other not indexed but stored) rather than just one field that's both indexed and stored? From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors If you aren't always using all the stored fields, then enabling lazy field loading can be a huge boon, especially if compressed fields are used. What does this mean? How do you load a field lazily? Thanks for your time, guys - this has started to become frustrating, since it works so well, but is very slow! -Pete On Jul 20, 2010, at 5:36 PM, Peter Spam wrote: Data set: About 4,000 log files (will eventually grow to millions). Average log file is 850k. Largest log file (so far) is about 70MB. Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds. TermVectors etc are enabled. When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds). Thanks in advance for any ideas! -Peter - 4GB RAM server % java -Xms2048M -Xmx3072M -jar start.jar - schema.xml changes: fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType ... field name=body type=text_pl indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ field name=version type=string indexed=true stored=true multiValued=false/ field name=device type=string indexed=true stored=true multiValued=false/ field name=filename type=string indexed=true stored=true multiValued=false/ field name=filesize type=long indexed=true stored=true multiValued=false/ field name=pversion type=int indexed=true stored=true multiValued=false/ field name=first2md5 type=string indexed=false stored=true multiValued=false/ field name=ckey type=string indexed=true stored=true
Re: Solr searching performance issues, using large documents
Any ideas? I've got 5000 documents with an average size of 850k each, and it sometimes takes 2 minutes for a query to come back when highlighting is turned on! Help! -Pete On Jul 21, 2010, at 2:41 PM, Peter Spam wrote: From the mailing list archive, Koji wrote: 1. Provide another field for highlighting and use copyField to copy plainText to the highlighting field. and Lance wrote: http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html If you want to highlight field X, doing the termOffsets/termPositions/termVectors will make highlighting that field faster. You should make a separate field and apply these options to that field. Now: doing a copyfield adds a value to a multiValued field. For a text field, you get a multi-valued text field. You should only copy one value to the highlighted field, so just copyField the document to your special field. To enforce this, I would add multiValued=false to that field, just to avoid mistakes. So, all_text should be indexed without the term* attributes, and should not be stored. Then your document stored in a separate field that you use for highlighting and has the term* attributes. I've been experimenting with this, and here's what I've tried: field name=body type=text_pl indexed=true stored=false multiValued=true termVectors=true termPositions=true termOff sets=true / field name=body_all type=text_pl indexed=false stored=true multiValued=true / copyField source=body dest=body_all/ ... but it's still very slow (10+ seconds). Why is it better to have two fields (one indexed but not stored, and the other not indexed but stored) rather than just one field that's both indexed and stored? From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors If you aren't always using all the stored fields, then enabling lazy field loading can be a huge boon, especially if compressed fields are used. What does this mean? How do you load a field lazily? Thanks for your time, guys - this has started to become frustrating, since it works so well, but is very slow! -Pete On Jul 20, 2010, at 5:36 PM, Peter Spam wrote: Data set: About 4,000 log files (will eventually grow to millions). Average log file is 850k. Largest log file (so far) is about 70MB. Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds. TermVectors etc are enabled. When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds). Thanks in advance for any ideas! -Peter - 4GB RAM server % java -Xms2048M -Xmx3072M -jar start.jar - schema.xml changes: fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType ... field name=body type=text_pl indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ field name=version type=string indexed=true stored=true multiValued=false/ field name=device type=string indexed=true stored=true multiValued=false/ field name=filename type=string indexed=true stored=true multiValued=false/ field name=filesize type=long indexed=true stored=true multiValued=false/ field name=pversion type=int indexed=true stored=true multiValued=false/ field name=first2md5 type=string indexed=false stored=true multiValued=false/ field name=ckey type=string indexed=true stored=true multiValued=false/ ... dynamicField name=* type=ignored multiValued=true / defaultSearchFieldbody/defaultSearchField solrQueryParser defaultOperator=AND/ - solrconfig.xml changes: maxFieldLength2147483647/maxFieldLength ramBufferSizeMB128/ramBufferSizeMB - The query: rowStr = rows=10 facet = facet=truefacet.limit=10facet.field=devicefacet.field=ckeyfacet.field=version fields = fl=id,score,filename,version,device,first2md5,filesize,ckey termvectors = tv=trueqt=tvrhtv.all=true hl = hl=truehl.fl=bodyhl.snippets=1hl.fragsize=400 regexv = (?m)^.*\n.*\n.*$ hl_regex
Re: Solr searching performance issues, using large documents
If I don't do highlighting, it's really fast. Optimize has no effect. -Peter On Jul 29, 2010, at 11:54 AM, dc tech wrote: Are you storing the entire log file text in SOLR? That's almost 3gb of text that you are storing in the SOLR. Try to 1) Is this first time performance or on repaat queries with the same fields? 2) Optimze the index and test performance again 3) index without storing the text and see what the performance looks like. On 7/29/10, Peter Spam ps...@mac.com wrote: Any ideas? I've got 5000 documents with an average size of 850k each, and it sometimes takes 2 minutes for a query to come back when highlighting is turned on! Help! -Pete On Jul 21, 2010, at 2:41 PM, Peter Spam wrote: From the mailing list archive, Koji wrote: 1. Provide another field for highlighting and use copyField to copy plainText to the highlighting field. and Lance wrote: http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html If you want to highlight field X, doing the termOffsets/termPositions/termVectors will make highlighting that field faster. You should make a separate field and apply these options to that field. Now: doing a copyfield adds a value to a multiValued field. For a text field, you get a multi-valued text field. You should only copy one value to the highlighted field, so just copyField the document to your special field. To enforce this, I would add multiValued=false to that field, just to avoid mistakes. So, all_text should be indexed without the term* attributes, and should not be stored. Then your document stored in a separate field that you use for highlighting and has the term* attributes. I've been experimenting with this, and here's what I've tried: field name=body type=text_pl indexed=true stored=false multiValued=true termVectors=true termPositions=true termOff sets=true / field name=body_all type=text_pl indexed=false stored=true multiValued=true / copyField source=body dest=body_all/ ... but it's still very slow (10+ seconds). Why is it better to have two fields (one indexed but not stored, and the other not indexed but stored) rather than just one field that's both indexed and stored? From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors If you aren't always using all the stored fields, then enabling lazy field loading can be a huge boon, especially if compressed fields are used. What does this mean? How do you load a field lazily? Thanks for your time, guys - this has started to become frustrating, since it works so well, but is very slow! -Pete On Jul 20, 2010, at 5:36 PM, Peter Spam wrote: Data set: About 4,000 log files (will eventually grow to millions). Average log file is 850k. Largest log file (so far) is about 70MB. Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds. TermVectors etc are enabled. When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds). Thanks in advance for any ideas! -Peter - 4GB RAM server % java -Xms2048M -Xmx3072M -jar start.jar - schema.xml changes: fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType ... field name=body type=text_pl indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ field name=version type=string indexed=true stored=true multiValued=false/ field name=device type=string indexed=true stored=true multiValued=false/ field name=filename type=string indexed=true stored=true multiValued=false/ field name=filesize type=long indexed=true stored=true multiValued=false/ field name=pversion type=int indexed=true stored=true multiValued=false/ field name=first2md5 type=string indexed=false stored=true multiValued=false/ field name=ckey type=string indexed=true stored=true multiValued=false/ ... dynamicField name=* type=ignored multiValued=true / defaultSearchFieldbody/defaultSearchField solrQueryParser defaultOperator=AND/ - solrconfig.xml changes: maxFieldLength2147483647/maxFieldLength ramBufferSizeMB128/ramBufferSizeMB
Re: Solr searching performance issues, using large documents
From the mailing list archive, Koji wrote: 1. Provide another field for highlighting and use copyField to copy plainText to the highlighting field. and Lance wrote: http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html If you want to highlight field X, doing the termOffsets/termPositions/termVectors will make highlighting that field faster. You should make a separate field and apply these options to that field. Now: doing a copyfield adds a value to a multiValued field. For a text field, you get a multi-valued text field. You should only copy one value to the highlighted field, so just copyField the document to your special field. To enforce this, I would add multiValued=false to that field, just to avoid mistakes. So, all_text should be indexed without the term* attributes, and should not be stored. Then your document stored in a separate field that you use for highlighting and has the term* attributes. I've been experimenting with this, and here's what I've tried: field name=body type=text_pl indexed=true stored=false multiValued=true termVectors=true termPositions=true termOff sets=true / field name=body_all type=text_pl indexed=false stored=true multiValued=true / copyField source=body dest=body_all/ ... but it's still very slow (10+ seconds). Why is it better to have two fields (one indexed but not stored, and the other not indexed but stored) rather than just one field that's both indexed and stored? From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors If you aren't always using all the stored fields, then enabling lazy field loading can be a huge boon, especially if compressed fields are used. What does this mean? How do you load a field lazily? Thanks for your time, guys - this has started to become frustrating, since it works so well, but is very slow! -Pete On Jul 20, 2010, at 5:36 PM, Peter Spam wrote: Data set: About 4,000 log files (will eventually grow to millions). Average log file is 850k. Largest log file (so far) is about 70MB. Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds. TermVectors etc are enabled. When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds). Thanks in advance for any ideas! -Peter - 4GB RAM server % java -Xms2048M -Xmx3072M -jar start.jar - schema.xml changes: fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType ... field name=body type=text_pl indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ field name=version type=string indexed=true stored=true multiValued=false/ field name=device type=string indexed=true stored=true multiValued=false/ field name=filename type=string indexed=true stored=true multiValued=false/ field name=filesize type=long indexed=true stored=true multiValued=false/ field name=pversion type=int indexed=true stored=true multiValued=false/ field name=first2md5 type=string indexed=false stored=true multiValued=false/ field name=ckey type=string indexed=true stored=true multiValued=false/ ... dynamicField name=* type=ignored multiValued=true / defaultSearchFieldbody/defaultSearchField solrQueryParser defaultOperator=AND/ - solrconfig.xml changes: maxFieldLength2147483647/maxFieldLength ramBufferSizeMB128/ramBufferSizeMB - The query: rowStr = rows=10 facet = facet=truefacet.limit=10facet.field=devicefacet.field=ckeyfacet.field=version fields = fl=id,score,filename,version,device,first2md5,filesize,ckey termvectors = tv=trueqt=tvrhtv.all=true hl = hl=truehl.fl=bodyhl.snippets=1hl.fragsize=400 regexv = (?m)^.*\n.*\n.*$ hl_regex = hl.regex.pattern= + CGI::escape(regexv) + hl.regex.slop=1hl.fragmenter=regexhl.regex.maxAnalyzedChars=2147483647hl.maxAnalyzedChars=2147483647 justq = 'q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub
Count hits per document?
If I search for foo, I get back a list of documents. Any way to get a per-document hit count? Thanks! -Pete
Re: Using hl.regex.pattern to print complete lines
Still not working ... any ideas? -Pete On Jul 14, 2010, at 11:56 AM, Peter Spam wrote: Any other thoughts, Chris? I've been messing with this a bit, and can't seem to get (?m)^.*$ to do what I want. 1) I don't care how many characters it returns, I'd like entire lines all the time 2) I just want it to always return 3 lines: the line before, the actual line, and the line after. 3) This should be like grep -C1 Thanks for your time! -Pete On Jul 9, 2010, at 12:08 AM, Peter Spam wrote: Ah, this makes sense. I've changed my regex to (?m)^.*$, and it works better, but I still get fragments before and after some returns. Thanks for the hint! -Pete On Jul 8, 2010, at 6:27 PM, Chris Hostetter wrote: : If you can use the latest branch_3x or trunk, hl.fragListBuilder=single : is available that is for getting entire field contents with search terms : highlighted. To use it, set hl.useFastVectorHighlighter to true. He doesn't want the entire field -- his stored field values contain multi-line strings (using newline characters) and he wants to make fragments per line (ie: bounded by newline characters, or the start/end of the entire field value) Peter: i haven't looked at the code, but i expect that the problem is that the java regex engine isn't being used in a way that makes ^ and $ match any line boundary -- they are probably only matching the start/end of the field (and . is probably only matching non-newline characters) java regexes support embedded flags (ie: (?xyz)your regex) so you might try that (i don't remember what the correct modifier flag is for the multiline mode off the top of my head) -Hoss
Solr searching performance issues, using large documents
Data set: About 4,000 log files (will eventually grow to millions). Average log file is 850k. Largest log file (so far) is about 70MB. Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds. TermVectors etc are enabled. When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds). Thanks in advance for any ideas! -Peter - 4GB RAM server % java -Xms2048M -Xmx3072M -jar start.jar - schema.xml changes: fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType ... field name=body type=text_pl indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ field name=version type=string indexed=true stored=true multiValued=false/ field name=device type=string indexed=true stored=true multiValued=false/ field name=filename type=string indexed=true stored=true multiValued=false/ field name=filesize type=long indexed=true stored=true multiValued=false/ field name=pversion type=int indexed=true stored=true multiValued=false/ field name=first2md5 type=string indexed=false stored=true multiValued=false/ field name=ckey type=string indexed=true stored=true multiValued=false/ ... dynamicField name=* type=ignored multiValued=true / defaultSearchFieldbody/defaultSearchField solrQueryParser defaultOperator=AND/ - solrconfig.xml changes: maxFieldLength2147483647/maxFieldLength ramBufferSizeMB128/ramBufferSizeMB - The query: rowStr = rows=10 facet = facet=truefacet.limit=10facet.field=devicefacet.field=ckeyfacet.field=version fields = fl=id,score,filename,version,device,first2md5,filesize,ckey termvectors = tv=trueqt=tvrhtv.all=true hl = hl=truehl.fl=bodyhl.snippets=1hl.fragsize=400 regexv = (?m)^.*\n.*\n.*$ hl_regex = hl.regex.pattern= + CGI::escape(regexv) + hl.regex.slop=1hl.fragmenter=regexhl.regex.maxAnalyzedChars=2147483647hl.maxAnalyzedChars=2147483647 justq = 'q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!=])/,'\1') + fuzzy + minLogSizeStr) thequery = '/solr/select?timeAllowed=5000wt=ruby' + (p['fq'].empty? ? '' : ('fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + hl_regex baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + 'rows=' + p['rows'].to_s + 'minLogSize=' + p['minLogSize'].to_s
Re: Using hl.regex.pattern to print complete lines
Any other thoughts, Chris? I've been messing with this a bit, and can't seem to get (?m)^.*$ to do what I want. 1) I don't care how many characters it returns, I'd like entire lines all the time 2) I just want it to always return 3 lines: the line before, the actual line, and the line after. 3) This should be like grep -C1 Thanks for your time! -Pete On Jul 9, 2010, at 12:08 AM, Peter Spam wrote: Ah, this makes sense. I've changed my regex to (?m)^.*$, and it works better, but I still get fragments before and after some returns. Thanks for the hint! -Pete On Jul 8, 2010, at 6:27 PM, Chris Hostetter wrote: : If you can use the latest branch_3x or trunk, hl.fragListBuilder=single : is available that is for getting entire field contents with search terms : highlighted. To use it, set hl.useFastVectorHighlighter to true. He doesn't want the entire field -- his stored field values contain multi-line strings (using newline characters) and he wants to make fragments per line (ie: bounded by newline characters, or the start/end of the entire field value) Peter: i haven't looked at the code, but i expect that the problem is that the java regex engine isn't being used in a way that makes ^ and $ match any line boundary -- they are probably only matching the start/end of the field (and . is probably only matching non-newline characters) java regexes support embedded flags (ie: (?xyz)your regex) so you might try that (i don't remember what the correct modifier flag is for the multiline mode off the top of my head) -Hoss
Re: Using hl.regex.pattern to print complete lines
Ah, this makes sense. I've changed my regex to (?m)^.*$, and it works better, but I still get fragments before and after some returns. Thanks for the hint! -Pete On Jul 8, 2010, at 6:27 PM, Chris Hostetter wrote: : If you can use the latest branch_3x or trunk, hl.fragListBuilder=single : is available that is for getting entire field contents with search terms : highlighted. To use it, set hl.useFastVectorHighlighter to true. He doesn't want the entire field -- his stored field values contain multi-line strings (using newline characters) and he wants to make fragments per line (ie: bounded by newline characters, or the start/end of the entire field value) Peter: i haven't looked at the code, but i expect that the problem is that the java regex engine isn't being used in a way that makes ^ and $ match any line boundary -- they are probably only matching the start/end of the field (and . is probably only matching non-newline characters) java regexes support embedded flags (ie: (?xyz)your regex) so you might try that (i don't remember what the correct modifier flag is for the multiline mode off the top of my head) -Hoss
Re: Using hl.regex.pattern to print complete lines
To clarify, I never want a snippet, I always want a whole line returned. Is this possible? Thanks! -Pete On Jul 7, 2010, at 5:33 PM, Peter Spam wrote: Hi, I have a text file broken apart by carriage returns, and I'd like to only return entire lines. So, I'm trying to use this: hl.fragmenter=regex hl.regex.pattern=^.*$ ... but I still get fragments, even if I crank up the hl.regex.slop to 3 or so. I also tried a pattern of \n.*\n which seems to work better, but still isn't right. Any ideas? -Pete
Re: Using hl.regex.pattern to print complete lines
Thanks for the note, Koji. However, hl.fragsize=0 seems to return the entire document, rather than just one single line. Here's what I tried (what I previously had was commented out): regexv = ^.*$ thequery = '/solr/select?facet=truefacet.limit=10fl=id,score,filenametv=truetimeAllowed=3000facet.field=filenameqt=tvrhwt=ruby' + (p['fq'].empty? ? '' : ('fq='+p['fq'].to_s) ) + 'q=' + CGI::escape(p['q'].to_s) + 'rows=' + p['rows'].to_s + hl=truehl.snippets=1hl.fragsize=0 #hl.regex.slop=.8hl.fragsize=200hl.fragmenter=regexhl.regex.pattern= + CGI::escape(regexv) Thanks for your help. -Peter On Jul 8, 2010, at 3:47 PM, Koji Sekiguchi wrote: (10/07/09 2:44), Peter Spam wrote: To clarify, I never want a snippet, I always want a whole line returned. Is this possible? Thanks! -Pete Hello Pete, Use NullFragmenter. It can be used via GapFragmenter with hl.fragsize=0. Koji -- http://www.rondhuit.com/en/
Using hl.regex.pattern to print complete lines
Hi, I have a text file broken apart by carriage returns, and I'd like to only return entire lines. So, I'm trying to use this: hl.fragmenter=regex hl.regex.pattern=^.*$ ... but I still get fragments, even if I crank up the hl.regex.slop to 3 or so. I also tried a pattern of \n.*\n which seems to work better, but still isn't right. Any ideas? -Pete
Re: Very basic questions: Faceted front-end?
Wow, thanks Lance - it's really fast now! The last piece of the puzzle is setting up a nice front-end. Are there any pre-built front-ends available, that mimic Google (for example), with facets? -Peter On Jun 29, 2010, at 9:04 PM, Lance Norskog wrote: To highlight a field, Solr needs some extra Lucene values. If these are not configured for the field in the schema, Solr has to re-analyze the field to highlight it. If you want faster highlighting, you have to add term vectors to the schema. Here is the grand map of such things: http://wiki.apache.org/solr/FieldOptionsByUseCase On Tue, Jun 29, 2010 at 6:29 PM, Erick Erickson erickerick...@gmail.com wrote: What are you actual highlighting requirements? you could try things like maxAnalyzedChars, requireFieldMatch, etc http://wiki.apache.org/solr/HighlightingParameters has a good list, but you've probably already seen that page Best Erick On Tue, Jun 29, 2010 at 9:11 PM, Peter Spam ps...@mac.com wrote: To follow up, I've found that my queries are very fast (even with fq=), until I add hl=true. What can I do to speed up highlighting? Should I consider injecting a line at a time, rather than the entire file as a field? -Pete On Jun 29, 2010, at 11:07 AM, Peter Spam wrote: Thanks for everyone's help - I have this working now, but sometimes the queries are incredibly slow!! For example, int name=QTime461360/int. Also, I had to bump up the min/max RAM size to 1GB/3.5GB for things to inject without throwing heap memory errors. However, my data set is very small! 36 text files, for a total of 113MB. (It will grow to many TB, but for now, this is a test). The largest file is 34MB. Therefore, I'm sure I'm doing something wrong :-) Here's my config: --- For the schema.xml, types is all default. For fields, here are the only lines that aren't commented out: field name=id type=string indexed=true stored=true required=true / field name=body type=text indexed=true stored=true multiValued=true/ field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ field name=build type=string indexed=true stored=true multiValued=false/ field name=device type=string indexed=true stored=true multiValued=false/ dynamicField name=* type=ignored multiValued=true / ... then, for the rest: uniqueKeyid/uniqueKey !-- field for the QueryParser to use when an explicit fieldname is absent -- defaultSearchFieldbody/defaultSearchField !-- SolrQueryParser configuration: defaultOperator=AND|OR -- solrQueryParser defaultOperator=AND/ --- Invoking: java -Xmx3584M -Xms1024M -jar start.jar --- Injecting: #!/bin/sh J=0 for i in `find . -name \*.txt`; do (( J++ )) curl http://localhost:8983/solr/update/extract?literal.id=doc$Jfmap.content=body; -F myfi...@$i; done; echo - Committing curl http://localhost:8983/solr/update/extract?commit=true; --- Searching: http://localhost:8983/solr/select?q=testinghl=truefl=id,scorehl.snippets=5hl.mergeContiguous=true -Pete On Jun 28, 2010, at 5:22 PM, Erick Erickson wrote: try adding hl.fl=text to specify your highlight field. I don't understand why you're only getting the ID field back though. Do note that the highlighting is after the docs, related by the ID. Try a (non highlighting) query of just * to verify that you're pointing at the index you think you are. It's possible that you've modified a different index with SolrJ than your web server is pointing at. Also, SOLR has no way of knowing you're modified your index with SolrJ, so it may not be automatically reopening an IndexReader so your recent changes may not be visible until you force the SOLR reader to reopen. HTH Erick On Mon, Jun 28, 2010 at 6:49 PM, Peter Spam ps...@mac.com wrote: On Jun 28, 2010, at 2:00 PM, Ahmet Arslan wrote: 1) I can get my docs in the index, but when I search, it returns the entire document. I'd love to have it only return the line (or two) around the search term. Solr can generate Google-like snippets as you describe. http://wiki.apache.org/solr/HighlightingParameters Here's how I commit my documents: J=0; for i in `find . -name \*.txt`; do (( J++ )) curl http://localhost:8983/solr/update/extract?literal.id=doc$J; -F myfi...@$i; done; echo - Committing curl http://localhost:8983/solr/update/extract?commit=true; Then, I try to query using http://localhost:8983/solr/select?rows=10start=0fl=*,scorehl
Re: Very basic questions: Faceted front-end?
Ah, I found this: https://issues.apache.org/jira/browse/SOLR-634 ... aka solr-ui. Is there anything else along these lines? Thanks! -Peter On Jun 30, 2010, at 3:59 PM, Peter Spam wrote: Wow, thanks Lance - it's really fast now! The last piece of the puzzle is setting up a nice front-end. Are there any pre-built front-ends available, that mimic Google (for example), with facets? -Peter On Jun 29, 2010, at 9:04 PM, Lance Norskog wrote: To highlight a field, Solr needs some extra Lucene values. If these are not configured for the field in the schema, Solr has to re-analyze the field to highlight it. If you want faster highlighting, you have to add term vectors to the schema. Here is the grand map of such things: http://wiki.apache.org/solr/FieldOptionsByUseCase On Tue, Jun 29, 2010 at 6:29 PM, Erick Erickson erickerick...@gmail.com wrote: What are you actual highlighting requirements? you could try things like maxAnalyzedChars, requireFieldMatch, etc http://wiki.apache.org/solr/HighlightingParameters has a good list, but you've probably already seen that page Best Erick On Tue, Jun 29, 2010 at 9:11 PM, Peter Spam ps...@mac.com wrote: To follow up, I've found that my queries are very fast (even with fq=), until I add hl=true. What can I do to speed up highlighting? Should I consider injecting a line at a time, rather than the entire file as a field? -Pete On Jun 29, 2010, at 11:07 AM, Peter Spam wrote: Thanks for everyone's help - I have this working now, but sometimes the queries are incredibly slow!! For example, int name=QTime461360/int. Also, I had to bump up the min/max RAM size to 1GB/3.5GB for things to inject without throwing heap memory errors. However, my data set is very small! 36 text files, for a total of 113MB. (It will grow to many TB, but for now, this is a test). The largest file is 34MB. Therefore, I'm sure I'm doing something wrong :-) Here's my config: --- For the schema.xml, types is all default. For fields, here are the only lines that aren't commented out: field name=id type=string indexed=true stored=true required=true / field name=body type=text indexed=true stored=true multiValued=true/ field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ field name=build type=string indexed=true stored=true multiValued=false/ field name=device type=string indexed=true stored=true multiValued=false/ dynamicField name=* type=ignored multiValued=true / ... then, for the rest: uniqueKeyid/uniqueKey !-- field for the QueryParser to use when an explicit fieldname is absent -- defaultSearchFieldbody/defaultSearchField !-- SolrQueryParser configuration: defaultOperator=AND|OR -- solrQueryParser defaultOperator=AND/ --- Invoking: java -Xmx3584M -Xms1024M -jar start.jar --- Injecting: #!/bin/sh J=0 for i in `find . -name \*.txt`; do (( J++ )) curl http://localhost:8983/solr/update/extract?literal.id=doc$Jfmap.content=body; -F myfi...@$i; done; echo - Committing curl http://localhost:8983/solr/update/extract?commit=true; --- Searching: http://localhost:8983/solr/select?q=testinghl=truefl=id,scorehl.snippets=5hl.mergeContiguous=true -Pete On Jun 28, 2010, at 5:22 PM, Erick Erickson wrote: try adding hl.fl=text to specify your highlight field. I don't understand why you're only getting the ID field back though. Do note that the highlighting is after the docs, related by the ID. Try a (non highlighting) query of just * to verify that you're pointing at the index you think you are. It's possible that you've modified a different index with SolrJ than your web server is pointing at. Also, SOLR has no way of knowing you're modified your index with SolrJ, so it may not be automatically reopening an IndexReader so your recent changes may not be visible until you force the SOLR reader to reopen. HTH Erick On Mon, Jun 28, 2010 at 6:49 PM, Peter Spam ps...@mac.com wrote: On Jun 28, 2010, at 2:00 PM, Ahmet Arslan wrote: 1) I can get my docs in the index, but when I search, it returns the entire document. I'd love to have it only return the line (or two) around the search term. Solr can generate Google-like snippets as you describe. http://wiki.apache.org/solr/HighlightingParameters Here's how I commit my documents: J=0; for i in `find . -name \*.txt`; do (( J++ )) curl http://localhost:8983/solr/update/extract?literal.id=doc$J; -F myfi
Re: Very basic questions: Indexing text - working, but slow!
Thanks for everyone's help - I have this working now, but sometimes the queries are incredibly slow!! For example, int name=QTime461360/int. Also, I had to bump up the min/max RAM size to 1GB/3.5GB for things to inject without throwing heap memory errors. However, my data set is very small! 36 text files, for a total of 113MB. (It will grow to many TB, but for now, this is a test). The largest file is 34MB. Therefore, I'm sure I'm doing something wrong :-) Here's my config: --- For the schema.xml, types is all default. For fields, here are the only lines that aren't commented out: field name=id type=string indexed=true stored=true required=true / field name=body type=text indexed=true stored=true multiValued=true/ field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ field name=build type=string indexed=true stored=true multiValued=false/ field name=device type=string indexed=true stored=true multiValued=false/ dynamicField name=* type=ignored multiValued=true / ... then, for the rest: uniqueKeyid/uniqueKey !-- field for the QueryParser to use when an explicit fieldname is absent -- defaultSearchFieldbody/defaultSearchField !-- SolrQueryParser configuration: defaultOperator=AND|OR -- solrQueryParser defaultOperator=AND/ --- Invoking: java -Xmx3584M -Xms1024M -jar start.jar --- Injecting: #!/bin/sh J=0 for i in `find . -name \*.txt`; do (( J++ )) curl http://localhost:8983/solr/update/extract?literal.id=doc$Jfmap.content=body; -F myfi...@$i; done; echo - Committing curl http://localhost:8983/solr/update/extract?commit=true; --- Searching: http://localhost:8983/solr/select?q=testinghl=truefl=id,scorehl.snippets=5hl.mergeContiguous=true -Pete On Jun 28, 2010, at 5:22 PM, Erick Erickson wrote: try adding hl.fl=text to specify your highlight field. I don't understand why you're only getting the ID field back though. Do note that the highlighting is after the docs, related by the ID. Try a (non highlighting) query of just * to verify that you're pointing at the index you think you are. It's possible that you've modified a different index with SolrJ than your web server is pointing at. Also, SOLR has no way of knowing you're modified your index with SolrJ, so it may not be automatically reopening an IndexReader so your recent changes may not be visible until you force the SOLR reader to reopen. HTH Erick On Mon, Jun 28, 2010 at 6:49 PM, Peter Spam ps...@mac.com wrote: On Jun 28, 2010, at 2:00 PM, Ahmet Arslan wrote: 1) I can get my docs in the index, but when I search, it returns the entire document. I'd love to have it only return the line (or two) around the search term. Solr can generate Google-like snippets as you describe. http://wiki.apache.org/solr/HighlightingParameters Here's how I commit my documents: J=0; for i in `find . -name \*.txt`; do (( J++ )) curl http://localhost:8983/solr/update/extract?literal.id=doc$J; -F myfi...@$i; done; echo - Committing curl http://localhost:8983/solr/update/extract?commit=true; Then, I try to query using http://localhost:8983/solr/select?rows=10start=0fl=*,scorehl=trueq=testing but I only get back the document ID rather than the snippet: doc float name=score0.05030759/float arr name=content_type strtext/plain/str /arr str name=iddoc16/str /doc I'm using the schema.xml from the lucid imagination: Indexing text and html files tutorial. -Pete
Re: Very basic questions: Indexing text - working, but slow!
To follow up, I've found that my queries are very fast (even with fq=), until I add hl=true. What can I do to speed up highlighting? Should I consider injecting a line at a time, rather than the entire file as a field? -Pete On Jun 29, 2010, at 11:07 AM, Peter Spam wrote: Thanks for everyone's help - I have this working now, but sometimes the queries are incredibly slow!! For example, int name=QTime461360/int. Also, I had to bump up the min/max RAM size to 1GB/3.5GB for things to inject without throwing heap memory errors. However, my data set is very small! 36 text files, for a total of 113MB. (It will grow to many TB, but for now, this is a test). The largest file is 34MB. Therefore, I'm sure I'm doing something wrong :-) Here's my config: --- For the schema.xml, types is all default. For fields, here are the only lines that aren't commented out: field name=id type=string indexed=true stored=true required=true / field name=body type=text indexed=true stored=true multiValued=true/ field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ field name=build type=string indexed=true stored=true multiValued=false/ field name=device type=string indexed=true stored=true multiValued=false/ dynamicField name=* type=ignored multiValued=true / ... then, for the rest: uniqueKeyid/uniqueKey !-- field for the QueryParser to use when an explicit fieldname is absent -- defaultSearchFieldbody/defaultSearchField !-- SolrQueryParser configuration: defaultOperator=AND|OR -- solrQueryParser defaultOperator=AND/ --- Invoking: java -Xmx3584M -Xms1024M -jar start.jar --- Injecting: #!/bin/sh J=0 for i in `find . -name \*.txt`; do (( J++ )) curl http://localhost:8983/solr/update/extract?literal.id=doc$Jfmap.content=body; -F myfi...@$i; done; echo - Committing curl http://localhost:8983/solr/update/extract?commit=true; --- Searching: http://localhost:8983/solr/select?q=testinghl=truefl=id,scorehl.snippets=5hl.mergeContiguous=true -Pete On Jun 28, 2010, at 5:22 PM, Erick Erickson wrote: try adding hl.fl=text to specify your highlight field. I don't understand why you're only getting the ID field back though. Do note that the highlighting is after the docs, related by the ID. Try a (non highlighting) query of just * to verify that you're pointing at the index you think you are. It's possible that you've modified a different index with SolrJ than your web server is pointing at. Also, SOLR has no way of knowing you're modified your index with SolrJ, so it may not be automatically reopening an IndexReader so your recent changes may not be visible until you force the SOLR reader to reopen. HTH Erick On Mon, Jun 28, 2010 at 6:49 PM, Peter Spam ps...@mac.com wrote: On Jun 28, 2010, at 2:00 PM, Ahmet Arslan wrote: 1) I can get my docs in the index, but when I search, it returns the entire document. I'd love to have it only return the line (or two) around the search term. Solr can generate Google-like snippets as you describe. http://wiki.apache.org/solr/HighlightingParameters Here's how I commit my documents: J=0; for i in `find . -name \*.txt`; do (( J++ )) curl http://localhost:8983/solr/update/extract?literal.id=doc$J; -F myfi...@$i; done; echo - Committing curl http://localhost:8983/solr/update/extract?commit=true; Then, I try to query using http://localhost:8983/solr/select?rows=10start=0fl=*,scorehl=trueq=testing but I only get back the document ID rather than the snippet: doc float name=score0.05030759/float arr name=content_type strtext/plain/str /arr str name=iddoc16/str /doc I'm using the schema.xml from the lucid imagination: Indexing text and html files tutorial. -Pete
Very basic questions: Indexing text
Hi everyone, I'm looking for a way to index a bunch of (potentially large) text files. I would love to see results like Google, so I went through a few tutorials, but I've still got questions: 1) I can get my docs in the index, but when I search, it returns the entire document. I'd love to have it only return the line (or two) around the search term. 2) There are one or two fields at the beginning of the file that I would like to search on, so these should be indexed differently, right? 3) Is there a nice front-end example anywhere? Something that would return results kind of like Google? Thanks for your time - Solr / Lucene seem to be very powerful. -Pete
Re: Very basic questions: Indexing text
Great, thanks for the pointers. Thanks, Peter On Jun 28, 2010, at 2:00 PM, Ahmet Arslan wrote: 1) I can get my docs in the index, but when I search, it returns the entire document. I'd love to have it only return the line (or two) around the search term. Solr can generate Google-like snippets as you describe. http://wiki.apache.org/solr/HighlightingParameters 2) There are one or two fields at the beginning of the file that I would like to search on, so these should be indexed differently, right? Probably yes. 3) Is there a nice front-end example anywhere? Something that would return results kind of like Google? http://wiki.apache.org/solr/PublicServers http://search-lucene.com/
Re: Very basic questions: Indexing text
On Jun 28, 2010, at 2:00 PM, Ahmet Arslan wrote: 1) I can get my docs in the index, but when I search, it returns the entire document. I'd love to have it only return the line (or two) around the search term. Solr can generate Google-like snippets as you describe. http://wiki.apache.org/solr/HighlightingParameters Here's how I commit my documents: J=0; for i in `find . -name \*.txt`; do (( J++ )) curl http://localhost:8983/solr/update/extract?literal.id=doc$J; -F myfi...@$i; done; echo - Committing curl http://localhost:8983/solr/update/extract?commit=true; Then, I try to query using http://localhost:8983/solr/select?rows=10start=0fl=*,scorehl=trueq=testing but I only get back the document ID rather than the snippet: doc float name=score0.05030759/float arr name=content_type strtext/plain/str /arr str name=iddoc16/str /doc I'm using the schema.xml from the lucid imagination: Indexing text and html files tutorial. -Pete