Analysis page broken on trunk?
Hi - it seems the analysis page is broken on trunk and it looks like our 4.5 and 4.6 builds are unaffected. Can anyone on trunk confirm this? Markus
RE: Analysis page broken on trunk?
Hi - You will see on the left side each filter abbreviation but you won't see anything in the right container. No terms, positions, offsets, nothing. Markus -Original message- From:Stefan Matheis matheis.ste...@gmail.com Sent: Wednesday 8th January 2014 14:10 To: solr-user@lucene.apache.org Subject: Re: Analysis page broken on trunk? Hey Markus i'm not up to date with the latest changes, but if you can describe how to reproduce it, i can try to verify that? -Stefan On Wednesday, January 8, 2014 at 12:44 PM, Markus Jelsma wrote: Hi - it seems the analysis page is broken on trunk and it looks like our 4.5 and 4.6 builds are unaffected. Can anyone on trunk confirm this? Markus
RE: Simple payloads example not working
Check the bytes property: http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/util/BytesRef.html#bytes @Override public float scorePayload(int doc, int start, int end, BytesRef payload) { if (payload != null) { return PayloadHelper.decodeFloat(payload.bytes); } return 1.0f; } -Original message- From:michael.boom my_sky...@yahoo.com Sent: Monday 13th January 2014 14:49 To: solr-user@lucene.apache.org Subject: Re: Simple payloads example not working Thanks iorixxx, Actually I've just tried it and I hit a small wall, the tutorial looks not to be up to date with the codebase. When implementing my custom similarity class i should be using PayloadHelper, but following happens: in PayloadHelper: public static final float decodeFloat(byte [] bytes, int offset) in DefaultSimilarity: public float scorePayload(int doc, int start, int end, BytesRef payload) So it's BytesRef vs byte[]. How should i proceed in this scenario? - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/Simple-payloads-example-not-working-tp4110998p4111040.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Analysis page broken on trunk?
:[2, 2, 2, 2, 2, 2, 2, 2, 2, 2], start:4, end:7, type:word, org.apache.lucene.analysis.tokenattributes.KeywordAttribute#keyword:false}, { text:bla, raw_bytes:[62 6c 61], position:3, positionHistory:[3, 3, 3, 3, 3, 3, 3, 3, 3, 3], start:8, end:11, type:word, org.apache.lucene.analysis.tokenattributes.KeywordAttribute#keyword:false}], org.apache.lucene.analysis.miscellaneous.LengthFilter,[{ text:bla, raw_bytes:[62 6c 61], position:1, positionHistory:[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], start:0, end:3, type:word, org.apache.lucene.analysis.tokenattributes.KeywordAttribute#keyword:false}, { text:bla, raw_bytes:[62 6c 61], position:2, positionHistory:[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], start:4, end:7, type:word, org.apache.lucene.analysis.tokenattributes.KeywordAttribute#keyword:false}, { text:bla, raw_bytes:[62 6c 61], position:3, positionHistory:[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], start:8, end:11, type:word, org.apache.lucene.analysis.tokenattributes.KeywordAttribute#keyword:false}]]}}, field_names:{}}} -Original message- From:Stefan Matheis matheis.ste...@gmail.com Sent: Friday 10th January 2014 11:35 To: solr-user@lucene.apache.org Subject: Re: Analysis page broken on trunk? Sorry for not getting back on this earlier - i've tried several fields w/ values from the example docs and that looks pretty okay to me, no change noticed on that. Can you share a screenshot or something like that? And perhaps Input, Fields/Fieldtype which doesn't work for you? -Stefan On Wednesday, January 8, 2014 at 2:24 PM, Markus Jelsma wrote: Hi - You will see on the left side each filter abbreviation but you won't see anything in the right container. No terms, positions, offsets, nothing. Markus -Original message- From:Stefan Matheis matheis.ste...@gmail.com (mailto:matheis.ste...@gmail.com) Sent: Wednesday 8th January 2014 14:10 To: solr-user@lucene.apache.org (mailto:solr-user@lucene.apache.org) Subject: Re: Analysis page broken on trunk? Hey Markus i'm not up to date with the latest changes, but if you can describe how to reproduce it, i can try to verify that? -Stefan On Wednesday, January 8, 2014 at 12:44 PM, Markus Jelsma wrote: Hi - it seems the analysis page is broken on trunk and it looks like our 4.5 and 4.6 builds are unaffected. Can anyone on trunk confirm this? Markus
RE: Analysis page broken on trunk?
], start:0, end:3, type:word, org.apache.lucene.analysis.tokenattributes.KeywordAttribute#keyword:false}, { text:bla, raw_bytes:[62 6c 61], position:2, positionHistory:[2, 2, 2, 2, 2, 2, 2, 2, 2, 2], start:4, end:7, type:word, org.apache.lucene.analysis.tokenattributes.KeywordAttribute#keyword:false}, { text:bla, raw_bytes:[62 6c 61], position:3, positionHistory:[3, 3, 3, 3, 3, 3, 3, 3, 3, 3], start:8, end:11, type:word, org.apache.lucene.analysis.tokenattributes.KeywordAttribute#keyword:false}], org.apache.lucene.analysis.miscellaneous.LengthFilter,[{ text:bla, raw_bytes:[62 6c 61], position:1, positionHistory:[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], start:0, end:3, type:word, org.apache.lucene.analysis.tokenattributes.KeywordAttribute#keyword:false}, { text:bla, raw_bytes:[62 6c 61], position:2, positionHistory:[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], start:4, end:7, type:word, org.apache.lucene.analysis.tokenattributes.KeywordAttribute#keyword:false}, { text:bla, raw_bytes:[62 6c 61], position:3, positionHistory:[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], start:8, end:11, type:word, org.apache.lucene.analysis.tokenattributes.KeywordAttribute#keyword:false}]]}}, field_names:{}}} -Original message- From:Stefan Matheis matheis.ste...@gmail.com Sent: Friday 10th January 2014 11:35 To: solr-user@lucene.apache.org Subject: Re: Analysis page broken on trunk? Sorry for not getting back on this earlier - i've tried several fields w/ values from the example docs and that looks pretty okay to me, no change noticed on that. Can you share a screenshot or something like that? And perhaps Input, Fields/Fieldtype which doesn't work for you? -Stefan On Wednesday, January 8, 2014 at 2:24 PM, Markus Jelsma wrote: Hi - You will see on the left side each filter abbreviation but you won't see anything in the right container. No terms, positions, offsets, nothing. Markus -Original message- From:Stefan Matheis matheis.ste...@gmail.com (mailto:matheis.ste...@gmail.com) Sent: Wednesday 8th January 2014 14:10 To: solr-user@lucene.apache.org (mailto:solr-user@lucene.apache.org) Subject: Re: Analysis page broken on trunk? Hey Markus i'm not up to date with the latest changes, but if you can describe how to reproduce it, i can try to verify that? -Stefan On Wednesday, January 8, 2014 at 12:44 PM, Markus Jelsma wrote: Hi - it seems the analysis page is broken on trunk and it looks like our 4.5 and 4.6 builds are unaffected. Can anyone on trunk confirm this? Markus
RE: Simple payloads example not working
Strange, is it really floats you are inserting as payload? We use payloads too but write them via PayloadAttribute in custom token filters as float. -Original message- From:michael.boom my_sky...@yahoo.com Sent: Tuesday 14th January 2014 11:59 To: solr-user@lucene.apache.org Subject: RE: Simple payloads example not working Investigating, it looks that the payload.bytes property is where the problem is. payload.toString() outputs corrects values, but .bytes property seems to behave a little weird: public class CustomSimilarity extends DefaultSimilarity { @Override public float scorePayload(int doc, int start, int end, BytesRef payload) { if (payload != null) { Float pscore = PayloadHelper.decodeFloat(payload.bytes); System.out.println(payload : + payload.toString() + , payload bytes: + payload.bytes.toString() + , decoded value is + pscore); return pscore; } return 1.0f; } } outputs on query: http://localhost:8983/solr/collection1/pds-search?q=payloads:testonewt=jsonindent=truedebugQuery=true payload : [41 26 66 66], payload bytes: [B@149c678, decoded value is 10.4 payload : [41 f0 0 0], payload bytes: [B@149c678, decoded value is 10.4 payload : [42 4a cc cd], payload bytes: [B@149c678, decoded value is 10.4 payload : [42 c6 0 0], payload bytes: [B@149c678, decoded value is 10.4 payload : [41 26 66 66], payload bytes: [B@850fb7, decoded value is 10.4 payload : [41 f0 0 0], payload bytes: [B@1cad357, decoded value is 10.4 payload : [42 4a cc cd], payload bytes: [B@f922cf, decoded value is 10.4 payload : [42 c6 0 0], payload bytes: [B@5c4dc4, decoded value is 10.4 Something doesn't seem right here. Any idea why this behaviour? Is anyone using payloads using Solr 4.6.0 ? - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/Simple-payloads-example-not-working-tp4110998p4111214.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Indexing URLs from websites
-Original message- From:Teague James teag...@insystechinc.com Sent: Wednesday 15th January 2014 22:01 To: solr-user@lucene.apache.org Subject: Re: Indexing URLs from websites I am still unsuccessful in getting this to work. My expectation is that the index-anchor plugin should produce values for the field anchor. However this field is not showing up in my Solr index no matter what I try. Here's what I have in my nutch-site.xml for plugins: valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-( basic|site|url)|indexer-solr|response(json|xml)|summary-basic|scoring-optic| urlnormalizer-(pass|reges|basic)/value I am using the schema-solr4.xml from the Nutch package and I added the _version_ field Here's the command I'm running: Bin/nutch crawl urls -solr http://localhost/solr -depth 3 topN 50 The fields that Solr returns are: Content, title, segment, boost, digest, tstamp, id, url, and _version_ Note that the url field is the url of the page being indexed and not the url(s) of the documents that may be outlinks on that page. It is the outlinks that I am trying to get into the index. What am I missing? I also tried using the invertlinks command that Markus suggested, but that did not work either, though I do appreciate the suggestion. That did get you a LinkDB right? You need to call solrindex and use the linkdb's location as part of the arguments, only then Nutch knows about it and will use the data contained in the LinkDB together with the index-anchor plugin to write the anchor field in your Solrindex. Any help is appreciated! Thanks! Markus Jelsma Wrote: You need to use the invertlinks command to build a database with docs with inlinks and anchors. Then use the index-anchor plugin when indexing. Then you will have a multivalued field with anchors pointing to your document. Teague James Wrote: I am trying to index a website that contains links to documents such as PDF, Word, etc. The intent is to be able to store the URLs for the links to the documents. For example, when indexing www.example.com which has links on the page like Example Document which points to www.example.com/docs/example.pdf, I want Solr to store the text of the link, Example Document, and the URL for the link, www.example.com/docs/example.pdf in separate fields. I've tried using Nutch 1.7 with Solr 4.6.0 and have successfully indexed the page content, but I am not getting the URLs from the links. There are no document type restrictions in Nutch for PDF or Word. Any suggestions on how I can accomplish this? Should I use a different method than Nutch for crawling the site? I appreciate any help on this!
RE: Indexing URLs from websites
Hi - you cannot use wildcards for segments. You need to give one segment or a -dir segments_dir. Check the usage of your indexer command. -Original message- From:Teague James teag...@insystechinc.com Sent: Thursday 16th January 2014 16:43 To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Hello Markus, I do get a linkdb folder in the crawl folder that gets created - but it is created at the time that I execute the command automatically by Nutch. I just tried to use solrindex against yesterday's cawl and did not get any errors, but did not get the anchor field or any of the outlinks. I used this command: bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/* I then tried: bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb crawl/segments/* This produced the following errors: Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/.../crawl/linkdb/crawl_fetch Input path does not exist: file:/.../crawl/linkdb/crawl_parse Input path does not exist: file:/.../crawl/linkdb/parse_data Input path does not exist: file:/.../crawl/linkdb/parse_text Along with a Java stacktrace So I tried invertlinks as you had previously suggested. No errors, but the above missing directories were not created. Using the same solrindex command above this one produced the same errors. When/How are the missing directories supposed to be created? I really appreciate the help! Thank you very much! -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, January 16, 2014 5:45 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites -Original message- From:Teague James teag...@insystechinc.com Sent: Wednesday 15th January 2014 22:01 To: solr-user@lucene.apache.org Subject: Re: Indexing URLs from websites I am still unsuccessful in getting this to work. My expectation is that the index-anchor plugin should produce values for the field anchor. However this field is not showing up in my Solr index no matter what I try. Here's what I have in my nutch-site.xml for plugins: valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|q uery-( basic|site|url)|indexer-solr|response(json|xml)|summary-basic|scoring- basic|site|optic| urlnormalizer-(pass|reges|basic)/value I am using the schema-solr4.xml from the Nutch package and I added the _version_ field Here's the command I'm running: Bin/nutch crawl urls -solr http://localhost/solr -depth 3 topN 50 The fields that Solr returns are: Content, title, segment, boost, digest, tstamp, id, url, and _version_ Note that the url field is the url of the page being indexed and not the url(s) of the documents that may be outlinks on that page. It is the outlinks that I am trying to get into the index. What am I missing? I also tried using the invertlinks command that Markus suggested, but that did not work either, though I do appreciate the suggestion. That did get you a LinkDB right? You need to call solrindex and use the linkdb's location as part of the arguments, only then Nutch knows about it and will use the data contained in the LinkDB together with the index-anchor plugin to write the anchor field in your Solrindex. Any help is appreciated! Thanks! Markus Jelsma Wrote: You need to use the invertlinks command to build a database with docs with inlinks and anchors. Then use the index-anchor plugin when indexing. Then you will have a multivalued field with anchors pointing to your document. Teague James Wrote: I am trying to index a website that contains links to documents such as PDF, Word, etc. The intent is to be able to store the URLs for the links to the documents. For example, when indexing www.example.com which has links on the page like Example Document which points to www.example.com/docs/example.pdf, I want Solr to store the text of the link, Example Document, and the URL for the link, www.example.com/docs/example.pdf in separate fields. I've tried using Nutch 1.7 with Solr 4.6.0 and have successfully indexed the page content, but I am not getting the URLs from the links. There are no document type restrictions in Nutch for PDF or Word. Any suggestions on how I can accomplish this? Should I use a different method than Nutch for crawling the site? I appreciate any help on this!
RE: Indexing URLs from websites
Usage: SolrIndexer solr url crawldb [-linkdb linkdb] [-params k1=v1k2=v2...] (segment ... | -dir segments) [-noCommit] [-deleteGone] [-deleteRobotsNoIndex] [-deleteSkippedByIndexingFilter] [-filter] [-normalize] You must point to the linkdb via the -linkdb parameter. -Original message- From:Teague James teag...@insystechinc.com Sent: Thursday 16th January 2014 16:57 To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Okay. I changed my solrindex to this: bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb crawl/segments/20140115143147 I got the same errors: Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/.../crawl/linkdb/crawl_fetch Input path does not exist: file:/.../crawl/linkdb/crawl_parse Input path does not exist: file:/.../crawl/linkdb/parse_data Input path does not exist: file:/.../crawl/linkdb/parse_text Along with a Java stacktrace Those linkdb folders are not being created. -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, January 16, 2014 10:44 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Hi - you cannot use wildcards for segments. You need to give one segment or a -dir segments_dir. Check the usage of your indexer command. -Original message- From:Teague James teag...@insystechinc.com Sent: Thursday 16th January 2014 16:43 To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Hello Markus, I do get a linkdb folder in the crawl folder that gets created - but it is created at the time that I execute the command automatically by Nutch. I just tried to use solrindex against yesterday's cawl and did not get any errors, but did not get the anchor field or any of the outlinks. I used this command: bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/* I then tried: bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb crawl/segments/* This produced the following errors: Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/.../crawl/linkdb/crawl_fetch Input path does not exist: file:/.../crawl/linkdb/crawl_parse Input path does not exist: file:/.../crawl/linkdb/parse_data Input path does not exist: file:/.../crawl/linkdb/parse_text Along with a Java stacktrace So I tried invertlinks as you had previously suggested. No errors, but the above missing directories were not created. Using the same solrindex command above this one produced the same errors. When/How are the missing directories supposed to be created? I really appreciate the help! Thank you very much! -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, January 16, 2014 5:45 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites -Original message- From:Teague James teag...@insystechinc.com Sent: Wednesday 15th January 2014 22:01 To: solr-user@lucene.apache.org Subject: Re: Indexing URLs from websites I am still unsuccessful in getting this to work. My expectation is that the index-anchor plugin should produce values for the field anchor. However this field is not showing up in my Solr index no matter what I try. Here's what I have in my nutch-site.xml for plugins: valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor) |q uery-( basic|site|url)|indexer-solr|response(json|xml)|summary-basic|scorin basic|site|g- basic|site|optic| urlnormalizer-(pass|reges|basic)/value I am using the schema-solr4.xml from the Nutch package and I added the _version_ field Here's the command I'm running: Bin/nutch crawl urls -solr http://localhost/solr -depth 3 topN 50 The fields that Solr returns are: Content, title, segment, boost, digest, tstamp, id, url, and _version_ Note that the url field is the url of the page being indexed and not the url(s) of the documents that may be outlinks on that page. It is the outlinks that I am trying to get into the index. What am I missing? I also tried using the invertlinks command that Markus suggested, but that did not work either, though I do appreciate the suggestion. That did get you a LinkDB right? You need to call solrindex and use the linkdb's location as part of the arguments, only then Nutch knows about it and will use the data contained in the LinkDB together with the index-anchor plugin to write the anchor field in your Solrindex. Any help is appreciated! Thanks! Markus Jelsma Wrote: You need to use the invertlinks command to build a database with docs with inlinks and anchors. Then use the index-anchor plugin when indexing
RE: Indexing URLs from websites
-Original message- From:Teague James teag...@insystechinc.com Sent: Thursday 16th January 2014 20:23 To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Okay. I had used that previously and I just tried it again. The following generated no errors: bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb -dir crawl/segments/ Solr is still not getting an anchor field and the outlinks are not appearing in the index anywhere else. To be sure I deleted the crawl directory and did a fresh crawl using: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 Then bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb -dir crawl/segments/ No errors, but no anchor fields or outlinks. One thing in the response from the crawl that I found interesting was a line that said: LinkDb: internal links will be ignored. Good catch! That is likely the problem. What does that mean? property namedb.ignore.internal.links/name valuetrue/value descriptionIf true, when adding new links to a page, links from the same host are ignored. This is an effective way to limit the size of the link database, keeping only the highest quality links. /description /property So change the property, rebuild the linkdb and try reindexing once again :) -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, January 16, 2014 11:08 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Usage: SolrIndexer solr url crawldb [-linkdb linkdb] [-params k1=v1k2=v2...] (segment ... | -dir segments) [-noCommit] [-deleteGone] [-deleteRobotsNoIndex] [-deleteSkippedByIndexingFilter] [-filter] [-normalize] You must point to the linkdb via the -linkdb parameter. -Original message- From:Teague James teag...@insystechinc.com Sent: Thursday 16th January 2014 16:57 To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Okay. I changed my solrindex to this: bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb crawl/segments/20140115143147 I got the same errors: Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/.../crawl/linkdb/crawl_fetch Input path does not exist: file:/.../crawl/linkdb/crawl_parse Input path does not exist: file:/.../crawl/linkdb/parse_data Input path does not exist: file:/.../crawl/linkdb/parse_text Along with a Java stacktrace Those linkdb folders are not being created. -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, January 16, 2014 10:44 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Hi - you cannot use wildcards for segments. You need to give one segment or a -dir segments_dir. Check the usage of your indexer command. -Original message- From:Teague James teag...@insystechinc.com Sent: Thursday 16th January 2014 16:43 To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Hello Markus, I do get a linkdb folder in the crawl folder that gets created - but it is created at the time that I execute the command automatically by Nutch. I just tried to use solrindex against yesterday's cawl and did not get any errors, but did not get the anchor field or any of the outlinks. I used this command: bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/* I then tried: bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb crawl/segments/* This produced the following errors: Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/.../crawl/linkdb/crawl_fetch Input path does not exist: file:/.../crawl/linkdb/crawl_parse Input path does not exist: file:/.../crawl/linkdb/parse_data Input path does not exist: file:/.../crawl/linkdb/parse_text Along with a Java stacktrace So I tried invertlinks as you had previously suggested. No errors, but the above missing directories were not created. Using the same solrindex command above this one produced the same errors. When/How are the missing directories supposed to be created? I really appreciate the help! Thank you very much! -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, January 16, 2014 5:45 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites -Original message- From:Teague James teag...@insystechinc.com Sent: Wednesday 15th January 2014 22:01 To: solr-user@lucene.apache.org Subject: Re: Indexing URLs from websites I am still unsuccessful in getting this to work. My expectation is that the index
RE: Indexing URLs from websites
Well it is hard to get a specific anchor because there is usually more than one. The content of the anchors field should be correct. What would you expect if there are multiple anchors? -Original message- From:Teague James teag...@insystechinc.com Sent: Friday 17th January 2014 18:13 To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Progress! I changed the value of that property in nutch-default.xml and I am getting the anchor field now. However, the stuff going in there is a bit random and doesn't seem to correlate to the pages I'm crawling. The primary objective is that when there is something on the page that is a link to a file ...href=/blah/somefile.pdfGet the PDF!... (using ... to prevent actual code in the email) I want to capture that URL and the anchor text Get the PDF! into field(s). Am I going in the right direction on this? Thank you so much for sticking with me on this - I really appreciate your help! -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Friday, January 17, 2014 6:46 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites -Original message- From:Teague James teag...@insystechinc.com Sent: Thursday 16th January 2014 20:23 To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Okay. I had used that previously and I just tried it again. The following generated no errors: bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb -dir crawl/segments/ Solr is still not getting an anchor field and the outlinks are not appearing in the index anywhere else. To be sure I deleted the crawl directory and did a fresh crawl using: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 Then bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb -dir crawl/segments/ No errors, but no anchor fields or outlinks. One thing in the response from the crawl that I found interesting was a line that said: LinkDb: internal links will be ignored. Good catch! That is likely the problem. What does that mean? property namedb.ignore.internal.links/name valuetrue/value descriptionIf true, when adding new links to a page, links from the same host are ignored. This is an effective way to limit the size of the link database, keeping only the highest quality links. /description /property So change the property, rebuild the linkdb and try reindexing once again :) -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, January 16, 2014 11:08 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Usage: SolrIndexer solr url crawldb [-linkdb linkdb] [-params k1=v1k2=v2...] (segment ... | -dir segments) [-noCommit] [-deleteGone] [-deleteRobotsNoIndex] [-deleteSkippedByIndexingFilter] [-filter] [-normalize] You must point to the linkdb via the -linkdb parameter. -Original message- From:Teague James teag...@insystechinc.com Sent: Thursday 16th January 2014 16:57 To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Okay. I changed my solrindex to this: bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb crawl/segments/20140115143147 I got the same errors: Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/.../crawl/linkdb/crawl_fetch Input path does not exist: file:/.../crawl/linkdb/crawl_parse Input path does not exist: file:/.../crawl/linkdb/parse_data Input path does not exist: file:/.../crawl/linkdb/parse_text Along with a Java stacktrace Those linkdb folders are not being created. -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, January 16, 2014 10:44 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Hi - you cannot use wildcards for segments. You need to give one segment or a -dir segments_dir. Check the usage of your indexer command. -Original message- From:Teague James teag...@insystechinc.com Sent: Thursday 16th January 2014 16:43 To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Hello Markus, I do get a linkdb folder in the crawl folder that gets created - but it is created at the time that I execute the command automatically by Nutch. I just tried to use solrindex against yesterday's cawl and did not get any errors, but did not get the anchor field or any of the outlinks. I used this command: bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/* I then tried: bin/nutch solrindex http
RE: Solr middle-ware?
Hi - We use Nginx to expose the index to the internet. It comes down to putting some limitations on input parameters and on-the-fly rewrite of queries using embedded Perl scripting. Limitations and rewrites are usually just a bunch of regular expressions, so it is not that hard. Cheers Markus -Original message- From:Alexandre Rafalovitch arafa...@gmail.com Sent: Tuesday 21st January 2014 14:01 To: solr-user@lucene.apache.org Subject: Solr middle-ware? Hello, All the Solr documents talk about not running Solr directly to the cloud. But I see people keep asking for a thin secure layer in front of Solr they can talk from JavaScript to, perhaps with some basic extension options. Has anybody actually written one? Open source or in a community part of larger project? I would love to be able to point people at something. Is there something particularly difficult about writing one? Does anybody has a story of aborted attempt or mid-point reversal? I would like to know. Regards, Alex. P.s. Personal context: I am thinking of doing a series of lightweight examples of how to use Solr. Like I did for a book, but with a bit more depth and something that can actually be exposed to the live web with live data. I don't want to reinvent the wheel of the thin Solr middleware. P.p.s. Though I keep thinking that Dart could make an interesting option for the middleware as it could have the same codebase on the server and in the client. Like NodeJS, but with saner syntax. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
RE: Indexing URLs from websites
Hi, are you getting pdfs at all? Sounds like a problem with url filters, those also work on the linkdb. You should also try dumping the linkdb and inspect it for urls. Btw, i noticed this is om the solr list, its best to open a new discussion on the nutch user mailing list. CheersTeague James teag...@insystechinc.com schreef:What I'm getting is just the anchor text. In cases where there are multiple anchors I am getting a comma separated list of anchor text - which is fine. However, I am not getting all of the anchors that are on the page, nor am I getting any of the URLs. The anchors I am getting back never include anchors that lead to documents - which is the primary objective. So on a page that looks something like: Article 1 text blah blah blah [Read more] Article 2 test blah blah blah [Read more] Download a the [PDF] Where each [Read more] links to a page where the rest of the article is stored and [PDF] links to a PDF document (these are relative links). That I get back in the anchor field is [Read more],[Read more] I am not getting the [PDF] anchor and I am not getting any of the URLs that those anchors point to - like /Artilce 1, /Article 2, and /documents/Article 1.pdf How can I get these URLs? -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Monday, January 20, 2014 9:08 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Well it is hard to get a specific anchor because there is usually more than one. The content of the anchors field should be correct. What would you expect if there are multiple anchors? -Original message- From:Teague James teag...@insystechinc.com Sent: Friday 17th January 2014 18:13 To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Progress! I changed the value of that property in nutch-default.xml and I am getting the anchor field now. However, the stuff going in there is a bit random and doesn't seem to correlate to the pages I'm crawling. The primary objective is that when there is something on the page that is a link to a file ...href=/blah/somefile.pdfGet the PDF!... (using ... to prevent actual code in the email) I want to capture that URL and the anchor text Get the PDF! into field(s). Am I going in the right direction on this? Thank you so much for sticking with me on this - I really appreciate your help! -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Friday, January 17, 2014 6:46 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites -Original message- From:Teague James teag...@insystechinc.com Sent: Thursday 16th January 2014 20:23 To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Okay. I had used that previously and I just tried it again. The following generated no errors: bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb -dir crawl/segments/ Solr is still not getting an anchor field and the outlinks are not appearing in the index anywhere else. To be sure I deleted the crawl directory and did a fresh crawl using: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 Then bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb -dir crawl/segments/ No errors, but no anchor fields or outlinks. One thing in the response from the crawl that I found interesting was a line that said: LinkDb: internal links will be ignored. Good catch! That is likely the problem. What does that mean? property namedb.ignore.internal.links/name valuetrue/value descriptionIf true, when adding new links to a page, links from the same host are ignored. This is an effective way to limit the size of the link database, keeping only the highest quality links. /description /property So change the property, rebuild the linkdb and try reindexing once again :) -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, January 16, 2014 11:08 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Usage: SolrIndexer solr url crawldb [-linkdb linkdb] [-params k1=v1k2=v2...] (segment ... | -dir segments) [-noCommit] [-deleteGone] [-deleteRobotsNoIndex] [-deleteSkippedByIndexingFilter] [-filter] [-normalize] You must point to the linkdb via the -linkdb parameter. -Original message- From:Teague James teag...@insystechinc.com Sent: Thursday 16th January 2014 16:57 To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Okay. I changed my solrindex to this: bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb crawl/segments/20140115143147 I got the same errors: Indexer
AIOOBException on trunk since 21st or 22nd build
Hi - this likely belongs to an existing open issue. We're seeing the stuff below on a build of the 22nd. Until just now we used builds of the 20th and didn't have the issue. This is either a bug or did some data format in Zookeeper change? Until now only two cores of the same shard through the error, all other nodes in the cluster are clean. 2014-01-22 15:32:48,826 ERROR [solr.core.SolrCore] - [http-8080-exec-5] - : java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.solr.common.cloud.CompositeIdRouter$KeyParser.getHash(CompositeIdRouter.java:291) at org.apache.solr.common.cloud.CompositeIdRouter.sliceHash(CompositeIdRouter.java:58) at org.apache.solr.common.cloud.HashBasedRouter.getTargetSlice(HashBasedRouter.java:33) at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:218) at org.apache.solr.update.processor.DistributedUpdateProcessor.processDelete(DistributedUpdateProcessor.java:961) at org.apache.solr.update.processor.UpdateRequestProcessor.processDelete(UpdateRequestProcessor.java:55) at org.apache.solr.handler.loader.XMLLoader.processDelete(XMLLoader.java:347) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:278) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1915) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:785) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:203) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889) at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724)
RE: AIOOBException on trunk since 21st or 22nd build
Yeah, i can now also reproduce the problem with a build of the 20th! Again the same nodes leader and replica. The problem seems to be in the data we're sending to Solr. I'll check it out an file an issue. Cheers -Original message- From:Mark Miller markrmil...@gmail.com Sent: Wednesday 22nd January 2014 18:56 To: solr-user solr-user@lucene.apache.org Subject: Re: AIOOBException on trunk since 21st or 22nd build Looking at the list of changes on the 21st and 22nd, I don’t see a smoking gun. - Mark On Jan 22, 2014, 11:13:26 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - this likely belongs to an existing open issue. We're seeing the stuff below on a build of the 22nd. Until just now we used builds of the 20th and didn't have the issue. This is either a bug or did some data format in Zookeeper change? Until now only two cores of the same shard through the error, all other nodes in the cluster are clean. 2014-01-22 15:32:48,826 ERROR [solr.core.SolrCore] - [http-8080-exec-5] - : java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.solr.common.cloud.CompositeIdRouter$KeyParser.getHash(CompositeIdRouter.java:291) at org.apache.solr.common.cloud.CompositeIdRouter.sliceHash(CompositeIdRouter.java:58) at org.apache.solr.common.cloud.HashBasedRouter.getTargetSlice(HashBasedRouter.java:33) at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:218) at org.apache.solr.update.processor.DistributedUpdateProcessor.processDelete(DistributedUpdateProcessor.java:961) at org.apache.solr.update.processor.UpdateRequestProcessor.processDelete(UpdateRequestProcessor.java:55) at org.apache.solr.handler.loader.XMLLoader.processDelete(XMLLoader.java:347) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:278) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1915) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:785) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:203) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889) at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724)
RE: AIOOBException on trunk since 21st or 22nd build
Ignore or throw proper error message for bad delete containing bad composite ID https://issues.apache.org/jira/browse/SOLR-5659 -Original message- From:Markus Jelsma markus.jel...@openindex.io Sent: Thursday 23rd January 2014 12:16 To: solr-user@lucene.apache.org Subject: RE: AIOOBException on trunk since 21st or 22nd build Yeah, i can now also reproduce the problem with a build of the 20th! Again the same nodes leader and replica. The problem seems to be in the data we're sending to Solr. I'll check it out an file an issue. Cheers -Original message- From:Mark Miller markrmil...@gmail.com Sent: Wednesday 22nd January 2014 18:56 To: solr-user solr-user@lucene.apache.org Subject: Re: AIOOBException on trunk since 21st or 22nd build Looking at the list of changes on the 21st and 22nd, I don’t see a smoking gun. - Mark On Jan 22, 2014, 11:13:26 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - this likely belongs to an existing open issue. We're seeing the stuff below on a build of the 22nd. Until just now we used builds of the 20th and didn't have the issue. This is either a bug or did some data format in Zookeeper change? Until now only two cores of the same shard through the error, all other nodes in the cluster are clean. 2014-01-22 15:32:48,826 ERROR [solr.core.SolrCore] - [http-8080-exec-5] - : java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.solr.common.cloud.CompositeIdRouter$KeyParser.getHash(CompositeIdRouter.java:291) at org.apache.solr.common.cloud.CompositeIdRouter.sliceHash(CompositeIdRouter.java:58) at org.apache.solr.common.cloud.HashBasedRouter.getTargetSlice(HashBasedRouter.java:33) at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:218) at org.apache.solr.update.processor.DistributedUpdateProcessor.processDelete(DistributedUpdateProcessor.java:961) at org.apache.solr.update.processor.UpdateRequestProcessor.processDelete(UpdateRequestProcessor.java:55) at org.apache.solr.handler.loader.XMLLoader.processDelete(XMLLoader.java:347) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:278) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1915) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:785) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:203) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889) at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724)
RE: Solr Related Search Suggestions
Query Recommendations using Query Logs in Search Engines http://personales.dcc.uchile.cl/~churtado/clustwebLNCS.pdf Very interesting paper and section 2.1 covers related work plus references. In our first attempt we did it even simpler, by finding for each query other top queries by inspecting our query and click logs. That works very well as well, the big problem is normalizing query terms for deduplication. Something that is never mentioned in any paper i read so far ;) -Original message- From:kumar pavan2...@gmail.com Sent: Tuesday 28th January 2014 6:09 To: solr-user@lucene.apache.org Subject: Re: Solr Related Search Suggestions These are just key words -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Related-Search-Suggestions-tp4113672p4113882.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Nutch
Short answer, you can't.rashmi maheshwari maheshwari.ras...@gmail.com schreef:Thanks All for quick response. Today I crawled a webpage using nutch. This page have many links. But all anchor tags have href=# and javascript is written on onClick event of each anchor tag to open a new page. So crawler didnt crawl any of those links which were opening using onClick event and has # href value. How these links are crawled using nutch? On Tue, Jan 28, 2014 at 10:54 PM, Alexei Martchenko ale...@martchenko.com.br wrote: 1) Plus, those files are binaries sometimes with metadata, specific crawlers need to understand them. html is a plain text 2) Yes, different data schemes. Sometimes I replicate the same core and make some A-B tests with different weights, filters etc etc and some people like to creare CoreA and CoreB with the same schema and hammer CoreA with updates and commits and optmizes, they make it available for searches while hammering CoreB. Then swap again. This produces faster searches. alexei martchenko Facebook http://www.facebook.com/alexeiramone | Linkedinhttp://br.linkedin.com/in/alexeimartchenko| Steam http://steamcommunity.com/id/alexeiramone/ | 4sqhttps://pt.foursquare.com/alexeiramone| Skype: alexeiramone | Github https://github.com/alexeiramone | (11) 9 7613.0966 | 2014-01-28 Jack Krupansky j...@basetechnology.com 1. Nutch follows the links within HTML web pages to crawl the full graph of a web of pages. 2. Think of a core as an SQL table - each table/core has a different type of data. 3. SolrCloud is all about scaling and availability - multiple shards for larger collections and multiple replicas for both scaling of query response and availability if nodes go down. -- Jack Krupansky -Original Message- From: rashmi maheshwari Sent: Tuesday, January 28, 2014 11:36 AM To: solr-user@lucene.apache.org Subject: Solr Nutch Hi, Question1 -- When Solr could parse html, documents like doc, excel pdf etc, why do we need nutch to parse html files? what is different? Questions 2: When do we use multiple core in solar? any practical business case when we need multiple cores? Question 3: When do we go for cloud? What is meaning of implementing solr cloud? -- Rashmi Be the change that you want to see in this world! www.minnal.zor.org disha.resolve.at www.artofliving.org -- Rashmi Be the change that you want to see in this world! www.minnal.zor.org disha.resolve.at www.artofliving.org
LUCENE-5388 AbstractMethodError
Hi, We have a developement environment running trunk but have custom analyzers and token filters built on 4.6.1. Now the constructors have changes somewhat and stuff breaks. Here's a consumer trying to get a TokenStream from an Analyzer object doing TokenStream stream = analyzer.tokenStream(null, new StringReader(input)); throwing: Caused by: java.lang.AbstractMethodError at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:140) Changing the constructors won't work either because on 4.x we must override that specific method: analyzer is not abstract and does not override abstract method createComponents(String,Reader) in Analyzer :) So, any hints on how to deal with this thing? Wait for 4.x backport of 5388, so something clever like ... fill in the blanks. Many thanks, Markus
RE: Sentence Detection for Highlighting
Boundary scanner using Java's break iterator: http://wiki.apache.org/solr/HighlightingParameters#hl.boundaryScanner -Original message- From:Furkan KAMACI furkankam...@gmail.com Sent: Tuesday 4th February 2014 12:03 To: solr-user@lucene.apache.org Subject: Sentence Detection for Highlighting Hi; I want to detect sentences for Turkish documents to generate better Higlighting at Solr 4.6.1 What do you prefer to me for that purpose? Thanks; Furkan KAMACI
RE: Inconsistency between Leader and replica in solr cloud
Yes, that issue is fixed. We are on trunk and seeing it happen again. Kill some nodes when indexing, trigger OOM or reload the collection and you are in trouble again. -Original message- From:Yago Riveiro yago.rive...@gmail.com Sent: Monday 24th February 2014 14:54 To: solr-user@lucene.apache.org Subject: Re: Inconsistency between Leader and replica in solr cloud This bug was fixed on Solr 4.6.1— /Yago Riveiro On Mon, Feb 24, 2014 at 11:56 AM, abhijit das abhijitdas1...@outlook.com wrote: We are currently using Solr Cloud Version 4.3, with the following set-up, a core with 2 shards - Shard1 and Shard2, each shard has replication factor 1. We have noticed that in one of the shards, the document differs between the leader and the replica. Though the doc exists in both the machines, the properties of the doc are not same. This is causing inconsistent result in subsequent queries, our understanding is that the docs would be replicated and be identical in both leader and replica. What could be causing this and how can this be avoided. Thanks in advance. Regards, Abhijit Sent from Windows Mail
RE: How To Test SolrCloud Indexing Limits
Something must be eating your memory in your solrcloud indexer in Nutch. We have our own SolrCloud indexer in Nutch and it uses extremely little memory. You either have a leak or your batch size is too large. -Original message- From:Furkan KAMACI furkankam...@gmail.com Sent: Thursday 27th February 2014 16:04 To: solr-user@lucene.apache.org Subject: How To Test SolrCloud Indexing Limits Hi; I'm trying to index 2 million documents into SolrCloud via Map Reduce Jobs (really small number of documents for my system). However I get that error at tasks when I increase the added document size: java.lang.ClassCastException: java.lang.OutOfMemoryError cannot be cast to java.lang.Exception at org.apache.solr.client.solrj.impl.CloudSolrServer$RouteException.init(CloudSolrServer.java:484) at org.apache.solr.client.solrj.impl.CloudSolrServer.directUpdate(CloudSolrServer.java:351) at org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:510) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54) at org.apache.nutch.indexwriter.solrcloud.SolrCloudIndexWriter.close(SolrCloudIndexWriter.java:95) at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:114) at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:54) at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:649) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:363) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapred.Child.main(Child.java:249) I use Solr 4.5.1 for my purpose. I do not get any error at my SolrCloud nodes.. I want to test my indexing capability and I have changed some parameters to tune up. Is there any idea for autocommit - softcommit size or maxTime - maxDocs parameters to test. I don't need the numbers I just want to follow a policy as like: increase autocommit and maxDocs, don't use softcommit and maxTime (or maybe no free lunch, try everything!). I don't ask this question for production purpose, I know that I should test more parameters and tune up my system for such kind of purpose I just want to test my indexing limits. Thanks; Furkan KAMACI
RE: Id As URL for Solrj
You are not escaping the Lucene query parser special characters: + - || ! ( ) { } [ ] ^ ~ * ? : \ / -Original message- From:Furkan KAMACI furkankam...@gmail.com Sent: Tuesday 4th March 2014 16:57 To: solr-user@lucene.apache.org Subject: Id As URL for Solrj Hi; This maybe a simple question but when I query from Admin interface: id:am.mobileworld.www:http/ returns me one document as well. However when I do it from Solrj with deleteById it does not. Also when I send a query via Solrj it returns me all documents (for id, id:am.mobileworld.www:http/ ). I've escaped the terms, URL encoded and ... What is the mos appropriate for it? Thanks; Furkan KAMACI
RE: IDF maxDocs / numDocs
Hi Steve - it seems most similarities use CollectionStatistics.maxDoc() in idfExplain but there's also a docCount(). We use docCount in all our custom similarities, also because it allows you to have multiple languages in one index where one is much larger than the other. The small language will have very high IDF scores using maxDoc but they are proportional enough using docCount(). Using docCount() also fixes SolrCloud ranking problems, unless one of your replica's becomes inconsistent ;) https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/CollectionStatistics.html#docCount%28%29 -Original message- From:Steven Bower smb-apa...@alcyon.net Sent: Wednesday 12th March 2014 16:08 To: solr-user solr-user@lucene.apache.org Subject: IDF maxDocs / numDocs I am noticing the maxDocs between replicas is consistently different and that in the idf calculation it is used which causes idf scores for the same query/doc between replicas to be different. obviously an optimize can normalize the maxDocs scores, but that is only temporary.. is there a way to have idf use numDocs instead (as it should be consistent across replicas)? thanks, steve
RE: IDF maxDocs / numDocs
Oh yes, i see what you mean. I would try SOLR-1632 and have distributed IDF, but it seems to be broken now. -Original message- From:Steven Bower smb-apa...@alcyon.net Sent: Wednesday 12th March 2014 21:47 To: solr-user solr-user@lucene.apache.org Subject: Re: IDF maxDocs / numDocs My problem is that both maxDoc() and docCount() both report documents that have been deleted in their values. Because of merging/etc.. those numbers can be different per replica (or at least that is what I'm seeing). I need a value that is consistent across replicas... I see in the comment it makes mention of not using IndexReader.numDocs() but there doesn't seem to me a way to get ahold of the IndexReader within a similarity implementation (as only TermStats, CollectionStats are passed in, and neither contains of ref to the reader) I am contemplating just using a static value for the number of docs as this won't change dramatically often.. steve On Wed, Mar 12, 2014 at 11:18 AM, Markus Jelsma markus.jel...@openindex.iowrote: Hi Steve - it seems most similarities use CollectionStatistics.maxDoc() in idfExplain but there's also a docCount(). We use docCount in all our custom similarities, also because it allows you to have multiple languages in one index where one is much larger than the other. The small language will have very high IDF scores using maxDoc but they are proportional enough using docCount(). Using docCount() also fixes SolrCloud ranking problems, unless one of your replica's becomes inconsistent ;) https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/CollectionStatistics.html#docCount%28%29 -Original message- From:Steven Bower smb-apa...@alcyon.net Sent: Wednesday 12th March 2014 16:08 To: solr-user solr-user@lucene.apache.org Subject: IDF maxDocs / numDocs I am noticing the maxDocs between replicas is consistently different and that in the idf calculation it is used which causes idf scores for the same query/doc between replicas to be different. obviously an optimize can normalize the maxDocs scores, but that is only temporary.. is there a way to have idf use numDocs instead (as it should be consistent across replicas)? thanks, steve
Re: Bug with OpenJDK on Ubuntu - affects Solr users
Hi - as far as i know it has never been a good idea to run Lucene on OpenJDK 6 at all. Only either Oracle Java 6 or higher or OpenJDK 7. On Wednesday, March 26, 2014 06:54:41 PM Nigel Sheridan-Smith wrote: Hi all, This is a bit of a 'heads up'. We have recently come across this bug on Ubuntu with OpenJDK: https://bugs.launchpad.net/ubuntu/+source/openjdk-6/+bug/1295987 Basically, finalizers are not being run, so effectively all of the commits written in SolrIndexWriter are not Garbage Collected. if you find that your Java heap memory grows continuously at around 4-8Mb per index update, and you are running this version of OpenJDK, and the Garbage Collector does not recycle much memory from the Old Gen generation, then this is likely to be your problem. We increased our heap space from 1Gb to 4Gb but the memory usage continued to grow at about the same pace. It was only when we ran 'jmap' and analysed the heap dump with Eclipse MAT that it became obvious that unreferenced objects were not being correctly Garbage Collected. i hope this helps someone else! Cheers, Nigel Sheridan-Smith
Re: tf and very short text fields
Yes, override tfidfsimilarity and emit 1f in tf(). You can also use bm25 with k1 set to zero in your schema. Walter Underwood wun...@wunderwood.org schreef:And here is another peculiarity of short text fields. The movie New York, New York should not be twice as relevant for the query new york. Is there a way to use a binary term frequency rather than a count? wunder -- Walter Underwood wun...@wunderwood.org
Re: omitNorms and very short text fields
Yes, that will work. And combined with your other question scores will always be equal even if cinderella or chuck occur more than once in one document. Walter Underwood wun...@wunderwood.org schreef:Just double-checking my understanding of omitNorms. For very short text fields like personal names or titles, length normalization can give odd results. For example, we might want these two to score the same for the query Cinderella. * Cinderella * Cinderella (Diamond Edition) (Blu-ray + DVD + Digital Copy) (Widescreen) And these two for the query chuck: * Chuck House * Check E. Cheese I think that omitNorm=true on those fields will give that behavior. Is that the right approach? wunder -- Walter Underwood wun...@wunderwood.org
Re: Re: tf and very short text fields
Also, if i remember correctly, k1 set to zero for bm25 automatically omits norms in the calculation. So thats easy to play with without reindexing. Markus Jelsma markus.jel...@openindex.io schreef:Yes, override tfidfsimilarity and emit 1f in tf(). You can also use bm25 with k1 set to zero in your schema. Walter Underwood wun...@wunderwood.org schreef:And here is another peculiarity of short text fields. The movie New York, New York should not be twice as relevant for the query new york. Is there a way to use a binary term frequency rather than a count? wunder -- Walter Underwood wun...@wunderwood.org
Re: Re: solr 4.2.1 index gets slower over time
You may want to increase reclaimdeletesweight for tieredmergepolicy from 2 to 3 or 4. By default it may keep too much deleted or updated docs in the index. This can increase index size by 50%!! Dmitry Kan solrexp...@gmail.com schreef:Elisabeth, Yes, I believe you are right in that the deletes are part of the optimize process. If you delete often, you may consider (if not already) the TieredMergePolicy, which is suited for this scenario. Check out this relevant discussion I had with Lucene committers: https://twitter.com/DmitryKan/status/399820408444051456 HTH, Dmitry On Tue, Apr 1, 2014 at 11:34 AM, elisabeth benoit elisaelisael...@gmail.com wrote: Thanks a lot for your answers! Shawn. Our GC configuration has far less parameters defined, so we'll check this out. Dimitry, about the expungeDeletes option, we'll add that in the delete process. But from what I read, this is done in the optimize process (cf. http://lucene.472066.n3.nabble.com/Does-expungeDeletes-need-calling-during-an-optimize-td1214083.html ). Or maybe not? Thanks again, Elisabeth 2014-04-01 7:52 GMT+02:00 Dmitry Kan solrexp...@gmail.com: Hi, We have noticed something like this as well, but with older versions of solr, 3.4. In our setup we delete documents pretty often. Internally in Lucene, when a document is client requested to be deleted, it is not physically deleted, but only marked as deleted. Our original optimization assumption was such that the deleted documents would get physically removed on each optimize command issued. We started to suspect it wasn't always true as the shards (especially relatively large shards) became slower over time. So we found out about the expungeDeletes option, which purges the deleted docs and is by default false. We have set it to true. If your solr update lifecycle includes frequent deletes, try this out. This of course does not override working towards finding better GCparameters. https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching On Mon, Mar 31, 2014 at 3:57 PM, elisabeth benoit elisaelisael...@gmail.com wrote: Hello, We are currently using solr 4.2.1. Our index is updated on a daily basis. After noticing solr query time has increased (two times the initial size) without any change in index size or in solr configuration, we tried an optimize on the index but it didn't fix our problem. We checked the garbage collector, but everything seemed fine. What did in fact fix our problem was to delete all documents and reindex from scratch. It looks like over time our index gets corrupted and optimize doesn't fix it. Does anyone have a clue how to investigate further this situation? Elisabeth -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan
RE: tf and very short text fields
Hi - In this case Walter, iirc, was looking for two things: no normalization and no flat TF (1f for tf(float freq) 0). We know that k1 controls TF saturation but in BM25Similarity you can see that k1 is multiplied by the encoded norm value, taking b also into account. So setting k1 to zero effectively disabled length normalization and results in flat or binary TF. Here's an example output of k1 = 0 and k1 = 0.2. Norms or enabled on the field, term occurs three times in the field: 28.203003 = score(doc=0,freq=1.5 = phraseFreq=1.5 ), product of: 6.4 = boost 4.406719 = idf(docFreq=1, docCount=122) 1.0 = tfNorm, computed from: 1.5 = phraseFreq=1.5 0.0 = parameter k1 0.75 = parameter b 8.721312 = avgFieldLength 16.0 = fieldLength 27.813797 = score(doc=0,freq=1.5 = phraseFreq=1.5 ), product of: 6.4 = boost 4.406719 = idf(docFreq=1, docCount=122) 0.98619986 = tfNorm, computed from: 1.5 = phraseFreq=1.5 0.2 = parameter k1 0.75 = parameter b 8.721312 = avgFieldLength 16.0 = fieldLength You can clearly see the final TF norm being 1, despite the term frequency and length. Please correct my wrongs :) Markus -Original message- From:Tom Burton-West tburt...@umich.edu Sent: Thursday 3rd April 2014 20:18 To: solr-user@lucene.apache.org Subject: Re: tf and very short text fields Hi Markus and Wunder, I'm missing the original context, but I don't think BM25 will solve this particular problem. The k1 parameter sets how quickly the contribution of tf to the score falls off with increasing tf. It would be helpful for making sure really long documents don't get too high a score, but I don't think it would help for very short documents without messing up its original design purpose. For BM25, if you want to turn off length normalization, you set b to 0. However, I don't think that will do what you want, since turning off normalization will mean that the score for new york, new york will be twice that of the score for new york since without normalization the tf in new york new york is twice that of new york. I think the earlier suggestion to override tfidfsimilarity and emit 1f in tf() is probably the best way to switch to eliminate using tf counts, assumming that is really what you want. Tom On Tue, Apr 1, 2014 at 4:17 PM, Walter Underwood wun...@wunderwood.orgwrote: Thanks! We'll try that out and report back. I keep forgetting that I want to try BM25, so this is a good excuse. wunder On Apr 1, 2014, at 12:30 PM, Markus Jelsma markus.jel...@openindex.io wrote: Also, if i remember correctly, k1 set to zero for bm25 automatically omits norms in the calculation. So thats easy to play with without reindexing. Markus Jelsma markus.jel...@openindex.io schreef:Yes, override tfidfsimilarity and emit 1f in tf(). You can also use bm25 with k1 set to zero in your schema. Walter Underwood wun...@wunderwood.org schreef:And here is another peculiarity of short text fields. The movie New York, New York should not be twice as relevant for the query new york. Is there a way to use a binary term frequency rather than a count? wunder -- Walter Underwood wun...@wunderwood.org -- Walter Underwood wun...@wunderwood.org
RE: Strange relevance scoring
Hi - the thing you describe is possible when your set up uses SpanFirstQuery. But to be sure what's going on you should post the debug output. -Original message- From:John Nielsen j...@mcb.dk Sent: Tuesday 8th April 2014 11:03 To: solr-user@lucene.apache.org Subject: Strange relevance scoring Hi, We are seeing a strange phenomenon with our Solr setup which I have been unable to answer. My Google-fu is clearly not up to the task, so I am trying here. It appears that if i do a freetext search for a single word, say modellering on a text field, the scoring is massively boosted if the first word of the text field is a hit. For instance if there is only one occurrence of the word modellering in the text field and that occurrence is the first word of the text, then that document gets a higher relevancy than if the word modelling occurs 5 times in the text and the first word of the text is any other word. Is this normal behavior? Is special attention paid to the first word in a text field? I would think that the latter case would get the highest score. -- Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk
Re: Fails to index if unique field has special characters
Well, this is somewhat of a problem if you have have URL's as uniqueKey that contain exclamation marks. Isn't it an idea to allow those to be escaped and thus ignored by CompositeIdRouter? On Friday, April 11, 2014 11:43:31 AM Cool Techi wrote: Thanks, that was helpful. Regards,Rohit Date: Thu, 10 Apr 2014 08:44:36 -0700 From: iori...@yahoo.com Subject: Re: Fails to index if unique field has special characters To: solr-user@lucene.apache.org Hi Ayush, I thinks this IBM!12345. The exclamation mark ('!') is critical here, as it distinguishes the prefix used to determine which shard to direct the document to. https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+ in+SolrCloud On Thursday, April 10, 2014 2:35 PM, Cool Techi cooltec...@outlook.com wrote: Hi, We are migrating from Solr 4.6 standalone to Solr 4.7 cloud version, while reindexing the document we are getting the following error. This is happening when the unique key has special character, this was not noticed in version 4.6 standalone mode, so we are not sure if this is a version problem or a cloud issue. Example of the unique key is given below, http://www.mynews.in/Blog/smrity!!**)))!miami_dolphins_vs_dallas_cowboys_ live_stream_on_line_nfl_football_free_video_broadcast_B142707.html Exception Stack Trace ERROR - 2014-04-10 10:51:44.361; org.apache.solr.common.SolrException; java.lang.ArrayIndexOutOfBoundsException: 2 at org.apache.solr.common.cloud.CompositeIdRouter$KeyParser.getHash(Composit eIdRouter.java:296) at org.apache.solr.common.cloud.CompositeIdRouter.sliceHash(CompositeIdRoute r.java:58) at org.apache.solr.common.cloud.HashBasedRouter.getTargetSlice(HashBasedRout er.java:33) at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest( DistributedUpdateProcessor.java:218) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(Di stributedUpdateProcessor.java:550) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateP rocessorFactory.java:100) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247 ) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.j ava:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Content StreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBas e.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java :780) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav a:427) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav a:217) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandl er.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:1 37) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557 ) at org.eclipse.jetty.server.session.SessionHandle Thanks,Ayush
RE: Topology of Solr use
This may help a bit: https://wiki.apache.org/solr/PublicServers -Original message- From:Olivier Austina olivier.aust...@gmail.com Sent:Thu 17-04-2014 18:16 Subject:Topology of Solr use To:solr-user@lucene.apache.org; Hi All, I would to have an idea about Solr usage: number of users, industry, countries or any helpful information. Thank you. Regards Olivier
Re: Boost Search results
Hi, replicating full features search engine behaviour is not going to work with nutch and solr out of the box. You are missing a thousand features such as proper main content extraction, deduplication, classification of content and hub or link pages, and much more. These things are possible to implement but you may want to start with having you solr request handler better configured, to begin with, your qf parameter does not have nutchs default title and content field selected. A Laxmi a.lakshmi...@gmail.com schreef:Hi, When I started to compare the search results with the two options below, I see a lot of difference in the search results esp. the* urls that show up on the top *(*Relevancy *perspective). (1) Nutch 2.2.1 (with *Solr 4.0*) (2) Bing custom search set-up I wonder how should I tweak the boost parameters to get the best results on the top like how Bing, Google does. Please suggest why I see a difference and what parameters are best to configure in Solr to achieve what I see from Bing, or Google search relevancy. Here is what i got in solrconfig.xml: str name=defTypeedismax/str str name=qf text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4 /str str name=q.alt*:*/str str name=rows10/str str name=fl*,score/str Thanks
Re: Re: PostingHighlighter complains about no offsets
Hello michael, you are not on lucene 4.8? https://issues.apache.org/jira/plugins/servlet/mobile#issue/LUCENE-5111 Michael Sokolov msoko...@safaribooksonline.com schreef:For posterity, in case anybody follows this thread, I tracked the problem down to WordDelimiterFilter; apparently it creates an offset of -1 in some case, which PostingsHighlighter rejects. -Mike On 5/2/2014 10:20 AM, Michael Sokolov wrote: I checked using the analysis admin page, and I believe there are offsets being generated (I assume start/end=offsets). So IDK I am going to try reindexing again. Maybe I neglected to reload the config before I indexed last time. -Mike On 05/02/2014 09:34 AM, Michael Sokolov wrote: I've been wanting to try out the PostingsHighlighter, so I added storeOffsetsWithPositions to my field definition, enabled the highlighter in solrconfig.xml, reindexed and tried it out. When I issue a query I'm getting this error: |field 'text' was indexed without offsets, cannot highlight java.lang.IllegalArgumentException: field 'text' was indexed without offsets, cannot highlight at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightDoc(PostingsHighlighter.java:545) at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightField(PostingsHighlighter.java:467) at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFieldsAsObjects(PostingsHighlighter.java:392) at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFields(PostingsHighlighter.java:293)| I've been trying to figure out why the field wouldn't have offsets indexed, but I just can't see it. Is there something in the analysis chain that could stripping out offsets? This is the field definition: field name=text type=text_en indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true storeOffsetsWithPositions=true / (Yes I know PH doesn't require term vectors; I'm keeping them around for now while I experiment) fieldType name=text_en class=solr.TextField positionIncrementGap=100 analyzer type=index !-- We are indexing mostly HTML so we need to ignore the tags -- charFilter class=solr.HTMLStripCharFilterFactory/ !--tokenizer class=solr.StandardTokenizerFactory/-- tokenizer class=solr.WhitespaceTokenizerFactory/ !-- lower casing must happen before WordDelimiterFilter or protwords.txt will not work -- filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory stemEnglishPossessive=1 protected=protwords.txt/ !-- This deals with contractions -- filter class=solr.SynonymFilterFactory synonyms=synonyms.txt expand=true ignoreCase=true/ filter class=solr.HunspellStemFilterFactory dictionary=en_US.dic affix=en_US.aff ignoreCase=true/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query !--tokenizer class=solr.StandardTokenizerFactory/-- tokenizer class=solr.WhitespaceTokenizerFactory/ !-- lower casing must happen before WordDelimiterFilter or protwords.txt will not work -- filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory protected=protwords.txt/ !-- setting tokenSeparator= solves issues with compound words and improves phrase search -- filter class=solr.HunspellStemFilterFactory dictionary=en_US.dic affix=en_US.aff ignoreCase=true/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType
RE: permissive mm value and efficient spellchecking
Elisabeth, i think you are looking for SOLR-3211 that introduced spellcheck.collateParam.* to override e.g. dismax settings. Markus -Original message- From:elisabeth benoit elisaelisael...@gmail.com Sent:Wed 14-05-2014 14:01 Subject:permissive mm value and efficient spellchecking To:solr-user@lucene.apache.org; Hello, I'm using solr 4.2.1. I use a very permissive value for mm, to be able to find results even if request contains non relevant words. At the same time, I'd like to be able to do some efficient spellcheking with solrdirectspellchecker. So for instance, if user searches for rue de Chraonne Paris, where Chraonne is mispelled, because of my permissive mm value I get more than 100 000 results containing words rue and Paris (de is a stopword), which are very frequent terms in my index, but no spellcheck correction for Chraonne. If I set mm=3, then I get the expected spellcheck correction value: rue de Charonne Paris. Is there a way to achieve my two goals in a single solr request? Thanks, Elisabeth
RE: Solr + SPDY
Hi Harsh, Does SPDY provide lower latency than HTTP/1.1 with KeepAlive or is it encryption that you're after? Markus -Original message- From:harspras prasadta...@outlook.com Sent:Tue 13-05-2014 05:38 Subject:Re: Solr + SPDY To:solr-user@lucene.apache.org; Hi Vinay, I have been trying to setup a similar environment with SPDY being enabled for Solr inter shard communication. Did you happen to have been able to do it? I somehow cannot use SolrCloud with SPDY enabled in jetty. Regards, Harsh Prasad -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-SPDY-tp4097771p4135377.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Edismax should, should not, exact match operators
http://wiki.apache.org/solr/ExtendedDisMax#Query_Syntax -Original message- From:michael.boom my_sky...@yahoo.com Sent:Tue 10-06-2014 13:15 Subject:Edismax should, should not, exact match operators To:solr-user@lucene.apache.org; On google a user can query using operators like + or - and quote the desired term in order to get the desired match. Does something like this come by default with edismax parser ? - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/Edismax-should-should-not-exact-match-operators-tp4140967.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Recommended ZooKeeper topology in Production
Yes, always use three or a higher odd number of machines. It is best to have them on dedicated machines and unless the cluster is very large three small VPS machines with 512 MB RAM suffice. -Original message- From:Gili Nachum gilinac...@gmail.com Sent:Tue 10-06-2014 08:58 Subject:Recommended ZooKeeper topology in Production To:solr-user@lucene.apache.org; Is there a recommended ZooKeeper topology for production Solr environments? I was planning: 3 ZK nodes, each on its own dedicated machine. Thinking that dedicated machines, separate from Solr servers, would keep ZK isolated from resource contention spikes that may occur on Solr. Also, if a Solr machine goes down, there would still be 3 ZK nodes to handle the event properly. If I want to save on resources, placing each ZK instance on the same box as Solr instance in considered common practice in production environments? Thanks!
RE: docFreq coming to be more than 1 for unique id field
Hi - did you perhaps update on of those documents? -Original message- From:Apoorva Gaurav apoorva.gau...@myntra.com Sent: Tuesday 17th June 2014 16:58 To: solr-user@lucene.apache.org Subject: docFreq coming to be more than 1 for unique id field Hello All, We are using solr 4.4.0. We have a uniqueKey of type solr.StrField. We need to extract docs in a pre-defined order if they match a certain condition. Our query is of the format uniqueField:(id1 ^ weight1 OR id2 ^ weight2 . OR idN ^ weightN) where weight1 weight2 weightN But the result is not in the desired order. On debugging the query we've found out that for some of the documents docFreq is higher than 1 and hence their tf-idf based score is less than others. What can be the reason behind a unique id field having docFreq greater than 1? How can we prevent it? -- Thanks Regards, Apoorva
RE: docFreq coming to be more than 1 for unique id field
Yes, it is unique but they are not immediately purged, only when `optimized` or forceMerge or during regular segment merges. The problem is that they keep messing with the statistics. -Original message- From:Apoorva Gaurav apoorva.gau...@myntra.com Sent: Tuesday 17th June 2014 17:16 To: solr-user solr-user@lucene.apache.org; Ahmet Arslan iori...@yahoo.com Subject: Re: docFreq coming to be more than 1 for unique id field Yes we have updates on these. Didn't try optimizing will do. But isn't the unique field supposed to be unique? On Tue, Jun 17, 2014 at 8:37 PM, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi, Just a guess, do you have deletions? What happens when you optimize and re-try? On Tuesday, June 17, 2014 5:58 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello All, We are using solr 4.4.0. We have a uniqueKey of type solr.StrField. We need to extract docs in a pre-defined order if they match a certain condition. Our query is of the format uniqueField:(id1 ^ weight1 OR id2 ^ weight2 . OR idN ^ weightN) where weight1 weight2 weightN But the result is not in the desired order. On debugging the query we've found out that for some of the documents docFreq is higher than 1 and hence their tf-idf based score is less than others. What can be the reason behind a unique id field having docFreq greater than 1? How can we prevent it? -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva
Re: Unable to start solr 4.8
Hi - remove the lock file in your solr/collection_name/data/index.*/ directory. Markus On Thursday, June 19, 2014 04:10:51 AM atp wrote: Hi experts, i have cnfigured solrcloud, on three machines , zookeeper started with no errors, tomcat log also no errors , solr log alos no errors reported but all the tomcat configured solr clusterstate shows as 'down' ,8870931 [Thread-13] INFO org.apache.solr.common.cloud.ZkStateReader â Updating cloud state from ZooKeeper... 8870934 [Thread-13] INFO org.apache.solr.cloud.Overseer â Update state numShards=2 message={ operation:state, state:down, base_url:http://10.***.***.28:7090/solr;, core:collection1, roles:null, node_name:10.***.***.28:7090_solr, shard:shard2, collection:collection1, numShards:2, core_node_name:10.***.***.28:7090_solr_collection1} 8870939 [main-EventThread] INFO org.apache.solr.cloud.DistributedQueue â LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged 8870942 [main-EventThread] INFO org.apache.solr.common.cloud.ZkStateReader â A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5) 8919667 [main-EventThread] INFO org.apache.solr.common.cloud.ZkStateReader â Updating live nodes... (4) 8933777 [main-EventThread] INFO org.apache.solr.common.cloud.ZkStateReader â Updating live nodes... (3) 8965906 [main-EventThread] INFO org.apache.solr.common.cloud.ZkStateReader â Updating live nodes... (4) 8965994 [main-EventThread] INFO org.apache.solr.cloud.DistributedQueue â LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged 8965997 [Thread-13] INFO org.apache.solr.common.cloud.ZkStateReader â Updating cloud state from ZooKeeper... 8966000 [Thread-13] INFO org.apache.solr.cloud.Overseer â Update state numShards=2 message={ operation:state, state:down, base_url:http://10.***.***.29:7070/solr;, core:collection1, roles:null, node_name:10.***.***.29:7070_solr, shard:shard1, collection:collection1, numShards:2, core_node_name:110.***.***.29:7070_solr_collection1} 8966006 [main-EventThread] INFO org.apache.solr.cloud.DistributedQueue â LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged 8966008 [main-EventThread] INFO org.apache.solr.common.cloud.ZkStateReader â A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 4) 8986466 [main-EventThread] INFO org.apache.solr.common.cloud.ZkStateReader â Updating live nodes... (5) 8986648 [main-EventThread] INFO org.apache.solr.cloud.DistributedQueue â LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged 8986652 [Thread-13] INFO org.apache.solr.common.cloud.ZkStateReader â Updating cloud state from ZooKeeper... 8986654 [Thread-13] INFO org.apache.solr.cloud.Overseer â Update state numShards=2 message={ operation:state, state:down, base_url:http://10.***.***.30:7080/solr;, core:collection1, roles:null, node_name:10.***.***.30:7080_solr, shard:shard1, collection:collection1, numShards:2, core_node_name:10.***.***.30:7080_solr_collection1} 8986661 [main-EventThread] INFO org.apache.solr.cloud.DistributedQueue â LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged 898 [main-EventThread] INFO org.apache.solr.common.cloud.ZkStateReader â A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5) 9008407 [main-EventThread] INFO org.apache.solr.common.cloud.ZkStateReader â Updating live nodes... (6) when i browse the 28,29 and 30th solr url , its throwing error like, HTTP Status 500 - {msg=SolrCore 'collection1' is not available due to init failure: Index locked for write for core collection1,trace=org.apache.solr.common.SolrException: SolrCore 'collection1' is not available due to init failure: Index locked for write for core collection1 at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:753) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java: 347) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java: 207) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application FilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh ain.java:208) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja va:220) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja va:122) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171 ) at
RE: How much free disk space will I need to optimize my index
-Original message- From:johnmu...@aol.com johnmu...@aol.com Sent: Wednesday 25th June 2014 20:13 To: solr-user@lucene.apache.org Subject: How much free disk space will I need to optimize my index Hi, I need to de-fragment my index. My question is, how much free disk space I need before I can do so? My understanding is, I need 1X free disk space of my current index un-optimized index size before I can optimize it. Is this true? Yes, 20 GB of FREE space to force merge an existing 20 GB index. That is, let say my index is 20 GB (un-optimized) then I must have 20 GB of free disk space to make sure the optimization is successful. The reason for this is because during optimization the index is re-written (is this the case?) and if it is already optimized, the re-write will create a new 20 GB index before it deletes the old one (is this true?), thus why there must be at least 20 GB free disk space. Can someone help me with this or point me to a wiki on this topic? Thanks!!! - MJ
RE: unable to start solr instance
(Too many open files) Try raising the limit from probably 1024 to 4k-16k orso. -Original message- From:Niklas Langvig niklas.lang...@globesoft.com Sent: Monday 30th June 2014 17:09 To: solr-user@lucene.apache.org Subject: unable to start solr instance Hello, We havet o solr instances running on linux/tomcat7 Both have been working fine, now only 1 works. The other seems to have crashed or something. SolrCore Initialization Failures * collection1: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error initializing QueryElevationComponent. We havn't changed anything in the setup. Earlier 4 days ago I could see in the logs response lst name=responseHeaderint name=status500/intint name=QTime0/int/lstlst name=errorstr name=msgjava.io.FileNotFoundException: /opt/solr410/document/collection1/data/tlog/tlog.2494137 (Too many open files)/strstr name=traceorg.apache.solr.common.SolrException: java.io.FileNotFoundException: /opt/solr410/document/collection1/data/tlog/tlog.2494137 (Too many open files) at org.apache.solr.update.TransactionLog.lt;initgt;(TransactionLog.java:182) at org.apache.solr.update.TransactionLog.lt;initgt;(TransactionLog.java:140) at org.apache.solr.update.UpdateLog.ensureLog(UpdateLog.java:796) at org.apache.solr.update.UpdateLog.delete(UpdateLog.java:409) at org.apache.solr.update.DirectUpdateHandler2.delete(DirectUpdateHandler2.java:284) at org.apache.solr.update.processor.RunUpdateProcessor.processDelete(RunUpdateProcessorFactory.java:77) at org.apache.solr.update.processor.UpdateRequestProcessor.processDelete(UpdateRequestProcessor.java:55) at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalDelete(DistributedUpdateProcessor.java:460) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionDelete(DistributedUpdateProcessor.java:1036) at org.apache.solr.update.processor.DistributedUpdateProcessor.processDelete(DistributedUpdateProcessor.java:721) at org.apache.solr.update.processor.LogUpdateProcessor.processDelete(LogUpdateProcessorFactory.java:121) at org.apache.solr.handler.loader.XMLLoader.processDelete(XMLLoader.java:346) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:277) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:448) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:269) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:861) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:606) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:722) Caused by: java.io.FileNotFoundException: /opt/solr410/document/collection1/data/tlog/tlog.2494137
RE: NPE when using facets with the MLT handler.
Hi, i don't think this is ever going to work with the MLT Handler, you should use the regular SearchHandler instead. -Original message- From:SafeJava T t...@safejava.com Sent: Monday 30th June 2014 17:52 To: solr-user@lucene.apache.org Subject: NPE when using facets with the MLT handler. I am getting an NPE when using facets with the MLT handler. I googled for other npe errors with facets, but this trace looked different from the ones I found. We are using Solr 4.9-SNAPSHOT. I have reduced the query to the most basic form I can: q=id:XXXmlt.fl=mlt_fieldfacet=truefacet.field=id I changed it to facet on id, to ensure that the field was present in all results. Any ideas on how to work around this? java.lang.NullPointerException at org.apache.solr.search.facet.SimpleFacets.addFacets(SimpleFacets.java:375) at org.apache.solr.handler.MoreLikeThisHandler.handleRequestBody(MoreLikeThisHandler.java:211) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1955) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:769) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:368) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:744) Thanks, Tom
RE: Memory Leaks in solr 4.8.1
Hi, you can safely ignore this, it is shutting down anyway. Just don't reload the app a lot of times without actually restarting Tomcat. -Original message- From:Aman Tandon amantandon...@gmail.com Sent: Wednesday 2nd July 2014 7:22 To: solr-user@lucene.apache.org Subject: Memory Leaks in solr 4.8.1 Hi, When i am shutting down the solr i am gettng the Memory Leaks error in logs. Jul 02, 2014 10:49:10 AM org.apache.catalina.loader.WebappClassLoader checkThreadLocalMapForLeaks SEVERE: The web application [/solr] created a ThreadLocal with key of type [org.apache.solr.schema.DateField.ThreadLocalDateFormat] (value [org.apache.solr.schema.DateField$ThreadLocalDateFormat@1d987b2]) and a value of type [org.apache.solr.schema.DateField.ISO8601CanonicalDateFormat] (value [org.apache.solr.schema.DateField$ISO8601CanonicalDateFormat@6b2ed43a]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak. Please check. With Regards Aman Tandon
RE: Disable Regular Expression Support
Hi, you can escape the surrounding slashes in your front-end. Markus -Original message- From:Markus Schuch markus_sch...@web.de Sent: Thursday 3rd July 2014 20:53 To: solr-user@lucene.apache.org Subject: Disable Regular Expression Support Hi Solr Community, we migrate from solr 1.4 to 4.3 and found out, that solr 4.x invented regular expression support for the query parser. Is it possible to disable this feature to get back to the 1.4 behavior of the query parser? Many thanks in advance, Markus Schuch
RE: Any Solr consultants available??
Hahaha thanks wunder, made me laugh! -Original message- From:Walter Underwood wun...@wunderwood.org Sent: Thursday 24th July 2014 2:07 To: solr-user@lucene.apache.org Subject: Re: Any Solr consultants available?? When I see job postings like this, I have to assume they were written by people who really don’t understand the problem and have never met people with the various skills they are asking for. They are not going to find one person who does all this. This is an opening for zebra unicorn that walks on water. At best, they’ll get a one-horned goat with painted stripes on a life raft. They need to talk to some people, make multiple realistic openings, and expect to grow some of their own expertise. I got an email like this from Goldman Sachs this morning. “... a Senior Application Architect/Developer and DevOps Engineer for a major company initiative. In addition to an effort to build a new cloud infrastructure from the ground up, they are beginning a number of company projects in the areas of cloud-based open source search, Machine Learning/AI, Big Data, Predictive Analytics Low-Latency Trading Algorithm Development.” Good luck, fellas. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ On Jul 23, 2014, at 1:01 PM, Jack Krupansky j...@basetechnology.com wrote: Yeah, I saw that, which is why I suggested not being too picky about specific requirements. If you have at least two or three years of solid Solr experience, that would make you at least worth looking at. -- Jack Krupansky From: Tri Cao Sent: Wednesday, July 23, 2014 3:57 PM To: solr-user@lucene.apache.org Cc: solr-user@lucene.apache.org Subject: Re: Any Solr consultants available?? Well, it's kind of hard to find a person if the requirement is 10 years' experience with Solr given that Solr was created in 2004. On Jul 23, 2014, at 12:45 PM, Jack Krupansky j...@basetechnology.com wrote: I occasionally get pinged by recruiters looking for Solr application developers... here’s the latest. If you are interested, either contact Jessica directly or reply to me and I’ll forward your reply. Even if you don’t strictly meet all the requirements... they are having trouble finding... anyone. All the great Solr guys I know are quite busy. Thanks. -- Jack Krupansky From: Jessica Feigin Sent: Wednesday, July 23, 2014 3:36 PM To: 'Jack Krupansky' Subject: Thank you! Hi Jack, Thanks for your assistance, below is the Solr Consultant job description: Our client, a hospitality Fortune 500 company are looking to update their platform to make accessing information easier for the franchisees. This is the first phase of the project which will take a few years. They want a hands on Solr consultant who has ideally worked in the search space. As you can imagine the company culture is great, everyone is really friendly and there is also an option to become permanent. They are looking for: - 10+ years’ experience with Solr (Apache Lucene), HTML, XML, Java, Tomcat, JBoss, MySQL - 5+ years’ experience implementing Solr builds of indexes, shards, and refined searches across semi-structured data sets to include architectural scaling - Experience in developing a re-usable framework to support web site search; implement rich web site search, including the incorporation of metadata. - Experienced in development using Java, Oracle, RedHat, Perl, shell, and clustering - A strong understanding of Data analytics, algorithms, and large data structures - Experienced in architectural design and resource planning for scaling Solr/Lucene capabilities. - Bachelor's degree in Computer Science or related discipline. Jessica Feigin Technical Recruiter Technology Resource Management 30 Vreeland Rd., Florham Park, NJ 07932 Phone 973-377-0040 x 415, Fax 973-377-7064 Email: jess...@trmconsulting.com Web site: www.trmconsulting.com LinkedIn Profile: www.linkedin.com/in/jessicafeigin
RE: crawling all links of same domain in nutch in solr
Hi - use the domain URL filter plugin and list the domains, hosts or TLD's you want to restrict the crawl to. -Original message- From:Vivekanand Ittigi vi...@biginfolabs.com Sent: Tuesday 29th July 2014 7:17 To: solr-user@lucene.apache.org Subject: crawling all links of same domain in nutch in solr Hi, Can anyone tel me how to crawl all other pages of same domain. For example i'm feeding a website http://www.techcrunch.com/ in seed.txt. Following property is added in nutch-site.xml property namedb.ignore.internal.links/name valuefalse/value descriptionIf true, when adding new links to a page, links from the same host are ignored. This is an effective way to limit the size of the link database, keeping only the highest quality links. /description /property And following is added in regex-urlfilter.txt # accept anything else +. Note: if i add http://www.tutorialspoint.com/ in seed.txt, I'm able to crawl all other pages but not techcrunch.com's pages though it has got many other pages too. Please help..? Thanks, Vivek
RE: Solr substring search yields all indexed results
Don't use N-grams at query time. -Original message- From:prem1980 prem1...@gmail.com Sent: Monday 4th August 2014 17:47 To: solr-user@lucene.apache.org Subject: Solr substring search yields all indexed results To do a substring search, I have added a new fieldType - Text with NgramFilter. It works fine perfectly but downside is this problem Example name = ['Apple','Samy','And','a'] When I do a search name:a, then all the above items gets pulled up. Even when search changes to App. All the above items are pulled. How can I fix this issue? fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=100 / /analyzer /fieldType -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-substring-search-yields-all-indexed-results-tp4151012.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: NGramTokenizer influence to length normalization?
All tokens produced have still have the same position as their initial position, so no. -Original message- From:Johannes Siegert johannes.sieg...@marktjagd.de Sent: Friday 8th August 2014 11:11 To: solr-user@lucene.apache.org Subject: NGramTokenizer influence to length normalization? Hi, does the NGramTokenizer have an influence to the length normalization? Thanks. Johannes
RE: Solr cloud performance degradation with billions of documents
Hi - You are running mapred jobs on the same nodes as Solr runs right? The first thing i would think of is that your OS file buffer cache is abused. The mappers read all data, presumably residing on the same node. The mapper output and shuffling part would take place on the same node, only the reducer output is sent to your nodes, which i assume are on the same machines. Those same machines have a large Lucene index. All this data, written to and read from the same disk, competes for a nice spot in the OS buffer cache. Forget it if i misread anything, but when you're using serious figures of size, then do not abuse your caches. Have a separate mapred and Solr cluster, because they both eat cache space. I assume you can see serious IO WAIT times. Split the stuff and maybe even use smaller hardware, but more. M -Original message- From:Wilburn, Scott scott.wilb...@verizonwireless.com.INVALID Sent: Wednesday 13th August 2014 23:09 To: solr-user@lucene.apache.org Subject: Solr cloud performance degradation with billions of documents Hello everyone, I am trying to use SolrCloud to index a very large number of simple documents and have run into some performance and scalability limitations and was wondering what can be done about it. Hardware wise, I have a 32-node Hadoop cluster that I use to run all of the Solr shards and each node has 128GB of memory. The current SolrCloud setup is split into 4 separate and individual clouds of 32 shards each thereby giving four running shards per cloud or one cloud per eight nodes. Each shard is currently assigned a 6GB heap size. I’d prefer to avoid increasing heap memory for Solr shards to have enough to run other MapReduce jobs on the cluster. The rate of documents that I am currently inserting into these clouds per day is 5 Billion each in two clouds, 3 Billion into the third, and 2 Billion into the fourth ; however to account for capacity, the aim is to scale the solution to support double that amount of documents. To index these documents, there are MapReduce jobs that run that generate the Solr XML documents and will then submit these documents via SolrJ's CloudSolrServer interface. In testing, I have found that limiting the number of active parallel inserts to 80 per cloud gave the best performance as anything higher gave diminishing returns, most likely due to the constant shuffling of documents internally to SolrCloud. From an index perspective, dated collections are being created to hold an entire day's of documents and generally the inserting happens primarily on the current day (the previous days are only to allow for searching) and the plan is to keep up to 60 days (or collections) in each cloud. A single shar d index in one collection in the busiest cloud currently takes up 30G disk space or 960G for the entire collection. The documents are being auto committed with a hard commit time of 4 minutes (opensearcher = false) and soft commit time of 8 minutes. From a search perspective, the use case is fairly generic and simple searches of the type :, so there is no need to tune the system to use any of the more advanced querying features. Therefore, the most important thing for me is to have the indexing performance be able to keep up with the rate of input. In the initial load testing, I was able to achieve a projected indexing rate of 10 Billion documents per cloud per day for a grand total of 40 Billion per day. However, the initial load testing was done on fairly empty clouds with just a few small collections. Now that there have been several days of documents being indexed, I am starting to see a fairly steep drop-off in indexing performance once the clouds reached about 15 full collections (or about 80-100 Billion documents per cloud) in the two biggest clouds. Based on current application logging I’m seeing a 40% drop off in indexing performance. Because of this, I have concerns on how performance will hold as more collections are added. My question to the community is if anyone else has had any experience in using Solr at this scale (hundreds of Billions) and if anyone has observed such a decline in indexing performance as the number of collections increases. My understanding is that each collection is a separate index and therefore the inserting rate should remain constant. Aside from that, what other tweaks or changes can be done in the SolrCloud configuration to increase the rate of indexing performance? Am I hitting a hard limitation of what Solr can handle? Thanks, Scott
RE: Announcing Splainer -- Open Source Solr Sandbox
Yeah, very cool. Since this is all just client side, how about integrating it in Solr's UI? Also, it seems to assume `id` is the ID field, which is not always true. -Original message- From:david.w.smi...@gmail.com david.w.smi...@gmail.com Sent: Friday 22nd August 2014 19:42 To: solr-user@lucene.apache.org Subject: Re: Announcing quot;Splainerquot; -- Open Source Solr Sandbox Cool Doug! I look forward to digging into this. ~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley On Fri, Aug 22, 2014 at 10:34 AM, Doug Turnbull dturnb...@opensourceconnections.com wrote: Greetings from the OpenSource Connections Team! We're happy to announce we've taken core sandbox of our search relevancy product Quepid and open sourced it as Splainer (http://splainer.io). Splainer is a search sandbox that explains search results in a human readable form as you work. By being a *sandbox* it differs from parsing tools such as explain.solr.pl by letting you tweak and tweak and tweak without leaving the tool itself. In short, it helps you work faster to solve relevancy problems. Simply paste in a Solr URL and Splainer goes to work. Splainer is entirely driven by your browser (there's no backend -- its all static js/html/css and uses HTML local storage to store a few settings for you). So if your browser can see it, Splainer can work with it. Anyway, we've started getting great use out of the tool, and would also like to gather feedback from the community by sharing it. We're open to ideas, bug reports, pull requests, etc. Relevant links: Blog Post announcing Splainer: http://opensourceconnections.com/blog/2014/08/18/introducing-splainer-the-open-source-search-sandbox-that-tells-you-why/ Splainer: http://splainer.io Splainer on Github (open sourced as Apache 2) http://github.com/o19s/splainer These features (and a ton more) are also in our relevancy testing product Quepid: http://quepid.com Bugs/feedback/complaints/ideas/questions/contributions/etc welcome. Thank you for your time! -- Doug Turnbull Search Big Data Architect OpenSource Connections http://o19s.com
RE: Query ReRanking question
Hi - You can already achieve this by boosting on the document's recency. The result set won't be exactly ordered by date but you will get the most relevant and recent documents on top. Markus -Original message- From:Ravi Solr ravis...@gmail.com mailto:ravis...@gmail.com Sent: Friday 5th September 2014 18:06 To: solr-user@lucene.apache.org mailto:solr-user@lucene.apache.org Subject: Re: Query ReRanking question Thank you very much for responding. I want to do exactly the opposite of what you said. I want to sort the relevant docs in reverse chronology. If you sort by date before hand then the relevancy is lost. So I want to get Top N relevant results and then rerank those Top N to achieve relevant reverse chronological results. If you ask Why would I want to do that ?? Lets take a example about Malaysian airline crash. several articles might have been published over a period of time. When I search for - malaysia airline crash blackbox - I would want to see relevant results but would also like to see the the recent developments on the top i.e. effectively a reverse chronological order within the relevant results, like telling a story over a period of time Hope i am clear. Thanks for your help. Thanks Ravi Kiran Bhaskar On Thu, Sep 4, 2014 at 5:08 PM, Joel Bernstein joels...@gmail.com mailto:joels...@gmail.com wrote: If you want the main query to be sorted by date then the top N docs reranked by a query, that should work. Try something like this: q=foosort=date+descrq={!rerank reRandDocs=1000 reRankQuery=$myquery}myquery=blah Joel Bernstein Search Engineer at Heliosearch On Thu, Sep 4, 2014 at 4:25 PM, Ravi Solr ravis...@gmail.com mailto:ravis...@gmail.com wrote: Can the ReRanking API be used to sort within docs retrieved by a date field ? Can somebody help me understand how to write such a query ? Thanks Ravi Kiran Bhaskar
RE: Problem deploying solr-4.10.0.war in Tomcat
Yes, this is a nasty error. You have not set up logging libraries properly: https://cwiki.apache.org/confluence/display/solr/Configuring+Logging -Original message- From:phi...@free.fr phi...@free.fr Sent: Wednesday 17th September 2014 11:51 To: solr-user@lucene.apache.org Subject: Problem deploying solr-4.10.0.war in Tomcat Hello, I've dropped solr-4.10.0.war in Tomcat 7's webapp directory. When I start the Java web server, the following message appears in catalina.out: --- INFO: Starting Servlet Engine: Apache Tomcat/7.0.55 Sep 17, 2014 11:35:59 AM org.apache.catalina.startup.HostConfig deployWAR INFO: Deploying web application archive /archives/apache-tomcat-7.0.55_solr_8983/webapps/solr-4.10.0.war Sep 17, 2014 11:35:59 AM org.apache.catalina.core.StandardContext startInternal SEVERE: Error filterStart Sep 17, 2014 11:35:59 AM org.apache.catalina.core.StandardContext startInternal SEVERE: Context [/solr-4.10.0] startup failed due to previous errors -- Any help would be much appreciated. Cheers, Philippe
RE: How does KeywordRepeatFilterFactory help giving a higher score to an original term vs a stemmed term
Hi - but this makes no sense, they are scored as equals, except for tiny differences in TF and IDF. What you would need is something like a stemmer that preserves the original token and gives a 1 payload to the stemmed token. The same goes for filters like decompounders and accent folders that change meaning of words. -Original message- From:Diego Fernandez difer...@redhat.com Sent: Wednesday 17th September 2014 23:37 To: solr-user@lucene.apache.org Subject: Re: How does KeywordRepeatFilterFactory help giving a higher score to an original term vs a stemmed term I'm not 100% on this, but I imagine this is what happens: (using - to mean tokenized to) Suppose that you index: I am running home - am run running home If you then query running home - run running home and thus give a higher score than if you query runs home - run runs home - Original Message - The Solr wiki says A repeated question is how can I have the original term contribute more to the score than the stemmed version? In Solr 4.3, the KeywordRepeatFilterFactory has been added to assist this functionality. https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Stemming (Full section reproduced below.) I can see how in the example from the wiki reproduced below that both the stemmed and original term get indexed, but I don't see how the original term gets more weight than the stemmed term. Wouldn't this require a filter that gives terms with the keyword attribute more weight? What am I missing? Tom - A repeated question is how can I have the original term contribute more to the score than the stemmed version? In Solr 4.3, the KeywordRepeatFilterFactory has been added to assist this functionality. This filter emits two tokens for each input token, one of them is marked with the Keyword attribute. Stemmers that respect keyword attributes will pass through the token so marked without change. So the effect of this filter would be to index both the original word and the stemmed version. The 4 stemmers listed above all respect the keyword attribute. For terms that are not changed by stemming, this will result in duplicate, identical tokens in the document. This can be alleviated by adding the RemoveDuplicatesTokenFilterFactory. fieldType name=text_keyword class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.KeywordRepeatFilterFactory/ filter class=solr.PorterStemFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType -- Diego Fernandez - 爱国 Software Engineer GSS - Diagnostics
RE: Best practice for KStemFilter query or index or both?
Hi - most filters should be used both sides, especially stemmers, accent foldings and obviously lowercasing. Synonyms only on one side, depending on how you want to utilize them. Markus -Original message- From:eShard zim...@yahoo.com Sent: Thursday 25th September 2014 22:23 To: solr-user@lucene.apache.org Subject: Best practice for KStemFilter query or index or both? Good afternoon, Here's my configuration for a text field. I have the same configuration for index and query time. Is this valid? What's the best practice for these query or index or both? for synonyms; I've read conflicting reports on when to use it but I'm currently changing it over to at indexing time only. Thanks, fieldType name=text_general class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=1 / filter class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.KStemFilterFactory / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=1 / filter class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KStemFilterFactory / /analyzer analyzer type=select tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=1 / filter class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KStemFilterFactory / /analyzer /fieldType -- View this message in context: http://lucene.472066.n3.nabble.com/Best-practice-for-KStemFilter-query-or-index-or-both-tp4161201.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Flexible search field analyser/tokenizer configuration
Yes, it appeared in 4.8 but you could use PatternReplaceFilterFactory to simulate the same behavior. Markus -Original message- From:PeterKerk petervdk...@hotmail.com Sent: Monday 29th September 2014 21:08 To: solr-user@lucene.apache.org Subject: Re: Flexible search field analyser/tokenizer configuration Hi Ahmet, Am I correct that his this is only avalable in Solr4.8? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.TruncateTokenFilterFactory Also, I need to add your lines to both index and query analyzers? making my definition like so: fieldType name=searchtext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.TruncateTokenFilterFactory prefixLength=3/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.TruncateTokenFilterFactory prefixLength=3/ /analyzer /fieldType Your solution seems much easier to setup than what is proposed by Alexandre...for my understanding, what is the difference? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Flexible-search-field-analyser-tokenizer-configuration-tp4161624p4161778.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr query field (qf) conditional boost
Hi - you need to use function queries via the bf parameter. The function exists() and in some cases query() will do the conditional work, depending on your use case. Markus -Original message- From:Shamik Bandopadhyay sham...@gmail.com Sent: Monday 29th September 2014 21:30 To: solr-user@lucene.apache.org Subject: Solr query field (qf) conditional boost Hi, I'm trying to check if it's possible to include a conditional boosting in Solr qf field. For e.g. I've the following entry in qf parameter. str name=qftext^0.5 title^10.0 ProductLine^5/str What I'm looking is to add the productline boosting only for a given Author field, something in the lines boost ProductLine^5 if Author:Tom. I've been using a similar filtering in appends section, but not sure how to do it in qf or whether it's possible. lst name=appends str name=fqAuthor:(Tom +Solution:yes) /str /lst Any pointers will be appreciated. Thanks, Shamik
RE: Solr query field (qf) conditional boost
Hi - check the def() and if() functions, they can have embedded functions such as exists() and query(). You can use those to apply the main query the the productline field if author has some value. I cannot give a concrete example because i don't have an environment to fiddle around with. If the main query has parameter qq, you can use parameter substitution by using $qq in the function queries. Please check the wiki and cwiki docs on edismax and function queries for examples and references. Markus -Original message- From:shamik sham...@gmail.com Sent: Monday 29th September 2014 22:54 To: solr-user@lucene.apache.org Subject: RE: Solr query field (qf) conditional boost Thanks Markus. Well, I tried using a conditional if-else function, but it doesn't seem to work for boosting field. What I'm trying to do is boost ProductLine field by 5, if the result documents contain Author = 'Tom'. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-query-field-qf-conditional-boost-tp4161783p4161797.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: If I can a field from text_ws to text do I need to drop and reindex or just reindex?
Hi - you don't need to erase the data directory, you can just reindex, but make sure you overwrite all documents. -Original message- From:Wayne W waynemailingli...@gmail.com Sent: Friday 3rd October 2014 11:55 To: solr-user@lucene.apache.org Subject: If I can a field from text_ws to text do I need to drop and reindex or just reindex? Hi, I've realized I need to change a particular field from text_ws to text. I realize I need to reindex as the tokens are being stored in a case sensitive manner which we do not want. However can I just reindex all my documents, or do I need to drop/wipe the /data/index dir and start fresh? I really don't want to drop as the current users will not be able to search and reindexing could take as long as a week. many thanks Wayne
RE: search query text field with Comma
Hi - you are probably using the WhitespaceTokenizer without a WordDelimiterFilter. Consider using the StandardTokenizer or add the WordDelimiterFilter. Markus -Original message- From:EXTERNAL Taminidi Ravi (ETI, Automotive-Service-Solutions) external.ravi.tamin...@us.bosch.com Sent: Monday 6th October 2014 20:57 To: solr-user@lucene.apache.org Subject: search query text field with Comma Hi users, This is may be a basic question, but I am facing some trouble.. The scenario is , I have a text Truck Series, 12V and 15V, if the user search Truck Series it is not getting the row , but Truck Series, is working.. How can I get for search Truck Series..? Thanks Ravi
Re: Weird Problem (possible bug?) with german stemming and wildcard search
Hi - you should not use wild cards for autocompletion, Lucene has far better tools for making very good autocompletion, also, since a wild card is a multi term query, they are not passed through your configured query time analyzer. Some other comments: - you use a porter stemmer but you should use one of the German specific stem filters. - you don't have an index time tokenizer defined, this should not be possible and behaviour is undefined as far as i know. On Tuesday 07 October 2014 14:25:27 Thomas Michael Engelke wrote: I have a problem with a stemmed german field. The field definition: field name=description type=text_splitting indexed=true stored=true required=false multiValued=false/ ... fieldType name=text_splitting class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType When we search for a word from an autosuggest kind of component, we always add an asterisk to a word, so when somebody enters something like Radbremszylinder and waits for some milliseconds, the autosuggest list is filled with the results of searching for Radbremszylinder*. This seemed to work quite well. Today we got a bug report from a customer for that exact word. So I made an analysis for the word as Field value (index) and Field value (query), and it looked like this: ST RadbremszylinderWT Radbremszylinder* SF RadbremszylinderSF Radbremszylinder* WDF RadbremszylinderSF Radbremszylinder* LCF radbremszylinderWDF Radbremszylinder SKMF radbremszylinderLCF radbremszylinder PSF radbremszylind SKMF radbremszylinder As you can see, the end result looks very much alike. However, records containing that word in their description field aren't reported as results. Strangely enough, records containing Radbremszylindern (plural) are reported as results. Removing the asterisk from the end reports all records with Radbremszylinder, just as we would expect. So the culprit is the asterisk at the end. As far as we can read from the docs, an asterisk is just 0 or more characters, which means that the literal word in front of the asterisk should match the query. Searching further we tried some variations, and it seems that searching for Radbremszylind* works. All records with any variation (Radbremszylinder, Radbremszylindern) are reported. So maybe there's a weird interaction with stemming? Any ideas?
RE: NullPointerException for ExternalFileField when key field has no terms
Hi - yes it is worth a ticket as the javadoc says it is ok: http://lucene.apache.org/solr/4_10_1/solr-core/org/apache/solr/schema/ExternalFileField.html -Original message- From:Matthew Nigl matthew.n...@gmail.com Sent: Wednesday 8th October 2014 14:48 To: solr-user@lucene.apache.org Subject: NullPointerException for ExternalFileField when key field has no terms Hi, I use various ID fields as the keys for various ExternalFileField fields, and I have noticed that I will sometimes get the following error: ERROR org.apache.solr.servlet.SolrDispatchFilter û null:java.lang.NullPointerException at org.apache.solr.search.function.FileFloatSource.getFloats(FileFloatSource.java:273) at org.apache.solr.search.function.FileFloatSource.access$000(FileFloatSource.java:51) at org.apache.solr.search.function.FileFloatSource$2.createValue(FileFloatSource.java:147) at org.apache.solr.search.function.FileFloatSource$Cache.get(FileFloatSource.java:190) at org.apache.solr.search.function.FileFloatSource.getCachedFloats(FileFloatSource.java:141) at org.apache.solr.search.function.FileFloatSource.getValues(FileFloatSource.java:84) at org.apache.solr.response.transform.ValueSourceAugmenter.transform(ValueSourceAugmenter.java:95) at org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:252) at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:170) at org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:184) at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:300) at org.apache.solr.response.JSONWriter.writeResponse(JSONResponseWriter.java:96) at org.apache.solr.response.JSONResponseWriter.write(JSONResponseWriter.java:61) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:765) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:426) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:368) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Unknown Source) The source code referenced in the error is below (FileFloatSource.java:273): TermsEnum termsEnum = MultiFields.getTerms(reader, idName).iterator(null); So if there are no terms in the index for the key field, then getTerms will return null, and of course trying to call iterator on null will cause the exception. For my use-case, it makes sense that the key field may have no terms (initially) because there are various types of documents
WhitespaceTokenizer to consider incorrectly encoded c2a0?
Hi, For some crazy reason, some users somehow manage to substitute a perfectly normal space with a badly encoded non-breaking space, properly URL encoded this then becomes %c2a0 and depending on the encoding you use to view you probably see  followed by a space. For example: Because c2a0 is not considered whitespace (indeed, it is not real whitespace, that is 00a0) by the Java Character class, the WhitespaceTokenizer won't split on it, but the WordDelimiterFilter still does, somehow mitigating the problem as it becomes: HTMLSCF een abonnement WT een abonnement WDF een eenabonnement abonnement Should the WhitespaceTokenizer not include this weird edge case? Cheers, Markus
RE: WhitespaceTokenizer to consider incorrectly encoded c2a0?
Alexandre - i am sorry if i was not clear, this is about queries, this all happens at query time. Yes we can do the substitution in with the regex replace filter, but i would propose this weird exception to be added to WhitespaceTokenizer so Lucene deals with this by itself. Markus -Original message- From:Alexandre Rafalovitch arafa...@gmail.com Sent: Wednesday 8th October 2014 16:12 To: solr-user solr-user@lucene.apache.org Subject: Re: WhitespaceTokenizer to consider incorrectly encoded c2a0? Is this a suggestion for JIRA ticket? Or a question on how to solve it? If the later, you could probably stick a RegEx replacement in the UpdateRequestProcessor chain and be done with it. As to why? I would look for the rest of the MSWord-generated artifacts, such as smart quotes, extra-long dashes, etc. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On 8 October 2014 09:59, Markus Jelsma markus.jel...@openindex.io wrote: Hi, For some crazy reason, some users somehow manage to substitute a perfectly normal space with a badly encoded non-breaking space, properly URL encoded this then becomes %c2a0 and depending on the encoding you use to view you probably see  followed by a space. For example: Because c2a0 is not considered whitespace (indeed, it is not real whitespace, that is 00a0) by the Java Character class, the WhitespaceTokenizer won't split on it, but the WordDelimiterFilter still does, somehow mitigating the problem as it becomes: HTMLSCF een abonnement WT een abonnement WDF een eenabonnement abonnement Should the WhitespaceTokenizer not include this weird edge case? Cheers, Markus
RE: does one need to reindex when changing similarity class
Hi - no you don't have to, although maybe if you changed on how norms are encoded. Markus -Original message- From:elisabeth benoit elisaelisael...@gmail.com Sent: Thursday 9th October 2014 12:26 To: solr-user@lucene.apache.org Subject: does one need to reindex when changing similarity class I've read somewhere that we do have to reindex when changing similarity class. Is that right? Thanks again, Elisabeth
RE: per field similarity not working with solr 4.2.1
Hi - it should work, not seeing your implemenation in the debug output is a known issue. -Original message- From:elisabeth benoit elisaelisael...@gmail.com Sent: Thursday 9th October 2014 12:22 To: solr-user@lucene.apache.org Subject: per field similarity not working with solr 4.2.1 Hello, I am using Solr 4..2.1 and I've tried to use a per field similarity, as described in https://apache.googlesource.com/lucene-solr/+/c5bb5cd921e1ce65e18eceb55e738f40591214f0/solr/core/src/test-files/solr/collection1/conf/schema-sim.xml so in my schema I have schema name=search version=1.4 similarity class=solr.SchemaSimilarityFactory/ and a custom similarity in fieldtype definition fieldType name=text class=solr.TextField positionIncrementGap=100 similarity class=com.company.lbs.solr.search.similarity.NoTFSimilarity/ analyzer type=index ... but it is not working when I send a request with debugQuery=on, instead of [ NoTFSimilarity], I see [] or to give an example, I have weight(catchall:bretagn in 2575) [] instead of weight(catchall:bretagn in 2575) [NoTFSimilarity] Anyone has a clue what I am doing wrong? Best regards, Elisabeth
RE: per field similarity not working with solr 4.2.1
Well, it is either the output of your calculation or writing something to System.out Markus -Original message- From:elisabeth benoit elisaelisael...@gmail.com Sent: Thursday 9th October 2014 13:31 To: solr-user@lucene.apache.org Subject: Re: per field similarity not working with solr 4.2.1 Thanks for the information! I've been struggling with that debug output. Any other way to know for sure my similarity class is being used? Thanks again, Elisabeth 2014-10-09 13:03 GMT+02:00 Markus Jelsma markus.jel...@openindex.io: Hi - it should work, not seeing your implemenation in the debug output is a known issue. -Original message- From:elisabeth benoit elisaelisael...@gmail.com Sent: Thursday 9th October 2014 12:22 To: solr-user@lucene.apache.org Subject: per field similarity not working with solr 4.2.1 Hello, I am using Solr 4..2.1 and I've tried to use a per field similarity, as described in https://apache.googlesource.com/lucene-solr/+/c5bb5cd921e1ce65e18eceb55e738f40591214f0/solr/core/src/test-files/solr/collection1/conf/schema-sim.xml so in my schema I have schema name=search version=1.4 similarity class=solr.SchemaSimilarityFactory/ and a custom similarity in fieldtype definition fieldType name=text class=solr.TextField positionIncrementGap=100 similarity class=com.company.lbs.solr.search.similarity.NoTFSimilarity/ analyzer type=index ... but it is not working when I send a request with debugQuery=on, instead of [ NoTFSimilarity], I see [] or to give an example, I have weight(catchall:bretagn in 2575) [] instead of weight(catchall:bretagn in 2575) [NoTFSimilarity] Anyone has a clue what I am doing wrong? Best regards, Elisabeth
RE: does one need to reindex when changing similarity class
Yes, if the replacing similarity has a different implementation on norms, you should reindex or gradually update all documents within decent time. -Original message- From:Ahmet Arslan iori...@yahoo.com.INVALID Sent: Thursday 9th October 2014 18:27 To: solr-user@lucene.apache.org Subject: Re: does one need to reindex when changing similarity class How about SweetSpotSimilarity? Length norm is saved at index time? On Thursday, October 9, 2014 5:44 PM, Jack Krupansky j...@basetechnology.com wrote: The similarity class is only invoked at query time, so it doesn't participate in indexing. -- Jack Krupansky -Original Message- From: Markus Jelsma Sent: Thursday, October 9, 2014 6:59 AM To: solr-user@lucene.apache.org Subject: RE: does one need to reindex when changing similarity class Hi - no you don't have to, although maybe if you changed on how norms are encoded. Markus -Original message- From:elisabeth benoit elisaelisael...@gmail.com Sent: Thursday 9th October 2014 12:26 To: solr-user@lucene.apache.org Subject: does one need to reindex when changing similarity class I've read somewhere that we do have to reindex when changing similarity class. Is that right? Thanks again, Elisabeth
Re: Recovering from Out of Mem
And don't forget to set the proper permissions on the script, the tomcat or jetty user. Markus On Tuesday 14 October 2014 13:47:47 Boogie Shafer wrote: a really simple approach is to have the OOM generate an email e.g. 1) create a simple script (call it java_oom.sh) and drop it in your tomcat bin dir echo `date` | mail -s Java Error: OutOfMemory - $HOSTNAME not...@domain.com 2) configure your java options (in setenv.sh or similar) to trigger heap dump and the email script when OOM occurs # config error behaviors CATALINA_OPTS=$CATALINA_OPTS -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=$TOMCAT_DIR/temp/tomcat-dump.hprof -XX:OnError=$TOMCAT_DIR/bin/java_error.sh -XX:OnOutOfMemoryError=$TOMCAT_DIR/bin/java_oom.sh -XX:ErrorFile=$TOMCAT_DIR/temp/java_error%p.log From: Mark Miller markrmil...@gmail.com Sent: Tuesday, October 14, 2014 06:30 To: solr-user@lucene.apache.org Subject: Re: Recovering from Out of Mem Best is to pass the Java cmd line option that kills the process on OOM and setup a supervisor on the process to restart it. You need a somewhat recent release for this to work properly though. - Mark On Oct 14, 2014, at 9:06 AM, Salman Akram salman.ak...@northbaysolutions.net wrote: I know there are some suggestions to avoid OOM issue e.g. setting appropriate Max Heap size etc. However, what's the best way to recover from it as it goes into non-responding state? We are using Tomcat on back end. The scenario is that once we face OOM issue it keeps on taking queries (doesn't give any error) but they just time out. So even though we have a fail over system implemented but we don't have a way to distinguish if these are real time out queries OR due to OOM. -- Regards, Salman Akram
Re: Recovering from Out of Mem
This will do: kill -9 `ps aux | grep -v grep | grep tomcat6 | awk '{print $2}'` pkill should also work On Tuesday 14 October 2014 07:02:03 Yago Riveiro wrote: Boogie, Any example for java_error.sh script? — /Yago Riveiro On Tue, Oct 14, 2014 at 2:48 PM, Boogie Shafer boogie.sha...@proquest.com wrote: a really simple approach is to have the OOM generate an email e.g. 1) create a simple script (call it java_oom.sh) and drop it in your tomcat bin dir echo `date` | mail -s Java Error: OutOfMemory - $HOSTNAME not...@domain.com 2) configure your java options (in setenv.sh or similar) to trigger heap dump and the email script when OOM occurs # config error behaviors CATALINA_OPTS=$CATALINA_OPTS -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=$TOMCAT_DIR/temp/tomcat-dump.hprof -XX:OnError=$TOMCAT_DIR/bin/java_error.sh -XX:OnOutOfMemoryError=$TOMCAT_DIR/bin/java_oom.sh -XX:ErrorFile=$TOMCAT_DIR/temp/java_error%p.log From: Mark Miller markrmil...@gmail.com Sent: Tuesday, October 14, 2014 06:30 To: solr-user@lucene.apache.org Subject: Re: Recovering from Out of Mem Best is to pass the Java cmd line option that kills the process on OOM and setup a supervisor on the process to restart it. You need a somewhat recent release for this to work properly though. - Mark On Oct 14, 2014, at 9:06 AM, Salman Akram salman.ak...@northbaysolutions.net wrote: I know there are some suggestions to avoid OOM issue e.g. setting appropriate Max Heap size etc. However, what's the best way to recover from it as it goes into non-responding state? We are using Tomcat on back end. The scenario is that once we face OOM issue it keeps on taking queries (doesn't give any error) but they just time out. So even though we have a fail over system implemented but we don't have a way to distinguish if these are real time out queries OR due to OOM. -- Regards, Salman Akram
RE: update external file
You either need to upload them and issue the reload command, or download them from the machine, and then issue the reload command. There is no REST support for it (yet) like the synonym filter, or was it stop filter? MArkus -Original message- From:Michael Sokolov msoko...@safaribooksonline.com Sent: Thursday 23rd October 2014 19:19 To: solr-user solr-user@lucene.apache.org Subject: update external file I've been looking at ExternalFileField to handle popularity boosting. Since Solr updatable docvalues (SOLR-5944) isn't quite there yet. My question is whether there is any support for uploading the external file via Solr, or if people do that some other (external, I guess) way? -Mike
RE: Stopwords in shingles suggester
You do not want stopwords in your shingles? Then put the stopword filter on top of the shingle filter. Markus -Original message- From:O. Klein kl...@octoweb.nl Sent: Monday 27th October 2014 13:56 To: solr-user@lucene.apache.org Subject: Stopwords in shingles suggester Is there a way in Solr to filter out stopwords in shingles like ES does? http://www.elasticsearch.org/blog/searching-with-shingles/ -- View this message in context: http://lucene.472066.n3.nabble.com/Stopwords-in-shingles-suggester-tp4166057.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.
It is an ancient issue. One of the major contributors to the issue was resolved some versions ago but we are still seeing it sometimes too, there is nothing to see in the logs. We ignore it and just reindex. -Original message- From:S.L simpleliving...@gmail.com Sent: Monday 27th October 2014 16:25 To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Thank Otis, I have checked the logs , in my case the default catalina.out and I dont see any OOMs or , any other exceptions. What others metrics do you suggest ? On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, You may simply be overwhelming your cluster-nodes. Have you checked various metrics to see if that is the case? Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Oct 26, 2014, at 9:59 PM, S.L simpleliving...@gmail.com wrote: Folks, I have posted previously about this , I am using SolrCloud 4.10.1 and have a sharded collection with 6 nodes , 3 shards and a replication factor of 2. I am indexing Solr using a Hadoop job , I have 15 Map fetch tasks , that can each have upto 5 threds each , so the load on the indexing side can get to as high as 75 concurrent threads. I am facing an issue where the replicas of a particular shard(s) are consistently getting out of synch , initially I thought this was beccause I was using a custom component , but I did a fresh install and removed the custom component and reindexed using the Hadoop job , I still see the same behavior. I do not see any exceptions in my catalina.out , like OOM , or any other excepitions, I suspecting thi scould be because of the multi-threaded indexing nature of the Hadoop job . I use CloudSolrServer from my java code to index and initialize the CloudSolrServer using a 3 node ZK ensemble. Does any one know of any known issues with a highly multi-threaded indexing and SolrCloud ? Can someone help ? This issue has been slowing things down on my end for a while now. Thanks and much appreciated!
RE: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.
https://issues.apache.org/jira/browse/SOLR-4260 resolved https://issues.apache.org/jira/browse/SOLR-4924 open -Original message- From:Michael Della Bitta michael.della.bi...@appinions.com Sent: Monday 27th October 2014 16:40 To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. I'm curious, could you elaborate on the issue and the partial fix? Thanks! On 10/27/14 11:31, Markus Jelsma wrote: It is an ancient issue. One of the major contributors to the issue was resolved some versions ago but we are still seeing it sometimes too, there is nothing to see in the logs. We ignore it and just reindex. -Original message- From:S.L simpleliving...@gmail.com Sent: Monday 27th October 2014 16:25 To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Thank Otis, I have checked the logs , in my case the default catalina.out and I dont see any OOMs or , any other exceptions. What others metrics do you suggest ? On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, You may simply be overwhelming your cluster-nodes. Have you checked various metrics to see if that is the case? Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Oct 26, 2014, at 9:59 PM, S.L simpleliving...@gmail.com wrote: Folks, I have posted previously about this , I am using SolrCloud 4.10.1 and have a sharded collection with 6 nodes , 3 shards and a replication factor of 2. I am indexing Solr using a Hadoop job , I have 15 Map fetch tasks , that can each have upto 5 threds each , so the load on the indexing side can get to as high as 75 concurrent threads. I am facing an issue where the replicas of a particular shard(s) are consistently getting out of synch , initially I thought this was beccause I was using a custom component , but I did a fresh install and removed the custom component and reindexed using the Hadoop job , I still see the same behavior. I do not see any exceptions in my catalina.out , like OOM , or any other excepitions, I suspecting thi scould be because of the multi-threaded indexing nature of the Hadoop job . I use CloudSolrServer from my java code to index and initialize the CloudSolrServer using a 3 node ZK ensemble. Does any one know of any known issues with a highly multi-threaded indexing and SolrCloud ? Can someone help ? This issue has been slowing things down on my end for a while now. Thanks and much appreciated!
RE: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.
Hi - if there is a very large discrepancy, you could consider to purge the smallest replica, it will then resync from the leader. -Original message- From:S.L simpleliving...@gmail.com Sent: Monday 27th October 2014 16:41 To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Markus, I would like to ignore it too, but whats happening is that the there is a lot of discrepancy between the replicas , queries like q=*:*fq=(id:220a8dce-3b31-4d46-8386-da8405595c47) fail depending on which replica the request goes to, because of huge amount of discrepancy between the replicas. Thank you for confirming that it is a know issue , I was thinking I was the only one facing this due to my set up. On Mon, Oct 27, 2014 at 11:31 AM, Markus Jelsma markus.jel...@openindex.io wrote: It is an ancient issue. One of the major contributors to the issue was resolved some versions ago but we are still seeing it sometimes too, there is nothing to see in the logs. We ignore it and just reindex. -Original message- From:S.L simpleliving...@gmail.com Sent: Monday 27th October 2014 16:25 To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Thank Otis, I have checked the logs , in my case the default catalina.out and I dont see any OOMs or , any other exceptions. What others metrics do you suggest ? On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, You may simply be overwhelming your cluster-nodes. Have you checked various metrics to see if that is the case? Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Oct 26, 2014, at 9:59 PM, S.L simpleliving...@gmail.com wrote: Folks, I have posted previously about this , I am using SolrCloud 4.10.1 and have a sharded collection with 6 nodes , 3 shards and a replication factor of 2. I am indexing Solr using a Hadoop job , I have 15 Map fetch tasks , that can each have upto 5 threds each , so the load on the indexing side can get to as high as 75 concurrent threads. I am facing an issue where the replicas of a particular shard(s) are consistently getting out of synch , initially I thought this was beccause I was using a custom component , but I did a fresh install and removed the custom component and reindexed using the Hadoop job , I still see the same behavior. I do not see any exceptions in my catalina.out , like OOM , or any other excepitions, I suspecting thi scould be because of the multi-threaded indexing nature of the Hadoop job . I use CloudSolrServer from my java code to index and initialize the CloudSolrServer using a 3 node ZK ensemble. Does any one know of any known issues with a highly multi-threaded indexing and SolrCloud ? Can someone help ? This issue has been slowing things down on my end for a while now. Thanks and much appreciated!
Re: SolrCloud config question and zookeeper
On Tuesday 28 October 2014 10:42:11 Bernd Fehling wrote: Thanks for the explanations. My idea about 4 zookeepers is a result of having the same software (java, zookeeper, solr, ...) installed on all 4 servers. But yes, I don't need to start a zookeeper on the 4th server. 3 other machines outside the cloud for ZK seams a bit oversized. And you have another point of failure with the network between ZK and cloud. If one of the cloud servers end up in smoke the ZK system should still work with ZK and cloud on the same servers. So the offline argument says the first thing I start is ZK and the last I shutdown is ZK. Good point. While moving fom master-slave to cloud I'm aware of the fact that all shards have to be connected to ZK. But how can I tell ZK that on server_1 is leader shard_1 AND replica shard_4 ? You don't, it will elect a leader by itself. Unfortunately the Getting Started with SolrCloud is a bit short on this. Regards Bernd Am 28.10.2014 um 09:15 schrieb Daniel Collins: As Michael says, you really want an odd number of zookeepers in order to meet the quorum requirements (which based on your comments you seem to be aware of). There is nothing wrong with 4 ZKs as such, just that it doesn't buy you anything above having 3, so its one more that might go wrong and cause you problems. In your case, I would suggest you just pick the first 3 machines to run ZK or even have 3 other machines outside the cloud to house ZK. The offline argument is also a good one, you really want your ZK instances to be longer lived than Solr, whilst you can restart individual Cores within a Solr Instance, it is often (at least for us) more convenient to bounce the whole java instance. In that scenario (again just re-iterating what Michael said), you don't want ZK to be down at the same time. If you are using Solr Cloud, then all your replicas need to be connected to ZK, you can't have the master instances in ZK, and the replicas not connected (that's more of the old Master-Slave replication system which is still available but orthogonal to Cloud). On 28 October 2014 07:01, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Yes, garbage collection is a very good argument to have external zookeepers. I haven't thought about that. But does this also mean seperate server for each zookeeper or can they live side by side with solr on the same server? What is the problem with 4 zookeepers beside that I have no real gain against 3 zookeepers (only 1 can fail)? Regards Bernd Am 27.10.2014 um 15:41 schrieb Michael Della Bitta: You want external zookeepers. Partially because you don't want your Solr garbage collections holding up zookeeper availability, but also because you don't want your zookeepers going offline if you have to restart Solr for some reason. Also, you want 3 or 5 zookeeepers, not 4 or 8. On 10/27/14 10:35, Bernd Fehling wrote: While starting now with SolrCloud I tried to understand the sense of external zookeeper. Let's assume I want to split 1 huge collection accross 4 server. My straight forward idea is to setup a cloud with 4 shards (one on each server) and also have a replication of the shard on another server. server_1: shard_1, shard_replication_4 server_2: shard_2, shard_replication_1 server_3: shard_3, shard_replication_2 server_4: shard_4, shard_replication_3 In this configuration I always have all 4 shards available if one server fails. But now to zookeeper. I would start the internal zookeeper for all shards including replicas. Does this make sense? Or I only start the internal zookeeper for shard 1 to 4 but not the replicas. Should be good enough, one server can fail, or not? Or I follow the recommendations and install on all 4 server an external seperate zookeeper, but what is the advantage against having the internal zookeeper on each server? I really don't get it at this point. Can anyone help me here? Regards Bernd
RE: MoreLikeThis filter by score threshold
Hi - sure you can, using the frange parser as a filter: https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-FunctionRangeQueryParser http://lucene.apache.org/solr/4_10_3/solr-core/org/apache/solr/search/FunctionRangeQParserPlugin.html But this is very much not recommended, at all, so don't do it: https://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_filter_by_score.3F -Original message- From:Ali Nazemian alinazem...@gmail.com Sent: Tuesday 3rd February 2015 16:22 To: solr-user@lucene.apache.org Subject: MoreLikeThis filter by score threshold Hi, I was wondering how can I limit the result of MoreLikeThis query by the score value instead of filtering them by document count? Thank you very much. -- A.Nazemian
RE: Score results by only the highest scoring term
Either use the MaxScoreQueryParser [1] or set tie to zero when using a DisMax parser. [1]: https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-MaxScoreQueryParser -Original message- From:Burgmans, Tom tom.burgm...@wolterskluwer.com Sent: Tuesday 3rd February 2015 16:13 To: solr-user@lucene.apache.org Subject: Score results by only the highest scoring term Hi All, I wonder if it's in some way possible to search for multiple terms like: (term A OR term B OR term C OR term D) and in case a document contains 2 or more of these terms: only the highest scoring term should contribute to the final relevancy score; possibly lower scoring terms should be discarded from the scoring algorithm. Ideally I'd like an operator like ANY: (term A ANY term B ANY term C ANY term D) that has the purpose: return documents, sorted by the score of the highest scoring term. Any thoughts about how to achieve this? _ Tom Burgmans
RE: low qps with high load averages on solrcloud
We recently upgraded our cloud from 4.8 to 4.10.3, the only config we updated was the luceneMatchVersion. Response times were very stable prior to the upgrade, but are quite erratic since the upgrade, and rising. I still have to check all the resolved issues but something went very wrong between 4.8 and 4.10.3. M. -Original message- From:Toke Eskildsen t...@statsbiblioteket.dk Sent: Wednesday 4th February 2015 20:58 To: solr-user@lucene.apache.org Subject: RE: low qps with high load averages on solrcloud Suchi Amalapurapu [su...@bloomreach.com] wrote: Noticed that a solrcloud cluster doesn't scale linearly with # of nodes unlike the unsharded solr cluster. We are seeing a 10 fold drop in QPS in multi sharded mode. As I understand it, you changed from single to multi shard. Guessing wildly: You have one or more facets with a non-trivial (10K or more) number of unique String values and you have a fairly high facet.limit (50+). If so, what you see might be the penalty for the two-phase faceting with SolrCloud, where the second fine-counting phase can be markedly slower than the first. There are ways to help with that, but let's hear if my guess is correct first. - Toke Eskildsen
RE: Lucene cosine similarity score for more like this query
Hi - MoreLikeThis is not based on cosine similarity. The idea is that rare terms - high IDF - are extracted from the source document, and then used to build a regular Query(). That query follows the same rules as regular queries, the rules of your similarity implemenation, which is TFIDF by default. So, as suggested, if you enable debugging, you can clearly see why scores can be above 1, or even much higher if queryNorm is disabled when using BM25 as similarity. If you really need cosine similarity between documents, you have to enable term vectors for the source fields, and use them to calculate the angle. The problem is that this does not scale well, you would need to calculate angles with virtually all other documents. M. -Original message- From:Ali Nazemian alinazem...@gmail.com Sent: Monday 2nd February 2015 21:39 To: solr-user@lucene.apache.org Subject: Re: Lucene cosine similarity score for more like this query Dear Erik, Thank you for your response. Would younplease tell me why this score could be higher than 1? While cosine similarity can not be higher than 1. On Feb 2, 2015 7:32 PM, Erik Hatcher erik.hatc...@gmail.com wrote: The scoring is the same as Lucene. To get deeper insight into how a score is computed, use Solr’s debug=true mode to see the explain details in the response. Erik On Feb 2, 2015, at 10:49 AM, Ali Nazemian alinazem...@gmail.com wrote: Hi, I was wondering what is the range of score is brought by more like this query in Solr? I know that the Lucene uses cosine similarity in vector space model for calculating similarity between two documents. I also know that cosine similarity is between -1 and 1 but the fact that I dont understand is why the score which is brought by more like this query could be 12 for example?! Would you please explain what is the calculation process is Solr? Thank you very much. Best regards. -- A.Nazemian
RE: Hit Highlighting and More Like This
Hi - you can use the MLT query parser in Solr 5.0 or patch 4.10.x https://issues.apache.org/jira/browse/SOLR-6248 -Original message- From:Tim Hearn timseman...@gmail.com Sent: Saturday 31st January 2015 0:31 To: solr-user@lucene.apache.org Subject: Hit Highlighting and More Like This Hi all, I'm fairly new to Solr. It seems like it should be possible to enable the hit highlighting feature and more like this feature at the same time, with the key words from the MLT query being the terms highlighted. Is this possible? I am trying right now to do this, but I am not having any snippets returned to me. Thanks!
RE: Question regarding SolrIndexSearcher implementation
From memory: there are different methods in SolrIndexSearcher for reason. It has to do with paging and sorting. Whenever you sort on a simple field, you can easily start at a specific offset. The problem with sorting on score, is that score has to be calculated for all documents matching query. This means that deep paging is a problem, which it is. -Original message- From:Biyyala, Shishir (Contractor) shishir_biyy...@cable.comcast.com Sent: Monday 2nd February 2015 22:22 To: solr-user@lucene.apache.org Cc: java-u...@lucene.apache.org Subject: Question regarding SolrIndexSearcher implementation Hello, I did not know what the right mailing list would be (java-user vs solr-user), so mailing both. My group uses solr/lucene, and we have custom collectors. I stumbled upon the implementation of SolrIndexSearcher.java and saw this : https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java (line 1676) TopDocs topDocs = topCollector.topDocs(0, len); the topDocs start value is always being hardcoded to 0; What that is leading to is creating of excessive topDocs that the application actually needs; My application can potentially be faced with deep pagination and we do not use queryresults cache. If I request for 200-250 docs, I was expecting start=199, howMany=51; But turns out that start=0 (always) and howMany=250 Any reasons why start value is hardcoded to 0? Please suggest. It is potentially impacting performance of our application. Thanks much, Shishir
RE: More Like This similarity tuning
Well, maxqt is easy, it is just the number of terms that compose your query. MinTF is a strange parameter, rare terms have a low DF and most usually not a high TF, so i would keep it at 1. MinDF is more useful, it depends entirely on the size of your corpus. If you have a lot of user-generated input - meaning, bad spelled terms - then you have to set MinDF to a setting higher than the most frequent misspellings but low enough to find rare terms. It depends on your index. -Original message- From:Ali Nazemian alinazem...@gmail.com Sent: Wednesday 4th February 2015 11:15 To: solr-user@lucene.apache.org Subject: More Like This similarity tuning Hi, I am looking for a best practice on More Like This parameters. I really appreciate if somebody can tell me what is the best value for these parameters in MLT query? Or at lease the proper methodology for finding the best value for each of these parameters: mlt.mintf mlt.mindf mlt.maxqt Thank you very much. Best regards. -- A.Nazemian
RE: MoreLikeThis filter by score threshold
Hello Upayavira - Indeed, it works, except ... insert-counter-arguments. It doesn't work after all :) Markus -Original message- From:Upayavira u...@odoko.co.uk Sent: Tuesday 3rd February 2015 21:38 To: solr-user@lucene.apache.org Subject: Re: MoreLikeThis filter by score threshold I've seen this done (encouraged against it, but didn't win). It works. Except, sometimes things change in the index, and the scores change subtly. We get complaints that documents that previously were above the threshold now aren't, and visa-versa. I try to explain that the score has no meaning between two search requests, but unfortunately, there's *enough* similarity between requests to make it work, *sometimes*. But when it doesn't work, people get baffled, and don't accept the truth as an answer (you can't use scores to compare separate sets of search results). Upayavira On Tue, Feb 3, 2015, at 08:01 PM, Ali Nazemian wrote: Dear Markus, Hi, Thank you very much for your response. I did check the reason why it is not recommended to filter by score in search query. But I think it is reasonable to filter by score in case of finding similar documents. I know in both of them (simple search query and mlt query) vsm of tf-idf similarity is used to calculate the score of documents, but suppose you indexed news as document in solr and you want to find all enough similar news for the specific one. In this case I think it is reasonable to filter similar documents by score threshold. Please correct me if I am wrong. Thank you very much. Regards. On Feb 3, 2015 7:00 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - sure you can, using the frange parser as a filter: https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-FunctionRangeQueryParser http://lucene.apache.org/solr/4_10_3/solr-core/org/apache/solr/search/FunctionRangeQParserPlugin.html But this is very much not recommended, at all, so don't do it: https://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_filter_by_score.3F -Original message- From:Ali Nazemian alinazem...@gmail.com Sent: Tuesday 3rd February 2015 16:22 To: solr-user@lucene.apache.org Subject: MoreLikeThis filter by score threshold Hi, I was wondering how can I limit the result of MoreLikeThis query by the score value instead of filtering them by document count? Thank you very much. -- A.Nazemian
RE: MoreLikeThis filter by score threshold
Hello Ali - no it is not reasonable and it is unnecessary at best. Regardless of the query, you sort by score. This means that the top queries are always the most relevant, so what exactly do you need to filter? -Original message- From:Ali Nazemian alinazem...@gmail.com Sent: Tuesday 3rd February 2015 21:02 To: solr-user@lucene.apache.org Subject: RE: MoreLikeThis filter by score threshold Dear Markus, Hi, Thank you very much for your response. I did check the reason why it is not recommended to filter by score in search query. But I think it is reasonable to filter by score in case of finding similar documents. I know in both of them (simple search query and mlt query) vsm of tf-idf similarity is used to calculate the score of documents, but suppose you indexed news as document in solr and you want to find all enough similar news for the specific one. In this case I think it is reasonable to filter similar documents by score threshold. Please correct me if I am wrong. Thank you very much. Regards. On Feb 3, 2015 7:00 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - sure you can, using the frange parser as a filter: https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-FunctionRangeQueryParser http://lucene.apache.org/solr/4_10_3/solr-core/org/apache/solr/search/FunctionRangeQParserPlugin.html But this is very much not recommended, at all, so don't do it: https://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_filter_by_score.3F -Original message- From:Ali Nazemian alinazem...@gmail.com Sent: Tuesday 3rd February 2015 16:22 To: solr-user@lucene.apache.org Subject: MoreLikeThis filter by score threshold Hi, I was wondering how can I limit the result of MoreLikeThis query by the score value instead of filtering them by document count? Thank you very much. -- A.Nazemian
Re: OutOfMemoryError for PDF document upload into Solr
Tika 1.6 has PDFBox 1.8.4, which has memory issues, eating excessive RAM! Either upgrade to Tika 1.7 (out now) or manually use the PDFBox 1.8.8 dependency. M. On Friday 16 January 2015 15:21:55 Charlie Hull wrote: On 16/01/2015 04:02, Dan Davis wrote: Why re-write all the document conversion in Java ;) Tika is very slow. 5 GB PDF is very big. Or you can run Tika in a separate process, or even on a separate machine, wrapped with something to cope if it dies due to some horrible input...we generally avoid document format translation within Solr and do it externally before feeding documents to Solr. Charlie If you have a lot of PDF like that try pdftotext in HTML and UTF-8 output mode. The HTML mode captures some meta-data that would otherwise be lost. If you need to go faster still, you can also write some stuff linked directly against poppler library. Before you jump down by through about Tika being slow - I wrote a PDF indexer that ran at 36 MB/s per core. Different indexer, all C, lots of getjmp/longjmp. But fast... On Thu, Jan 15, 2015 at 1:54 PM, ganesh.ya...@sungard.com wrote: Siegfried and Michael Thank you for your replies and help. -Original Message- From: Siegfried Goeschl [mailto:sgoes...@gmx.at] Sent: Thursday, January 15, 2015 3:45 AM To: solr-user@lucene.apache.org Subject: Re: OutOfMemoryError for PDF document upload into Solr Hi Ganesh, you can increase the heap size but parsing a 4 GB PDF document will very likely consume A LOT OF memory - I think you need to check if that large PDF can be parsed at all :-) Cheers, Siegfried Goeschl On 14.01.15 18:04, Michael Della Bitta wrote: Yep, you'll have to increase the heap size for your Tomcat container. http://stackoverflow.com/questions/6897476/tomcat-7-how-to-set-initial -heap-size-correctly Michael Della Bitta Senior Software Engineer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/11200277628550959 3336/posts w: appinions.com http://www.appinions.com/ On Wed, Jan 14, 2015 at 12:00 PM, ganesh.ya...@sungard.com wrote: Hello, Can someone pass on the hints to get around following error? Is there any Heap Size parameter I can set in Tomcat or in Solr webApp that gets deployed in Solr? I am running Solr webapp inside Tomcat on my local machine which has RAM of 12 GB. I have PDF document which is 4 GB max in size that needs to be loaded into Solr Exception in thread http-apr-8983-exec-6 java.lang.: Java heap space at java.util.AbstractCollection.toArray(Unknown Source) at java.util.ArrayList.init(Unknown Source) at org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:518) at org.apache.pdfbox.cos.COSDocument.close(COSDocument.java:575) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:254) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120 ) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extracti ngDocumentLoader.java:219) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conten tStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBa se.java:135) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequ est(RequestHandlers.java:246) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.jav a:777) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.ja va:418) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.ja va:207) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applicat ionFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilte rChain.java:208) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve .java:220) at
RE: American /British Dictionary for solr-4.10.2
There are no dictionaries that sum up all possible conjugations, using a heuristics based normalizer would be more appropriate. There are nevertheless some good sources to start: Contains lots of useful spelling issues, incl. british/american/canadian/australian http://grammarist.com/spelling Very useful http://en.wikipedia.org/wiki/American_and_British_English_spelling_differences#Acronyms_and_abbreviations A handy list http://www.avko.org/free/reference/british-vs-american-spelling.html There are some more lists but it seems the other one's tab is no longer open! Good luck -Original message- From:dinesh naik dineshkumarn...@gmail.com Sent: Thursday 12th February 2015 7:17 To: solr-user@lucene.apache.org Subject: American /British Dictionary for solr-4.10.2 Hi, What are the dictionaries available for Solr 4.10.2? We are looking for a dictionary to support American/British English synonym. -- Best Regards, Dinesh Naik
RE: unusually high 4.10.2 vs 4.3.1 RAM consumption
We have seen an increase between 4.8.1 and 4.10. -Original message- From:Dmitry Kan solrexp...@gmail.com Sent: Tuesday 17th February 2015 11:06 To: solr-user@lucene.apache.org Subject: unusually high 4.10.2 vs 4.3.1 RAM consumption Hi, We are currently comparing the RAM consumption of two parallel Solr clusters with different solr versions: 4.10.2 and 4.3.1. For comparable index sizes of a shard (20G and 26G), we observed 9G vs 5.6G RAM footprint (reserved RAM as seen by top), 4.3.1 being the winner. We have not changed the solrconfig.xml to upgrade to 4.10.2 and have reindexed data from scratch. The commits are all controlled on the client, i.e. not auto-commits. Solr: 4.10.2 (high load, mass indexing) Java: 1.7.0_76 (Oracle) -Xmx25600m Solr: 4.3.1 (normal load, no mass indexing) Java: 1.7.0_11 (Oracle) -Xmx25600m The RAM consumption remained the same after the load has stopped on the 4.10.2 cluster. Manually collecting the memory on a 4.10.2 shard via jvisualvm dropped the used RAM from 8,5G to 0,5G. But the reserved RAM as seen by top remained at 9G level. This unusual spike happened during mass data indexing. What else could be the artifact of such a difference -- Solr or JVM? Can it only be explained by the mass indexing? What is worrisome is that the 4.10.2 shard reserves 8x times it uses. What can be done about this? -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info
RE: unusually high 4.10.2 vs 4.3.1 RAM consumption
I would have shared it if i had one :) -Original message- From:Dmitry Kan solrexp...@gmail.com Sent: Tuesday 17th February 2015 11:40 To: solr-user@lucene.apache.org Subject: Re: unusually high 4.10.2 vs 4.3.1 RAM consumption Have you found an explanation to that? On Tue, Feb 17, 2015 at 12:12 PM, Markus Jelsma markus.jel...@openindex.io wrote: We have seen an increase between 4.8.1 and 4.10. -Original message- From:Dmitry Kan solrexp...@gmail.com Sent: Tuesday 17th February 2015 11:06 To: solr-user@lucene.apache.org Subject: unusually high 4.10.2 vs 4.3.1 RAM consumption Hi, We are currently comparing the RAM consumption of two parallel Solr clusters with different solr versions: 4.10.2 and 4.3.1. For comparable index sizes of a shard (20G and 26G), we observed 9G vs 5.6G RAM footprint (reserved RAM as seen by top), 4.3.1 being the winner. We have not changed the solrconfig.xml to upgrade to 4.10.2 and have reindexed data from scratch. The commits are all controlled on the client, i.e. not auto-commits. Solr: 4.10.2 (high load, mass indexing) Java: 1.7.0_76 (Oracle) -Xmx25600m Solr: 4.3.1 (normal load, no mass indexing) Java: 1.7.0_11 (Oracle) -Xmx25600m The RAM consumption remained the same after the load has stopped on the 4.10.2 cluster. Manually collecting the memory on a 4.10.2 shard via jvisualvm dropped the used RAM from 8,5G to 0,5G. But the reserved RAM as seen by top remained at 9G level. This unusual spike happened during mass data indexing. What else could be the artifact of such a difference -- Solr or JVM? Can it only be explained by the mass indexing? What is worrisome is that the 4.10.2 shard reserves 8x times it uses. What can be done about this? -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info
Distributed unit tests and SSL doesn't have a valid keystore
Hi - in a small Maven project depending on Solr 4.10.3, running unit tests that extend BaseDistributedSearchTestCase randomly fail with SSL doesn't have a valid keystore, and a lot of zombie threads. We have a solrtest.keystore file laying around, but where to put it? Thanks, Markus
RE: Extending solr analysis in index time
Hi - You mention having a list with important terms, then using payloads would be the most straightforward i suppose. You still need a custom similarity and custom query parser. Payloads work for us very well. M -Original message- From:Ahmet Arslan iori...@yahoo.com.INVALID Sent: Monday 12th January 2015 19:50 To: solr-user@lucene.apache.org Subject: Re: Extending solr analysis in index time Hi Ali, Reading your example, if you could somehow replace idf component with your importance weight, I think your use case looks like TFIDFSimilarity. Tf component remains same. https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html I also suggest you ask this in lucene mailing list. Someone familiar with similarity package can give insight on this. Ahmet On Monday, January 12, 2015 6:54 PM, Jack Krupansky jack.krupan...@gmail.com wrote: Could you clarify what you mean by Lucene reverse index? That's not a term I am familiar with. -- Jack Krupansky On Mon, Jan 12, 2015 at 1:01 AM, Ali Nazemian alinazem...@gmail.com wrote: Dear Jack, Thank you very much. Yeah I was thinking of function query for sorting, but I have to problems in this case, 1) function query do the process at query time which I dont want to. 2) I also want to have the score field for retrieving and showing to users. Dear Alexandre, Here is some more explanation about the business behind the question: I am going to provide a field for each document, lets refer it as document_score. I am going to fill this field based on the information that could be extracted from Lucene reverse index. Assume I have a list of terms, called important terms and I am going to extract the term frequency for each of the terms inside this list per each document. To be honest I want to use the term frequency for calculating document_score. document_score should be storable since I am going to retrieve this field for each document. I also want to do sorting on document_store in case of preferred by user. I hope I did convey my point. Best regards. On Mon, Jan 12, 2015 at 12:53 AM, Jack Krupansky jack.krupan...@gmail.com wrote: Won't function queries do the job at query time? You can add or multiply the tf*idf score by a function of the term frequency of arbitrary terms, using the tf, mul, and add functions. See: https://cwiki.apache.org/confluence/display/solr/Function+Queries -- Jack Krupansky On Sun, Jan 11, 2015 at 10:55 AM, Ali Nazemian alinazem...@gmail.com wrote: Dear Jack, Hi, I think you misunderstood my need. I dont want to change the default scoring behavior of Lucene (tf-idf) I just want to have another field to do sorting for some specific queries (not all the search business), however I am aware of Lucene payload. Thank you very much. On Sun, Jan 11, 2015 at 7:15 PM, Jack Krupansky jack.krupan...@gmail.com wrote: You would do that with a custom similarity (scoring) class. That's an expert feature. In fact a SUPER-expert feature. Start by completely familiarizing yourself with how TF*IDF similarity already works: http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html And to use your custom similarity class in Solr: https://cwiki.apache.org/confluence/display/solr/Other+Schema+Elements#OtherSchemaElements-Similarity -- Jack Krupansky On Sun, Jan 11, 2015 at 9:04 AM, Ali Nazemian alinazem...@gmail.com wrote: Hi everybody, I am going to add some analysis to Solr at the index time. Here is what I am considering in my mind: Suppose I have two different fields for Solr schema, field a and field b. I am going to use the created reverse index in a way that some terms are considered as important ones and tell lucene to calculate a value based on these terms frequency per each document. For example let the word hello considered as important word with the weight of 2.0. Suppose the term frequency for this word at field a is 3 and at field b is 6 for document 1. Therefor the score value would be 2*3+(2*6)^2. I want to calculate this score based on these fields and put it in the index for retrieving. My question would be how can I do such thing? First I did consider using term component for calculating this value from outside and put it back to Solr index, but it seems it is not efficient enough. Thank you very much. Best regards. -- A.Nazemian -- A.Nazemian -- A.Nazemian
RE: Distributed unit tests and SSL doesn't have a valid keystore
Thanks, we will supress it for now! M. -Original message- From:Mark Miller markrmil...@gmail.com Sent: Monday 12th January 2015 19:25 To: solr-user@lucene.apache.org Subject: Re: Distributed unit tests and SSL doesn't have a valid keystore I'd have to do some digging. Hossman might know offhand. You might just want to use @SupressSSL on the tests :) - Mark On Mon Jan 12 2015 at 8:45:11 AM Markus Jelsma markus.jel...@openindex.io wrote: Hi - in a small Maven project depending on Solr 4.10.3, running unit tests that extend BaseDistributedSearchTestCase randomly fail with SSL doesn't have a valid keystore, and a lot of zombie threads. We have a solrtest.keystore file laying around, but where to put it? Thanks, Markus
RE: multiple patterns in solr.PatternTokenizerFactory
You can split into all groups by specifying group=-1. -Original message- From:Nivedita nivedita.pa...@tcs.com Sent: Monday 9th February 2015 12:08 To: solr-user@lucene.apache.org Subject: multiple patterns in solr.PatternTokenizerFactory Can I give multiple patterns in tokenizer class=solr.PatternTokenizerFactory pattern=(SKU|Part(\sNumber)?):?\s(\[0-9-\]+) group=3/ -- View this message in context: http://lucene.472066.n3.nabble.com/multiple-patterns-in-solr-PatternTokenizerFactory-tp4184986.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Upgrading Solr 4.7.2 to 4.10.3
Well, the CHANGES.txt is filled with just the right information you need :) -Original message- From:Elan Palani elan.pal...@kaybus.com Sent: Tuesday 10th February 2015 22:30 To: solr-user@lucene.apache.org Subject: Upgrading Solr 4.7.2 to 4.10.3 Team.. Planning to Upgrade solr from 4.7.2 to 4.10.3 , I just want through the Documentation seems like a straight forward download/install.. Anything specifically issues I should look for? Any help will be appreciated. Thanks Elan
RE: Relevancy : Keyword stuffing
Hello - setting (e)dismax' tie breaker to 0 or much low than default would `solve` this for now. Markus -Original message- From:Mihran Shahinian slowmih...@gmail.com Sent: Monday 16th March 2015 16:29 To: solr-user@lucene.apache.org Subject: Relevancy : Keyword stuffing Hi all, I have a use case where the data is generated by SEO minded authors and more often than not they perfectly guess the synonym expansions for the document titles skewing results in their favor. At the moment I don't have an offline processing infrastructure to detect these (I can't punish these docs either... just have to level the playing field). I am experimenting with taking the max of the term scores, cutting off scores after certain number of terms,etc but would appreciate any hints if anyone has experience dealing with a similar use case in solr. Much appreciated, Mihran
RE: Relevancy : Keyword stuffing
Hello - Chris' suggestion is indeed a good one but it can be tricky to properly configure the parameters. Regarding position information, you can override dismax to have it use SpanFirstQuery. It allows for setting strict boundaries from the front of the document to a given position. You can also override SpanFirstQuery to incorporate a gradient, to decrease boosting as distance from the front increases. I don't know how you ingest document bodies, but if they are unstructured HTML, you may want to install proper main content extraction if you haven't already. Having decent control over HTML is a powerful tool. You may also want to look at Lucene's BM25 implementation. It is simple to set up and easier to control. It isn't as rough a tool as TFIDF is regarding to length normalization. Plus it allows you to smooth TF, which in your case should also help. If you like to scrutinize SSS and get some proper results, you are more than welcome to share them here :) Markus -Original message- From:Mihran Shahinian slowmih...@gmail.com Sent: Monday 16th March 2015 22:41 To: solr-user@lucene.apache.org Subject: Re: Relevancy : Keyword stuffing Thank you Markus and Chris, for pointers. For SweetSpotSimilarity I am thinking perhaps a set of closed ranges exposed via similarity config is easier to maintain as data changes than making adjustments to fit a function. Another piece of info would've been handy is to know the average position info + position info for the first few occurrences for each term. This would allow perhaps higher boosting for term occurrences earlier in the doc. In my case extra keywords are towards the end of the doc,but that info does not seem to be propagated into scorer. Thanks again, Mihran On Mon, Mar 16, 2015 at 1:52 PM, Chris Hostetter hossman_luc...@fucit.org wrote: You should start by checking out the SweetSpotSimilarity .. it was heavily designed arround the idea of dealing with things like excessively verbose titles, and keyword stuffing in summary text ... so you can configure your expectation for what a normal length doc is, and they will be penalized for being longer then that. similarly you can say what a 'resaonable' tf is, and docs that exceed that would't get added boost (which in conjunction with teh lengthNorm penality penalizes docs that stuff keywords) https://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.computeLengthNorm.svg https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.hyperbolicTf.svg -Hoss http://www.lucidworks.com/
RE: Distributed IDF performance
Anshum, Jack - don't any of you have a cluster at hand to get some real results on this? After testing the actual functionality for a quite some time while the final patch was in development, we have not had the change to work on performance tests. We are still on Solr 4.10 and have to port lots of Lucene stuff to 5. I would sure like to see some numbers from any of you :) Markus -Original message- From:Anshum Gupta ans...@anshumgupta.net Sent: Friday 13th March 2015 23:33 To: solr-user@lucene.apache.org Subject: Re: Distributed IDF performance np! I forgot to mention that I didn't notice any considerable performance hit in my tests. The QTimes were barely off by 5%. On Fri, Mar 13, 2015 at 3:13 PM, Jack Krupansky jack.krupan...@gmail.com wrote: Oops... I said StatsInfo and that should have been StatsCache (statsCache .../). -- Jack Krupansky On Fri, Mar 13, 2015 at 6:04 PM, Anshum Gupta ans...@anshumgupta.net wrote: There's no rough formula or performance data that I know of at this point. About he guidance, if you want to use Global stats, my obvious choice would be to use the LRUStatsCache. Before committing, I did run some tests on my macbook but as I said back then, they shouldn't be totally taken at face value. The tests didn't involve any network and were just about 20mn docs and synthetic queries. On Fri, Mar 13, 2015 at 2:08 PM, Jack Krupansky jack.krupan...@gmail.com wrote: Does anybody have any actual performance data or even a rough formula for calculating the overhead for using the new Solr 5.0 Distributed IDF ( SOLR-1632 https://issues.apache.org/jira/browse/SOLR-1632)? And any guidance as far as which StatsInfo plugin is best to use? Are many people now using Distributed IDF as their default? I'm not currently using this, but the existing doc and Jira is too minimal to offer guidance as requested above. Mostly I'm just curious. Thanks. -- Jack Krupansky -- Anshum Gupta -- Anshum Gupta