Analysis page broken on trunk?

2014-01-08 Thread Markus Jelsma
Hi - it seems the analysis page is broken on trunk and it looks like our 4.5 
and 4.6 builds are unaffected. Can anyone on trunk confirm this? 
Markus


RE: Analysis page broken on trunk?

2014-01-08 Thread Markus Jelsma
Hi - You will see on the left side each filter abbreviation but you won't see 
anything in the right container. No terms, positions, offsets, nothing.

Markus
 
 
-Original message-
 From:Stefan Matheis matheis.ste...@gmail.com
 Sent: Wednesday 8th January 2014 14:10
 To: solr-user@lucene.apache.org
 Subject: Re: Analysis page broken on trunk?
 
 Hey Markus
 
 i'm not up to date with the latest changes, but if you can describe how to 
 reproduce it, i can try to verify that?
 
 -Stefan  
 
 
 On Wednesday, January 8, 2014 at 12:44 PM, Markus Jelsma wrote:
 
  Hi - it seems the analysis page is broken on trunk and it looks like our 
  4.5 and 4.6 builds are unaffected. Can anyone on trunk confirm this? 
  Markus
  
  
 
 
 


RE: Simple payloads example not working

2014-01-13 Thread Markus Jelsma
Check the bytes property:
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/util/BytesRef.html#bytes

  @Override
  public float scorePayload(int doc, int start, int end, BytesRef payload) {
if (payload != null) {
  return PayloadHelper.decodeFloat(payload.bytes);
}
return 1.0f;
  }



 
 
-Original message-
 From:michael.boom my_sky...@yahoo.com
 Sent: Monday 13th January 2014 14:49
 To: solr-user@lucene.apache.org
 Subject: Re: Simple payloads example not working
 
 Thanks iorixxx,
 
 Actually I've just tried it and I hit a small wall, the tutorial looks not
 to be up to date with the codebase.
 When implementing my custom similarity class i should be using
 PayloadHelper, but following happens:
 
 in PayloadHelper:
 public static final float decodeFloat(byte [] bytes, int offset)
 
 in DefaultSimilarity:
 public float scorePayload(int doc, int start, int end, BytesRef payload) 
 
 So it's BytesRef vs byte[].
 How should i proceed in this scenario?
 
 
 
 -
 Thanks,
 Michael
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Simple-payloads-example-not-working-tp4110998p4111040.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 


RE: Analysis page broken on trunk?

2014-01-13 Thread Markus Jelsma
:[2,
2,
2,
2,
2,
2,
2,
2,
2,
2],
  start:4,
  end:7,
  type:word,
  
org.apache.lucene.analysis.tokenattributes.KeywordAttribute#keyword:false},
{
  text:bla,
  raw_bytes:[62 6c 61],
  position:3,
  positionHistory:[3,
3,
3,
3,
3,
3,
3,
3,
3,
3],
  start:8,
  end:11,
  type:word,
  
org.apache.lucene.analysis.tokenattributes.KeywordAttribute#keyword:false}],
  org.apache.lucene.analysis.miscellaneous.LengthFilter,[{
  text:bla,
  raw_bytes:[62 6c 61],
  position:1,
  positionHistory:[1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1],
  start:0,
  end:3,
  type:word,
  
org.apache.lucene.analysis.tokenattributes.KeywordAttribute#keyword:false},
{
  text:bla,
  raw_bytes:[62 6c 61],
  position:2,
  positionHistory:[2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2],
  start:4,
  end:7,
  type:word,
  
org.apache.lucene.analysis.tokenattributes.KeywordAttribute#keyword:false},
{
  text:bla,
  raw_bytes:[62 6c 61],
  position:3,
  positionHistory:[3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
3],
  start:8,
  end:11,
  type:word,
  
org.apache.lucene.analysis.tokenattributes.KeywordAttribute#keyword:false}]]}},
field_names:{}}}


 
 
-Original message-
 From:Stefan Matheis matheis.ste...@gmail.com
 Sent: Friday 10th January 2014 11:35
 To: solr-user@lucene.apache.org
 Subject: Re: Analysis page broken on trunk?
 
 Sorry for not getting back on this earlier - i've tried several fields w/ 
 values from the example docs and that looks pretty okay to me, no change 
 noticed on that.
 
 Can you share a screenshot or something like that? And perhaps Input, 
 Fields/Fieldtype which doesn't work for you?
 
 -Stefan 
 
 
 On Wednesday, January 8, 2014 at 2:24 PM, Markus Jelsma wrote:
 
  Hi - You will see on the left side each filter abbreviation but you won't 
  see anything in the right container. No terms, positions, offsets, nothing.
  
  Markus
  
  
  -Original message-
   From:Stefan Matheis matheis.ste...@gmail.com 
   (mailto:matheis.ste...@gmail.com)
   Sent: Wednesday 8th January 2014 14:10
   To: solr-user@lucene.apache.org (mailto:solr-user@lucene.apache.org)
   Subject: Re: Analysis page broken on trunk?
   
   Hey Markus
   
   i'm not up to date with the latest changes, but if you can describe how 
   to reproduce it, i can try to verify that?
   
   -Stefan 
   
   
   On Wednesday, January 8, 2014 at 12:44 PM, Markus Jelsma wrote:
   
Hi - it seems the analysis page is broken on trunk and it looks like 
our 4.5 and 4.6 builds are unaffected. Can anyone on trunk confirm 
this? 
Markus

   
   
  
  
  
 
 
 

RE: Analysis page broken on trunk?

2014-01-13 Thread Markus Jelsma
],
  start:0,
  end:3,
  type:word,
  
org.apache.lucene.analysis.tokenattributes.KeywordAttribute#keyword:false},
{
  text:bla,
  raw_bytes:[62 6c 61],
  position:2,
  positionHistory:[2,
2,
2,
2,
2,
2,
2,
2,
2,
2],
  start:4,
  end:7,
  type:word,
  
org.apache.lucene.analysis.tokenattributes.KeywordAttribute#keyword:false},
{
  text:bla,
  raw_bytes:[62 6c 61],
  position:3,
  positionHistory:[3,
3,
3,
3,
3,
3,
3,
3,
3,
3],
  start:8,
  end:11,
  type:word,
  
org.apache.lucene.analysis.tokenattributes.KeywordAttribute#keyword:false}],
  org.apache.lucene.analysis.miscellaneous.LengthFilter,[{
  text:bla,
  raw_bytes:[62 6c 61],
  position:1,
  positionHistory:[1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1],
  start:0,
  end:3,
  type:word,
  
org.apache.lucene.analysis.tokenattributes.KeywordAttribute#keyword:false},
{
  text:bla,
  raw_bytes:[62 6c 61],
  position:2,
  positionHistory:[2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2],
  start:4,
  end:7,
  type:word,
  
org.apache.lucene.analysis.tokenattributes.KeywordAttribute#keyword:false},
{
  text:bla,
  raw_bytes:[62 6c 61],
  position:3,
  positionHistory:[3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
3],
  start:8,
  end:11,
  type:word,
  
org.apache.lucene.analysis.tokenattributes.KeywordAttribute#keyword:false}]]}},
field_names:{}}}


 
 
-Original message-
 From:Stefan Matheis matheis.ste...@gmail.com
 Sent: Friday 10th January 2014 11:35
 To: solr-user@lucene.apache.org
 Subject: Re: Analysis page broken on trunk?
 
 Sorry for not getting back on this earlier - i've tried several fields w/ 
 values from the example docs and that looks pretty okay to me, no change 
 noticed on that.
 
 Can you share a screenshot or something like that? And perhaps Input, 
 Fields/Fieldtype which doesn't work for you?
 
 -Stefan 
 
 
 On Wednesday, January 8, 2014 at 2:24 PM, Markus Jelsma wrote:
 
  Hi - You will see on the left side each filter abbreviation but you won't 
  see anything in the right container. No terms, positions, offsets, nothing.
  
  Markus
  
  
  -Original message-
   From:Stefan Matheis matheis.ste...@gmail.com 
   (mailto:matheis.ste...@gmail.com)
   Sent: Wednesday 8th January 2014 14:10
   To: solr-user@lucene.apache.org (mailto:solr-user@lucene.apache.org)
   Subject: Re: Analysis page broken on trunk?
   
   Hey Markus
   
   i'm not up to date with the latest changes, but if you can describe how 
   to reproduce it, i can try to verify that?
   
   -Stefan 
   
   
   On Wednesday, January 8, 2014 at 12:44 PM, Markus Jelsma wrote:
   
Hi - it seems the analysis page is broken on trunk and it looks like 
our 4.5 and 4.6 builds are unaffected. Can anyone on trunk confirm 
this? 
Markus

   
   
  
  
  
 
 
 


RE: Simple payloads example not working

2014-01-14 Thread Markus Jelsma
Strange, is it really floats you are inserting as payload? We use payloads too 
but write them via PayloadAttribute in custom token filters as float. 
 
-Original message-
 From:michael.boom my_sky...@yahoo.com
 Sent: Tuesday 14th January 2014 11:59
 To: solr-user@lucene.apache.org
 Subject: RE: Simple payloads example not working
 
 Investigating, it looks that the payload.bytes property is where the problem
 is.
 payload.toString() outputs corrects values, but .bytes property seems to
 behave a little weird:
 public class CustomSimilarity extends DefaultSimilarity {
 
 @Override
 public float scorePayload(int doc, int start, int end, BytesRef payload)
 {
 if (payload != null) {
 Float pscore = PayloadHelper.decodeFloat(payload.bytes);
 System.out.println(payload :  + payload.toString() + ,
 payload bytes:  + payload.bytes.toString() + , decoded value is  +
 pscore);
 return pscore;
 }
 return 1.0f;
 }
 }
 
 outputs on query:
 http://localhost:8983/solr/collection1/pds-search?q=payloads:testonewt=jsonindent=truedebugQuery=true
 
 payload : [41 26 66 66], payload bytes: [B@149c678, decoded value is 10.4
 payload : [41 f0 0 0], payload bytes: [B@149c678, decoded value is 10.4
 payload : [42 4a cc cd], payload bytes: [B@149c678, decoded value is 10.4
 payload : [42 c6 0 0], payload bytes: [B@149c678, decoded value is 10.4
 payload : [41 26 66 66], payload bytes: [B@850fb7, decoded value is 10.4
 payload : [41 f0 0 0], payload bytes: [B@1cad357, decoded value is 10.4
 payload : [42 4a cc cd], payload bytes: [B@f922cf, decoded value is 10.4
 payload : [42 c6 0 0], payload bytes: [B@5c4dc4, decoded value is 10.4
 
 
 Something doesn't seem right here. Any idea why this behaviour?
 Is anyone using payloads using Solr 4.6.0 ?
 
 
 
 
 -
 Thanks,
 Michael
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Simple-payloads-example-not-working-tp4110998p4111214.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 


RE: Indexing URLs from websites

2014-01-16 Thread Markus Jelsma
 
-Original message-
 From:Teague James teag...@insystechinc.com
 Sent: Wednesday 15th January 2014 22:01
 To: solr-user@lucene.apache.org
 Subject: Re: Indexing URLs from websites
 
 I am still unsuccessful in getting this to work. My expectation is that the
 index-anchor plugin should produce values for the field anchor. However this
 field is not showing up in my Solr index no matter what I try.
 
 Here's what I have in my nutch-site.xml for plugins:
 valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(
 basic|site|url)|indexer-solr|response(json|xml)|summary-basic|scoring-optic|
 urlnormalizer-(pass|reges|basic)/value
 
 I am using the schema-solr4.xml from the Nutch package and I added the
 _version_ field
 
 Here's the command I'm running:
 Bin/nutch crawl urls -solr http://localhost/solr -depth 3 topN 50
 
 The fields that Solr returns are:
 Content, title, segment, boost, digest, tstamp, id, url, and _version_
 
 Note that the url field is the url of the page being indexed and not the
 url(s) of the documents that may be outlinks on that page. It is the
 outlinks that I am trying to get into the index.
 
 What am I missing? I also tried using the invertlinks command that Markus
 suggested, but that did not work either, though I do appreciate the
 suggestion.

That did get you a LinkDB right? You need to call solrindex and use the 
linkdb's location as part of the arguments, only then Nutch knows about it and 
will use the data contained in the LinkDB together with the index-anchor plugin 
to write the anchor field in your Solrindex.

 
 Any help is appreciated! Thanks!
 
 Markus Jelsma Wrote:
 You need to use the invertlinks command to build a database with docs with
 inlinks and anchors. Then use the index-anchor plugin when indexing. Then
 you will have a multivalued field with anchors pointing to your document. 
 
 Teague James Wrote:
 I am trying to index a website that contains links to documents such as PDF,
 Word, etc. The intent is to be able to store the URLs for the links to the
 documents. 
 
 For example, when indexing www.example.com which has links on the page like
 Example Document which points to www.example.com/docs/example.pdf, I want
 Solr to store the text of the link, Example Document, and the URL for the
 link, www.example.com/docs/example.pdf in separate fields. I've tried
 using Nutch 1.7 with Solr 4.6.0 and have successfully indexed the page
 content, but I am not getting the URLs from the links. There are no document
 type restrictions in Nutch for PDF or Word. Any suggestions on how I can
 accomplish this? Should I use a different method than Nutch for crawling the
 site?
 
 I appreciate any help on this!
 
 
 


RE: Indexing URLs from websites

2014-01-16 Thread Markus Jelsma
Hi - you cannot use wildcards for segments. You need to give one segment or a 
-dir segments_dir. Check the usage of your indexer command. 
 
-Original message-
 From:Teague James teag...@insystechinc.com
 Sent: Thursday 16th January 2014 16:43
 To: solr-user@lucene.apache.org
 Subject: RE: Indexing URLs from websites
 
 Hello Markus,
 
 I do get a linkdb folder in the crawl folder that gets created - but it is 
 created at the time that I execute the command automatically by Nutch. I just 
 tried to use solrindex against yesterday's cawl and did not get any errors, 
 but did not get the anchor field or any of the outlinks. I used this command:
 bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb 
 crawl/segments/*
 
 I then tried:
 bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb 
 crawl/segments/*
 This produced the following errors:
 Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not 
 exist: file:/.../crawl/linkdb/crawl_fetch
 Input path does not exist: file:/.../crawl/linkdb/crawl_parse
 Input path does not exist: file:/.../crawl/linkdb/parse_data
 Input path does not exist: file:/.../crawl/linkdb/parse_text
 Along with a Java stacktrace
 
 So I tried invertlinks as you had previously suggested. No errors, but the 
 above missing directories were not created. Using the same solrindex command 
 above this one produced the same errors. 
 
 When/How are the missing directories supposed to be created?
 
 I really appreciate the help! Thank you very much!
 
 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
 Sent: Thursday, January 16, 2014 5:45 AM
 To: solr-user@lucene.apache.org
 Subject: RE: Indexing URLs from websites
 
  
 -Original message-
  From:Teague James teag...@insystechinc.com
  Sent: Wednesday 15th January 2014 22:01
  To: solr-user@lucene.apache.org
  Subject: Re: Indexing URLs from websites
  
  I am still unsuccessful in getting this to work. My expectation is 
  that the index-anchor plugin should produce values for the field 
  anchor. However this field is not showing up in my Solr index no matter 
  what I try.
  
  Here's what I have in my nutch-site.xml for plugins:
  valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|q
  uery-(
  basic|site|url)|indexer-solr|response(json|xml)|summary-basic|scoring-
  basic|site|optic|
  urlnormalizer-(pass|reges|basic)/value
  
  I am using the schema-solr4.xml from the Nutch package and I added the 
  _version_ field
  
  Here's the command I'm running:
  Bin/nutch crawl urls -solr http://localhost/solr -depth 3 topN 50
  
  The fields that Solr returns are:
  Content, title, segment, boost, digest, tstamp, id, url, and _version_
  
  Note that the url field is the url of the page being indexed and not 
  the
  url(s) of the documents that may be outlinks on that page. It is the 
  outlinks that I am trying to get into the index.
  
  What am I missing? I also tried using the invertlinks command that 
  Markus suggested, but that did not work either, though I do appreciate 
  the suggestion.
 
 That did get you a LinkDB right? You need to call solrindex and use the 
 linkdb's location as part of the arguments, only then Nutch knows about it 
 and will use the data contained in the LinkDB together with the index-anchor 
 plugin to write the anchor field in your Solrindex.
 
  
  Any help is appreciated! Thanks!
  
  Markus Jelsma Wrote:
  You need to use the invertlinks command to build a database with docs 
  with inlinks and anchors. Then use the index-anchor plugin when 
  indexing. Then you will have a multivalued field with anchors pointing to 
  your document.
  
  Teague James Wrote:
  I am trying to index a website that contains links to documents such 
  as PDF, Word, etc. The intent is to be able to store the URLs for the 
  links to the documents.
  
  For example, when indexing www.example.com which has links on the page 
  like Example Document which points to 
  www.example.com/docs/example.pdf, I want Solr to store the text of the 
  link, Example Document, and the URL for the link, 
  www.example.com/docs/example.pdf in separate fields. I've tried 
  using Nutch 1.7 with Solr 4.6.0 and have successfully indexed the page 
  content, but I am not getting the URLs from the links. There are no 
  document type restrictions in Nutch for PDF or Word. Any suggestions 
  on how I can accomplish this? Should I use a different method than Nutch 
  for crawling the site?
  
  I appreciate any help on this!
  
  
  
 
 


RE: Indexing URLs from websites

2014-01-16 Thread Markus Jelsma
Usage: SolrIndexer solr url crawldb [-linkdb linkdb] [-params 
k1=v1k2=v2...] (segment ... | -dir segments) [-noCommit] [-deleteGone] 
[-deleteRobotsNoIndex] [-deleteSkippedByIndexingFilter] [-filter] [-normalize]

You must point to the linkdb via the -linkdb parameter. 
 
-Original message-
 From:Teague James teag...@insystechinc.com
 Sent: Thursday 16th January 2014 16:57
 To: solr-user@lucene.apache.org
 Subject: RE: Indexing URLs from websites
 
 Okay. I changed my solrindex to this:
 
 bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb 
 crawl/segments/20140115143147
 
 I got the same errors:
 Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not 
 exist: file:/.../crawl/linkdb/crawl_fetch
 Input path does not exist: file:/.../crawl/linkdb/crawl_parse
 Input path does not exist: file:/.../crawl/linkdb/parse_data 
 Input path does not exist: file:/.../crawl/linkdb/parse_text 
 Along with a Java stacktrace
 
 Those linkdb folders are not being created.
 
 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
 Sent: Thursday, January 16, 2014 10:44 AM
 To: solr-user@lucene.apache.org
 Subject: RE: Indexing URLs from websites
 
 Hi - you cannot use wildcards for segments. You need to give one segment or a 
 -dir segments_dir. Check the usage of your indexer command. 
  
 -Original message-
  From:Teague James teag...@insystechinc.com
  Sent: Thursday 16th January 2014 16:43
  To: solr-user@lucene.apache.org
  Subject: RE: Indexing URLs from websites
  
  Hello Markus,
  
  I do get a linkdb folder in the crawl folder that gets created - but it is 
  created at the time that I execute the command automatically by Nutch. I 
  just tried to use solrindex against yesterday's cawl and did not get any 
  errors, but did not get the anchor field or any of the outlinks. I used 
  this command:
  bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
  crawl/linkdb crawl/segments/*
  
  I then tried:
  bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb 
  crawl/segments/* This produced the following errors:
  Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path 
  does not exist: file:/.../crawl/linkdb/crawl_fetch
  Input path does not exist: file:/.../crawl/linkdb/crawl_parse
  Input path does not exist: file:/.../crawl/linkdb/parse_data Input 
  path does not exist: file:/.../crawl/linkdb/parse_text Along with a 
  Java stacktrace
  
  So I tried invertlinks as you had previously suggested. No errors, but the 
  above missing directories were not created. Using the same solrindex 
  command above this one produced the same errors. 
  
  When/How are the missing directories supposed to be created?
  
  I really appreciate the help! Thank you very much!
  
  -Original Message-
  From: Markus Jelsma [mailto:markus.jel...@openindex.io]
  Sent: Thursday, January 16, 2014 5:45 AM
  To: solr-user@lucene.apache.org
  Subject: RE: Indexing URLs from websites
  
   
  -Original message-
   From:Teague James teag...@insystechinc.com
   Sent: Wednesday 15th January 2014 22:01
   To: solr-user@lucene.apache.org
   Subject: Re: Indexing URLs from websites
   
   I am still unsuccessful in getting this to work. My expectation is 
   that the index-anchor plugin should produce values for the field 
   anchor. However this field is not showing up in my Solr index no matter 
   what I try.
   
   Here's what I have in my nutch-site.xml for plugins:
   valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor)
   |q
   uery-(
   basic|site|url)|indexer-solr|response(json|xml)|summary-basic|scorin
   basic|site|g-
   basic|site|optic|
   urlnormalizer-(pass|reges|basic)/value
   
   I am using the schema-solr4.xml from the Nutch package and I added 
   the _version_ field
   
   Here's the command I'm running:
   Bin/nutch crawl urls -solr http://localhost/solr -depth 3 topN 50
   
   The fields that Solr returns are:
   Content, title, segment, boost, digest, tstamp, id, url, and 
   _version_
   
   Note that the url field is the url of the page being indexed and not 
   the
   url(s) of the documents that may be outlinks on that page. It is the 
   outlinks that I am trying to get into the index.
   
   What am I missing? I also tried using the invertlinks command that 
   Markus suggested, but that did not work either, though I do 
   appreciate the suggestion.
  
  That did get you a LinkDB right? You need to call solrindex and use the 
  linkdb's location as part of the arguments, only then Nutch knows about it 
  and will use the data contained in the LinkDB together with the 
  index-anchor plugin to write the anchor field in your Solrindex.
  
   
   Any help is appreciated! Thanks!
   
   Markus Jelsma Wrote:
   You need to use the invertlinks command to build a database with 
   docs with inlinks and anchors. Then use the index-anchor plugin when 
   indexing

RE: Indexing URLs from websites

2014-01-17 Thread Markus Jelsma


 
 
-Original message-
 From:Teague James teag...@insystechinc.com
 Sent: Thursday 16th January 2014 20:23
 To: solr-user@lucene.apache.org
 Subject: RE: Indexing URLs from websites
 
 Okay. I had used that previously and I just tried it again. The following 
 generated no errors:
 
 bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb 
 -dir crawl/segments/
 
 Solr is still not getting an anchor field and the outlinks are not appearing 
 in the index anywhere else.
 
 To be sure I deleted the crawl directory and did a fresh crawl using:
 
 bin/nutch crawl urls -dir crawl -depth 3 -topN 50
 
 Then
 
 bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb 
 -dir crawl/segments/
 
 No errors, but no anchor fields or outlinks. One thing in the response from 
 the crawl that I found interesting was a line that said:
 
 LinkDb: internal links will be ignored.

Good catch! That is likely the problem. 

 
 What does that mean?

property
  namedb.ignore.internal.links/name
  valuetrue/value
  descriptionIf true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping only the highest quality
  links.
  /description
/property

So change the property, rebuild the linkdb and try reindexing once again :)

 
 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
 Sent: Thursday, January 16, 2014 11:08 AM
 To: solr-user@lucene.apache.org
 Subject: RE: Indexing URLs from websites
 
 Usage: SolrIndexer solr url crawldb [-linkdb linkdb] [-params 
 k1=v1k2=v2...] (segment ... | -dir segments) [-noCommit] [-deleteGone] 
 [-deleteRobotsNoIndex] [-deleteSkippedByIndexingFilter] [-filter] [-normalize]
 
 You must point to the linkdb via the -linkdb parameter. 
  
 -Original message-
  From:Teague James teag...@insystechinc.com
  Sent: Thursday 16th January 2014 16:57
  To: solr-user@lucene.apache.org
  Subject: RE: Indexing URLs from websites
  
  Okay. I changed my solrindex to this:
  
  bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb 
  crawl/segments/20140115143147
  
  I got the same errors:
  Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path 
  does not exist: file:/.../crawl/linkdb/crawl_fetch
  Input path does not exist: file:/.../crawl/linkdb/crawl_parse
  Input path does not exist: file:/.../crawl/linkdb/parse_data Input 
  path does not exist: file:/.../crawl/linkdb/parse_text Along with a 
  Java stacktrace
  
  Those linkdb folders are not being created.
  
  -Original Message-
  From: Markus Jelsma [mailto:markus.jel...@openindex.io]
  Sent: Thursday, January 16, 2014 10:44 AM
  To: solr-user@lucene.apache.org
  Subject: RE: Indexing URLs from websites
  
  Hi - you cannot use wildcards for segments. You need to give one segment or 
  a -dir segments_dir. Check the usage of your indexer command. 
   
  -Original message-
   From:Teague James teag...@insystechinc.com
   Sent: Thursday 16th January 2014 16:43
   To: solr-user@lucene.apache.org
   Subject: RE: Indexing URLs from websites
   
   Hello Markus,
   
   I do get a linkdb folder in the crawl folder that gets created - but it 
   is created at the time that I execute the command automatically by Nutch. 
   I just tried to use solrindex against yesterday's cawl and did not get 
   any errors, but did not get the anchor field or any of the outlinks. I 
   used this command:
   bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
   crawl/linkdb crawl/segments/*
   
   I then tried:
   bin/nutch solrindex http://localhost/solr/ crawl/crawldb 
   crawl/linkdb
   crawl/segments/* This produced the following errors:
   Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path 
   does not exist: file:/.../crawl/linkdb/crawl_fetch
   Input path does not exist: file:/.../crawl/linkdb/crawl_parse
   Input path does not exist: file:/.../crawl/linkdb/parse_data Input 
   path does not exist: file:/.../crawl/linkdb/parse_text Along with a 
   Java stacktrace
   
   So I tried invertlinks as you had previously suggested. No errors, but 
   the above missing directories were not created. Using the same solrindex 
   command above this one produced the same errors. 
   
   When/How are the missing directories supposed to be created?
   
   I really appreciate the help! Thank you very much!
   
   -Original Message-
   From: Markus Jelsma [mailto:markus.jel...@openindex.io]
   Sent: Thursday, January 16, 2014 5:45 AM
   To: solr-user@lucene.apache.org
   Subject: RE: Indexing URLs from websites
   

   -Original message-
From:Teague James teag...@insystechinc.com
Sent: Wednesday 15th January 2014 22:01
To: solr-user@lucene.apache.org
Subject: Re: Indexing URLs from websites

I am still unsuccessful in getting this to work. My expectation is 
that the index

RE: Indexing URLs from websites

2014-01-20 Thread Markus Jelsma
Well it is hard to get a specific anchor because there is usually more than 
one. The content of the anchors field should be correct. What would you expect 
if there are multiple anchors? 
 
-Original message-
 From:Teague James teag...@insystechinc.com
 Sent: Friday 17th January 2014 18:13
 To: solr-user@lucene.apache.org
 Subject: RE: Indexing URLs from websites
 
 Progress!
 
 I changed the value of that property in nutch-default.xml and I am getting 
 the anchor field now. However, the stuff going in there is a bit random and 
 doesn't seem to correlate to the pages I'm crawling. The primary objective is 
 that when there is something on the page that is a link to a file 
 ...href=/blah/somefile.pdfGet the PDF!... (using ... to prevent actual 
 code in the email) I want to capture that URL and the anchor text Get the 
 PDF! into field(s).
 
 Am I going in the right direction on this?
 
 Thank you so much for sticking with me on this - I really appreciate your 
 help!
 
 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
 Sent: Friday, January 17, 2014 6:46 AM
 To: solr-user@lucene.apache.org
 Subject: RE: Indexing URLs from websites
 
 
 
  
  
 -Original message-
  From:Teague James teag...@insystechinc.com
  Sent: Thursday 16th January 2014 20:23
  To: solr-user@lucene.apache.org
  Subject: RE: Indexing URLs from websites
  
  Okay. I had used that previously and I just tried it again. The following 
  generated no errors:
  
  bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
  crawl/linkdb -dir crawl/segments/
  
  Solr is still not getting an anchor field and the outlinks are not 
  appearing in the index anywhere else.
  
  To be sure I deleted the crawl directory and did a fresh crawl using:
  
  bin/nutch crawl urls -dir crawl -depth 3 -topN 50
  
  Then
  
  bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
  crawl/linkdb -dir crawl/segments/
  
  No errors, but no anchor fields or outlinks. One thing in the response from 
  the crawl that I found interesting was a line that said:
  
  LinkDb: internal links will be ignored.
 
 Good catch! That is likely the problem. 
 
  
  What does that mean?
 
 property
   namedb.ignore.internal.links/name
   valuetrue/value
   descriptionIf true, when adding new links to a page, links from
   the same host are ignored.  This is an effective way to limit the
   size of the link database, keeping only the highest quality
   links.
   /description
 /property
 
 So change the property, rebuild the linkdb and try reindexing once again :)
 
  
  -Original Message-
  From: Markus Jelsma [mailto:markus.jel...@openindex.io]
  Sent: Thursday, January 16, 2014 11:08 AM
  To: solr-user@lucene.apache.org
  Subject: RE: Indexing URLs from websites
  
  Usage: SolrIndexer solr url crawldb [-linkdb linkdb] [-params 
  k1=v1k2=v2...] (segment ... | -dir segments) [-noCommit] 
  [-deleteGone] [-deleteRobotsNoIndex] [-deleteSkippedByIndexingFilter] 
  [-filter] [-normalize]
  
  You must point to the linkdb via the -linkdb parameter. 
   
  -Original message-
   From:Teague James teag...@insystechinc.com
   Sent: Thursday 16th January 2014 16:57
   To: solr-user@lucene.apache.org
   Subject: RE: Indexing URLs from websites
   
   Okay. I changed my solrindex to this:
   
   bin/nutch solrindex http://localhost/solr/ crawl/crawldb 
   crawl/linkdb
   crawl/segments/20140115143147
   
   I got the same errors:
   Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path 
   does not exist: file:/.../crawl/linkdb/crawl_fetch
   Input path does not exist: file:/.../crawl/linkdb/crawl_parse
   Input path does not exist: file:/.../crawl/linkdb/parse_data Input 
   path does not exist: file:/.../crawl/linkdb/parse_text Along with a 
   Java stacktrace
   
   Those linkdb folders are not being created.
   
   -Original Message-
   From: Markus Jelsma [mailto:markus.jel...@openindex.io]
   Sent: Thursday, January 16, 2014 10:44 AM
   To: solr-user@lucene.apache.org
   Subject: RE: Indexing URLs from websites
   
   Hi - you cannot use wildcards for segments. You need to give one segment 
   or a -dir segments_dir. Check the usage of your indexer command. 

   -Original message-
From:Teague James teag...@insystechinc.com
Sent: Thursday 16th January 2014 16:43
To: solr-user@lucene.apache.org
Subject: RE: Indexing URLs from websites

Hello Markus,

I do get a linkdb folder in the crawl folder that gets created - but it 
is created at the time that I execute the command automatically by 
Nutch. I just tried to use solrindex against yesterday's cawl and did 
not get any errors, but did not get the anchor field or any of the 
outlinks. I used this command:
bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
crawl/linkdb crawl/segments/*

I then tried:
bin/nutch solrindex http

RE: Solr middle-ware?

2014-01-21 Thread Markus Jelsma
Hi - We use Nginx to expose the index to the internet. It comes down to putting 
some limitations on input parameters and on-the-fly rewrite of queries using 
embedded Perl scripting. Limitations and rewrites are usually just a bunch of 
regular expressions, so it is not that hard.

Cheers
Markus
 
 
-Original message-
 From:Alexandre Rafalovitch arafa...@gmail.com
 Sent: Tuesday 21st January 2014 14:01
 To: solr-user@lucene.apache.org
 Subject: Solr middle-ware?
 
 Hello,
 
 All the Solr documents talk about not running Solr directly to the
 cloud. But I see people keep asking for a thin secure layer in front
 of Solr they can talk from JavaScript to, perhaps with some basic
 extension options.
 
 Has anybody actually written one? Open source or in a community part
 of larger project? I would love to be able to point people at
 something.
 
 Is there something particularly difficult about writing one? Does
 anybody has a story of aborted attempt or mid-point reversal? I would
 like to know.
 
 Regards,
Alex.
 P.s. Personal context: I am thinking of doing a series of lightweight
 examples of how to use Solr. Like I did for a book, but with a bit
 more depth and something that can actually be exposed to the live web
 with live data. I don't want to reinvent the wheel of the thin Solr
 middleware.
 P.p.s. Though I keep thinking that Dart could make an interesting
 option for the middleware as it could have the same codebase on the
 server and in the client. Like NodeJS, but with saner syntax.
 
 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all
 at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)
 


RE: Indexing URLs from websites

2014-01-21 Thread Markus Jelsma
Hi, are you getting pdfs at all? Sounds like a problem with url filters, those 
also work on the linkdb. You should also try dumping the linkdb and inspect it 
for urls.

Btw, i noticed this is om the solr list, its best to open a new discussion on 
the nutch user mailing list.

CheersTeague James teag...@insystechinc.com schreef:What I'm getting is just 
the anchor text. In cases where there are multiple anchors I am getting a comma 
separated list of anchor text - which is fine. However, I am not getting all of 
the anchors that are on the page, nor am I getting any of the URLs. The anchors 
I am getting back never include anchors that lead to documents - which is the 
primary objective. So on a page that looks something like:

Article 1 text blah blah blah [Read more]
Article 2 test blah blah blah [Read more]
Download a the [PDF]

Where each [Read more] links to a page where the rest of the article is stored 
and [PDF] links to a PDF document (these are relative links). That I get back 
in the anchor field is [Read more],[Read more]

I am not getting the [PDF] anchor and I am not getting any of the URLs that 
those anchors point to - like /Artilce 1, /Article 2, and  
/documents/Article 1.pdf

How can I get these URLs?

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Monday, January 20, 2014 9:08 AM
To: solr-user@lucene.apache.org
Subject: RE: Indexing URLs from websites

Well it is hard to get a specific anchor because there is usually more than 
one. The content of the anchors field should be correct. What would you expect 
if there are multiple anchors? 

-Original message-
 From:Teague James teag...@insystechinc.com
 Sent: Friday 17th January 2014 18:13
 To: solr-user@lucene.apache.org
 Subject: RE: Indexing URLs from websites
 
 Progress!
 
 I changed the value of that property in nutch-default.xml and I am getting 
 the anchor field now. However, the stuff going in there is a bit random and 
 doesn't seem to correlate to the pages I'm crawling. The primary objective is 
 that when there is something on the page that is a link to a file 
 ...href=/blah/somefile.pdfGet the PDF!... (using ... to prevent actual 
 code in the email) I want to capture that URL and the anchor text Get the 
 PDF! into field(s).
 
 Am I going in the right direction on this?
 
 Thank you so much for sticking with me on this - I really appreciate your 
 help!
 
 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io]
 Sent: Friday, January 17, 2014 6:46 AM
 To: solr-user@lucene.apache.org
 Subject: RE: Indexing URLs from websites
 
 
 
  
  
 -Original message-
  From:Teague James teag...@insystechinc.com
  Sent: Thursday 16th January 2014 20:23
  To: solr-user@lucene.apache.org
  Subject: RE: Indexing URLs from websites
  
  Okay. I had used that previously and I just tried it again. The following 
  generated no errors:
  
  bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
  crawl/linkdb -dir crawl/segments/
  
  Solr is still not getting an anchor field and the outlinks are not 
  appearing in the index anywhere else.
  
  To be sure I deleted the crawl directory and did a fresh crawl using:
  
  bin/nutch crawl urls -dir crawl -depth 3 -topN 50
  
  Then
  
  bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
  crawl/linkdb -dir crawl/segments/
  
  No errors, but no anchor fields or outlinks. One thing in the response from 
  the crawl that I found interesting was a line that said:
  
  LinkDb: internal links will be ignored.
 
 Good catch! That is likely the problem. 
 
  
  What does that mean?
 
 property
   namedb.ignore.internal.links/name
   valuetrue/value
   descriptionIf true, when adding new links to a page, links from
   the same host are ignored.  This is an effective way to limit the
   size of the link database, keeping only the highest quality
   links.
   /description
 /property
 
 So change the property, rebuild the linkdb and try reindexing once 
 again :)
 
  
  -Original Message-
  From: Markus Jelsma [mailto:markus.jel...@openindex.io]
  Sent: Thursday, January 16, 2014 11:08 AM
  To: solr-user@lucene.apache.org
  Subject: RE: Indexing URLs from websites
  
  Usage: SolrIndexer solr url crawldb [-linkdb linkdb] [-params 
  k1=v1k2=v2...] (segment ... | -dir segments) [-noCommit] 
  [-deleteGone] [-deleteRobotsNoIndex] 
  [-deleteSkippedByIndexingFilter] [-filter] [-normalize]
  
  You must point to the linkdb via the -linkdb parameter. 
   
  -Original message-
   From:Teague James teag...@insystechinc.com
   Sent: Thursday 16th January 2014 16:57
   To: solr-user@lucene.apache.org
   Subject: RE: Indexing URLs from websites
   
   Okay. I changed my solrindex to this:
   
   bin/nutch solrindex http://localhost/solr/ crawl/crawldb 
   crawl/linkdb
   crawl/segments/20140115143147
   
   I got the same errors:
   Indexer

AIOOBException on trunk since 21st or 22nd build

2014-01-22 Thread Markus Jelsma
Hi - this likely belongs to an existing open issue. We're seeing the stuff 
below on a build of the 22nd. Until just now we used builds of the 20th and 
didn't have the issue. This is either a bug or did some data format in 
Zookeeper change? Until now only two cores of the same shard through the error, 
all other nodes in the cluster are clean.

2014-01-22 15:32:48,826 ERROR [solr.core.SolrCore] - [http-8080-exec-5] - : 
java.lang.ArrayIndexOutOfBoundsException: 1
at 
org.apache.solr.common.cloud.CompositeIdRouter$KeyParser.getHash(CompositeIdRouter.java:291)
at 
org.apache.solr.common.cloud.CompositeIdRouter.sliceHash(CompositeIdRouter.java:58)
at 
org.apache.solr.common.cloud.HashBasedRouter.getTargetSlice(HashBasedRouter.java:33)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:218)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processDelete(DistributedUpdateProcessor.java:961)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processDelete(UpdateRequestProcessor.java:55)
at 
org.apache.solr.handler.loader.XMLLoader.processDelete(XMLLoader.java:347)
at 
org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:278)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
at 
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1915)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:785)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:203)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at 
org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889)
at 
org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744)
at 
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)


RE: AIOOBException on trunk since 21st or 22nd build

2014-01-23 Thread Markus Jelsma
Yeah, i can now also reproduce the problem with a build of the 20th! Again the 
same nodes leader and replica. The problem seems to be in the data we're 
sending to Solr. I'll check it out an file an issue.
Cheers

-Original message-
 From:Mark Miller markrmil...@gmail.com
 Sent: Wednesday 22nd January 2014 18:56
 To: solr-user solr-user@lucene.apache.org
 Subject: Re: AIOOBException on trunk since 21st or 22nd build
 
 Looking at the list of changes on the 21st and 22nd, I don’t see a smoking 
 gun.
 
 - Mark  
 
 
 
 On Jan 22, 2014, 11:13:26 AM, Markus Jelsma markus.jel...@openindex.io 
 wrote: Hi - this likely belongs to an existing open issue. We're seeing the 
 stuff below on a build of the 22nd. Until just now we used builds of the 20th 
 and didn't have the issue. This is either a bug or did some data format in 
 Zookeeper change? Until now only two cores of the same shard through the 
 error, all other nodes in the cluster are clean.
 
 2014-01-22 15:32:48,826 ERROR [solr.core.SolrCore] - [http-8080-exec-5] - : 
 java.lang.ArrayIndexOutOfBoundsException: 1
 at 
 org.apache.solr.common.cloud.CompositeIdRouter$KeyParser.getHash(CompositeIdRouter.java:291)
 at 
 org.apache.solr.common.cloud.CompositeIdRouter.sliceHash(CompositeIdRouter.java:58)
 at 
 org.apache.solr.common.cloud.HashBasedRouter.getTargetSlice(HashBasedRouter.java:33)
 at 
 org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:218)
 at 
 org.apache.solr.update.processor.DistributedUpdateProcessor.processDelete(DistributedUpdateProcessor.java:961)
 at 
 org.apache.solr.update.processor.UpdateRequestProcessor.processDelete(UpdateRequestProcessor.java:55)
 at org.apache.solr.handler.loader.XMLLoader.processDelete(XMLLoader.java:347)
 at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:278)
 at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
 at 
 org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
 at 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
 at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1915)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:785)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:203)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
 at 
 org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889)
 at 
 org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744)
 at 
 org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2282)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:724)
 


RE: AIOOBException on trunk since 21st or 22nd build

2014-01-23 Thread Markus Jelsma
Ignore or throw proper error message for bad delete containing bad composite ID
https://issues.apache.org/jira/browse/SOLR-5659

 
 
-Original message-
 From:Markus Jelsma markus.jel...@openindex.io
 Sent: Thursday 23rd January 2014 12:16
 To: solr-user@lucene.apache.org
 Subject: RE: AIOOBException on trunk since 21st or 22nd build
 
 Yeah, i can now also reproduce the problem with a build of the 20th! Again 
 the same nodes leader and replica. The problem seems to be in the data we're 
 sending to Solr. I'll check it out an file an issue.
 Cheers
 
 -Original message-
  From:Mark Miller markrmil...@gmail.com
  Sent: Wednesday 22nd January 2014 18:56
  To: solr-user solr-user@lucene.apache.org
  Subject: Re: AIOOBException on trunk since 21st or 22nd build
  
  Looking at the list of changes on the 21st and 22nd, I don’t see a smoking 
  gun.
  
  - Mark  
  
  
  
  On Jan 22, 2014, 11:13:26 AM, Markus Jelsma markus.jel...@openindex.io 
  wrote: Hi - this likely belongs to an existing open issue. We're seeing the 
  stuff below on a build of the 22nd. Until just now we used builds of the 
  20th and didn't have the issue. This is either a bug or did some data 
  format in Zookeeper change? Until now only two cores of the same shard 
  through the error, all other nodes in the cluster are clean.
  
  2014-01-22 15:32:48,826 ERROR [solr.core.SolrCore] - [http-8080-exec-5] - : 
  java.lang.ArrayIndexOutOfBoundsException: 1
  at 
  org.apache.solr.common.cloud.CompositeIdRouter$KeyParser.getHash(CompositeIdRouter.java:291)
  at 
  org.apache.solr.common.cloud.CompositeIdRouter.sliceHash(CompositeIdRouter.java:58)
  at 
  org.apache.solr.common.cloud.HashBasedRouter.getTargetSlice(HashBasedRouter.java:33)
  at 
  org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:218)
  at 
  org.apache.solr.update.processor.DistributedUpdateProcessor.processDelete(DistributedUpdateProcessor.java:961)
  at 
  org.apache.solr.update.processor.UpdateRequestProcessor.processDelete(UpdateRequestProcessor.java:55)
  at 
  org.apache.solr.handler.loader.XMLLoader.processDelete(XMLLoader.java:347)
  at 
  org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:278)
  at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
  at 
  org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
  at 
  org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
  at 
  org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1915)
  at 
  org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:785)
  at 
  org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
  at 
  org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:203)
  at 
  org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
  at 
  org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
  at 
  org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
  at 
  org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
  at 
  org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
  at 
  org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
  at 
  org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
  at 
  org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
  at 
  org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889)
  at 
  org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744)
  at 
  org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2282)
  at 
  java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at 
  java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:724)
  
 


RE: Solr Related Search Suggestions

2014-01-28 Thread Markus Jelsma
Query Recommendations using Query Logs in Search Engines
http://personales.dcc.uchile.cl/~churtado/clustwebLNCS.pdf

Very interesting paper and section 2.1 covers related work plus references.

In our first attempt we did it even simpler, by finding for each query other 
top queries by inspecting our query and click logs. That works very well as 
well, the big problem is normalizing query terms for deduplication. Something 
that is never mentioned in any paper i read so far ;)
 
-Original message-
 From:kumar pavan2...@gmail.com
 Sent: Tuesday 28th January 2014 6:09
 To: solr-user@lucene.apache.org
 Subject: Re: Solr Related Search Suggestions
 
 These are just key words
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-Related-Search-Suggestions-tp4113672p4113882.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 


Re: Solr Nutch

2014-01-28 Thread Markus Jelsma
Short answer, you can't.rashmi maheshwari maheshwari.ras...@gmail.com 
schreef:Thanks All for quick response.

Today I crawled a webpage using nutch. This page have many links. But all
anchor tags have href=# and javascript is written on onClick event of
each anchor tag to open a new page.

So crawler didnt crawl any of those links which were opening using onClick
event and has # href value.

How these links are crawled using nutch?




On Tue, Jan 28, 2014 at 10:54 PM, Alexei Martchenko 
ale...@martchenko.com.br wrote:

 1) Plus, those files are binaries sometimes with metadata, specific
 crawlers need to understand them. html is a plain text

 2) Yes, different data schemes. Sometimes I replicate the same core and
 make some A-B tests with different weights, filters etc etc and some people
 like to creare CoreA and CoreB with the same schema and hammer CoreA with
 updates and commits and optmizes, they make it available for searches while
 hammering CoreB. Then swap again. This produces faster searches.


 alexei martchenko
 Facebook http://www.facebook.com/alexeiramone |
 Linkedinhttp://br.linkedin.com/in/alexeimartchenko|
 Steam http://steamcommunity.com/id/alexeiramone/ |
 4sqhttps://pt.foursquare.com/alexeiramone| Skype: alexeiramone |
 Github https://github.com/alexeiramone | (11) 9 7613.0966 |


 2014-01-28 Jack Krupansky j...@basetechnology.com

  1. Nutch follows the links within HTML web pages to crawl the full graph
  of a web of pages.
 
  2. Think of a core as an SQL table - each table/core has a different type
  of data.
 
  3. SolrCloud is all about scaling and availability - multiple shards for
  larger collections and multiple replicas for both scaling of query
 response
  and availability if nodes go down.
 
  -- Jack Krupansky
 
  -Original Message- From: rashmi maheshwari
  Sent: Tuesday, January 28, 2014 11:36 AM
  To: solr-user@lucene.apache.org
  Subject: Solr  Nutch
 
 
  Hi,
 
  Question1 -- When Solr could parse html, documents like doc, excel pdf
  etc, why do we need nutch to parse html files? what is different?
 
  Questions 2: When do we use multiple core in solar? any practical
 business
  case when we need multiple cores?
 
  Question 3: When do we go for cloud? What is meaning of implementing solr
  cloud?
 
 
  --
  Rashmi
  Be the change that you want to see in this world!
  www.minnal.zor.org
  disha.resolve.at
  www.artofliving.org
 




-- 
Rashmi
Be the change that you want to see in this world!
www.minnal.zor.org
disha.resolve.at
www.artofliving.org


LUCENE-5388 AbstractMethodError

2014-01-29 Thread Markus Jelsma
Hi,

We have a developement environment running trunk but have custom analyzers and 
token filters built on 4.6.1. Now the constructors have changes somewhat and 
stuff breaks. Here's a consumer trying to get a TokenStream from an Analyzer 
object doing TokenStream stream = analyzer.tokenStream(null, new 
StringReader(input)); throwing:

Caused by: java.lang.AbstractMethodError
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:140)

Changing the constructors won't work either because on 4.x we must override 
that specific method: analyzer is not abstract and does not override abstract 
method createComponents(String,Reader) in Analyzer :)

So, any hints on how to deal with this thing? Wait for 4.x backport of 5388, so 
something clever like ... fill in the blanks.

Many thanks,
Markus


RE: Sentence Detection for Highlighting

2014-02-04 Thread Markus Jelsma
Boundary scanner using Java's break iterator:
http://wiki.apache.org/solr/HighlightingParameters#hl.boundaryScanner

 
 
-Original message-
 From:Furkan KAMACI furkankam...@gmail.com
 Sent: Tuesday 4th February 2014 12:03
 To: solr-user@lucene.apache.org
 Subject: Sentence Detection for Highlighting
 
 Hi;
 
 I want to detect sentences for Turkish documents to generate better
 Higlighting at Solr 4.6.1 What do you prefer to me for that purpose?
 
 Thanks;
 Furkan KAMACI
 


RE: Inconsistency between Leader and replica in solr cloud

2014-02-24 Thread Markus Jelsma
Yes, that issue is fixed. We are on trunk and seeing it happen again. Kill some 
nodes when indexing, trigger OOM or reload the collection and you are in 
trouble again.
 
-Original message-
 From:Yago Riveiro yago.rive...@gmail.com
 Sent: Monday 24th February 2014 14:54
 To: solr-user@lucene.apache.org
 Subject: Re: Inconsistency between Leader and replica in solr cloud
 
 This bug was fixed on Solr 4.6.1—
 /Yago Riveiro
 
 On Mon, Feb 24, 2014 at 11:56 AM, abhijit das abhijitdas1...@outlook.com
 wrote:
 
  We are currently using Solr Cloud Version 4.3, with the following set-up, a 
  core with 2 shards - Shard1 and Shard2, each shard has replication factor 1.
  We have noticed that in one of the shards, the document differs between the 
  leader and the replica. Though the doc exists in both the machines, the 
  properties of the doc are not same.
  This is causing inconsistent result in subsequent queries, our 
  understanding is that the docs would be replicated and be identical in both 
  leader and replica.
  What could be causing this and how can this be avoided.
  Thanks in advance.
  Regards,
  Abhijit
  Sent from Windows Mail


RE: How To Test SolrCloud Indexing Limits

2014-02-27 Thread Markus Jelsma
Something must be eating your memory in your solrcloud indexer in Nutch. We 
have our own SolrCloud indexer in Nutch and it uses extremely little memory.  
You either have a leak or your batch size is too large.
 
-Original message-
 From:Furkan KAMACI furkankam...@gmail.com
 Sent: Thursday 27th February 2014 16:04
 To: solr-user@lucene.apache.org
 Subject: How To Test SolrCloud Indexing Limits
 
 Hi;
 
 I'm trying to index 2 million documents into SolrCloud via Map Reduce Jobs
 (really small number of documents for my system). However I get that error
 at tasks when I increase the added document size:
 
 java.lang.ClassCastException: java.lang.OutOfMemoryError cannot be
 cast to java.lang.Exception
   at 
 org.apache.solr.client.solrj.impl.CloudSolrServer$RouteException.init(CloudSolrServer.java:484)
   at 
 org.apache.solr.client.solrj.impl.CloudSolrServer.directUpdate(CloudSolrServer.java:351)
   at 
 org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:510)
   at 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
   at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
   at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
   at 
 org.apache.nutch.indexwriter.solrcloud.SolrCloudIndexWriter.close(SolrCloudIndexWriter.java:95)
   at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:114)
   at 
 org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:54)
   at 
 org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:649)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:363)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.mapred.Child.main(Child.java:249)
 
 
 I use Solr 4.5.1 for my purpose. I do not get any error at my
 SolrCloud nodes.. I want to test my indexing capability and I have
 changed some parameters to tune up. Is there any idea for autocommit -
 softcommit size or maxTime - maxDocs parameters to test. I don't need
 the numbers I just want to follow a policy as like: increase
 autocommit and maxDocs, don't use softcommit and maxTime (or maybe no
 free lunch, try everything!).
 
 I don't ask this question for production purpose, I know that I should
 test more parameters and tune up my system for such kind of purpose I
 just want to test my indexing limits.
 
 
 Thanks;
 
 Furkan KAMACI
 


RE: Id As URL for Solrj

2014-03-04 Thread Markus Jelsma
You are not escaping the Lucene query parser special characters:


+ -  || ! ( ) { } [ ] ^  ~ * ? : \ /

 
 
-Original message-
 From:Furkan KAMACI furkankam...@gmail.com
 Sent: Tuesday 4th March 2014 16:57
 To: solr-user@lucene.apache.org
 Subject: Id As URL for Solrj
 
 Hi;
 
 This maybe a simple question but when I query from Admin interface:
 
 id:am.mobileworld.www:http/
 
 returns me one document as well. However when I do it from Solrj with
 deleteById it does not. Also when I send a query via Solrj it returns me
 all documents (for id, id:am.mobileworld.www:http/ ). I've escaped the
 terms, URL encoded  and ...
 
 What is the mos appropriate for it?
 
 Thanks;
 Furkan KAMACI
 


RE: IDF maxDocs / numDocs

2014-03-12 Thread Markus Jelsma
Hi Steve - it seems most similarities use CollectionStatistics.maxDoc() in 
idfExplain but there's also a docCount(). We use docCount in all our custom 
similarities, also because it allows you to have multiple languages in one 
index where one is much larger than the other. The small language will have 
very high IDF scores using maxDoc but they are proportional enough using 
docCount(). Using docCount() also fixes SolrCloud ranking problems, unless one 
of your replica's becomes inconsistent ;)

https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/CollectionStatistics.html#docCount%28%29

 
 
-Original message-
 From:Steven Bower smb-apa...@alcyon.net
 Sent: Wednesday 12th March 2014 16:08
 To: solr-user solr-user@lucene.apache.org
 Subject: IDF maxDocs / numDocs
 
 I am noticing the maxDocs between replicas is consistently different and
 that in the idf calculation it is used which causes idf scores for the same
 query/doc between replicas to be different. obviously an optimize can
 normalize the maxDocs scores, but that is only temporary.. is there a way
 to have idf use numDocs instead (as it should be consistent across
 replicas)?
 
 thanks,
 
 steve
 


RE: IDF maxDocs / numDocs

2014-03-13 Thread Markus Jelsma
Oh yes, i see what you mean. I would try SOLR-1632 and have distributed IDF, 
but it seems to be broken now.
 
-Original message-
 From:Steven Bower smb-apa...@alcyon.net
 Sent: Wednesday 12th March 2014 21:47
 To: solr-user solr-user@lucene.apache.org
 Subject: Re: IDF maxDocs / numDocs
 
 My problem is that both maxDoc() and docCount() both report documents that
 have been deleted in their values. Because of merging/etc.. those numbers
 can be different per replica (or at least that is what I'm seeing). I need
 a value that is consistent across replicas... I see in the comment it makes
 mention of not using IndexReader.numDocs() but there doesn't seem to me a
 way to get ahold of the IndexReader within a similarity implementation (as
 only TermStats, CollectionStats are passed in, and neither contains of ref
 to the reader)
 
 I am contemplating just using a static value for the number of docs as
 this won't change dramatically often..
 
 steve
 
 
 On Wed, Mar 12, 2014 at 11:18 AM, Markus Jelsma
 markus.jel...@openindex.iowrote:
 
  Hi Steve - it seems most similarities use CollectionStatistics.maxDoc() in
  idfExplain but there's also a docCount(). We use docCount in all our custom
  similarities, also because it allows you to have multiple languages in one
  index where one is much larger than the other. The small language will have
  very high IDF scores using maxDoc but they are proportional enough using
  docCount(). Using docCount() also fixes SolrCloud ranking problems, unless
  one of your replica's becomes inconsistent ;)
 
 
  https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/CollectionStatistics.html#docCount%28%29
 
 
 
  -Original message-
   From:Steven Bower smb-apa...@alcyon.net
   Sent: Wednesday 12th March 2014 16:08
   To: solr-user solr-user@lucene.apache.org
   Subject: IDF maxDocs / numDocs
  
   I am noticing the maxDocs between replicas is consistently different and
   that in the idf calculation it is used which causes idf scores for the
  same
   query/doc between replicas to be different. obviously an optimize can
   normalize the maxDocs scores, but that is only temporary.. is there a way
   to have idf use numDocs instead (as it should be consistent across
   replicas)?
  
   thanks,
  
   steve
  
 
 


Re: Bug with OpenJDK on Ubuntu - affects Solr users

2014-03-26 Thread Markus Jelsma
Hi - as far as i know it has never been a good idea to run Lucene on OpenJDK 6 
at all. Only either Oracle Java 6 or higher or OpenJDK 7.


On Wednesday, March 26, 2014 06:54:41 PM Nigel Sheridan-Smith wrote:
 Hi all,
 
 This is a bit of a 'heads up'. We have recently come across this bug on
 Ubuntu with OpenJDK:
 
 https://bugs.launchpad.net/ubuntu/+source/openjdk-6/+bug/1295987
 
 Basically, finalizers are not being run, so effectively all of the commits
 written in SolrIndexWriter are not Garbage Collected.
 
 if you find that your Java heap memory grows continuously at around 4-8Mb
 per index update, and you are running this version of OpenJDK, and the
 Garbage Collector does not recycle much memory from the Old Gen
 generation, then this is likely to be your problem.
 
 We increased our heap space from 1Gb to 4Gb but the memory usage continued
 to grow at about the same pace. It was only when we ran 'jmap' and analysed
 the heap dump with Eclipse MAT that it became obvious that unreferenced
 objects were not being correctly Garbage Collected.
 
 i hope this helps someone else!
 
 Cheers,
 
 Nigel Sheridan-Smith



Re: tf and very short text fields

2014-04-01 Thread Markus Jelsma
Yes, override tfidfsimilarity and emit 1f in tf(). You can also use bm25 with 
k1 set to zero in your schema.


Walter Underwood wun...@wunderwood.org schreef:And here is another 
peculiarity of short text fields.

The movie New York, New York should not be twice as relevant for the query 
new york. Is there a way to use a binary term frequency rather than a count?

wunder
--
Walter Underwood
wun...@wunderwood.org





Re: omitNorms and very short text fields

2014-04-01 Thread Markus Jelsma
Yes, that will work. And combined with your other question scores will always 
be equal even if cinderella or chuck occur more than once in one document.



Walter Underwood wun...@wunderwood.org schreef:Just double-checking my 
understanding of omitNorms.

For very short text fields like personal names or titles, length normalization 
can give odd results. For example, we might want these two to score the same 
for the query Cinderella.

* Cinderella
* Cinderella (Diamond Edition) (Blu-ray + DVD + Digital Copy) (Widescreen)

And these two for the query chuck:

* Chuck House
* Check E. Cheese

I think that omitNorm=true on those fields will give that behavior. Is that the 
right approach?

wunder
--
Walter Underwood
wun...@wunderwood.org





Re: Re: tf and very short text fields

2014-04-01 Thread Markus Jelsma
Also, if i remember correctly, k1 set to zero for bm25 automatically omits 
norms in the calculation. So thats easy to play with without reindexing.


Markus Jelsma markus.jel...@openindex.io schreef:Yes, override 
tfidfsimilarity and emit 1f in tf(). You can also use bm25 with k1 set to zero 
in your schema.


Walter Underwood wun...@wunderwood.org schreef:And here is another 
peculiarity of short text fields.

The movie New York, New York should not be twice as relevant for the query 
new york. Is there a way to use a binary term frequency rather than a count?

wunder
--
Walter Underwood
wun...@wunderwood.org





Re: Re: solr 4.2.1 index gets slower over time

2014-04-01 Thread Markus Jelsma
You may want to increase reclaimdeletesweight for tieredmergepolicy from 2 to 3 
or 4. By default it may keep too much deleted or updated docs in the index. 
This can increase index size by 50%!! Dmitry Kan solrexp...@gmail.com 
schreef:Elisabeth,

Yes, I believe you are right in that the deletes are part of the optimize
process. If you delete often, you may consider (if not already) the
TieredMergePolicy, which is suited for this scenario. Check out this
relevant discussion I had with Lucene committers:
https://twitter.com/DmitryKan/status/399820408444051456

HTH,

Dmitry


On Tue, Apr 1, 2014 at 11:34 AM, elisabeth benoit elisaelisael...@gmail.com
 wrote:

 Thanks a lot for your answers!

 Shawn. Our GC configuration has far less parameters defined, so we'll check
 this out.

 Dimitry, about the expungeDeletes option, we'll add that in the delete
 process. But from what I read, this is done in the optimize process (cf.

 http://lucene.472066.n3.nabble.com/Does-expungeDeletes-need-calling-during-an-optimize-td1214083.html
 ).
 Or maybe not?

 Thanks again,
 Elisabeth


 2014-04-01 7:52 GMT+02:00 Dmitry Kan solrexp...@gmail.com:

  Hi,
 
  We have noticed something like this as well, but with older versions of
  solr, 3.4. In our setup we delete documents pretty often. Internally in
  Lucene, when a document is client requested to be deleted, it is not
  physically deleted, but only marked as deleted. Our original
 optimization
  assumption was such that the deleted documents would get physically
  removed on each optimize command issued. We started to suspect it wasn't
  always true as the shards (especially relatively large shards) became
  slower over time. So we found out about the expungeDeletes option, which
  purges the deleted docs and is by default false. We have set it to
 true.
  If your solr update lifecycle includes frequent deletes, try this out.
 
  This of course does not override working towards finding better
  GCparameters.
 
 
 https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching
 
 
  On Mon, Mar 31, 2014 at 3:57 PM, elisabeth benoit 
  elisaelisael...@gmail.com
   wrote:
 
   Hello,
  
   We are currently using solr 4.2.1. Our index is updated on a daily
 basis.
   After noticing solr query time has increased (two times the initial
 size)
   without any change in index size or in solr configuration, we tried an
   optimize on the index but it didn't fix our problem. We checked the
  garbage
   collector, but everything seemed fine. What did in fact fix our problem
  was
   to delete all documents and reindex from scratch.
  
   It looks like over time our index gets corrupted and optimize doesn't
  fix
   it. Does anyone have a clue how to investigate further this situation?
  
  
   Elisabeth
  
 
 
 
  --
  Dmitry
  Blog: http://dmitrykan.blogspot.com
  Twitter: http://twitter.com/dmitrykan
 




-- 
Dmitry
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan


RE: tf and very short text fields

2014-04-04 Thread Markus Jelsma
Hi - In this case Walter, iirc, was looking for two things: no normalization 
and no flat TF (1f for tf(float freq)  0). We know that k1 controls TF 
saturation but in BM25Similarity you can see that k1 is multiplied by the 
encoded norm value, taking b also into account. So setting k1 to zero 
effectively disabled length normalization and results in flat or binary TF. 

Here's an example output of k1 = 0 and k1 = 0.2. Norms or enabled on the field, 
term occurs three times in the field:

28.203003 = score(doc=0,freq=1.5 = phraseFreq=1.5
), product of:
  6.4 = boost
  4.406719 = idf(docFreq=1, docCount=122)
  1.0 = tfNorm, computed from:
1.5 = phraseFreq=1.5
0.0 = parameter k1
0.75 = parameter b
8.721312 = avgFieldLength
16.0 = fieldLength




27.813797 = score(doc=0,freq=1.5 = phraseFreq=1.5
), product of:
  6.4 = boost
  4.406719 = idf(docFreq=1, docCount=122)
  0.98619986 = tfNorm, computed from:
1.5 = phraseFreq=1.5
0.2 = parameter k1
0.75 = parameter b
8.721312 = avgFieldLength
16.0 = fieldLength


You can clearly see the final TF norm being 1, despite the term frequency and 
length. Please correct my wrongs :)
Markus

 
 
-Original message-
 From:Tom Burton-West tburt...@umich.edu
 Sent: Thursday 3rd April 2014 20:18
 To: solr-user@lucene.apache.org
 Subject: Re: tf and very short text fields
 
 Hi Markus and Wunder,
 
 I'm  missing the original context, but I don't think BM25 will solve this
 particular problem.
 
 The k1 parameter sets how quickly the contribution of tf to the score falls
 off with increasing tf.   It would be helpful for making sure really long
 documents don't get too high a score, but I don't think it would help for
 very short documents without messing up its original design purpose.
 
 For BM25, if you want to turn off length normalization, you set b to 0.
  However, I don't think that will do what you want, since turning off
 normalization will mean that the score for new york, new york  will be
 twice that of the score for new york since without normalization the tf
 in new york new york is twice that of new york.
 
 I think the earlier suggestion to override tfidfsimilarity and emit 1f in
 tf() is probably the best way to switch to eliminate using tf counts,
 assumming that is really what you want.
 
 Tom
 
 
 
 
 
 
 
 
 On Tue, Apr 1, 2014 at 4:17 PM, Walter Underwood wun...@wunderwood.orgwrote:
 
  Thanks! We'll try that out and report back. I keep forgetting that I want
  to try BM25, so this is a good excuse.
 
  wunder
 
  On Apr 1, 2014, at 12:30 PM, Markus Jelsma markus.jel...@openindex.io
  wrote:
 
   Also, if i remember correctly, k1 set to zero for bm25 automatically
  omits norms in the calculation. So thats easy to play with without
  reindexing.
  
  
   Markus Jelsma markus.jel...@openindex.io schreef:Yes, override
  tfidfsimilarity and emit 1f in tf(). You can also use bm25 with k1 set to
  zero in your schema.
  
  
   Walter Underwood wun...@wunderwood.org schreef:And here is another
  peculiarity of short text fields.
  
   The movie New York, New York should not be twice as relevant for the
  query new york. Is there a way to use a binary term frequency rather than
  a count?
  
   wunder
   --
   Walter Underwood
   wun...@wunderwood.org
  
  
  
 
  --
  Walter Underwood
  wun...@wunderwood.org
 
 
 
 
 


RE: Strange relevance scoring

2014-04-08 Thread Markus Jelsma
Hi - the thing you describe is possible when your set up uses SpanFirstQuery. 
But to be sure what's going on you should post the debug output. 
 
-Original message-
 From:John Nielsen j...@mcb.dk
 Sent: Tuesday 8th April 2014 11:03
 To: solr-user@lucene.apache.org
 Subject: Strange relevance scoring
 
 Hi,
 
 We are seeing a strange phenomenon with our Solr setup which I have been
 unable to answer.
 
 My Google-fu is clearly not up to the task, so I am trying here.
 
 It appears that if i do a freetext search for a single word, say modellering
 on a text field, the scoring is massively boosted if the first word of the
 text field is a hit.
 
 For instance if there is only one occurrence of the word modellering in
 the text field and that occurrence is the first word of the text, then that
 document gets a higher relevancy than if the word modelling occurs 5
 times in the text and the first word of the text is any other word.
 
 Is this normal behavior? Is special attention paid to the first word in a
 text field? I would think that the latter case would get the highest score.
 
 
 -- 
 Med venlig hilsen / Best regards
 
 *John Nielsen*
 Programmer
 
 
 
 *MCB A/S*
 Enghaven 15
 DK-7500 Holstebro
 
 Kundeservice: +45 9610 2824
 p...@mcb.dk
 www.mcb.dk
 


Re: Fails to index if unique field has special characters

2014-04-11 Thread Markus Jelsma
Well, this is somewhat of  a problem if you have have URL's as uniqueKey that 
contain exclamation marks. Isn't it an idea to allow those to be escaped and 
thus ignored by CompositeIdRouter?

On Friday, April 11, 2014 11:43:31 AM Cool Techi wrote:
 Thanks, that was helpful.
 Regards,Rohit
 
  Date: Thu, 10 Apr 2014 08:44:36 -0700
  From: iori...@yahoo.com
  Subject: Re: Fails to index if unique field has special characters
  To: solr-user@lucene.apache.org
  
  Hi Ayush,
  
  I thinks this
  
  IBM!12345. The exclamation mark ('!') is critical here, as it
  distinguishes the prefix used to determine which shard to direct the
  document to.
  
  https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+
  in+SolrCloud
  
  
  
  
  On Thursday, April 10, 2014 2:35 PM, Cool Techi cooltec...@outlook.com
  wrote: Hi,
  We are migrating from Solr 4.6 standalone to Solr 4.7 cloud version, while
  reindexing the document we are getting the following error. This is
  happening when the unique key has special character, this was not noticed
  in version 4.6 standalone mode, so we are not sure if this is a version
  problem or a cloud issue. Example of the unique key is given below,
  http://www.mynews.in/Blog/smrity!!**)))!miami_dolphins_vs_dallas_cowboys_
  live_stream_on_line_nfl_football_free_video_broadcast_B142707.html
  Exception Stack Trace
  ERROR - 2014-04-10 10:51:44.361; org.apache.solr.common.SolrException;
  java.lang.ArrayIndexOutOfBoundsException: 2   at
  org.apache.solr.common.cloud.CompositeIdRouter$KeyParser.getHash(Composit
  eIdRouter.java:296)   at
  org.apache.solr.common.cloud.CompositeIdRouter.sliceHash(CompositeIdRoute
  r.java:58)   at
  org.apache.solr.common.cloud.HashBasedRouter.getTargetSlice(HashBasedRout
  er.java:33)   at
  org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(
  DistributedUpdateProcessor.java:218)   at
  org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(Di
  stributedUpdateProcessor.java:550)   at
  org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateP
  rocessorFactory.java:100)   at
  org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247
  )   at
  org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)  
  at
  
 org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.j
 ava:92)   at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Content
 StreamHandlerBase.java:74)   at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBas
 e.java:135)   at
 org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)   at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java
 :780)   at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav
 a:427)   at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav
 a:217)   at
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandl
 er.java:1419)   at
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:1
 37)   at
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557
 )   at org.eclipse.jetty.server.session.SessionHandle
  Thanks,Ayush  
 
  



RE: Topology of Solr use

2014-04-17 Thread Markus Jelsma
This may help a bit:

https://wiki.apache.org/solr/PublicServers
 
-Original message-
From:Olivier Austina olivier.aust...@gmail.com
Sent:Thu 17-04-2014 18:16
Subject:Topology of Solr use
To:solr-user@lucene.apache.org; 
Hi All,
I would to have an idea about Solr usage: number of users, industry,
countries or any helpful information. Thank you.
Regards
Olivier


Re: Boost Search results

2014-04-18 Thread Markus Jelsma
Hi, replicating full features search engine behaviour is not going to work with 
nutch and solr out of the box. You are missing a thousand features such as 
proper main content extraction, deduplication, classification of content and 
hub or link pages, and much more. These things are possible to implement but 
you may want to start with having you solr request handler better configured, 
to begin with, your qf parameter does not have nutchs default title and content 
field selected.


A Laxmi a.lakshmi...@gmail.com schreef:Hi,


When I started to compare the search results with the two options below, I
see a lot of difference in the search results esp. the* urls that show up
on the top *(*Relevancy *perspective).

(1) Nutch 2.2.1 (with *Solr 4.0*)
(2) Bing custom search set-up

I wonder how should I tweak the boost parameters to get the best results on
the top like how Bing, Google does.

Please suggest why I see a difference and what parameters are best to
configure in Solr to achieve what I see from Bing, or Google search
relevancy.

Here is what i got in solrconfig.xml:

str name=defTypeedismax/str
   str name=qf
 text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
   /str
   str name=q.alt*:*/str
   str name=rows10/str
   str name=fl*,score/str


Thanks


Re: Re: PostingHighlighter complains about no offsets

2014-05-03 Thread Markus Jelsma
Hello michael, you are not on lucene 4.8?
https://issues.apache.org/jira/plugins/servlet/mobile#issue/LUCENE-5111


Michael Sokolov msoko...@safaribooksonline.com schreef:For posterity, in case 
anybody follows this thread, I tracked the 
problem down to WordDelimiterFilter; apparently it creates an offset of 
-1 in some case, which PostingsHighlighter rejects.

-Mike


On 5/2/2014 10:20 AM, Michael Sokolov wrote:
 I checked using the analysis admin page, and I believe there are 
 offsets being generated (I assume start/end=offsets).  So IDK I am 
 going to try reindexing again.  Maybe I neglected to reload the config 
 before I indexed last time.

 -Mike

 On 05/02/2014 09:34 AM, Michael Sokolov wrote:
 I've been wanting to try out the PostingsHighlighter, so I added 
 storeOffsetsWithPositions to my field definition, enabled the 
 highlighter in solrconfig.xml,  reindexed and tried it out. When I 
 issue a query I'm getting this error:

 |field 'text' was indexed without offsets, cannot highlight


 java.lang.IllegalArgumentException: field 'text' was indexed without 
 offsets, cannot highlight
 at 
 org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightDoc(PostingsHighlighter.java:545)
 at 
 org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightField(PostingsHighlighter.java:467)
 at 
 org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFieldsAsObjects(PostingsHighlighter.java:392)
 at 
 org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFields(PostingsHighlighter.java:293)|
 I've been trying to figure out why the field wouldn't have offsets 
 indexed, but I just can't see it.  Is there something in the analysis 
 chain that could stripping out offsets?


 This is the field definition:

 field name=text type=text_en indexed=true stored=true 
 multiValued=false termVectors=true termPositions=true 
 termOffsets=true storeOffsetsWithPositions=true /

 (Yes I know PH doesn't require term vectors; I'm keeping them around 
 for now while I experiment)

 fieldType name=text_en class=solr.TextField 
 positionIncrementGap=100
   analyzer type=index
 !-- We are indexing mostly HTML so we need to ignore the 
 tags --
 charFilter class=solr.HTMLStripCharFilterFactory/
 !--tokenizer class=solr.StandardTokenizerFactory/--
 tokenizer class=solr.WhitespaceTokenizerFactory/
 !-- lower casing must happen before WordDelimiterFilter or 
 protwords.txt will not work --
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.WordDelimiterFilterFactory 
 stemEnglishPossessive=1 protected=protwords.txt/
 !-- This deals with contractions --
 filter class=solr.SynonymFilterFactory 
 synonyms=synonyms.txt expand=true ignoreCase=true/
 filter class=solr.HunspellStemFilterFactory 
 dictionary=en_US.dic affix=en_US.aff ignoreCase=true/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
   analyzer type=query
 !--tokenizer class=solr.StandardTokenizerFactory/--
 tokenizer class=solr.WhitespaceTokenizerFactory/
 !-- lower casing must happen before WordDelimiterFilter or 
 protwords.txt will not work --
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.WordDelimiterFilterFactory 
 protected=protwords.txt/
 !-- setting tokenSeparator= solves issues with compound 
 words and improves phrase search --
 filter class=solr.HunspellStemFilterFactory 
 dictionary=en_US.dic affix=en_US.aff ignoreCase=true/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
 /fieldType




RE: permissive mm value and efficient spellchecking

2014-05-14 Thread Markus Jelsma
Elisabeth, i think you are looking for SOLR-3211 that introduced 
spellcheck.collateParam.* to override e.g. dismax settings.

Markus
 
-Original message-
From:elisabeth benoit elisaelisael...@gmail.com
Sent:Wed 14-05-2014 14:01
Subject:permissive mm value and efficient spellchecking
To:solr-user@lucene.apache.org; 
Hello,

I'm using solr 4.2.1.

I use a very permissive value for mm, to be able to find results even if
request contains non relevant words.

At the same time, I'd like to be able to do some efficient spellcheking
with solrdirectspellchecker.

So for instance, if user searches for rue de Chraonne Paris, where
Chraonne is mispelled, because of my permissive mm value I get more than
100 000 results containing words rue and Paris (de is a stopword),
which are very frequent terms in my index, but no spellcheck correction for
Chraonne. If I set mm=3, then I get the expected spellcheck correction
value: rue de Charonne Paris.

Is there a way to achieve my two goals in a single solr request?

Thanks,
Elisabeth


RE: Solr + SPDY

2014-05-15 Thread Markus Jelsma
Hi Harsh,

 
Does SPDY provide lower latency than HTTP/1.1 with KeepAlive or is it 
encryption that you're after?

 
Markus


 
-Original message-
From:harspras prasadta...@outlook.com
Sent:Tue 13-05-2014 05:38
Subject:Re: Solr + SPDY
To:solr-user@lucene.apache.org; 
Hi Vinay,

I have been trying to setup a similar environment with SPDY being enabled
for Solr inter shard communication. Did you happen to have been able to do
it? I somehow cannot use SolrCloud with SPDY enabled in jetty.

Regards,
Harsh Prasad



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-SPDY-tp4097771p4135377.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Edismax should, should not, exact match operators

2014-06-10 Thread Markus Jelsma
http://wiki.apache.org/solr/ExtendedDisMax#Query_Syntax
 
-Original message-
From:michael.boom my_sky...@yahoo.com
Sent:Tue 10-06-2014 13:15
Subject:Edismax should, should not, exact match operators
To:solr-user@lucene.apache.org; 
On google a user can query using operators like + or - and quote the
desired term in order to get the desired match.
Does something like this come by default with edismax parser ?



-
Thanks,
Michael
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Edismax-should-should-not-exact-match-operators-tp4140967.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Recommended ZooKeeper topology in Production

2014-06-10 Thread Markus Jelsma
Yes, always use three or a higher odd number of machines. It is best to have 
them on dedicated machines and unless the cluster is very large three small VPS 
machines with 512 MB RAM suffice.
 
-Original message-
From:Gili Nachum gilinac...@gmail.com
Sent:Tue 10-06-2014 08:58
Subject:Recommended ZooKeeper topology in Production
To:solr-user@lucene.apache.org; 
Is there a recommended ZooKeeper topology for production Solr environments?

I was planning: 3 ZK nodes, each on its own dedicated machine.

Thinking that dedicated machines, separate from Solr servers, would keep ZK
isolated from resource contention spikes that may occur on Solr. Also, if a
Solr machine goes down, there would still be 3 ZK nodes to handle the event
properly.

If I want to save on resources, placing each ZK instance on the same box as
Solr instance in considered common practice in production environments?

Thanks!


RE: docFreq coming to be more than 1 for unique id field

2014-06-17 Thread Markus Jelsma
Hi - did you perhaps update on of those documents?

 
 
-Original message-
 From:Apoorva Gaurav apoorva.gau...@myntra.com
 Sent: Tuesday 17th June 2014 16:58
 To: solr-user@lucene.apache.org
 Subject: docFreq coming to be more than 1 for unique id field
 
 Hello All,
 
 We are using solr 4.4.0. We have a uniqueKey of type solr.StrField. We need
 to extract docs in a pre-defined order if they match a certain condition.
 Our query is of the format
 
 uniqueField:(id1 ^ weight1 OR id2 ^ weight2 . OR idN ^ weightN)
 where weight1  weight2    weightN
 
 But the result is not in the desired order. On debugging the query we've
 found out that for some of the documents docFreq is higher than 1 and hence
 their tf-idf based score is less than others. What can be the reason behind
 a unique id field having docFreq greater than 1?  How can we prevent it?
 
 -- 
 Thanks  Regards,
 Apoorva
 


RE: docFreq coming to be more than 1 for unique id field

2014-06-17 Thread Markus Jelsma
Yes, it is unique but they are not immediately purged, only when `optimized` or 
forceMerge or during regular segment merges. The problem is that they keep 
messing with the statistics.
 
-Original message-
 From:Apoorva Gaurav apoorva.gau...@myntra.com
 Sent: Tuesday 17th June 2014 17:16
 To: solr-user solr-user@lucene.apache.org; Ahmet Arslan iori...@yahoo.com
 Subject: Re: docFreq coming to be more than 1 for unique id field
 
 Yes we have updates on these. Didn't try optimizing will do. But isn't the
 unique field supposed to be unique?
 
 
 On Tue, Jun 17, 2014 at 8:37 PM, Ahmet Arslan iori...@yahoo.com.invalid
 wrote:
 
  Hi,
 
  Just a guess, do you have deletions? What happens when you optimize and
  re-try?
 
 
 
  On Tuesday, June 17, 2014 5:58 PM, Apoorva Gaurav 
  apoorva.gau...@myntra.com wrote:
  Hello All,
 
  We are using solr 4.4.0. We have a uniqueKey of type solr.StrField. We need
  to extract docs in a pre-defined order if they match a certain condition.
  Our query is of the format
 
  uniqueField:(id1 ^ weight1 OR id2 ^ weight2 . OR idN ^ weightN)
  where weight1  weight2    weightN
 
  But the result is not in the desired order. On debugging the query we've
  found out that for some of the documents docFreq is higher than 1 and hence
  their tf-idf based score is less than others. What can be the reason behind
  a unique id field having docFreq greater than 1?  How can we prevent it?
 
  --
  Thanks  Regards,
  Apoorva
 
 
 
 
 -- 
 Thanks  Regards,
 Apoorva
 


Re: Unable to start solr 4.8

2014-06-19 Thread Markus Jelsma
Hi - remove the lock file in your solr/collection_name/data/index.*/ 
directory.

Markus

On Thursday, June 19, 2014 04:10:51 AM atp wrote:
 Hi experts,
 
 i have cnfigured solrcloud, on three machines , zookeeper started with no
 errors, tomcat log also no errors , solr log alos no errors reported but all
 the tomcat configured solr clusterstate shows as 'down'
 
 
 
 ,8870931 [Thread-13] INFO  org.apache.solr.common.cloud.ZkStateReader  â
 Updating cloud state from ZooKeeper...
 8870934 [Thread-13] INFO  org.apache.solr.cloud.Overseer  â Update state
 numShards=2 message={
   operation:state,
   state:down,
   base_url:http://10.***.***.28:7090/solr;,
   core:collection1,
   roles:null,
   node_name:10.***.***.28:7090_solr,
   shard:shard2,
   collection:collection1,
   numShards:2,
   core_node_name:10.***.***.28:7090_solr_collection1}
 8870939 [main-EventThread] INFO  org.apache.solr.cloud.DistributedQueue  â
 LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type
 NodeChildrenChanged
 8870942 [main-EventThread] INFO  org.apache.solr.common.cloud.ZkStateReader
 â A cluster state change: WatchedEvent state:SyncConnected
 type:NodeDataChanged path:/clusterstate.json, has occurred - updating...
 (live nodes size: 5)
 8919667 [main-EventThread] INFO  org.apache.solr.common.cloud.ZkStateReader
 â Updating live nodes... (4)
 8933777 [main-EventThread] INFO  org.apache.solr.common.cloud.ZkStateReader
 â Updating live nodes... (3)
 8965906 [main-EventThread] INFO  org.apache.solr.common.cloud.ZkStateReader
 â Updating live nodes... (4)
 8965994 [main-EventThread] INFO  org.apache.solr.cloud.DistributedQueue  â
 LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type
 NodeChildrenChanged
 8965997 [Thread-13] INFO  org.apache.solr.common.cloud.ZkStateReader  â
 Updating cloud state from ZooKeeper...
 8966000 [Thread-13] INFO  org.apache.solr.cloud.Overseer  â Update state
 numShards=2 message={
   operation:state,
   state:down,
   base_url:http://10.***.***.29:7070/solr;,
   core:collection1,
   roles:null,
   node_name:10.***.***.29:7070_solr,
   shard:shard1,
   collection:collection1,
   numShards:2,
   core_node_name:110.***.***.29:7070_solr_collection1}
 8966006 [main-EventThread] INFO  org.apache.solr.cloud.DistributedQueue  â
 LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type
 NodeChildrenChanged
 8966008 [main-EventThread] INFO  org.apache.solr.common.cloud.ZkStateReader
 â A cluster state change: WatchedEvent state:SyncConnected
 type:NodeDataChanged path:/clusterstate.json, has occurred - updating...
 (live nodes size: 4)
 8986466 [main-EventThread] INFO  org.apache.solr.common.cloud.ZkStateReader
 â Updating live nodes... (5)
 8986648 [main-EventThread] INFO  org.apache.solr.cloud.DistributedQueue  â
 LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type
 NodeChildrenChanged
 8986652 [Thread-13] INFO  org.apache.solr.common.cloud.ZkStateReader  â
 Updating cloud state from ZooKeeper...
 8986654 [Thread-13] INFO  org.apache.solr.cloud.Overseer  â Update state
 numShards=2 message={
   operation:state,
   state:down,
   base_url:http://10.***.***.30:7080/solr;,
   core:collection1,
   roles:null,
   node_name:10.***.***.30:7080_solr,
   shard:shard1,
   collection:collection1,
   numShards:2,
   core_node_name:10.***.***.30:7080_solr_collection1}
 8986661 [main-EventThread] INFO  org.apache.solr.cloud.DistributedQueue  â
 LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type
 NodeChildrenChanged
 898 [main-EventThread] INFO  org.apache.solr.common.cloud.ZkStateReader
 â A cluster state change: WatchedEvent state:SyncConnected
 type:NodeDataChanged path:/clusterstate.json, has occurred - updating...
 (live nodes size: 5)
 9008407 [main-EventThread] INFO  org.apache.solr.common.cloud.ZkStateReader
 â Updating live nodes... (6)
 
 
 
 when i browse the 28,29 and 30th solr url , its throwing error like,
 
 
 HTTP Status 500 - {msg=SolrCore 'collection1' is not available due to init
 failure: Index locked for write for core
 collection1,trace=org.apache.solr.common.SolrException: SolrCore
 'collection1' is not available due to init failure: Index locked for write
 for core collection1 at
 org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:753) at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
 347) at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
 207) at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
 FilterChain.java:241) at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
 ain.java:208) at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja
 va:220) at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja
 va:122) at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171
 ) at
 

RE: How much free disk space will I need to optimize my index

2014-06-25 Thread Markus Jelsma


 
 
-Original message-
 From:johnmu...@aol.com johnmu...@aol.com
 Sent: Wednesday 25th June 2014 20:13
 To: solr-user@lucene.apache.org
 Subject: How much free disk space will I need to optimize my index
 
 Hi,
 
 
 I need to de-fragment my index.  My question is, how much free disk space I 
 need before I can do so?  My understanding is, I need 1X free disk space of 
 my current index un-optimized index size before I can optimize it.  Is this 
 true?

Yes, 20 GB of FREE space to force merge an existing 20 GB index.

 
 
 That is, let say my index is 20 GB (un-optimized) then I must have 20 GB of 
 free disk space to make sure the optimization is successful.  The reason for 
 this is because during optimization the index is re-written (is this the 
 case?) and if it is already optimized, the re-write will create a new 20 GB 
 index before it deletes the old one (is this true?), thus why there must be 
 at least 20 GB free disk space.
 
 
 Can someone help me with this or point me to a wiki on this topic?
 
 
 Thanks!!!
 
 
 - MJ
 


RE: unable to start solr instance

2014-06-30 Thread Markus Jelsma
(Too many open files)

Try raising the limit from probably 1024 to 4k-16k orso.
 
 
-Original message-
 From:Niklas Langvig niklas.lang...@globesoft.com
 Sent: Monday 30th June 2014 17:09
 To: solr-user@lucene.apache.org
 Subject: unable to start solr instance
 
 Hello,
 We havet o solr instances running on linux/tomcat7
 Both have been working fine, now only 1 works. The other seems to have 
 crashed or something.
 
 SolrCore Initialization Failures
 * collection1: 
 org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: 
 Error initializing QueryElevationComponent.
 
 We havn't changed anything in the setup.
 
 Earlier 4 days ago I could see in the logs
 response
 lst name=responseHeaderint name=status500/intint 
 name=QTime0/int/lstlst name=errorstr 
 name=msgjava.io.FileNotFoundException: 
 /opt/solr410/document/collection1/data/tlog/tlog.2494137 (Too 
 many open files)/strstr name=traceorg.apache.solr.common.SolrException: 
 java.io.FileNotFoundException: 
 /opt/solr410/document/collection1/data/tlog/tlog.2494137 (Too 
 many open files)
  at 
 org.apache.solr.update.TransactionLog.lt;initgt;(TransactionLog.java:182)
  at 
 org.apache.solr.update.TransactionLog.lt;initgt;(TransactionLog.java:140)
  at 
 org.apache.solr.update.UpdateLog.ensureLog(UpdateLog.java:796)
  at 
 org.apache.solr.update.UpdateLog.delete(UpdateLog.java:409)
  at 
 org.apache.solr.update.DirectUpdateHandler2.delete(DirectUpdateHandler2.java:284)
  at 
 org.apache.solr.update.processor.RunUpdateProcessor.processDelete(RunUpdateProcessorFactory.java:77)
  at 
 org.apache.solr.update.processor.UpdateRequestProcessor.processDelete(UpdateRequestProcessor.java:55)
  at 
 org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalDelete(DistributedUpdateProcessor.java:460)
  at 
 org.apache.solr.update.processor.DistributedUpdateProcessor.versionDelete(DistributedUpdateProcessor.java:1036)
  at 
 org.apache.solr.update.processor.DistributedUpdateProcessor.processDelete(DistributedUpdateProcessor.java:721)
  at 
 org.apache.solr.update.processor.LogUpdateProcessor.processDelete(LogUpdateProcessorFactory.java:121)
  at 
 org.apache.solr.handler.loader.XMLLoader.processDelete(XMLLoader.java:346)
  at 
 org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:277)
  at 
 org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
  at 
 org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
  at 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
  at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
  at 
 org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
  at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:448)
  at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:269)
  at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
  at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
  at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
  at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
  at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
  at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
  at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
  at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
  at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:861)
  at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:606)
  at 
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
  at java.lang.Thread.run(Thread.java:722)
 Caused by: java.io.FileNotFoundException: 
 /opt/solr410/document/collection1/data/tlog/tlog.2494137 

RE: NPE when using facets with the MLT handler.

2014-07-02 Thread Markus Jelsma
Hi, i don't think this is ever going to work with the MLT Handler, you should 
use the regular SearchHandler instead.
 
 
-Original message-
 From:SafeJava T t...@safejava.com
 Sent: Monday 30th June 2014 17:52
 To: solr-user@lucene.apache.org
 Subject: NPE when using facets with the MLT handler.
 
 I am getting an NPE when using facets with the MLT handler.  I googled for
 other npe errors with facets, but this trace looked different from the ones
 I found. We are using Solr 4.9-SNAPSHOT.
 
 I have reduced the query to the most basic form I can:
 
 q=id:XXXmlt.fl=mlt_fieldfacet=truefacet.field=id
 I changed it to facet on id, to ensure that the field was present in all
 results.
 
 Any ideas on how to work around this?
 
 
 java.lang.NullPointerException at
 org.apache.solr.search.facet.SimpleFacets.addFacets(SimpleFacets.java:375)
 at
 org.apache.solr.handler.MoreLikeThisHandler.handleRequestBody(MoreLikeThisHandler.java:211)
 at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1955) at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:769)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
 at
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
 at
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
 at
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
 at
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
 at
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
 at
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
 at
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
 at
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
 at
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
 at
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
 at
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:368) at
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
 at
 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
 at
 org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
 at
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861) at
 org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) at
 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
 at
 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
 at java.lang.Thread.run(Thread.java:744)
 
 Thanks,
 Tom
 


RE: Memory Leaks in solr 4.8.1

2014-07-02 Thread Markus Jelsma
Hi, you can safely ignore this, it is shutting down anyway. Just don't reload 
the app a lot of times without actually restarting Tomcat. 
 
-Original message-
 From:Aman Tandon amantandon...@gmail.com
 Sent: Wednesday 2nd July 2014 7:22
 To: solr-user@lucene.apache.org
 Subject: Memory Leaks in solr 4.8.1
 
 Hi,
 
 When i am shutting down the solr i am gettng the Memory Leaks error in logs.
 
 Jul 02, 2014 10:49:10 AM org.apache.catalina.loader.WebappClassLoader
  checkThreadLocalMapForLeaks
  SEVERE: The web application [/solr] created a ThreadLocal with key of type
  [org.apache.solr.schema.DateField.ThreadLocalDateFormat] (value
  [org.apache.solr.schema.DateField$ThreadLocalDateFormat@1d987b2]) and a
  value of type [org.apache.solr.schema.DateField.ISO8601CanonicalDateFormat]
  (value 
  [org.apache.solr.schema.DateField$ISO8601CanonicalDateFormat@6b2ed43a])
  but failed to remove it when the web application was stopped. Threads are
  going to be renewed over time to try and avoid a probable memory leak.
 
 
 Please check.
 With Regards
 Aman Tandon
 


RE: Disable Regular Expression Support

2014-07-03 Thread Markus Jelsma
Hi, you can escape the surrounding slashes in your front-end.
Markus

 
 
-Original message-
 From:Markus Schuch markus_sch...@web.de
 Sent: Thursday 3rd July 2014 20:53
 To: solr-user@lucene.apache.org
 Subject: Disable Regular Expression Support
 
 Hi Solr Community,
 
 we migrate from solr 1.4 to 4.3 and found out, that solr 4.x invented regular 
 expression support for the query parser.
 
 Is it possible to disable this feature to get back to the 1.4 behavior of the 
 query parser?
 
 Many thanks in advance,
 Markus Schuch


RE: Any Solr consultants available??

2014-07-24 Thread Markus Jelsma
Hahaha thanks wunder, made me laugh!

 
-Original message-
 From:Walter Underwood wun...@wunderwood.org
 Sent: Thursday 24th July 2014 2:07
 To: solr-user@lucene.apache.org
 Subject: Re: Any Solr consultants available??
 
 When I see job postings like this, I have to assume they were written by 
 people who really don’t understand the problem and have never met people with 
 the various skills they are asking for. They are not going to find one person 
 who does all this.
 
 This is an opening for zebra unicorn that walks on water. At best, they’ll 
 get a one-horned goat with painted stripes on a life raft. They need to talk 
 to some people, make multiple realistic openings, and expect to grow some of 
 their own expertise.
 
 I got an email like this from Goldman Sachs this morning.
 
 “... a Senior Application Architect/Developer and DevOps Engineer for a major 
 company initiative. In addition to an effort to build a new cloud 
 infrastructure from the ground up, they are beginning a number of company 
 projects in the areas of cloud-based open source search, Machine Learning/AI, 
 Big Data, Predictive Analytics  Low-Latency Trading Algorithm Development.”
 
 Good luck, fellas.
 
 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/
 
 
 On Jul 23, 2014, at 1:01 PM, Jack Krupansky j...@basetechnology.com wrote:
 
  Yeah, I saw that, which is why I suggested not being too picky about 
  specific requirements. If you have at least two or three years of solid 
  Solr experience, that would make you at least worth looking at.
  
  -- Jack Krupansky
  
  From: Tri Cao 
  Sent: Wednesday, July 23, 2014 3:57 PM
  To: solr-user@lucene.apache.org 
  Cc: solr-user@lucene.apache.org 
  Subject: Re: Any Solr consultants available??
  
  Well, it's kind of hard to find a person if the requirement is 10 years' 
  experience with Solr given that Solr was created in 2004.
  
  On Jul 23, 2014, at 12:45 PM, Jack Krupansky j...@basetechnology.com 
  wrote:
  
  
   I occasionally get pinged by recruiters looking for Solr application 
  developers... here’s the latest. If you are interested, either contact 
  Jessica directly or reply to me and I’ll forward your reply.
  
   Even if you don’t strictly meet all the requirements... they are having 
  trouble finding... anyone. All the great Solr guys I know are quite busy.
  
   Thanks.
  
   -- Jack Krupansky
  
   From: Jessica Feigin 
   Sent: Wednesday, July 23, 2014 3:36 PM
   To: 'Jack Krupansky' 
   Subject: Thank you!
  
   Hi Jack,
  
  
  
   Thanks for your assistance, below is the Solr Consultant job description:
  
  
  
   Our client, a hospitality Fortune 500 company are looking to update their 
  platform to make accessing information easier for the franchisees. This is 
  the first phase of the project which will take a few years. They want a 
  hands on Solr consultant who has ideally worked in the search space. As you 
  can imagine the company culture is great, everyone is really friendly and 
  there is also an option to become permanent. They are looking for:
  
  
  
   - 10+ years’ experience with Solr (Apache Lucene), HTML, XML, Java, 
  Tomcat, JBoss, MySQL
  
   - 5+ years’ experience implementing Solr builds of indexes, shards, and 
  refined searches across semi-structured data sets to include architectural 
  scaling
  
   - Experience in developing a re-usable framework to support web site 
  search; implement rich web site search, including the incorporation of 
  metadata.
  
   - Experienced in development using Java, Oracle, RedHat, Perl, shell, and 
  clustering
  
   - A strong understanding of Data analytics, algorithms, and large data 
  structures
  
   - Experienced in architectural design and resource planning for scaling 
  Solr/Lucene capabilities.
  
   - Bachelor's degree in Computer Science or related discipline.
  
  
  
  
  
  
  
  
  
   Jessica Feigin 
   Technical Recruiter
  
   Technology Resource Management 
   30 Vreeland Rd., Florham Park, NJ 07932 
   Phone 973-377-0040 x 415, Fax 973-377-7064 
   Email: jess...@trmconsulting.com
  
   Web site: www.trmconsulting.com
  
   LinkedIn Profile: www.linkedin.com/in/jessicafeigin
  
  
 
 


RE: crawling all links of same domain in nutch in solr

2014-07-29 Thread Markus Jelsma
Hi - use the domain URL filter plugin and list the domains, hosts or TLD's you 
want to restrict the crawl to.


 
 
-Original message-
 From:Vivekanand Ittigi vi...@biginfolabs.com
 Sent: Tuesday 29th July 2014 7:17
 To: solr-user@lucene.apache.org
 Subject: crawling all links of same domain in nutch in solr
 
 Hi,
 
 Can anyone tel me how to crawl all other pages of same domain.
 For example i'm feeding a website http://www.techcrunch.com/ in seed.txt.
 
 Following property is added in nutch-site.xml
 
 property
   namedb.ignore.internal.links/name
   valuefalse/value
   descriptionIf true, when adding new links to a page, links from
   the same host are ignored.  This is an effective way to limit the
   size of the link database, keeping only the highest quality
   links.
   /description
 /property
 
 And following is added in regex-urlfilter.txt
 
 # accept anything else
 +.
 
 Note: if i add http://www.tutorialspoint.com/ in seed.txt, I'm able to
 crawl all other pages but not techcrunch.com's pages though it has got many
 other pages too.
 
 Please help..?
 
 Thanks,
 Vivek
 


RE: Solr substring search yields all indexed results

2014-08-04 Thread Markus Jelsma
Don't use N-grams at query time.

 
 
-Original message-
 From:prem1980 prem1...@gmail.com
 Sent: Monday 4th August 2014 17:47
 To: solr-user@lucene.apache.org
 Subject: Solr substring search yields all indexed results
 
 To do a substring search, I have added a new fieldType - Text with
 NgramFilter.
 
 It works fine perfectly but downside is this problem
 
 Example
 
 name = ['Apple','Samy','And','a']
 When I do a search name:a, then all the above items gets pulled up. Even
 when search changes to App. All the above items are pulled. How can I fix
 this issue?
 
 fieldType name=text class=solr.TextField positionIncrementGap=100
 analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=100 /
 /analyzer
 /fieldType
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-substring-search-yields-all-indexed-results-tp4151012.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 


RE: NGramTokenizer influence to length normalization?

2014-08-08 Thread Markus Jelsma
All tokens produced have still have the same position as their initial 
position, so no. 
 
-Original message-
 From:Johannes Siegert johannes.sieg...@marktjagd.de
 Sent: Friday 8th August 2014 11:11
 To: solr-user@lucene.apache.org
 Subject: NGramTokenizer influence to length normalization?
 
 Hi,
 
 does the NGramTokenizer have an influence to the length normalization?
 
 Thanks.
 
 Johannes
 


RE: Solr cloud performance degradation with billions of documents

2014-08-13 Thread Markus Jelsma
Hi - You are running mapred jobs on the same nodes as Solr runs right? The 
first thing i would think of is that your OS file buffer cache is abused. The 
mappers read all data, presumably residing on the same node. The mapper output 
and shuffling part would take place on the same node, only the reducer output 
is sent to your nodes, which i assume are on the same machines. Those same 
machines have a large Lucene index. All this data, written to and read from the 
same disk, competes for a nice spot in the OS buffer cache.

Forget it if i misread anything, but when you're using serious figures of size, 
then do not abuse your caches. Have a separate mapred and Solr cluster, because 
they both eat cache space. I assume you can see serious IO WAIT times. 

Split the stuff and maybe even use smaller hardware, but more.

M 
 
-Original message-
 From:Wilburn, Scott scott.wilb...@verizonwireless.com.INVALID
 Sent: Wednesday 13th August 2014 23:09
 To: solr-user@lucene.apache.org
 Subject: Solr cloud performance degradation with billions of documents
 
 Hello everyone,
 I am trying to use SolrCloud to index a very large number of simple documents 
 and have run into some performance and scalability limitations and was 
 wondering what can be done about it.
 
 Hardware wise, I have a 32-node Hadoop cluster that I use to run all of the 
 Solr shards and each node has 128GB of memory. The current SolrCloud setup is 
 split into 4 separate and individual clouds of 32 shards each thereby giving 
 four running shards per cloud or one cloud per eight nodes. Each shard is 
 currently assigned a 6GB heap size. I’d prefer to avoid increasing heap 
 memory for Solr shards to have enough to run other MapReduce jobs on the 
 cluster.
 
 The rate of documents that I am currently inserting into these clouds per day 
 is 5 Billion each in two clouds, 3 Billion into the third, and 2 Billion into 
 the fourth ; however to account for capacity, the aim is to scale the 
 solution to support double that amount of documents. To index these 
 documents, there are MapReduce jobs that run that generate the Solr XML 
 documents and will then submit these documents via SolrJ's CloudSolrServer 
 interface. In testing, I have found that limiting the number of active 
 parallel inserts to 80 per cloud gave the best performance as anything higher 
 gave diminishing returns, most likely due to the constant shuffling of 
 documents internally to SolrCloud. From an index perspective, dated 
 collections are being created to hold an entire day's of documents and 
 generally the inserting happens primarily on the current day (the previous 
 days are only to allow for searching) and the plan is to keep up to 60 days 
 (or collections) in each cloud. A single shar
 d index in one collection in the busiest cloud currently takes up 30G disk 
space or 960G for the entire collection. The documents are being auto committed 
with a hard commit time of 4 minutes (opensearcher = false) and soft commit 
time of 8 minutes.
 
 From a search perspective, the use case is fairly generic and simple searches 
 of the type :, so there is no need to tune the system to use any of the more 
 advanced querying features. Therefore, the most important thing for me is to 
 have the indexing performance be able to keep up with the rate of input.
 
 In the initial load testing, I was able to achieve a projected indexing rate 
 of 10 Billion documents per cloud per day for a grand total of 40 Billion per 
 day. However, the initial load testing was done on fairly empty clouds with 
 just a few small collections. Now that there have been several days of 
 documents being indexed, I am starting to see a fairly steep drop-off in 
 indexing performance once the clouds reached about 15 full collections (or 
 about 80-100 Billion documents per cloud) in the two biggest clouds. Based on 
 current application logging I’m seeing a 40% drop off in indexing 
 performance. Because of this, I have concerns on how performance will hold as 
 more collections are added.
 
 My question to the community is if anyone else has had any experience in 
 using Solr at this scale (hundreds of Billions) and if anyone has observed 
 such a decline in indexing performance as the number of collections 
 increases. My understanding is that each collection is a separate index and 
 therefore the inserting rate should remain constant. Aside from that, what 
 other tweaks or changes can be done in the SolrCloud configuration to 
 increase the rate of indexing performance? Am I hitting a hard limitation of 
 what Solr can handle?
 
 Thanks,
 Scott 
 
 


RE: Announcing Splainer -- Open Source Solr Sandbox

2014-08-27 Thread Markus Jelsma
Yeah, very cool. Since this is all just client side, how about integrating it 
in Solr's UI?
Also,  it seems to assume `id` is the ID field, which is not always true.
 
-Original message-
 From:david.w.smi...@gmail.com david.w.smi...@gmail.com
 Sent: Friday 22nd August 2014 19:42
 To: solr-user@lucene.apache.org
 Subject: Re: Announcing quot;Splainerquot; -- Open Source Solr Sandbox
 
 Cool Doug!  I look forward to digging into this.
 
 ~ David Smiley
 Freelance Apache Lucene/Solr Search Consultant/Developer
 http://www.linkedin.com/in/davidwsmiley
 
 
 On Fri, Aug 22, 2014 at 10:34 AM, Doug Turnbull 
 dturnb...@opensourceconnections.com wrote:
 
  Greetings from the OpenSource Connections Team!
 
  We're happy to announce we've taken core sandbox of our search relevancy
  product Quepid and open sourced it as Splainer (http://splainer.io).
  Splainer is a search sandbox that explains search results in a human
  readable form as you work. By being a *sandbox* it differs from parsing
  tools such as explain.solr.pl by letting you tweak and tweak and tweak
  without leaving the tool itself. In short, it helps you work faster to
  solve relevancy problems.
 
  Simply paste in a Solr URL and Splainer goes to work. Splainer is entirely
  driven by your browser (there's no backend -- its all static js/html/css
  and uses HTML local storage to store a few settings for you). So if your
  browser can see it, Splainer can work with it.
 
  Anyway, we've started getting great use out of the tool, and would also
  like to gather feedback from the community by sharing it. We're open to
  ideas, bug reports, pull requests, etc.
 
  Relevant links:
 
  Blog Post announcing Splainer:
 
  http://opensourceconnections.com/blog/2014/08/18/introducing-splainer-the-open-source-search-sandbox-that-tells-you-why/
 
  Splainer:
  http://splainer.io
 
  Splainer on Github (open sourced as Apache 2)
  http://github.com/o19s/splainer
 
  These features (and a ton more) are also in our relevancy testing product
  Quepid:
  http://quepid.com
 
  Bugs/feedback/complaints/ideas/questions/contributions/etc welcome.
 
  Thank you for your time!
  --
  Doug Turnbull
  Search  Big Data Architect
  OpenSource Connections http://o19s.com
 
 


RE: Query ReRanking question

2014-09-05 Thread Markus Jelsma
Hi - You can already achieve this by boosting on the document's recency. The 
result set won't be exactly ordered by date but you will get the most relevant 
and recent documents on top.

Markus 

-Original message-
 From:Ravi Solr ravis...@gmail.com mailto:ravis...@gmail.com 
 Sent: Friday 5th September 2014 18:06
 To: solr-user@lucene.apache.org mailto:solr-user@lucene.apache.org 
 Subject: Re: Query ReRanking question
 
 Thank you very much for responding. I want to do exactly the opposite of
 what you said. I want to sort the relevant docs in reverse chronology. If
 you sort by date before hand then the relevancy is lost. So I want to get
 Top N relevant results and then rerank those Top N to achieve relevant
 reverse chronological results.
 
 If you ask Why would I want to do that ??
 
 Lets take a example about Malaysian airline crash. several articles might
 have been published over a period of time. When I search for - malaysia
 airline crash blackbox - I would want to see relevant results but would
 also like to see the the recent developments on the top i.e. effectively a
 reverse chronological order within the relevant results, like telling a
 story over a period of time
 
 Hope i am clear. Thanks for your help.
 
 Thanks
 
 Ravi Kiran Bhaskar
 
 
 On Thu, Sep 4, 2014 at 5:08 PM, Joel Bernstein joels...@gmail.com 
 mailto:joels...@gmail.com  wrote:
 
  If you want the main query to be sorted by date then the top N docs
  reranked by a query, that should work. Try something like this:
 
  q=foosort=date+descrq={!rerank reRandDocs=1000
  reRankQuery=$myquery}myquery=blah
 
 
  Joel Bernstein
  Search Engineer at Heliosearch
 
 
  On Thu, Sep 4, 2014 at 4:25 PM, Ravi Solr ravis...@gmail.com 
  mailto:ravis...@gmail.com  wrote:
 
   Can the ReRanking API be used to sort within docs retrieved by a date
  field
   ? Can somebody help me understand how to write such a query ?
  
   Thanks
  
   Ravi Kiran Bhaskar
  
 
 



RE: Problem deploying solr-4.10.0.war in Tomcat

2014-09-17 Thread Markus Jelsma
Yes, this is a nasty error. You have not set up logging libraries properly:
https://cwiki.apache.org/confluence/display/solr/Configuring+Logging
 
 
-Original message-
 From:phi...@free.fr phi...@free.fr
 Sent: Wednesday 17th September 2014 11:51
 To: solr-user@lucene.apache.org
 Subject: Problem deploying solr-4.10.0.war in Tomcat
 
 
 
 Hello,
 
 I've dropped solr-4.10.0.war in Tomcat 7's webapp directory.
 
 When I start the Java web server, the following message appears in 
 catalina.out:
 
 ---
 
 INFO: Starting Servlet Engine: Apache Tomcat/7.0.55
 Sep 17, 2014 11:35:59 AM org.apache.catalina.startup.HostConfig deployWAR
 INFO: Deploying web application archive 
 /archives/apache-tomcat-7.0.55_solr_8983/webapps/solr-4.10.0.war
 Sep 17, 2014 11:35:59 AM org.apache.catalina.core.StandardContext 
 startInternal
 SEVERE: Error filterStart
 Sep 17, 2014 11:35:59 AM org.apache.catalina.core.StandardContext 
 startInternal
 SEVERE: Context [/solr-4.10.0] startup failed due to previous errors
 
 --
 
 Any help would be much appreciated.
 
 Cheers,
 
 Philippe
 
 
 
 


RE: How does KeywordRepeatFilterFactory help giving a higher score to an original term vs a stemmed term

2014-09-24 Thread Markus Jelsma
Hi - but this makes no sense, they are scored as equals, except for tiny 
differences in TF and IDF. What you would need is something like a stemmer that 
preserves the original token and gives a  1 payload to the stemmed token. The 
same goes for filters like decompounders and accent folders that change meaning 
of words.
 
 
-Original message-
 From:Diego Fernandez difer...@redhat.com
 Sent: Wednesday 17th September 2014 23:37
 To: solr-user@lucene.apache.org
 Subject: Re: How does KeywordRepeatFilterFactory help giving a higher score 
 to an original term vs a stemmed term
 
 I'm not 100% on this, but I imagine this is what happens:
 
 (using - to mean tokenized to)
 
 Suppose that you index:
 
 I am running home - am run running home
 
 If you then query running home - run running home and thus give a higher 
 score than if you query runs home - run runs home
 
 
 - Original Message -
  The Solr wiki says   A repeated question is how can I have the
  original term contribute
  more to the score than the stemmed version? In Solr 4.3, the
  KeywordRepeatFilterFactory has been added to assist this
  functionality. 
  
  https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Stemming
  
  (Full section reproduced below.)
  I can see how in the example from the wiki reproduced below that both
  the stemmed and original term get indexed, but I don't see how the
  original term gets more weight than the stemmed term.  Wouldn't this
  require a filter that gives terms with the keyword attribute more
  weight?
  
  What am I missing?
  
  Tom
  
  
  
  -
  A repeated question is how can I have the original term contribute
  more to the score than the stemmed version? In Solr 4.3, the
  KeywordRepeatFilterFactory has been added to assist this
  functionality. This filter emits two tokens for each input token, one
  of them is marked with the Keyword attribute. Stemmers that respect
  keyword attributes will pass through the token so marked without
  change. So the effect of this filter would be to index both the
  original word and the stemmed version. The 4 stemmers listed above all
  respect the keyword attribute.
  
  For terms that are not changed by stemming, this will result in
  duplicate, identical tokens in the document. This can be alleviated by
  adding the RemoveDuplicatesTokenFilterFactory.
  
  fieldType name=text_keyword class=solr.TextField
  positionIncrementGap=100
   analyzer
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.KeywordRepeatFilterFactory/
 filter class=solr.PorterStemFilterFactory/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
  /fieldType
  
 
 -- 
 Diego Fernandez - 爱国
 Software Engineer
 GSS - Diagnostics
 
 


RE: Best practice for KStemFilter query or index or both?

2014-09-25 Thread Markus Jelsma
Hi - most filters should be used both sides, especially stemmers, accent 
foldings and obviously lowercasing. Synonyms only on one side, depending on how 
you want to utilize them.

Markus

 
 
-Original message-
 From:eShard zim...@yahoo.com
 Sent: Thursday 25th September 2014 22:23
 To: solr-user@lucene.apache.org
 Subject: Best practice for KStemFilter query or index or both?
 
 Good afternoon,
 Here's my configuration for a text field.
 I have the same configuration for index and query time.
 Is this valid? 
 What's the best practice for these query or index or both?
 for synonyms; I've read conflicting reports on when to use it but I'm
 currently changing it over to at indexing time only.
 
 Thanks,
 
 fieldType name=text_general class=solr.TextField
 positionIncrementGap=100 autoGeneratePhraseQueries=true
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1
 generateNumberParts=1
 catenateWords=0
 catenateNumbers=0
 catenateAll=0
 preserveOriginal=1
 /
   filter class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
 
 filter class=solr.LowerCaseFilterFactory/
   filter class=solr.KStemFilterFactory /
   /analyzer
   analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1
 generateNumberParts=1
 catenateWords=0
 catenateNumbers=0
 catenateAll=0
 preserveOriginal=1
 /
   filter class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
 filter class=solr.LowerCaseFilterFactory/
   filter class=solr.KStemFilterFactory /  
   /analyzer
   analyzer type=select
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1
 generateNumberParts=1
 catenateWords=0
 catenateNumbers=0
 catenateAll=0
 preserveOriginal=1
 /
   filter class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
 filter class=solr.LowerCaseFilterFactory/
   filter class=solr.KStemFilterFactory /  
   /analyzer
 /fieldType
 
 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Best-practice-for-KStemFilter-query-or-index-or-both-tp4161201.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 


RE: Flexible search field analyser/tokenizer configuration

2014-09-29 Thread Markus Jelsma
Yes, it appeared in 4.8 but you could use PatternReplaceFilterFactory to 
simulate the same behavior.

Markus

 
 
-Original message-
 From:PeterKerk petervdk...@hotmail.com
 Sent: Monday 29th September 2014 21:08
 To: solr-user@lucene.apache.org
 Subject: Re: Flexible search field analyser/tokenizer configuration
 
 Hi Ahmet,
 
 Am I correct that his this is only avalable in Solr4.8?
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.TruncateTokenFilterFactory
 
 
 Also, I need to add your lines to both index and query analyzers? making
 my definition like so:
 
 fieldType name=searchtext class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/ 
filter class=solr.LowerCaseFilterFactory/ 
filter class=solr.TruncateTokenFilterFactory 
 prefixLength=3/ 
   /analyzer
   analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/ 
filter class=solr.LowerCaseFilterFactory/ 
filter class=solr.TruncateTokenFilterFactory 
 prefixLength=3/ 
   /analyzer
 /fieldType
 
 Your solution seems much easier to setup than what is proposed by
 Alexandre...for my understanding, what is the difference?
 
 Thanks!
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Flexible-search-field-analyser-tokenizer-configuration-tp4161624p4161778.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 


RE: Solr query field (qf) conditional boost

2014-09-29 Thread Markus Jelsma
Hi - you need to use function queries via the bf parameter. The function 
exists() and in some cases query() will do the conditional work, depending on 
your use case.

Markus

 
 
-Original message-
 From:Shamik Bandopadhyay sham...@gmail.com
 Sent: Monday 29th September 2014 21:30
 To: solr-user@lucene.apache.org
 Subject: Solr query field (qf) conditional boost
 
 Hi,
 
   I'm trying to check if it's possible to include a conditional boosting in
 Solr qf field. For e.g. I've the following entry in qf parameter.
 
 str name=qftext^0.5 title^10.0 ProductLine^5/str
 
 What I'm looking is to add the productline boosting only for a given Author
 field, something in the lines boost ProductLine^5 if Author:Tom.
 
 I've been using a similar filtering in appends section, but not sure how
 to do it in qf or whether it's possible.
 
 
 lst name=appends
 str name=fqAuthor:(Tom  +Solution:yes) /str
 /lst
 
 Any pointers will be appreciated.
 
 Thanks,
 Shamik
 


RE: Solr query field (qf) conditional boost

2014-09-29 Thread Markus Jelsma
Hi - check the def() and if() functions, they can have embedded functions such 
as exists() and query(). You can use those to apply the main query the the 
productline field if author has some value. I cannot give a concrete example 
because i don't have an environment to fiddle around with. If the main query 
has parameter qq, you can use parameter substitution by using $qq in the 
function queries. Please check the wiki and cwiki docs on edismax and function 
queries for examples and references.

Markus

 
 
-Original message-
 From:shamik sham...@gmail.com
 Sent: Monday 29th September 2014 22:54
 To: solr-user@lucene.apache.org
 Subject: RE: Solr query field (qf) conditional boost
 
 Thanks Markus. Well, I tried using a conditional if-else function, but it
 doesn't seem to work for boosting field. What I'm trying to do is boost
 ProductLine field by 5, if the result documents contain Author = 'Tom'.
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-query-field-qf-conditional-boost-tp4161783p4161797.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 


RE: If I can a field from text_ws to text do I need to drop and reindex or just reindex?

2014-10-03 Thread Markus Jelsma
Hi - you don't need to erase the data directory, you can just reindex, but make 
sure you overwrite all documents.

 
 
-Original message-
 From:Wayne W waynemailingli...@gmail.com
 Sent: Friday 3rd October 2014 11:55
 To: solr-user@lucene.apache.org
 Subject: If I can a field from text_ws to text do I need to drop and reindex 
 or just reindex?
 
 Hi,
 
 I've realized I need to change a particular field from text_ws to text.
 
 I realize I need to reindex as the tokens are being stored in a case
 sensitive manner which we do not want.
 
 However can I just reindex all my documents, or do I need to drop/wipe the
 /data/index dir and start fresh?
 
 I really don't want to drop as the current users will not be able to search
 and reindexing could take as long as a week.
 
 many thanks
 Wayne
 


RE: search query text field with Comma

2014-10-06 Thread Markus Jelsma
Hi - you are probably using the WhitespaceTokenizer without a 
WordDelimiterFilter. Consider using the StandardTokenizer or add the 
WordDelimiterFilter.

Markus

 
 
-Original message-
 From:EXTERNAL Taminidi Ravi (ETI, Automotive-Service-Solutions) 
 external.ravi.tamin...@us.bosch.com
 Sent: Monday 6th October 2014 20:57
 To: solr-user@lucene.apache.org
 Subject: search query text field with Comma
 
 Hi users, This is may be a basic question, but I am facing some trouble.. 
 
 The scenario is ,  I have a text  Truck Series, 12V and 15V, if the user 
 search Truck Series it is not getting the row , but Truck Series, is 
 working.. How can I get for search Truck Series..?
 
 Thanks
 
 Ravi
 


Re: Weird Problem (possible bug?) with german stemming and wildcard search

2014-10-07 Thread Markus Jelsma
Hi - you should not use wild cards for autocompletion, Lucene has far better 
tools for making very good autocompletion, also, since a wild card is a multi 
term query, they are not passed through your configured query time analyzer.

Some other comments:
- you use a porter stemmer but you should use one of the German specific stem 
filters.
- you don't have an index time tokenizer defined, this should not be possible 
and behaviour is undefined as far as i know.


On Tuesday 07 October 2014 14:25:27 Thomas Michael Engelke wrote:
 I have a problem with a stemmed german field. The field definition:
 
 field name=description type=text_splitting indexed=true
 stored=true required=false multiValued=false/
 ...
 fieldType name=text_splitting class=solr.TextField
 positionIncrementGap=100 autoGeneratePhraseQueries=true
analyzer type=index
  filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt/
  filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
  filter class=solr.PorterStemFilterFactory/
/analyzer
analyzer type=query
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
  filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt/
  filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
  filter class=solr.PorterStemFilterFactory/
/analyzer
 /fieldType
 
 When we search for a word from an autosuggest kind of component, we
 always add an asterisk to a word, so when somebody enters something like
 Radbremszylinder and waits for some milliseconds, the autosuggest list
 is filled with the results of searching for Radbremszylinder*. This
 seemed to work quite well. Today we got a bug report from a customer for
 that exact word.
 
 So I made an analysis for the word as Field value (index) and Field
 value (query), and it looked like this:
 
 ST   RadbremszylinderWT   Radbremszylinder*
 SF   RadbremszylinderSF   Radbremszylinder*
 WDF  RadbremszylinderSF   Radbremszylinder*
 LCF  radbremszylinderWDF  Radbremszylinder
 SKMF radbremszylinderLCF  radbremszylinder
 PSF  radbremszylind  SKMF radbremszylinder
 
 As you can see, the end result looks very much alike. However, records
 containing that word in their description field aren't reported as
 results. Strangely enough, records containing Radbremszylindern
 (plural) are reported as results. Removing the asterisk from the end
 reports all records with Radbremszylinder, just as we would expect. So
 the culprit is the asterisk at the end. As far as we can read from the
 docs, an asterisk is just 0 or more characters, which means that the
 literal word in front of the asterisk should match the query.
 
 Searching further we tried some variations, and it seems that searching
 for Radbremszylind* works. All records with any variation
 (Radbremszylinder, Radbremszylindern) are reported. So maybe there's
 a weird interaction with stemming?
 
 Any ideas?



RE: NullPointerException for ExternalFileField when key field has no terms

2014-10-08 Thread Markus Jelsma
Hi - yes it is worth a ticket as the javadoc says it is ok:
http://lucene.apache.org/solr/4_10_1/solr-core/org/apache/solr/schema/ExternalFileField.html
 
 
-Original message-
 From:Matthew Nigl matthew.n...@gmail.com
 Sent: Wednesday 8th October 2014 14:48
 To: solr-user@lucene.apache.org
 Subject: NullPointerException for ExternalFileField when key field has no 
 terms
 
 Hi,
 
 I use various ID fields as the keys for various ExternalFileField fields,
 and I have noticed that I will sometimes get the following error:
 
 ERROR org.apache.solr.servlet.SolrDispatchFilter  û
 null:java.lang.NullPointerException
 at
 org.apache.solr.search.function.FileFloatSource.getFloats(FileFloatSource.java:273)
 at
 org.apache.solr.search.function.FileFloatSource.access$000(FileFloatSource.java:51)
 at
 org.apache.solr.search.function.FileFloatSource$2.createValue(FileFloatSource.java:147)
 at
 org.apache.solr.search.function.FileFloatSource$Cache.get(FileFloatSource.java:190)
 at
 org.apache.solr.search.function.FileFloatSource.getCachedFloats(FileFloatSource.java:141)
 at
 org.apache.solr.search.function.FileFloatSource.getValues(FileFloatSource.java:84)
 at
 org.apache.solr.response.transform.ValueSourceAugmenter.transform(ValueSourceAugmenter.java:95)
 at
 org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:252)
 at
 org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:170)
 at
 org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:184)
 at
 org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:300)
 at
 org.apache.solr.response.JSONWriter.writeResponse(JSONResponseWriter.java:96)
 at
 org.apache.solr.response.JSONResponseWriter.write(JSONResponseWriter.java:61)
 at
 org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:765)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:426)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
 at
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
 at
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
 at
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
 at
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
 at
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
 at
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
 at
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
 at
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
 at
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
 at
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
 at
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:368)
 at
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
 at
 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
 at
 org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
 at
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
 at
 org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
 at
 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
 at
 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
 at java.lang.Thread.run(Unknown Source)
 
 
 
 The source code referenced in the error is below (FileFloatSource.java:273):
 
 TermsEnum termsEnum = MultiFields.getTerms(reader, idName).iterator(null);
 
 So if there are no terms in the index for the key field, then getTerms will
 return null, and of course trying to call iterator on null will cause the
 exception.
 
 For my use-case, it makes sense that the key field may have no terms
 (initially) because there are various types of documents 

WhitespaceTokenizer to consider incorrectly encoded c2a0?

2014-10-08 Thread Markus Jelsma
Hi,

For some crazy reason, some users somehow manage to substitute a perfectly 
normal space with a badly encoded non-breaking space, properly URL encoded this 
then becomes %c2a0 and depending on the encoding you use to view you probably 
see  followed by a space. For example:

Because c2a0 is not considered whitespace (indeed, it is not real whitespace, 
that is 00a0) by the Java Character class, the WhitespaceTokenizer won't split 
on it, but the WordDelimiterFilter still does, somehow mitigating the problem 
as it becomes:

HTMLSCF een abonnement
WT een abonnement
WDF een eenabonnement abonnement

Should the WhitespaceTokenizer not include this weird edge case? 

Cheers,
Markus


RE: WhitespaceTokenizer to consider incorrectly encoded c2a0?

2014-10-08 Thread Markus Jelsma
Alexandre - i am sorry if i was not clear, this is about queries, this all 
happens at query time. Yes we can do the substitution in with the regex replace 
filter, but i would propose this weird exception to be added to 
WhitespaceTokenizer so Lucene deals with this by itself.

Markus
 
-Original message-
 From:Alexandre Rafalovitch arafa...@gmail.com
 Sent: Wednesday 8th October 2014 16:12
 To: solr-user solr-user@lucene.apache.org
 Subject: Re: WhitespaceTokenizer to consider incorrectly encoded c2a0?
 
 Is this a suggestion for JIRA ticket? Or a question on how to solve
 it? If the later, you could probably stick a RegEx replacement in the
 UpdateRequestProcessor chain and be done with it.
 
 As to why? I would look for the rest of the MSWord-generated
 artifacts, such as smart quotes, extra-long dashes, etc.
 
 Regards,
Alex.
 Personal: http://www.outerthoughts.com/ and @arafalov
 Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
 Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
 
 
 On 8 October 2014 09:59, Markus Jelsma markus.jel...@openindex.io wrote:
  Hi,
 
  For some crazy reason, some users somehow manage to substitute a perfectly 
  normal space with a badly encoded non-breaking space, properly URL encoded 
  this then becomes %c2a0 and depending on the encoding you use to view you 
  probably see  followed by a space. For example:
 
  Because c2a0 is not considered whitespace (indeed, it is not real 
  whitespace, that is 00a0) by the Java Character class, the 
  WhitespaceTokenizer won't split on it, but the WordDelimiterFilter still 
  does, somehow mitigating the problem as it becomes:
 
  HTMLSCF een abonnement
  WT een abonnement
  WDF een eenabonnement abonnement
 
  Should the WhitespaceTokenizer not include this weird edge case?
 
  Cheers,
  Markus
 


RE: does one need to reindex when changing similarity class

2014-10-09 Thread Markus Jelsma
Hi - no you don't have to, although maybe if you changed on how norms are 
encoded.
Markus

 
 
-Original message-
 From:elisabeth benoit elisaelisael...@gmail.com
 Sent: Thursday 9th October 2014 12:26
 To: solr-user@lucene.apache.org
 Subject: does one need to reindex when changing similarity class
 
 I've read somewhere that we do have to reindex when changing similarity
 class. Is that right?
 
 Thanks again,
 Elisabeth
 


RE: per field similarity not working with solr 4.2.1

2014-10-09 Thread Markus Jelsma
Hi - it should work, not seeing your implemenation in the debug output is a 
known issue.
 
 
-Original message-
 From:elisabeth benoit elisaelisael...@gmail.com
 Sent: Thursday 9th October 2014 12:22
 To: solr-user@lucene.apache.org
 Subject: per field similarity not working with solr 4.2.1
 
 Hello,
 
 I am using Solr 4..2.1 and I've tried to use a per field similarity, as
 described in
 
 https://apache.googlesource.com/lucene-solr/+/c5bb5cd921e1ce65e18eceb55e738f40591214f0/solr/core/src/test-files/solr/collection1/conf/schema-sim.xml
 
 so in my schema I have
 
 schema name=search version=1.4
 similarity class=solr.SchemaSimilarityFactory/
 
 and a custom similarity in fieldtype definition
 
 fieldType name=text class=solr.TextField
 positionIncrementGap=100
  similarity
 class=com.company.lbs.solr.search.similarity.NoTFSimilarity/
analyzer type=index
 ...
 
 but it is not working
 
 when I send a request with debugQuery=on, instead of [
 NoTFSimilarity], I see []
 
 or to give an example, I have
 
 
 weight(catchall:bretagn in 2575) []
 
 instead of weight(catchall:bretagn in 2575) [NoTFSimilarity]
 
 Anyone has a clue what I am doing wrong?
 
 Best regards,
 Elisabeth
 


RE: per field similarity not working with solr 4.2.1

2014-10-09 Thread Markus Jelsma
Well, it is either the output of your calculation or writing something to 
System.out
Markus

 
 
-Original message-
 From:elisabeth benoit elisaelisael...@gmail.com
 Sent: Thursday 9th October 2014 13:31
 To: solr-user@lucene.apache.org
 Subject: Re: per field similarity not working with solr 4.2.1
 
 Thanks for the information!
 
 I've been struggling with that debug output. Any other way to know for sure
 my similarity class is being used?
 
 Thanks again,
 Elisabeth
 
 2014-10-09 13:03 GMT+02:00 Markus Jelsma markus.jel...@openindex.io:
 
  Hi - it should work, not seeing your implemenation in the debug output is
  a known issue.
 
 
  -Original message-
   From:elisabeth benoit elisaelisael...@gmail.com
   Sent: Thursday 9th October 2014 12:22
   To: solr-user@lucene.apache.org
   Subject: per field similarity not working with solr 4.2.1
  
   Hello,
  
   I am using Solr 4..2.1 and I've tried to use a per field similarity, as
   described in
  
  
  https://apache.googlesource.com/lucene-solr/+/c5bb5cd921e1ce65e18eceb55e738f40591214f0/solr/core/src/test-files/solr/collection1/conf/schema-sim.xml
  
   so in my schema I have
  
   schema name=search version=1.4
   similarity class=solr.SchemaSimilarityFactory/
  
   and a custom similarity in fieldtype definition
  
   fieldType name=text class=solr.TextField
   positionIncrementGap=100
similarity
   class=com.company.lbs.solr.search.similarity.NoTFSimilarity/
  analyzer type=index
   ...
  
   but it is not working
  
   when I send a request with debugQuery=on, instead of [
   NoTFSimilarity], I see []
  
   or to give an example, I have
  
  
   weight(catchall:bretagn in 2575) []
  
   instead of weight(catchall:bretagn in 2575) [NoTFSimilarity]
  
   Anyone has a clue what I am doing wrong?
  
   Best regards,
   Elisabeth
  
 
 


RE: does one need to reindex when changing similarity class

2014-10-13 Thread Markus Jelsma
Yes, if the replacing similarity has a different implementation on norms, you 
should reindex or gradually update all documents within decent time.

 
 
-Original message-
 From:Ahmet Arslan iori...@yahoo.com.INVALID
 Sent: Thursday 9th October 2014 18:27
 To: solr-user@lucene.apache.org
 Subject: Re: does one need to reindex when changing similarity class
 
 How about SweetSpotSimilarity? Length norm is saved at index time?
 
 
 
 On Thursday, October 9, 2014 5:44 PM, Jack Krupansky 
 j...@basetechnology.com wrote:
 The similarity class is only invoked at query time, so it doesn't 
 participate in indexing.
 
 -- Jack Krupansky
 
 
 
 
 -Original Message- 
 From: Markus Jelsma
 Sent: Thursday, October 9, 2014 6:59 AM
 To: solr-user@lucene.apache.org
 Subject: RE: does one need to reindex when changing similarity class
 
 Hi - no you don't have to, although maybe if you changed on how norms are 
 encoded.
 Markus
 
 
 
 -Original message-
  From:elisabeth benoit elisaelisael...@gmail.com
  Sent: Thursday 9th October 2014 12:26
  To: solr-user@lucene.apache.org
  Subject: does one need to reindex when changing similarity class
 
  I've read somewhere that we do have to reindex when changing similarity
  class. Is that right?
 
  Thanks again,
  Elisabeth
  
 


Re: Recovering from Out of Mem

2014-10-14 Thread Markus Jelsma
And don't forget to set the proper permissions on the script, the tomcat or 
jetty user.

Markus

On Tuesday 14 October 2014 13:47:47 Boogie Shafer wrote:
 a really simple approach is to have the OOM generate an email
 
 e.g.
 
 1) create a simple script (call it java_oom.sh) and drop it in your tomcat
 bin dir
 
 
 echo `date` | mail -s Java Error: OutOfMemory - $HOSTNAME
 not...@domain.com
 
 
 2) configure your java options (in setenv.sh or similar) to trigger heap
 dump and the email script when OOM occurs
 
 # config error behaviors
 CATALINA_OPTS=$CATALINA_OPTS -XX:+HeapDumpOnOutOfMemoryError
 -XX:HeapDumpPath=$TOMCAT_DIR/temp/tomcat-dump.hprof
 -XX:OnError=$TOMCAT_DIR/bin/java_error.sh
 -XX:OnOutOfMemoryError=$TOMCAT_DIR/bin/java_oom.sh
 -XX:ErrorFile=$TOMCAT_DIR/temp/java_error%p.log
 
 
 
 
 From: Mark Miller markrmil...@gmail.com
 Sent: Tuesday, October 14, 2014 06:30
 To: solr-user@lucene.apache.org
 Subject: Re: Recovering from Out of Mem
 
 Best is to pass the Java cmd line option that kills the process on OOM and
 setup a supervisor on the process to restart it.  You need a somewhat
 recent release for this to work properly though.
 
 - Mark
 
  On Oct 14, 2014, at 9:06 AM, Salman Akram
  salman.ak...@northbaysolutions.net wrote:
  
  I know there are some suggestions to avoid OOM issue e.g. setting
  appropriate Max Heap size etc. However, what's the best way to recover
  from
  it as it goes into non-responding state? We are using Tomcat on back end.
  
  The scenario is that once we face OOM issue it keeps on taking queries
  (doesn't give any error) but they just time out. So even though we have a
  fail over system implemented but we don't have a way to distinguish if
  these are real time out queries OR due to OOM.
  
  --
  Regards,
  
  Salman Akram



Re: Recovering from Out of Mem

2014-10-14 Thread Markus Jelsma
This will do:
kill -9 `ps aux | grep -v grep | grep tomcat6 | awk '{print $2}'`

pkill should also work

On Tuesday 14 October 2014 07:02:03 Yago Riveiro wrote:
 Boogie,
 
 
 
 
 Any example for java_error.sh script?
 
 
 —
 /Yago Riveiro
 
 On Tue, Oct 14, 2014 at 2:48 PM, Boogie Shafer boogie.sha...@proquest.com
 
 wrote:
  a really simple approach is to have the OOM generate an email
  e.g.
  1) create a simple script (call it java_oom.sh) and drop it in your tomcat
  bin dir echo `date` | mail -s Java Error: OutOfMemory - $HOSTNAME
  not...@domain.com 2) configure your java options (in setenv.sh or
  similar) to trigger heap dump and the email script when OOM occurs #
  config error behaviors
  CATALINA_OPTS=$CATALINA_OPTS -XX:+HeapDumpOnOutOfMemoryError
  -XX:HeapDumpPath=$TOMCAT_DIR/temp/tomcat-dump.hprof
  -XX:OnError=$TOMCAT_DIR/bin/java_error.sh
  -XX:OnOutOfMemoryError=$TOMCAT_DIR/bin/java_oom.sh
  -XX:ErrorFile=$TOMCAT_DIR/temp/java_error%p.log
  
  From: Mark Miller markrmil...@gmail.com
  Sent: Tuesday, October 14, 2014 06:30
  To: solr-user@lucene.apache.org
  Subject: Re: Recovering from Out of Mem
  Best is to pass the Java cmd line option that kills the process on OOM and
  setup a supervisor on the process to restart it.  You need a somewhat
  recent release for this to work properly though. - Mark
  
  On Oct 14, 2014, at 9:06 AM, Salman Akram
  salman.ak...@northbaysolutions.net wrote:
  
  I know there are some suggestions to avoid OOM issue e.g. setting
  appropriate Max Heap size etc. However, what's the best way to recover
  from
  it as it goes into non-responding state? We are using Tomcat on back end.
  
  The scenario is that once we face OOM issue it keeps on taking queries
  (doesn't give any error) but they just time out. So even though we have a
  fail over system implemented but we don't have a way to distinguish if
  these are real time out queries OR due to OOM.
  
  --
  Regards,
  
  Salman Akram



RE: update external file

2014-10-23 Thread Markus Jelsma
You either need to upload them and issue the reload command, or download them 
from the machine, and then issue the reload command. There is no REST support 
for it (yet) like the synonym filter, or was it stop filter?

MArkus 
 
-Original message-
 From:Michael Sokolov msoko...@safaribooksonline.com
 Sent: Thursday 23rd October 2014 19:19
 To: solr-user solr-user@lucene.apache.org
 Subject: update external file
 
 I've been looking at ExternalFileField to handle popularity boosting.  
 Since Solr updatable docvalues (SOLR-5944) isn't quite there yet.  My 
 question is whether there is any support for uploading the external file 
 via Solr, or if people do that some other (external, I guess) way?
 
 -Mike
 


RE: Stopwords in shingles suggester

2014-10-27 Thread Markus Jelsma
You do not want stopwords in your shingles? Then put the stopword filter on top 
of the shingle filter.
Markus
 
-Original message-
 From:O. Klein kl...@octoweb.nl
 Sent: Monday 27th October 2014 13:56
 To: solr-user@lucene.apache.org
 Subject: Stopwords in shingles suggester
 
 Is there a way in Solr to filter out stopwords in shingles like ES does?
 
 http://www.elasticsearch.org/blog/searching-with-shingles/
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Stopwords-in-shingles-suggester-tp4166057.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 


RE: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-27 Thread Markus Jelsma
It is an ancient issue. One of the major contributors to the issue was resolved 
some versions ago but we are still seeing it sometimes too, there is nothing to 
see in the logs. We ignore it and just reindex.

-Original message-
 From:S.L simpleliving...@gmail.com
 Sent: Monday 27th October 2014 16:25
 To: solr-user@lucene.apache.org
 Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out 
 of synch.
 
 Thank Otis,
 
 I have checked the logs , in my case the default catalina.out and I dont
 see any OOMs or , any other exceptions.
 
 What others metrics do you suggest ?
 
 On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic 
 otis.gospodne...@gmail.com wrote:
 
  Hi,
 
  You may simply be overwhelming your cluster-nodes. Have you checked
  various metrics to see if that is the case?
 
  Otis
  --
  Monitoring * Alerting * Anomaly Detection * Centralized Log Management
  Solr  Elasticsearch Support * http://sematext.com/
 
 
 
   On Oct 26, 2014, at 9:59 PM, S.L simpleliving...@gmail.com wrote:
  
   Folks,
  
   I have posted previously about this , I am using SolrCloud 4.10.1 and
  have
   a sharded collection with  6 nodes , 3 shards and a replication factor
  of 2.
  
   I am indexing Solr using a Hadoop job , I have 15 Map fetch tasks , that
   can each have upto 5 threds each , so the load on the indexing side can
  get
   to as high as 75 concurrent threads.
  
   I am facing an issue where the replicas of a particular shard(s) are
   consistently getting out of synch , initially I thought this was
  beccause I
   was using a custom component , but I did a fresh install and removed the
   custom component and reindexed using the Hadoop job , I still see the
  same
   behavior.
  
   I do not see any exceptions in my catalina.out , like OOM , or any other
   excepitions, I suspecting thi scould be because of the multi-threaded
   indexing nature of the Hadoop job . I use CloudSolrServer from my java
  code
   to index and initialize the CloudSolrServer using a 3 node ZK ensemble.
  
   Does any one know of any known issues with a highly multi-threaded
  indexing
   and SolrCloud ?
  
   Can someone help ? This issue has been slowing things down on my end for
  a
   while now.
  
   Thanks and much appreciated!
 
 


RE: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-27 Thread Markus Jelsma
https://issues.apache.org/jira/browse/SOLR-4260 resolved
https://issues.apache.org/jira/browse/SOLR-4924 open

 
 
-Original message-
 From:Michael Della Bitta michael.della.bi...@appinions.com
 Sent: Monday 27th October 2014 16:40
 To: solr-user@lucene.apache.org
 Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out 
 of synch.
 
 I'm curious, could you elaborate on the issue and the partial fix?
 
 Thanks!
 
 On 10/27/14 11:31, Markus Jelsma wrote:
  It is an ancient issue. One of the major contributors to the issue was 
  resolved some versions ago but we are still seeing it sometimes too, there 
  is nothing to see in the logs. We ignore it and just reindex.
 
  -Original message-
  From:S.L simpleliving...@gmail.com
  Sent: Monday 27th October 2014 16:25
  To: solr-user@lucene.apache.org
  Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas 
  out of synch.
 
  Thank Otis,
 
  I have checked the logs , in my case the default catalina.out and I dont
  see any OOMs or , any other exceptions.
 
  What others metrics do you suggest ?
 
  On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic 
  otis.gospodne...@gmail.com wrote:
 
  Hi,
 
  You may simply be overwhelming your cluster-nodes. Have you checked
  various metrics to see if that is the case?
 
  Otis
  --
  Monitoring * Alerting * Anomaly Detection * Centralized Log Management
  Solr  Elasticsearch Support * http://sematext.com/
 
 
 
  On Oct 26, 2014, at 9:59 PM, S.L simpleliving...@gmail.com wrote:
 
  Folks,
 
  I have posted previously about this , I am using SolrCloud 4.10.1 and
  have
  a sharded collection with  6 nodes , 3 shards and a replication factor
  of 2.
  I am indexing Solr using a Hadoop job , I have 15 Map fetch tasks , that
  can each have upto 5 threds each , so the load on the indexing side can
  get
  to as high as 75 concurrent threads.
 
  I am facing an issue where the replicas of a particular shard(s) are
  consistently getting out of synch , initially I thought this was
  beccause I
  was using a custom component , but I did a fresh install and removed the
  custom component and reindexed using the Hadoop job , I still see the
  same
  behavior.
 
  I do not see any exceptions in my catalina.out , like OOM , or any other
  excepitions, I suspecting thi scould be because of the multi-threaded
  indexing nature of the Hadoop job . I use CloudSolrServer from my java
  code
  to index and initialize the CloudSolrServer using a 3 node ZK ensemble.
 
  Does any one know of any known issues with a highly multi-threaded
  indexing
  and SolrCloud ?
 
  Can someone help ? This issue has been slowing things down on my end for
  a
  while now.
 
  Thanks and much appreciated!
 
 


RE: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-27 Thread Markus Jelsma
Hi - if there is a very large discrepancy, you could consider to purge the 
smallest replica, it will then resync from the leader.
 
 
-Original message-
 From:S.L simpleliving...@gmail.com
 Sent: Monday 27th October 2014 16:41
 To: solr-user@lucene.apache.org
 Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out 
 of synch.
 
 Markus,
 
 I would like to ignore it too, but whats happening is that the there is a
 lot of discrepancy between the replicas , queries like
 q=*:*fq=(id:220a8dce-3b31-4d46-8386-da8405595c47) fail depending on which
 replica the request goes to, because of huge amount of discrepancy between
 the replicas.
 
 Thank you for confirming that it is a know issue , I was thinking I was the
 only one facing this due to my set up.
 
 On Mon, Oct 27, 2014 at 11:31 AM, Markus Jelsma markus.jel...@openindex.io
 wrote:
 
  It is an ancient issue. One of the major contributors to the issue was
  resolved some versions ago but we are still seeing it sometimes too, there
  is nothing to see in the logs. We ignore it and just reindex.
 
  -Original message-
   From:S.L simpleliving...@gmail.com
   Sent: Monday 27th October 2014 16:25
   To: solr-user@lucene.apache.org
   Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas
  out of synch.
  
   Thank Otis,
  
   I have checked the logs , in my case the default catalina.out and I dont
   see any OOMs or , any other exceptions.
  
   What others metrics do you suggest ?
  
   On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic 
   otis.gospodne...@gmail.com wrote:
  
Hi,
   
You may simply be overwhelming your cluster-nodes. Have you checked
various metrics to see if that is the case?
   
Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/
   
   
   
 On Oct 26, 2014, at 9:59 PM, S.L simpleliving...@gmail.com wrote:

 Folks,

 I have posted previously about this , I am using SolrCloud 4.10.1 and
have
 a sharded collection with  6 nodes , 3 shards and a replication
  factor
of 2.

 I am indexing Solr using a Hadoop job , I have 15 Map fetch tasks ,
  that
 can each have upto 5 threds each , so the load on the indexing side
  can
get
 to as high as 75 concurrent threads.

 I am facing an issue where the replicas of a particular shard(s) are
 consistently getting out of synch , initially I thought this was
beccause I
 was using a custom component , but I did a fresh install and removed
  the
 custom component and reindexed using the Hadoop job , I still see the
same
 behavior.

 I do not see any exceptions in my catalina.out , like OOM , or any
  other
 excepitions, I suspecting thi scould be because of the multi-threaded
 indexing nature of the Hadoop job . I use CloudSolrServer from my
  java
code
 to index and initialize the CloudSolrServer using a 3 node ZK
  ensemble.

 Does any one know of any known issues with a highly multi-threaded
indexing
 and SolrCloud ?

 Can someone help ? This issue has been slowing things down on my end
  for
a
 while now.

 Thanks and much appreciated!
   
  
 


Re: SolrCloud config question and zookeeper

2014-10-28 Thread Markus Jelsma
On Tuesday 28 October 2014 10:42:11 Bernd Fehling wrote:
 Thanks for the explanations.
 
 My idea about 4 zookeepers is a result of having the same software
 (java, zookeeper, solr, ...) installed on all 4 servers.
 But yes, I don't need to start a zookeeper on the 4th server.
 
 3 other machines outside the cloud for ZK seams a bit oversized.
 And you have another point of failure with the network between ZK and cloud.
 If one of the cloud servers end up in smoke the ZK system should
 still work with ZK and cloud on the same servers.
 
 So the offline argument says the first thing I start is ZK and
 the last I shutdown is ZK. Good point.
 
 While moving fom master-slave to cloud I'm aware of the fact that
 all shards have to be connected to ZK. But how can I tell ZK that
 on server_1 is leader shard_1 AND replica shard_4 ?

You don't, it will elect a leader by itself.

 
 Unfortunately the Getting Started with SolrCloud is a bit short on this.
 
 
 Regards
 Bernd
 
 Am 28.10.2014 um 09:15 schrieb Daniel Collins:
  As Michael says, you really want an odd number of zookeepers in order to
  meet the quorum requirements (which based on your comments you seem to be
  aware of).  There is nothing wrong with 4 ZKs as such, just that it
  doesn't buy you anything above having 3, so its one more that might go
  wrong and cause you problems.  In your case, I would suggest you just pick
  the first 3 machines to run ZK or even have 3 other machines outside the
  cloud to house ZK.
  
  The offline argument is also a good one, you really want your ZK instances
  to be longer lived than Solr, whilst you can restart individual Cores
  within a Solr Instance, it is often (at least for us) more convenient to
  bounce the whole java instance.  In that scenario (again just re-iterating
  what Michael said), you don't want ZK to be down at the same time.
  
  If you are using Solr Cloud, then all your replicas need to be connected
  to
  ZK, you can't have the master instances in ZK, and the replicas not
  connected (that's more of the old Master-Slave replication system which is
  still available but orthogonal to Cloud).
  
  
  On 28 October 2014 07:01, Bernd Fehling bernd.fehl...@uni-bielefeld.de
  
  wrote:
  Yes, garbage collection is a very good argument to have external
  zookeepers. I haven't thought about that.
  But does this also mean seperate server for each zookeeper or
  can they live side by side with solr on the same server?
  
  
  What is the problem with 4 zookeepers beside that I have no real
  gain against 3 zookeepers (only 1 can fail)?
  
  
  Regards
  Bernd
  
  Am 27.10.2014 um 15:41 schrieb Michael Della Bitta:
  You want external zookeepers. Partially because you don't want your
  Solr garbage collections holding up zookeeper availability,
  but also because you don't want your zookeepers going offline if
  you have to restart Solr for some reason.
  
  Also, you want 3 or 5 zookeeepers, not 4 or 8.
  
  On 10/27/14 10:35, Bernd Fehling wrote:
  While starting now with SolrCloud I tried to understand the sense
  of external zookeeper.
  
  Let's assume I want to split 1 huge collection accross 4 server.
  My straight forward idea is to setup a cloud with 4 shards (one
  on each server) and also have a replication of the shard on another
  server.
  server_1: shard_1, shard_replication_4
  server_2: shard_2, shard_replication_1
  server_3: shard_3, shard_replication_2
  server_4: shard_4, shard_replication_3
  
  In this configuration I always have all 4 shards available if
  one server fails.
  
  But now to zookeeper. I would start the internal zookeeper for
  all shards including replicas. Does this make sense?
  
  
  Or I only start the internal zookeeper for shard 1 to 4 but not
  the replicas. Should be good enough, one server can fail, or not?
  
  
  Or I follow the recommendations and install on all 4 server
  an external seperate zookeeper, but what is the advantage against
  having the internal zookeeper on each server?
  
  
  I really don't get it at this point. Can anyone help me here?
  
  Regards
  Bernd



RE: MoreLikeThis filter by score threshold

2015-02-03 Thread Markus Jelsma
Hi - sure you can, using the frange parser as a filter:

https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-FunctionRangeQueryParser
http://lucene.apache.org/solr/4_10_3/solr-core/org/apache/solr/search/FunctionRangeQParserPlugin.html

But this is very much not recommended, at all, so don't do it:

https://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_filter_by_score.3F
 
-Original message-
 From:Ali Nazemian alinazem...@gmail.com
 Sent: Tuesday 3rd February 2015 16:22
 To: solr-user@lucene.apache.org
 Subject: MoreLikeThis filter by score threshold
 
 Hi,
 I was wondering how can I limit the result of MoreLikeThis query by the
 score value instead of filtering them by document count?
 Thank you very much.
 
 -- 
 A.Nazemian
 


RE: Score results by only the highest scoring term

2015-02-03 Thread Markus Jelsma
Either use the MaxScoreQueryParser [1] or set tie to zero when using a DisMax 
parser.
[1]: 
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-MaxScoreQueryParser

 
 
-Original message-
 From:Burgmans, Tom tom.burgm...@wolterskluwer.com
 Sent: Tuesday 3rd February 2015 16:13
 To: solr-user@lucene.apache.org
 Subject: Score results by only the highest scoring term
 
 Hi All,
 
 I wonder if it's in some way possible to search for multiple terms like:
 
 (term A OR term B OR term C OR term D)
 
 and in case a document contains 2 or more of these terms: only the highest 
 scoring term should contribute to the final relevancy score; possibly lower 
 scoring  terms should be discarded from the scoring algorithm.
 
 Ideally I'd like an operator like ANY:
 
 (term A ANY term B ANY term C ANY term D)
 
 that has the purpose: return documents, sorted by the score of the highest 
 scoring term.
 
 Any thoughts about how to achieve this?
 
 _
 Tom Burgmans
 
 


RE: low qps with high load averages on solrcloud

2015-02-04 Thread Markus Jelsma
We recently upgraded our cloud from 4.8 to 4.10.3, the only config we updated 
was the luceneMatchVersion. Response times were very stable prior to the 
upgrade, but are quite erratic since the upgrade, and rising. I still have to 
check all the resolved issues but something went very wrong between 4.8 and 
4.10.3.

M. 
 
-Original message-
 From:Toke Eskildsen t...@statsbiblioteket.dk
 Sent: Wednesday 4th February 2015 20:58
 To: solr-user@lucene.apache.org
 Subject: RE: low qps with high load averages on solrcloud
 
 Suchi Amalapurapu [su...@bloomreach.com] wrote:
  Noticed that a solrcloud cluster doesn't scale linearly with # of nodes
  unlike the unsharded solr cluster. We are seeing a 10 fold drop in QPS in
  multi sharded mode.
 
 As I understand it, you changed from single to multi shard.
 
 Guessing wildly: You have one or more facets with a non-trivial (10K or more) 
 number of unique String values and you have a fairly high facet.limit (50+). 
 If so, what you see might be the penalty for the two-phase faceting with 
 SolrCloud, where the second fine-counting phase can be markedly slower than 
 the first. There are ways to help with that, but let's hear if my guess is 
 correct first.
 
 - Toke Eskildsen
 


RE: Lucene cosine similarity score for more like this query

2015-02-02 Thread Markus Jelsma
Hi - MoreLikeThis is not based on cosine similarity. The idea is that rare 
terms - high IDF - are extracted from the source document, and then used to 
build a regular Query(). That query follows the same rules as regular queries, 
the rules of your similarity implemenation, which is TFIDF by default. So, as 
suggested, if you enable debugging, you can clearly see why scores can be above 
1, or even much higher if queryNorm is disabled when using BM25 as similarity.

If you really need cosine similarity between documents, you have to enable term 
vectors for the source fields, and use them to calculate the angle. The problem 
is that this does not scale well, you would need to calculate angles with 
virtually all other documents.

M.
 
-Original message-
 From:Ali Nazemian alinazem...@gmail.com
 Sent: Monday 2nd February 2015 21:39
 To: solr-user@lucene.apache.org
 Subject: Re: Lucene cosine similarity score for more like this query
 
 Dear Erik,
 Thank you for your response. Would younplease tell me why this score could
 be higher than 1? While cosine similarity can not be higher than 1.
 On Feb 2, 2015 7:32 PM, Erik Hatcher erik.hatc...@gmail.com wrote:
 
  The scoring is the same as Lucene.  To get deeper insight into how a score
  is computed, use Solr’s debug=true mode to see the explain details in the
  response.
 
  Erik
 
   On Feb 2, 2015, at 10:49 AM, Ali Nazemian alinazem...@gmail.com wrote:
  
   Hi,
   I was wondering what is the range of score is brought by more like this
   query in Solr? I know that the Lucene uses cosine similarity in vector
   space model for calculating similarity between two documents. I also know
   that cosine similarity is between -1 and 1 but the fact that I dont
   understand is why the score which is brought by more like this query
  could
   be 12 for example?! Would you please explain what is the calculation
   process is Solr?
   Thank you very much.
  
   Best regards.
  
   --
   A.Nazemian
 
 
 


RE: Hit Highlighting and More Like This

2015-02-02 Thread Markus Jelsma
Hi - you can use the MLT query parser in Solr 5.0 or patch 4.10.x
https://issues.apache.org/jira/browse/SOLR-6248


 
-Original message-
 From:Tim Hearn timseman...@gmail.com
 Sent: Saturday 31st January 2015 0:31
 To: solr-user@lucene.apache.org
 Subject: Hit Highlighting and More Like This
 
 Hi all,
 
 I'm fairly new to Solr.  It seems like it should be possible to enable the
 hit highlighting feature and more like this feature at the same time, with
 the key words from the MLT query being the terms highlighted.  Is this
 possible?  I am trying right now to do this, but I am not having any
 snippets returned to me.
 
 Thanks!
 


RE: Question regarding SolrIndexSearcher implementation

2015-02-02 Thread Markus Jelsma
From memory: there are different methods in SolrIndexSearcher for reason. It 
has to do with paging and sorting. Whenever you sort on a simple field, you 
can easily start at a specific offset. The problem with sorting on score, is 
that score has to be calculated for all documents matching query. This means 
that deep paging is a problem, which it is. 
 
-Original message-
 From:Biyyala, Shishir (Contractor) shishir_biyy...@cable.comcast.com
 Sent: Monday 2nd February 2015 22:22
 To: solr-user@lucene.apache.org
 Cc: java-u...@lucene.apache.org
 Subject: Question regarding SolrIndexSearcher implementation
 
 Hello, 
 
 I did not know what the right mailing list would be (java-user vs solr-user), 
 so mailing both.
 
 My group uses solr/lucene, and we have custom collectors.
 
 I stumbled upon the implementation of SolrIndexSearcher.java and saw this :
 
 https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java
   (line 1676)
 
  TopDocs topDocs = topCollector.topDocs(0, len); the topDocs start value 
 is always being hardcoded to 0;
 
 What that is leading to is creating of excessive topDocs that the application 
 actually needs; My application can potentially be faced with deep pagination 
 and we do not use queryresults cache. 
 
 If I request for 200-250 docs,
 
 I was expecting start=199, howMany=51;
 But turns out that start=0 (always) and howMany=250
 
 Any reasons why start value is hardcoded to 0? Please suggest. It is 
 potentially impacting performance of our application.
 
 Thanks much,
 Shishir


RE: More Like This similarity tuning

2015-02-04 Thread Markus Jelsma
Well, maxqt is easy, it is just the number of terms that compose your query.  
MinTF is a strange parameter, rare terms have a low DF and most usually not a 
high TF, so i would keep it at 1. MinDF is more useful, it depends entirely on 
the size of your corpus. If you have a lot of user-generated input - meaning, 
bad spelled terms - then you have to set MinDF to a setting higher than the 
most frequent misspellings but low enough to find rare terms.

It depends on your index.

-Original message-
 From:Ali Nazemian alinazem...@gmail.com
 Sent: Wednesday 4th February 2015 11:15
 To: solr-user@lucene.apache.org
 Subject: More Like This similarity tuning
 
 Hi,
 I am looking for a best practice on More Like This parameters. I really
 appreciate if somebody can tell me what is the best value for these
 parameters in MLT query? Or at lease the proper methodology for finding the
 best value for each of these parameters:
 mlt.mintf
 mlt.mindf
 mlt.maxqt
 
 Thank you very much.
 Best regards.
 
 -- 
 A.Nazemian
 


RE: MoreLikeThis filter by score threshold

2015-02-04 Thread Markus Jelsma
Hello Upayavira - Indeed, it works, except ... insert-counter-arguments. It 
doesn't work after all :) 
Markus

-Original message-
 From:Upayavira u...@odoko.co.uk
 Sent: Tuesday 3rd February 2015 21:38
 To: solr-user@lucene.apache.org
 Subject: Re: MoreLikeThis filter by score threshold
 
 I've seen this done (encouraged against it, but didn't win). It works.
 Except, sometimes things change in the index, and the scores change
 subtly. We get complaints that documents that previously were above the
 threshold now aren't, and visa-versa. I try to explain that the score
 has no meaning between two search requests, but unfortunately, there's
 *enough* similarity between requests to make it work, *sometimes*. But
 when it doesn't work, people get baffled, and don't accept the truth as
 an answer (you can't use scores to compare separate sets of search
 results).
 
 Upayavira
 
 On Tue, Feb 3, 2015, at 08:01 PM, Ali Nazemian wrote:
  Dear Markus,
  Hi,
  Thank you very much for your response. I did check the reason why it is
  not
  recommended to filter by score in search query. But I think it is
  reasonable to filter by score in case of finding similar documents. I
  know
  in both of them (simple search query and mlt query) vsm of tf-idf
  similarity is used to calculate the score of documents, but suppose you
  indexed news as document in solr and you want to find all enough similar
  news for the specific one. In this case I think it is reasonable to
  filter
  similar documents by score threshold. Please correct me if I am wrong.
  Thank you very much.
  Regards.
  On Feb 3, 2015 7:00 PM, Markus Jelsma markus.jel...@openindex.io
  wrote:
  
   Hi - sure you can, using the frange parser as a filter:
  
  
   https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-FunctionRangeQueryParser
  
   http://lucene.apache.org/solr/4_10_3/solr-core/org/apache/solr/search/FunctionRangeQParserPlugin.html
  
   But this is very much not recommended, at all, so don't do it:
  
   https://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_filter_by_score.3F
  
   -Original message-
From:Ali Nazemian alinazem...@gmail.com
Sent: Tuesday 3rd February 2015 16:22
To: solr-user@lucene.apache.org
Subject: MoreLikeThis filter by score threshold
   
Hi,
I was wondering how can I limit the result of MoreLikeThis query by the
score value instead of filtering them by document count?
Thank you very much.
   
--
A.Nazemian
   
  
 


RE: MoreLikeThis filter by score threshold

2015-02-04 Thread Markus Jelsma
Hello Ali - no it is not reasonable and it is unnecessary at best. Regardless 
of the query, you sort by score. This means that the top queries are always the 
most relevant, so what exactly do you need to filter?
 
-Original message-
 From:Ali Nazemian alinazem...@gmail.com
 Sent: Tuesday 3rd February 2015 21:02
 To: solr-user@lucene.apache.org
 Subject: RE: MoreLikeThis filter by score threshold
 
 Dear Markus,
 Hi,
 Thank you very much for your response. I did check the reason why it is not
 recommended to filter by score in search query. But I think it is
 reasonable to filter by score in case of finding similar documents. I know
 in both of them (simple search query and mlt query) vsm of tf-idf
 similarity is used to calculate the score of documents, but suppose you
 indexed news as document in solr and you want to find all enough similar
 news for the specific one. In this case I think it is reasonable to filter
 similar documents by score threshold. Please correct me if I am wrong.
 Thank you very much.
 Regards.
 On Feb 3, 2015 7:00 PM, Markus Jelsma markus.jel...@openindex.io wrote:
 
  Hi - sure you can, using the frange parser as a filter:
 
 
  https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-FunctionRangeQueryParser
 
  http://lucene.apache.org/solr/4_10_3/solr-core/org/apache/solr/search/FunctionRangeQParserPlugin.html
 
  But this is very much not recommended, at all, so don't do it:
 
  https://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_filter_by_score.3F
 
  -Original message-
   From:Ali Nazemian alinazem...@gmail.com
   Sent: Tuesday 3rd February 2015 16:22
   To: solr-user@lucene.apache.org
   Subject: MoreLikeThis filter by score threshold
  
   Hi,
   I was wondering how can I limit the result of MoreLikeThis query by the
   score value instead of filtering them by document count?
   Thank you very much.
  
   --
   A.Nazemian
  
 
 


Re: OutOfMemoryError for PDF document upload into Solr

2015-01-16 Thread Markus Jelsma
Tika 1.6 has PDFBox 1.8.4, which has memory issues, eating excessive RAM! 
Either upgrade to Tika 1.7 (out now) or manually use the PDFBox 1.8.8 
dependency.

M.

On Friday 16 January 2015 15:21:55 Charlie Hull wrote:
 On 16/01/2015 04:02, Dan Davis wrote:
  Why re-write all the document conversion in Java ;)  Tika is very slow.  
  5
  GB PDF is very big.
 
 Or you can run Tika in a separate process, or even on a separate
 machine, wrapped with something to cope if it dies due to some horrible
 input...we generally avoid document format translation within Solr and
 do it externally before feeding documents to Solr.
 
 Charlie
 
  If you have a lot of PDF like that try pdftotext in HTML and UTF-8 output
  mode.   The HTML mode captures some meta-data that would otherwise be
  lost.
  
  
  If you need to go faster still, you can  also write some stuff linked
  directly against poppler library.
  
  Before you jump down by through about Tika being slow - I wrote a PDF
  indexer that ran at 36 MB/s per core.   Different indexer, all C, lots of
  getjmp/longjmp.   But fast...
  
  On Thu, Jan 15, 2015 at 1:54 PM, ganesh.ya...@sungard.com wrote:
  Siegfried and Michael Thank you for your replies and help.
  
  -Original Message-
  From: Siegfried Goeschl [mailto:sgoes...@gmx.at]
  Sent: Thursday, January 15, 2015 3:45 AM
  To: solr-user@lucene.apache.org
  Subject: Re: OutOfMemoryError for PDF document upload into Solr
  
  Hi Ganesh,
  
  you can increase the heap size but parsing a 4 GB PDF document will very
  likely consume A LOT OF memory - I think you need to check if that large
  PDF can be parsed at all :-)
  
  Cheers,
  
  Siegfried Goeschl
  
  On 14.01.15 18:04, Michael Della Bitta wrote:
  Yep, you'll have to increase the heap size for your Tomcat container.
  
  http://stackoverflow.com/questions/6897476/tomcat-7-how-to-set-initial
  -heap-size-correctly
  
  Michael Della Bitta
  
  Senior Software Engineer
  
  o: +1 646 532 3062
  
  appinions inc.
  
  “The Science of Influence Marketing”
  
  18 East 41st Street
  
  New York, NY 10017
  
  t: @appinions https://twitter.com/Appinions | g+:
  plus.google.com/appinions
  https://plus.google.com/u/0/b/112002776285509593336/11200277628550959
  3336/posts
  w: appinions.com http://www.appinions.com/
  
  On Wed, Jan 14, 2015 at 12:00 PM, ganesh.ya...@sungard.com wrote:
  Hello,
  
  Can someone pass on the hints to get around following error? Is there
  any Heap Size parameter I can set in Tomcat or in Solr webApp that
  gets deployed in Solr?
  
  I am running Solr webapp inside Tomcat on my local machine which has
  RAM of 12 GB. I have PDF document which is 4 GB max in size that
  needs to be loaded into Solr
  
  
  
  
  Exception in thread http-apr-8983-exec-6 java.lang.: Java heap
  
  space
  
at java.util.AbstractCollection.toArray(Unknown Source)
at java.util.ArrayList.init(Unknown Source)
at
  
  org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:518)
  
at
  
  org.apache.pdfbox.cos.COSDocument.close(COSDocument.java:575)
  
at
  
  org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:254)
  
at
  
  org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238)
  
at
  
  org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203)
  
at
  
  org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111)
  
at
  
  org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
  
at
  
  org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
  
at
  
  org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120
  )
  
at
  
  org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extracti
  ngDocumentLoader.java:219) 
at
  
  org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conten
  tStreamHandlerBase.java:74) 
at
  
  org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBa
  se.java:135) 
at
  
  org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequ
  est(RequestHandlers.java:246) 
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
at
  
  org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.jav
  a:777) 
at
  
  org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.ja
  va:418) 
at
  
  org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.ja
  va:207) 
at
  
  org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applicat
  ionFilterChain.java:241) 
at
  
  org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilte
  rChain.java:208) 
at
  
  org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve
  .java:220) 
at
  
  

RE: American /British Dictionary for solr-4.10.2

2015-02-12 Thread Markus Jelsma
There are no dictionaries that sum up all possible conjugations, using a 
heuristics based normalizer would be more appropriate. There are nevertheless 
some good sources to start:

Contains lots of useful spelling issues, incl. 
british/american/canadian/australian
http://grammarist.com/spelling

Very useful
http://en.wikipedia.org/wiki/American_and_British_English_spelling_differences#Acronyms_and_abbreviations

A handy list
http://www.avko.org/free/reference/british-vs-american-spelling.html

There are some more lists but it seems the other one's tab is no longer open!
Good luck

-Original message-
 From:dinesh naik dineshkumarn...@gmail.com
 Sent: Thursday 12th February 2015 7:17
 To: solr-user@lucene.apache.org
 Subject: American /British Dictionary for solr-4.10.2
 
 Hi,
 
 What are the dictionaries available for Solr 4.10.2?
 We are looking for a dictionary to support American/British English synonym.
 
 
 -- 
 Best Regards,
 Dinesh Naik
 


RE: unusually high 4.10.2 vs 4.3.1 RAM consumption

2015-02-17 Thread Markus Jelsma
We have seen an increase between 4.8.1 and 4.10. 
 
-Original message-
 From:Dmitry Kan solrexp...@gmail.com
 Sent: Tuesday 17th February 2015 11:06
 To: solr-user@lucene.apache.org
 Subject: unusually high 4.10.2 vs 4.3.1 RAM consumption
 
 Hi,
 
 We are currently comparing the RAM consumption of two parallel Solr
 clusters with different solr versions: 4.10.2 and 4.3.1.
 
 For comparable index sizes of a shard (20G and 26G), we observed 9G vs 5.6G
 RAM footprint (reserved RAM as seen by top), 4.3.1 being the winner.
 
 We have not changed the solrconfig.xml to upgrade to 4.10.2 and have
 reindexed data from scratch. The commits are all controlled on the client,
 i.e. not auto-commits.
 
 Solr: 4.10.2 (high load, mass indexing)
 Java: 1.7.0_76 (Oracle)
 -Xmx25600m
 
 
 Solr: 4.3.1 (normal load, no mass indexing)
 Java: 1.7.0_11 (Oracle)
 -Xmx25600m
 
 The RAM consumption remained the same after the load has stopped on the
 4.10.2 cluster. Manually collecting the memory on a 4.10.2 shard via
 jvisualvm dropped the used RAM from 8,5G to 0,5G. But the reserved RAM as
 seen by top remained at 9G level.
 
 This unusual spike happened during mass data indexing.
 
 What else could be the artifact of such a difference -- Solr or JVM? Can it
 only be explained by the mass indexing? What is worrisome is that the
 4.10.2 shard reserves 8x times it uses.
 
 What can be done about this?
 
 -- 
 Dmitry Kan
 Luke Toolbox: http://github.com/DmitryKey/luke
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan
 SemanticAnalyzer: www.semanticanalyzer.info
 


RE: unusually high 4.10.2 vs 4.3.1 RAM consumption

2015-02-17 Thread Markus Jelsma
I would have shared it if i had one :)  
 
-Original message-
 From:Dmitry Kan solrexp...@gmail.com
 Sent: Tuesday 17th February 2015 11:40
 To: solr-user@lucene.apache.org
 Subject: Re: unusually high 4.10.2 vs 4.3.1 RAM consumption
 
 Have you found an explanation to that?
 
 On Tue, Feb 17, 2015 at 12:12 PM, Markus Jelsma markus.jel...@openindex.io
 wrote:
 
  We have seen an increase between 4.8.1 and 4.10.
 
  -Original message-
   From:Dmitry Kan solrexp...@gmail.com
   Sent: Tuesday 17th February 2015 11:06
   To: solr-user@lucene.apache.org
   Subject: unusually high 4.10.2 vs 4.3.1 RAM consumption
  
   Hi,
  
   We are currently comparing the RAM consumption of two parallel Solr
   clusters with different solr versions: 4.10.2 and 4.3.1.
  
   For comparable index sizes of a shard (20G and 26G), we observed 9G vs
  5.6G
   RAM footprint (reserved RAM as seen by top), 4.3.1 being the winner.
  
   We have not changed the solrconfig.xml to upgrade to 4.10.2 and have
   reindexed data from scratch. The commits are all controlled on the
  client,
   i.e. not auto-commits.
  
   Solr: 4.10.2 (high load, mass indexing)
   Java: 1.7.0_76 (Oracle)
   -Xmx25600m
  
  
   Solr: 4.3.1 (normal load, no mass indexing)
   Java: 1.7.0_11 (Oracle)
   -Xmx25600m
  
   The RAM consumption remained the same after the load has stopped on the
   4.10.2 cluster. Manually collecting the memory on a 4.10.2 shard via
   jvisualvm dropped the used RAM from 8,5G to 0,5G. But the reserved RAM as
   seen by top remained at 9G level.
  
   This unusual spike happened during mass data indexing.
  
   What else could be the artifact of such a difference -- Solr or JVM? Can
  it
   only be explained by the mass indexing? What is worrisome is that the
   4.10.2 shard reserves 8x times it uses.
  
   What can be done about this?
  
   --
   Dmitry Kan
   Luke Toolbox: http://github.com/DmitryKey/luke
   Blog: http://dmitrykan.blogspot.com
   Twitter: http://twitter.com/dmitrykan
   SemanticAnalyzer: www.semanticanalyzer.info
  
 
 
 
 
 -- 
 Dmitry Kan
 Luke Toolbox: http://github.com/DmitryKey/luke
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan
 SemanticAnalyzer: www.semanticanalyzer.info
 


Distributed unit tests and SSL doesn't have a valid keystore

2015-01-12 Thread Markus Jelsma
Hi - in a small Maven project depending on Solr 4.10.3, running unit tests that 
extend BaseDistributedSearchTestCase randomly fail with SSL doesn't have a 
valid keystore, and a lot of zombie threads. We have a solrtest.keystore file 
laying around, but where to put it?

Thanks,
Markus


RE: Extending solr analysis in index time

2015-01-12 Thread Markus Jelsma
Hi - You mention having a list with important terms, then using payloads would 
be the most straightforward i suppose. You still need a custom similarity and 
custom query parser. Payloads work for us very well.

M

 
 
-Original message-
 From:Ahmet Arslan iori...@yahoo.com.INVALID
 Sent: Monday 12th January 2015 19:50
 To: solr-user@lucene.apache.org
 Subject: Re: Extending solr analysis in index time
 
 Hi Ali,
 
 Reading your example, if you could somehow replace idf component with your 
 importance weight,
 I think your use case looks like TFIDFSimilarity. Tf component remains same.
 
 https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
 
 I also suggest you ask this in lucene mailing list. Someone familiar with 
 similarity package can give insight on this.
 
 Ahmet
 
 
 
 On Monday, January 12, 2015 6:54 PM, Jack Krupansky 
 jack.krupan...@gmail.com wrote:
 Could you clarify what you mean by Lucene reverse index? That's not a
 term I am familiar with.
 
 -- Jack Krupansky
 
 
 On Mon, Jan 12, 2015 at 1:01 AM, Ali Nazemian alinazem...@gmail.com wrote:
 
  Dear Jack,
  Thank you very much.
  Yeah I was thinking of function query for sorting, but I have to problems
  in this case, 1) function query do the process at query time which I dont
  want to. 2) I also want to have the score field for retrieving and showing
  to users.
 
  Dear Alexandre,
  Here is some more explanation about the business behind the question:
  I am going to provide a field for each document, lets refer it as
  document_score. I am going to fill this field based on the information
  that could be extracted from Lucene reverse index. Assume I have a list of
  terms, called important terms and I am going to extract the term frequency
  for each of the terms inside this list per each document. To be honest I
  want to use the term frequency for calculating document_score.
  document_score should be storable since I am going to retrieve this field
  for each document. I also want to do sorting on document_store in case of
  preferred by user.
  I hope I did convey my point.
  Best regards.
 
 
  On Mon, Jan 12, 2015 at 12:53 AM, Jack Krupansky jack.krupan...@gmail.com
  
  wrote:
 
   Won't function queries do the job at query time? You can add or multiply
   the tf*idf score by a function of the term frequency of arbitrary terms,
   using the tf, mul, and add functions.
  
   See:
   https://cwiki.apache.org/confluence/display/solr/Function+Queries
  
   -- Jack Krupansky
  
   On Sun, Jan 11, 2015 at 10:55 AM, Ali Nazemian alinazem...@gmail.com
   wrote:
  
Dear Jack,
Hi,
I think you misunderstood my need. I dont want to change the default
scoring behavior of Lucene (tf-idf) I just want to have another field
  to
   do
sorting for some specific queries (not all the search business),
  however
   I
am aware of Lucene payload.
Thank you very much.
   
On Sun, Jan 11, 2015 at 7:15 PM, Jack Krupansky 
   jack.krupan...@gmail.com
wrote:
   
 You would do that with a custom similarity (scoring) class. That's an
 expert feature. In fact a SUPER-expert feature.

 Start by completely familiarizing yourself with how TF*IDF
  similarity
 already works:


   
  
  http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

 And to use your custom similarity class in Solr:


   
  
  https://cwiki.apache.org/confluence/display/solr/Other+Schema+Elements#OtherSchemaElements-Similarity


 -- Jack Krupansky

 On Sun, Jan 11, 2015 at 9:04 AM, Ali Nazemian alinazem...@gmail.com
  
 wrote:

  Hi everybody,
 
  I am going to add some analysis to Solr at the index time. Here is
what I
  am considering in my mind:
  Suppose I have two different fields for Solr schema, field a and
field
  b. I am going to use the created reverse index in a way that some
terms
  are considered as important ones and tell lucene to calculate a
  value
 based
  on these terms frequency per each document. For example let the
  word
  hello considered as important word with the weight of 2.0.
   Suppose
 the
  term frequency for this word at field a is 3 and at field b is
  6
for
  document 1. Therefor the score value would be 2*3+(2*6)^2. I want
  to
  calculate this score based on these fields and put it in the index
   for
  retrieving. My question would be how can I do such thing? First I
  did
  consider using term component for calculating this value from
  outside
and
  put it back to Solr index, but it seems it is not efficient enough.
 
  Thank you very much.
  Best regards.
 
  --
  A.Nazemian
 

   
   
   
--
A.Nazemian
   
  
 
 
 
  --
  A.Nazemian
 
 


RE: Distributed unit tests and SSL doesn't have a valid keystore

2015-01-13 Thread Markus Jelsma
Thanks, we will supress it for now! 
M. 
 
-Original message-
 From:Mark Miller markrmil...@gmail.com
 Sent: Monday 12th January 2015 19:25
 To: solr-user@lucene.apache.org
 Subject: Re: Distributed unit tests and SSL doesn't have a valid keystore
 
 I'd have to do some digging. Hossman might know offhand. You might just
 want to use @SupressSSL on the tests :)
 
 - Mark
 
 On Mon Jan 12 2015 at 8:45:11 AM Markus Jelsma markus.jel...@openindex.io
 wrote:
 
  Hi - in a small Maven project depending on Solr 4.10.3, running unit tests
  that extend BaseDistributedSearchTestCase randomly fail with SSL doesn't
  have a valid keystore, and a lot of zombie threads. We have a
  solrtest.keystore file laying around, but where to put it?
 
  Thanks,
  Markus
 
 


RE: multiple patterns in solr.PatternTokenizerFactory

2015-02-09 Thread Markus Jelsma
You can split into all groups by specifying group=-1. 
 
-Original message-
 From:Nivedita nivedita.pa...@tcs.com
 Sent: Monday 9th February 2015 12:08
 To: solr-user@lucene.apache.org
 Subject: multiple patterns in solr.PatternTokenizerFactory
 
 Can I give multiple patterns in
 
  tokenizer class=solr.PatternTokenizerFactory
 pattern=(SKU|Part(\sNumber)?):?\s(\[0-9-\]+) group=3/
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/multiple-patterns-in-solr-PatternTokenizerFactory-tp4184986.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 


RE: Upgrading Solr 4.7.2 to 4.10.3

2015-02-10 Thread Markus Jelsma
Well, the CHANGES.txt is filled with just the right information you need :) 
 
-Original message-
 From:Elan Palani elan.pal...@kaybus.com
 Sent: Tuesday 10th February 2015 22:30
 To: solr-user@lucene.apache.org
 Subject: Upgrading Solr 4.7.2 to 4.10.3
 
 Team.. 
 
 Planning to Upgrade solr from 4.7.2 to 4.10.3 , I just want through the 
 Documentation
 seems like a straight forward download/install.. 
 
 Anything specifically issues I should look for?
 
 Any help will be appreciated.
 
 Thanks 
 
 Elan
 
 
 
 
 


RE: Relevancy : Keyword stuffing

2015-03-16 Thread Markus Jelsma
Hello - setting (e)dismax' tie breaker to 0 or much low than default would 
`solve` this for now.
Markus 
 
-Original message-
 From:Mihran Shahinian slowmih...@gmail.com
 Sent: Monday 16th March 2015 16:29
 To: solr-user@lucene.apache.org
 Subject: Relevancy : Keyword stuffing
 
 Hi all,
 I have a use case where the data is generated by SEO minded authors and
 more often than not
 they perfectly guess the synonym expansions for the document titles skewing
 results in their favor.
 At the moment I don't have an offline processing infrastructure to detect
 these (I can't punish these docs either... just have to level the playing
 field).
 I am experimenting with taking the max of the term scores, cutting off
 scores after certain number of terms,etc but would appreciate any hints if
 anyone has experience dealing with a similar use case in solr.
 
 Much appreciated,
 Mihran
 


RE: Relevancy : Keyword stuffing

2015-03-16 Thread Markus Jelsma
Hello - Chris' suggestion is indeed a good one but it can be tricky to properly 
configure the parameters. Regarding position information, you can override 
dismax to have it use SpanFirstQuery. It allows for setting strict boundaries 
from the front of the document to a given position. You can also override 
SpanFirstQuery to incorporate a gradient, to decrease boosting as distance from 
the front increases.

I don't know how you ingest document bodies, but if they are unstructured HTML, 
you may want to install proper main content extraction if you haven't already. 
Having decent control over HTML is a powerful tool.

You may also want to look at Lucene's BM25 implementation. It is simple to set 
up and easier to control. It isn't as rough a tool as TFIDF is regarding to 
length normalization. Plus it allows you to smooth TF, which in your case 
should also help.

If you like to scrutinize SSS and get some proper results, you are more than 
welcome to share them here :)

Markus
 
-Original message-
 From:Mihran Shahinian slowmih...@gmail.com
 Sent: Monday 16th March 2015 22:41
 To: solr-user@lucene.apache.org
 Subject: Re: Relevancy : Keyword stuffing
 
 Thank you Markus and Chris, for pointers.
 For SweetSpotSimilarity I am thinking perhaps a set of closed ranges
 exposed via similarity config is easier to maintain as data changes than
 making adjustments to fit a
 function. Another piece of info would've been handy is to know the average
 position info + position info for the first few occurrences for each term.
 This would allow
 perhaps higher boosting for term occurrences earlier in the doc. In my case
 extra keywords are towards the end of the doc,but that info does not seem
 to be propagated into scorer.
 Thanks again,
 Mihran
 
 
 
 On Mon, Mar 16, 2015 at 1:52 PM, Chris Hostetter hossman_luc...@fucit.org
 wrote:
 
 
  You should start by checking out the SweetSpotSimilarity .. it was
  heavily designed arround the idea of dealing with things like excessively
  verbose titles, and keyword stuffing in summary text ... so you can
  configure your expectation for what a normal length doc is, and they
  will be penalized for being longer then that.  similarly you can say what
  a 'resaonable' tf is, and docs that exceed that would't get added boost
  (which in conjunction with teh lengthNorm penality penalizes docs that
  stuff keywords)
 
 
  https://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html
 
 
  https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.computeLengthNorm.svg
 
  https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.hyperbolicTf.svg
 
 
  -Hoss
  http://www.lucidworks.com/
 
 


RE: Distributed IDF performance

2015-03-18 Thread Markus Jelsma
Anshum, Jack - don't any of you have a cluster at hand to get some real results 
on this? After testing the actual functionality for a quite some time while the 
final patch was in development, we have not had the change to work on 
performance tests. We are still on Solr 4.10 and have to port lots of Lucene 
stuff to 5. I would sure like to see some numbers from any of you :)

Markus
 
 
-Original message-
 From:Anshum Gupta ans...@anshumgupta.net
 Sent: Friday 13th March 2015 23:33
 To: solr-user@lucene.apache.org
 Subject: Re: Distributed IDF performance
 
 np!
 
 I forgot to mention that I didn't notice any considerable performance hit
 in my tests. The QTimes were barely off by 5%.
 
 On Fri, Mar 13, 2015 at 3:13 PM, Jack Krupansky jack.krupan...@gmail.com
 wrote:
 
  Oops... I said StatsInfo and that should have been StatsCache
  (statsCache .../).
 
  -- Jack Krupansky
 
  On Fri, Mar 13, 2015 at 6:04 PM, Anshum Gupta ans...@anshumgupta.net
  wrote:
 
   There's no rough formula or performance data that I know of at this
  point.
   About he guidance, if you want to use Global stats, my obvious choice
  would
   be to use the LRUStatsCache.
   Before committing, I did run some tests on my macbook but as I said back
   then, they shouldn't be totally taken at face value. The tests didn't
   involve any network and were just about 20mn docs and synthetic queries.
  
   On Fri, Mar 13, 2015 at 2:08 PM, Jack Krupansky 
  jack.krupan...@gmail.com
   wrote:
  
Does anybody have any actual performance data or even a rough formula
  for
calculating the overhead for using the new Solr 5.0 Distributed IDF (
SOLR-1632 https://issues.apache.org/jira/browse/SOLR-1632)?
   
And any guidance as far as which StatsInfo plugin is best to use?
   
Are many people now using Distributed IDF as their default?
   
I'm not currently using this, but the existing doc and Jira is too
   minimal
to offer guidance as requested above. Mostly I'm just curious.
   
Thanks.
   
-- Jack Krupansky
   
  
  
  
   --
   Anshum Gupta
  
 
 
 
 
 -- 
 Anshum Gupta
 


<    5   6   7   8   9   10   11   12   13   14   >