Re: Dead node, but clusterstate.json says active, won't sync on restart

2014-01-29 Thread Greg Preston
If you removed the tlog and index and restart it should resync, or
something is really crazy.

It doesn't, or at least if it tries, it's somehow failing.  I'd be ok with
the sync failing for some reason if the node wasn't also serving queries.


-Greg


On Tue, Jan 28, 2014 at 11:10 AM, Mark Miller markrmil...@gmail.com wrote:

 Sounds like a bug. 4.6.1 is out any minute - you might try that. There was
 a replication bug that may be involved.

 If you removed the tlog and index and restart it should resync, or
 something is really crazy.

 The clusterstate.json is a red herring. You have to merge the live nodes
 info with the state to know the real state.

 - Mark

 http://www.about.me/markrmiller

  On Jan 28, 2014, at 12:31 PM, Greg Preston gpres...@marinsoftware.com
 wrote:
 
  ** Using solrcloud 4.4.0 **
 
  I had to kill a running solrcloud node.  There is still a replica for
 that
  shard, so everything is functional.  We've done some indexing while the
  node was killed.
 
  I'd like to bring back up the downed node and have it resync from the
 other
  replica.  But when I restart the downed node, it joins back up as active
  immediately, and doesn't resync.  I even wiped the data directory on the
  downed node, hoping that would force it to sync on restart, but it
 doesn't.
 
  I'm assuming this is related to the state still being listed as active in
  clusterstate.json for the downed node?  Since it comes back as active,
 it's
  serving queries and giving old results.
 
  How can I force this node to do a recovery on restart?
 
  Thanks.
 
 
  -Greg



Re: Dead node, but clusterstate.json says active, won't sync on restart

2014-01-29 Thread Greg Preston
I've attached the log of the downed node (truffle-solr-4).
This is the relevant log entry from the node it should replicate from
(truffle-solr-5):

[29 Jan 2014 19:31:29] [qtp1614415528-74] ERROR
(org.apache.solr.common.SolrException) -
org.apache.solr.common.SolrException: I was asked to wait on state
recovering for truffle-solr-4:8983_solr but I still do not see the
requested state. I see state: active live:true
at
org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:966)
at
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:191)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:611)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:209)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)

You can see that 4 is serving queries.  It appears that 4 tries to recover
from 5, but 5 is confused about the state of 4?  4 had an empty index and
tlog when it was started.

We will eventually upgrade to 4.6.x or 4.7.x, but we've got a pretty
extensive regression testing cycle, so there is some delay in upgrading
versions.



-Greg


On Wed, Jan 29, 2014 at 9:08 AM, Mark Miller markrmil...@gmail.com wrote:

 What's in the logs of the node that won't recover on restart after
 clearing the index and tlog

 - Mark

 On Jan 29, 2014, at 11:41 AM, Greg Preston gpres...@marinsoftware.com
 wrote:

  If you removed the tlog and index and restart it should resync, or
  something is really crazy.
 
  It doesn't, or at least if it tries, it's somehow failing.  I'd be ok
 with
  the sync failing for some reason if the node wasn't also serving queries.
 
 
  -Greg
 
 
  On Tue, Jan 28, 2014 at 11:10 AM, Mark Miller markrmil...@gmail.com
 wrote:
 
  Sounds like a bug. 4.6.1 is out any minute - you might try that. There
 was
  a replication bug that may be involved.
 
  If you removed the tlog and index and restart it should resync, or
  something is really crazy.
 
  The clusterstate.json is a red herring. You have to merge the live nodes
  info with the state to know the real state.
 
  - Mark
 
  http://www.about.me/markrmiller
 
  On Jan 28, 2014, at 12:31 PM, Greg Preston 
 gpres...@marinsoftware.com
  wrote:
 
  ** Using solrcloud 4.4.0 **
 
  I had to kill a running solrcloud node.  There is still a replica for
  that
  shard, so everything is functional.  We've done some indexing while the
  node was killed.
 
  I'd like to bring back up the downed node and have it resync from the
  other
  replica.  But when I restart the downed node, it joins back up as
 active
  immediately, and doesn't resync.  I even wiped the data directory on
 the
  downed node, hoping that would force it to sync on restart, but it
  doesn't.
 
  I'm assuming this is related to the state still being listed as active
 in
  clusterstate.json for the downed node?  Since it comes back as active,
  it's
  serving queries and giving old results.
 
  How can I force this node to do a recovery on restart?
 
  Thanks.
 
 
  -Greg
 

[29 Jan 2014 19:28:57] [main] INFO  (org.eclipse.jetty.server.Server) - jetty-8.1.10.v20130312
[29 Jan 2014 19:28:57] [main] INFO  (org.eclipse.jetty.deploy.providers.ScanningAppProvider) - Deployment monitor /home/solr/solr/solr-4.4.0/example/contexts at interval 0
[29 Jan 2014 19:28:57] [main] INFO  (org.eclipse.jetty.deploy.DeploymentManager) - Deployable added: /home/solr/solr/solr-4.4.0/example/contexts/solr-jetty-context.xml
[29 Jan 2014 19:28:58] [main] INFO

Dead node, but clusterstate.json says active, won't sync on restart

2014-01-28 Thread Greg Preston
** Using solrcloud 4.4.0 **

I had to kill a running solrcloud node.  There is still a replica for that
shard, so everything is functional.  We've done some indexing while the
node was killed.

I'd like to bring back up the downed node and have it resync from the other
replica.  But when I restart the downed node, it joins back up as active
immediately, and doesn't resync.  I even wiped the data directory on the
downed node, hoping that would force it to sync on restart, but it doesn't.

I'm assuming this is related to the state still being listed as active in
clusterstate.json for the downed node?  Since it comes back as active, it's
serving queries and giving old results.

How can I force this node to do a recovery on restart?

Thanks.


-Greg


Re: Dead node, but clusterstate.json says active, won't sync on restart

2014-01-28 Thread Greg Preston
Thanks for the idea.  I tried it, and the state for the bad node, even
after an orderly shutdown, is still active in clusterstate.json.  I see
this in the logs on restart:

[28 Jan 2014 18:25:29] [RecoveryThread] ERROR
(org.apache.solr.common.SolrException) - Error while trying to recover.
core=marin:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
I was asked to wait on state recovering for truffle-solr-4:8983_solr but I
still do not see the requested state. I see state: active live:true
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:424)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
at
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:198)
at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:342)
at
org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219)





-Greg


On Tue, Jan 28, 2014 at 9:53 AM, Shawn Heisey s...@elyograg.org wrote:

 On 1/28/2014 10:31 AM, Greg Preston wrote:

 ** Using solrcloud 4.4.0 **

 I had to kill a running solrcloud node.  There is still a replica for that
 shard, so everything is functional.  We've done some indexing while the
 node was killed.

 I'd like to bring back up the downed node and have it resync from the
 other
 replica.  But when I restart the downed node, it joins back up as active
 immediately, and doesn't resync.  I even wiped the data directory on the
 downed node, hoping that would force it to sync on restart, but it
 doesn't.

 I'm assuming this is related to the state still being listed as active in
 clusterstate.json for the downed node?  Since it comes back as active,
 it's
 serving queries and giving old results.

 How can I force this node to do a recovery on restart?


 This might be completely wrong, but hopefully it will help you: Perhaps a
 graceful stop of that node will result in the proper clusterstate so it
 will work the next time it's started? That may already be what you've done,
 so this may not help at all ... but you did say kill which might mean
 that it wasn't a clean shutdown of Solr.

 Thanks,
 Shawn




Re: Possible memory leak after segment merge? (related to DocValues?)

2013-12-30 Thread Greg Preston
That was it.  Setting omitNorms=true on all fields fixed my problem.
 I left it indexing all weekend, and heap usage still looks great.

I'm still not clear why bouncing the solr instance freed up memory,
unless the in-memory structure for this norms data is lazily loaded
somehow.

Anyway, thank you very much for the suggestion.

-Greg


On Fri, Dec 27, 2013 at 4:25 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 Likely this is for field norms, which use doc values under the hood.

 Mike McCandless

 http://blog.mikemccandless.com


 On Thu, Dec 26, 2013 at 5:03 PM, Greg Preston
 gpres...@marinsoftware.com wrote:
 Does anybody with knowledge of solr internals know why I'm seeing
 instances of Lucene42DocValuesProducer when I don't have any fields
 that are using DocValues?  Or am I misunderstanding what this class is
 for?

 -Greg


 On Mon, Dec 23, 2013 at 12:07 PM, Greg Preston
 gpres...@marinsoftware.com wrote:
 Hello,

 I'm loading up our solr cloud with data (from a solrj client) and
 running into a weird memory issue.  I can reliably reproduce the
 problem.

 - Using Solr Cloud 4.4.0 (also replicated with 4.6.0)
 - 24 solr nodes (one shard each), spread across 3 physical hosts, each
 host has 256G of memory
 - index and tlogs on ssd
 - Xmx=7G, G1GC
 - Java 1.7.0_25
 - schema and solrconfig.xml attached

 I'm using composite routing to route documents with the same clientId
 to the same shard.  After several hours of indexing, I occasionally
 see an IndexWriter go OOM.  I think that's a symptom.  When that
 happens, indexing continues, and that node's tlog starts to grow.
 When I notice this, I stop indexing, and bounce the problem node.
 That's where it gets interesting.

 Upon bouncing, the tlog replays, and then segments merge.  Once the
 merging is complete, the heap is fairly full, and forced full GC only
 helps a little.  But if I then bounce the node again, the heap usage
 goes way down, and stays low until the next segment merge.  I believe
 segment merges are also what causes the original OOM.

 More details:

 Index on disk for this node is ~13G, tlog is ~2.5G.
 See attached mem1.png.  This is a jconsole view of the heap during the
 following:

 (Solr cloud node started at the left edge of this graph)

 A) One CPU core pegged at 100%.  Thread dump shows:
 Lucene Merge Thread #0 daemon prio=10 tid=0x7f5a3c064800
 nid=0x7a74 runnable [0x7f5a41c5f000]
java.lang.Thread.State: RUNNABLE
 at org.apache.lucene.util.fst.Builder.add(Builder.java:397)
 at 
 org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1000)
 at 
 org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:112)
 at 
 org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72)
 at 
 org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:365)
 at 
 org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
 at 
 org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
 at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
 at 
 org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
 at 
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)

 B) One CPU core pegged at 100%.  Manually triggered GC.  Lots of
 memory freed.  Thread dump shows:
 Lucene Merge Thread #0 daemon prio=10 tid=0x7f5a3c064800
 nid=0x7a74 runnable [0x7f5a41c5f000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
 at 
 org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:144)
 at 
 org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
 at 
 org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
 at 
 org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
 at 
 org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
 at 
 org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
 at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
 at 
 org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
 at 
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)

 C) One CPU core pegged at 100%.  Manually triggered GC.  No memory
 freed.  Thread dump shows:
 Lucene Merge Thread #0 daemon prio=10 tid=0x7f5a3c064800
 nid=0x7a74 runnable [0x7f5a41c5f000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
 at 
 org.apache.lucene.codecs.lucene42

Re: Possible memory leak after segment merge? (related to DocValues?)

2013-12-27 Thread Greg Preston
Interesting.  I'm not using score at all (all searches have an
explicit sort defined).  I'll try setting omit norms on all my fields
and see if I can reproduce.

Thanks.

-Greg


On Fri, Dec 27, 2013 at 4:25 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 Likely this is for field norms, which use doc values under the hood.

 Mike McCandless

 http://blog.mikemccandless.com


 On Thu, Dec 26, 2013 at 5:03 PM, Greg Preston
 gpres...@marinsoftware.com wrote:
 Does anybody with knowledge of solr internals know why I'm seeing
 instances of Lucene42DocValuesProducer when I don't have any fields
 that are using DocValues?  Or am I misunderstanding what this class is
 for?

 -Greg


 On Mon, Dec 23, 2013 at 12:07 PM, Greg Preston
 gpres...@marinsoftware.com wrote:
 Hello,

 I'm loading up our solr cloud with data (from a solrj client) and
 running into a weird memory issue.  I can reliably reproduce the
 problem.

 - Using Solr Cloud 4.4.0 (also replicated with 4.6.0)
 - 24 solr nodes (one shard each), spread across 3 physical hosts, each
 host has 256G of memory
 - index and tlogs on ssd
 - Xmx=7G, G1GC
 - Java 1.7.0_25
 - schema and solrconfig.xml attached

 I'm using composite routing to route documents with the same clientId
 to the same shard.  After several hours of indexing, I occasionally
 see an IndexWriter go OOM.  I think that's a symptom.  When that
 happens, indexing continues, and that node's tlog starts to grow.
 When I notice this, I stop indexing, and bounce the problem node.
 That's where it gets interesting.

 Upon bouncing, the tlog replays, and then segments merge.  Once the
 merging is complete, the heap is fairly full, and forced full GC only
 helps a little.  But if I then bounce the node again, the heap usage
 goes way down, and stays low until the next segment merge.  I believe
 segment merges are also what causes the original OOM.

 More details:

 Index on disk for this node is ~13G, tlog is ~2.5G.
 See attached mem1.png.  This is a jconsole view of the heap during the
 following:

 (Solr cloud node started at the left edge of this graph)

 A) One CPU core pegged at 100%.  Thread dump shows:
 Lucene Merge Thread #0 daemon prio=10 tid=0x7f5a3c064800
 nid=0x7a74 runnable [0x7f5a41c5f000]
java.lang.Thread.State: RUNNABLE
 at org.apache.lucene.util.fst.Builder.add(Builder.java:397)
 at 
 org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1000)
 at 
 org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:112)
 at 
 org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72)
 at 
 org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:365)
 at 
 org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
 at 
 org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
 at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
 at 
 org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
 at 
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)

 B) One CPU core pegged at 100%.  Manually triggered GC.  Lots of
 memory freed.  Thread dump shows:
 Lucene Merge Thread #0 daemon prio=10 tid=0x7f5a3c064800
 nid=0x7a74 runnable [0x7f5a41c5f000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
 at 
 org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:144)
 at 
 org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
 at 
 org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
 at 
 org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
 at 
 org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
 at 
 org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
 at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
 at 
 org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
 at 
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)

 C) One CPU core pegged at 100%.  Manually triggered GC.  No memory
 freed.  Thread dump shows:
 Lucene Merge Thread #0 daemon prio=10 tid=0x7f5a3c064800
 nid=0x7a74 runnable [0x7f5a41c5f000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
 at 
 org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:108)
 at 
 org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField

Re: Possible memory leak after segment merge? (related to DocValues?)

2013-12-26 Thread Greg Preston
Does anybody with knowledge of solr internals know why I'm seeing
instances of Lucene42DocValuesProducer when I don't have any fields
that are using DocValues?  Or am I misunderstanding what this class is
for?

-Greg


On Mon, Dec 23, 2013 at 12:07 PM, Greg Preston
gpres...@marinsoftware.com wrote:
 Hello,

 I'm loading up our solr cloud with data (from a solrj client) and
 running into a weird memory issue.  I can reliably reproduce the
 problem.

 - Using Solr Cloud 4.4.0 (also replicated with 4.6.0)
 - 24 solr nodes (one shard each), spread across 3 physical hosts, each
 host has 256G of memory
 - index and tlogs on ssd
 - Xmx=7G, G1GC
 - Java 1.7.0_25
 - schema and solrconfig.xml attached

 I'm using composite routing to route documents with the same clientId
 to the same shard.  After several hours of indexing, I occasionally
 see an IndexWriter go OOM.  I think that's a symptom.  When that
 happens, indexing continues, and that node's tlog starts to grow.
 When I notice this, I stop indexing, and bounce the problem node.
 That's where it gets interesting.

 Upon bouncing, the tlog replays, and then segments merge.  Once the
 merging is complete, the heap is fairly full, and forced full GC only
 helps a little.  But if I then bounce the node again, the heap usage
 goes way down, and stays low until the next segment merge.  I believe
 segment merges are also what causes the original OOM.

 More details:

 Index on disk for this node is ~13G, tlog is ~2.5G.
 See attached mem1.png.  This is a jconsole view of the heap during the
 following:

 (Solr cloud node started at the left edge of this graph)

 A) One CPU core pegged at 100%.  Thread dump shows:
 Lucene Merge Thread #0 daemon prio=10 tid=0x7f5a3c064800
 nid=0x7a74 runnable [0x7f5a41c5f000]
java.lang.Thread.State: RUNNABLE
 at org.apache.lucene.util.fst.Builder.add(Builder.java:397)
 at 
 org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1000)
 at 
 org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:112)
 at 
 org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72)
 at 
 org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:365)
 at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
 at 
 org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
 at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
 at 
 org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
 at 
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)

 B) One CPU core pegged at 100%.  Manually triggered GC.  Lots of
 memory freed.  Thread dump shows:
 Lucene Merge Thread #0 daemon prio=10 tid=0x7f5a3c064800
 nid=0x7a74 runnable [0x7f5a41c5f000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
 at 
 org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:144)
 at 
 org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
 at 
 org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
 at 
 org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
 at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
 at 
 org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
 at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
 at 
 org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
 at 
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)

 C) One CPU core pegged at 100%.  Manually triggered GC.  No memory
 freed.  Thread dump shows:
 Lucene Merge Thread #0 daemon prio=10 tid=0x7f5a3c064800
 nid=0x7a74 runnable [0x7f5a41c5f000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
 at 
 org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:108)
 at 
 org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
 at 
 org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
 at 
 org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
 at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
 at 
 org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
 at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376

Possible memory leak after segment merge? (related to DocValues?)

2013-12-23 Thread Greg Preston
Hello,

I'm loading up our solr cloud with data (from a solrj client) and
running into a weird memory issue.  I can reliably reproduce the
problem.

- Using Solr Cloud 4.4.0 (also replicated with 4.6.0)
- 24 solr nodes (one shard each), spread across 3 physical hosts, each
host has 256G of memory
- index and tlogs on ssd
- Xmx=7G, G1GC
- Java 1.7.0_25
- schema and solrconfig.xml attached

I'm using composite routing to route documents with the same clientId
to the same shard.  After several hours of indexing, I occasionally
see an IndexWriter go OOM.  I think that's a symptom.  When that
happens, indexing continues, and that node's tlog starts to grow.
When I notice this, I stop indexing, and bounce the problem node.
That's where it gets interesting.

Upon bouncing, the tlog replays, and then segments merge.  Once the
merging is complete, the heap is fairly full, and forced full GC only
helps a little.  But if I then bounce the node again, the heap usage
goes way down, and stays low until the next segment merge.  I believe
segment merges are also what causes the original OOM.

More details:

Index on disk for this node is ~13G, tlog is ~2.5G.
See attached mem1.png.  This is a jconsole view of the heap during the
following:

(Solr cloud node started at the left edge of this graph)

A) One CPU core pegged at 100%.  Thread dump shows:
Lucene Merge Thread #0 daemon prio=10 tid=0x7f5a3c064800
nid=0x7a74 runnable [0x7f5a41c5f000]
   java.lang.Thread.State: RUNNABLE
at org.apache.lucene.util.fst.Builder.add(Builder.java:397)
at 
org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1000)
at org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:112)
at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72)
at 
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:365)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)

B) One CPU core pegged at 100%.  Manually triggered GC.  Lots of
memory freed.  Thread dump shows:
Lucene Merge Thread #0 daemon prio=10 tid=0x7f5a3c064800
nid=0x7a74 runnable [0x7f5a41c5f000]
   java.lang.Thread.State: RUNNABLE
at 
org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
at 
org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:144)
at 
org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
at 
org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
at 
org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)

C) One CPU core pegged at 100%.  Manually triggered GC.  No memory
freed.  Thread dump shows:
Lucene Merge Thread #0 daemon prio=10 tid=0x7f5a3c064800
nid=0x7a74 runnable [0x7f5a41c5f000]
   java.lang.Thread.State: RUNNABLE
at 
org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
at 
org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:108)
at 
org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
at 
org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
at 
org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)

D) One CPU core pegged at 100%.  Thread dump shows:
Lucene Merge Thread #0 daemon prio=10 tid=0x7f5a3c064800
nid=0x7a74 runnable [0x7f5a41c5f000]
   java.lang.Thread.State: RUNNABLE
 

Re: Possible memory leak after segment merge? (related to DocValues?)

2013-12-23 Thread Greg Preston
Hi Joel,

Thanks for the suggestion.  I could see how decreasing autoCommit time
would reduce tlog size, and how that could possibly be related to the
original OOM error.  I'm not seeing how that would make any difference
once a tlog exists, though?

I have a saved off copy of my data dir that has the 13G index and 2.5G
tlog.  So I can reproduce the replay - merge - memory usage issue
very quickly.  Changing the autoCommit to possibly avoid the initial
OOM will take a good bit longer to try to reproduce.  I may try that
later in the week.

-Greg


On Mon, Dec 23, 2013 at 12:20 PM, Joel Bernstein joels...@gmail.com wrote:
 Hi Greg,

 I have a suspicion that the problem might be related or exacerbated be
 overly large tlogs. Can you adjust your autoCommits to 15 seconds. Leave
 openSearcher = false. I would remove the maxDoc as well. If you try
 rerunning under those commit setting it's possible the OOM errors will stop
 occurring.

 Joel

 Joel Bernstein
 Search Engineer at Heliosearch


 On Mon, Dec 23, 2013 at 3:07 PM, Greg Preston 
 gpres...@marinsoftware.comwrote:

 Hello,

 I'm loading up our solr cloud with data (from a solrj client) and
 running into a weird memory issue.  I can reliably reproduce the
 problem.

 - Using Solr Cloud 4.4.0 (also replicated with 4.6.0)
 - 24 solr nodes (one shard each), spread across 3 physical hosts, each
 host has 256G of memory
 - index and tlogs on ssd
 - Xmx=7G, G1GC
 - Java 1.7.0_25
 - schema and solrconfig.xml attached

 I'm using composite routing to route documents with the same clientId
 to the same shard.  After several hours of indexing, I occasionally
 see an IndexWriter go OOM.  I think that's a symptom.  When that
 happens, indexing continues, and that node's tlog starts to grow.
 When I notice this, I stop indexing, and bounce the problem node.
 That's where it gets interesting.

 Upon bouncing, the tlog replays, and then segments merge.  Once the
 merging is complete, the heap is fairly full, and forced full GC only
 helps a little.  But if I then bounce the node again, the heap usage
 goes way down, and stays low until the next segment merge.  I believe
 segment merges are also what causes the original OOM.

 More details:

 Index on disk for this node is ~13G, tlog is ~2.5G.
 See attached mem1.png.  This is a jconsole view of the heap during the
 following:

 (Solr cloud node started at the left edge of this graph)

 A) One CPU core pegged at 100%.  Thread dump shows:
 Lucene Merge Thread #0 daemon prio=10 tid=0x7f5a3c064800
 nid=0x7a74 runnable [0x7f5a41c5f000]
java.lang.Thread.State: RUNNABLE
 at org.apache.lucene.util.fst.Builder.add(Builder.java:397)
 at
 org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1000)
 at
 org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:112)
 at
 org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72)
 at
 org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:365)
 at
 org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
 at
 org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
 at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
 at
 org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
 at
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)

 B) One CPU core pegged at 100%.  Manually triggered GC.  Lots of
 memory freed.  Thread dump shows:
 Lucene Merge Thread #0 daemon prio=10 tid=0x7f5a3c064800
 nid=0x7a74 runnable [0x7f5a41c5f000]
java.lang.Thread.State: RUNNABLE
 at
 org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
 at
 org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:144)
 at
 org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
 at
 org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
 at
 org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
 at
 org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
 at
 org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
 at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
 at
 org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
 at
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)

 C) One CPU core pegged at 100%.  Manually triggered GC.  No memory
 freed.  Thread dump shows:
 Lucene Merge Thread #0 daemon prio=10 tid=0x7f5a3c064800
 nid=0x7a74 runnable [0x7f5a41c5f000]
java.lang.Thread.State: RUNNABLE

Re: adding a node to SolrCloud

2013-12-23 Thread Greg Preston
Yes, I'm well aware of the performance implications, many of which are 
mitigated by 2TB of SSD and 512GB RAM

I've got a very similar setup in production.  2TB SSD, 256G RAM (128G
heaps), and 1 - 1.5 TB of index per node.  We're in the process of
splitting that to multiple JVMs per host.  GC pauses were causing ZK
timeouts (you should up that in solr.xml).  And resync's after the
timeouts took long enough that a large tlog built up (we have near
continuous indexing), and we couldn't replay the tlog fast enough to
catch up to current.

If you're going to have a mostly static index, then it may be less of an issue.

-Greg


On Mon, Dec 23, 2013 at 2:31 AM, David Santamauro
david.santama...@gmail.com wrote:
 On 12/22/2013 09:48 PM, Shawn Heisey wrote:

 On 12/22/2013 2:10 PM, David Santamauro wrote:

 My goal is to have a redundant copy of all 8 currently running, but
 non-redundant shards. This setup (8 nodes with no replicas) was a test
 and it has proven quite functional from a performance perspective.
 Loading, though, takes almost 3 weeks so I'm really not in a position to
 redesign the distribution, though I can add nodes.

 I have acquired another resource, a very large machine that I'd like to
 use to hold the replicas of the currently deployed 8-nodes.

 I realize I can run 8 jetty/tomcats and accomplish my goal but that is a
 maintenance headache and is really a last resort. I really would just
 like to be able to deploy this big machine with 'numShards=8'.

 Is that possible or do I really need to have 8 other nodes running?


 You don't want to run more than one container or Solr instance per
 machine.  Things can get very confused, and it's too much overhead.



 With existing collections, you can simply run the CoreAdmin CREATE

 action on the new node with more resources.

 So you'd do something like this, once for each of the 8 existing parts:


 http://newnode:port/solr/admin/cores?action=CREATEname=collname_shard1_replica2collection=collnameshard=shard1

 It will automatically replicate the shard from its current leader.


 Fantastic! Clearly my understanding of collection, vs core vs shard
 was lacking but now I see the relationship better.



 One thing to be aware of: With 1.4TB of index data, it might be
 impossible to keep enough of the index in RAM for good performance,
 unless the machine has a terabyte or more of RAM.


 Yes, I'm well aware of the performance implications, many of which are
 mitigated by 2TB of SSD and 512GB RAM.

 Thanks for the nudge in the right direction. The first node/shard1 is
 replicating right now.

 David





Re: Possible memory leak after segment merge? (related to DocValues?)

2013-12-23 Thread Greg Preston
Interesting.  In my original post, the memory growth (during restart)
occurs after the tlog is done replaying, but during the merge.

-Greg


On Mon, Dec 23, 2013 at 2:06 PM, Joel Bernstein joels...@gmail.com wrote:
 Greg,

 There is a memory component to the tlog, which supports realtime gets. This
 memory component grows until there is a commit, so it will appear like a
 leak. I suspect that replaying a tlog that was big enough to possibly cause
 OOM is also problematic.

 One thing you might want to try is going to 15 second commits, and then
 kill the Solr instance between the commits. Then watch the memory as the
 replaying occurs with the smaller tlog.

 Joel




 Joel Bernstein
 Search Engineer at Heliosearch


 On Mon, Dec 23, 2013 at 4:17 PM, Greg Preston 
 gpres...@marinsoftware.comwrote:

 Hi Joel,

 Thanks for the suggestion.  I could see how decreasing autoCommit time
 would reduce tlog size, and how that could possibly be related to the
 original OOM error.  I'm not seeing how that would make any difference
 once a tlog exists, though?

 I have a saved off copy of my data dir that has the 13G index and 2.5G
 tlog.  So I can reproduce the replay - merge - memory usage issue
 very quickly.  Changing the autoCommit to possibly avoid the initial
 OOM will take a good bit longer to try to reproduce.  I may try that
 later in the week.

 -Greg


 On Mon, Dec 23, 2013 at 12:20 PM, Joel Bernstein joels...@gmail.com
 wrote:
  Hi Greg,
 
  I have a suspicion that the problem might be related or exacerbated be
  overly large tlogs. Can you adjust your autoCommits to 15 seconds. Leave
  openSearcher = false. I would remove the maxDoc as well. If you try
  rerunning under those commit setting it's possible the OOM errors will
 stop
  occurring.
 
  Joel
 
  Joel Bernstein
  Search Engineer at Heliosearch
 
 
  On Mon, Dec 23, 2013 at 3:07 PM, Greg Preston 
 gpres...@marinsoftware.comwrote:
 
  Hello,
 
  I'm loading up our solr cloud with data (from a solrj client) and
  running into a weird memory issue.  I can reliably reproduce the
  problem.
 
  - Using Solr Cloud 4.4.0 (also replicated with 4.6.0)
  - 24 solr nodes (one shard each), spread across 3 physical hosts, each
  host has 256G of memory
  - index and tlogs on ssd
  - Xmx=7G, G1GC
  - Java 1.7.0_25
  - schema and solrconfig.xml attached
 
  I'm using composite routing to route documents with the same clientId
  to the same shard.  After several hours of indexing, I occasionally
  see an IndexWriter go OOM.  I think that's a symptom.  When that
  happens, indexing continues, and that node's tlog starts to grow.
  When I notice this, I stop indexing, and bounce the problem node.
  That's where it gets interesting.
 
  Upon bouncing, the tlog replays, and then segments merge.  Once the
  merging is complete, the heap is fairly full, and forced full GC only
  helps a little.  But if I then bounce the node again, the heap usage
  goes way down, and stays low until the next segment merge.  I believe
  segment merges are also what causes the original OOM.
 
  More details:
 
  Index on disk for this node is ~13G, tlog is ~2.5G.
  See attached mem1.png.  This is a jconsole view of the heap during the
  following:
 
  (Solr cloud node started at the left edge of this graph)
 
  A) One CPU core pegged at 100%.  Thread dump shows:
  Lucene Merge Thread #0 daemon prio=10 tid=0x7f5a3c064800
  nid=0x7a74 runnable [0x7f5a41c5f000]
 java.lang.Thread.State: RUNNABLE
  at org.apache.lucene.util.fst.Builder.add(Builder.java:397)
  at
 
 org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1000)
  at
  org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:112)
  at
  org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72)
  at
  org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:365)
  at
  org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
  at
  org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
  at
 org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
  at
 
 org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
  at
 
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
 
  B) One CPU core pegged at 100%.  Manually triggered GC.  Lots of
  memory freed.  Thread dump shows:
  Lucene Merge Thread #0 daemon prio=10 tid=0x7f5a3c064800
  nid=0x7a74 runnable [0x7f5a41c5f000]
 java.lang.Thread.State: RUNNABLE
  at
 
 org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
  at
 
 org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:144)
  at
 
 org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField

Re: adding a node to SolrCloud

2013-12-23 Thread Greg Preston
I believe you can just define multiple cores:

core default=true instanceDir=shard1/
name=collectionName_shard1 shard=shard1/
core default=true instanceDir=shard2/
name=collectionName_shard2 shard=shard2/
...

(this is the old style solr.xml.  I don't know how to do it in the newer style)

Also, make sure you don't define a non-relative dataDir in
solrconfig.xml, or you may run into issues with cores trying to use
the same data dir.






-Greg


On Mon, Dec 23, 2013 at 2:16 PM, David Santamauro
david.santama...@gmail.com wrote:
 On 12/23/2013 05:03 PM, Greg Preston wrote:

 Yes, I'm well aware of the performance implications, many of which are
 mitigated by 2TB of SSD and 512GB RAM


 I've got a very similar setup in production.  2TB SSD, 256G RAM (128G
 heaps), and 1 - 1.5 TB of index per node.  We're in the process of
 splitting that to multiple JVMs per host.  GC pauses were causing ZK
 timeouts (you should up that in solr.xml).  And resync's after the
 timeouts took long enough that a large tlog built up (we have near
 continuous indexing), and we couldn't replay the tlog fast enough to
 catch up to current.


 GC pauses are a huge issue in our current production environment (monolithic
 index) and general performance was meager, hence the move to a distributed
 design. We will have 8 nodes with ~ 200GB per node, one shard each and
 performance for single and most multi-term queries has become sub-second and
 throughput has increased 10-fold. Larger boolean queries can still take 2-3s
 but we can live with that.

 At any rate, I still can't figure out what my solr.xml is supposed to look
 like on the node with all 8 redundant shards.

 David



 On Mon, Dec 23, 2013 at 2:31 AM, David Santamauro
 david.santama...@gmail.com wrote:

 On 12/22/2013 09:48 PM, Shawn Heisey wrote:


 On 12/22/2013 2:10 PM, David Santamauro wrote:


 My goal is to have a redundant copy of all 8 currently running, but
 non-redundant shards. This setup (8 nodes with no replicas) was a test
 and it has proven quite functional from a performance perspective.
 Loading, though, takes almost 3 weeks so I'm really not in a position
 to
 redesign the distribution, though I can add nodes.

 I have acquired another resource, a very large machine that I'd like to
 use to hold the replicas of the currently deployed 8-nodes.

 I realize I can run 8 jetty/tomcats and accomplish my goal but that is
 a
 maintenance headache and is really a last resort. I really would just
 like to be able to deploy this big machine with 'numShards=8'.

 Is that possible or do I really need to have 8 other nodes running?



 You don't want to run more than one container or Solr instance per
 machine.  Things can get very confused, and it's too much overhead.




 With existing collections, you can simply run the CoreAdmin CREATE

 action on the new node with more resources.

 So you'd do something like this, once for each of the 8 existing parts:



 http://newnode:port/solr/admin/cores?action=CREATEname=collname_shard1_replica2collection=collnameshard=shard1

 It will automatically replicate the shard from its current leader.



 Fantastic! Clearly my understanding of collection, vs core vs shard
 was lacking but now I see the relationship better.



 One thing to be aware of: With 1.4TB of index data, it might be
 impossible to keep enough of the index in RAM for good performance,
 unless the machine has a terabyte or more of RAM.



 Yes, I'm well aware of the performance implications, many of which are
 mitigated by 2TB of SSD and 512GB RAM.

 Thanks for the nudge in the right direction. The first node/shard1 is
 replicating right now.

 David






How to always tokenize on underscore?

2013-09-25 Thread Greg Preston
[Using SolrCloud 4.4.0]

I have a text field where the data will sometimes be delimited by
whitespace, and sometimes by underscore.  For example, both of the
following are possible input values:

Group_EN_1000232142_blah_1000232142abc_foo
Group EN 1000232142 blah 1000232142abc foo

What I'd like to do is have underscores treated as spaces for
tokenization purposes.  I've tried using a PatternReplaceFilterFactory
with:

fieldType name=text_general class=solr.TextField
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.PatternReplaceFilterFactory pattern=_
replacement=  replace=all /
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.PatternReplaceFilterFactory pattern=_
replacement=  replace=all /
  /analyzer
/fieldType

but that seems to do the pattern replacement on each token, rather
than splitting tokens into multiple tokens based on the pattern.  So
with the input Group_EN_1000232142_blah_1000232142abc_foo I end up
with a single token of group en 1000232142 blah 1000232142abc foo
rather than what I want, which is 6 tokens: group, en,
1000232142, blah, 1000232142abc, foo.

Is there a way to configure for the behavior I'm looking for, or would
I need to write a customer tokenizer?

Thanks!

-Greg


Re: How to always tokenize on underscore?

2013-09-25 Thread Greg Preston
This is exactly what I needed.  Thank you!

-Greg


On Wed, Sep 25, 2013 at 2:48 PM, Jack Krupansky j...@basetechnology.com wrote:
 Use the char filter instead:
 http://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/pattern/PatternReplaceCharFilterFactory.html

 -- Jack Krupansky

 -Original Message- From: Greg Preston
 Sent: Wednesday, September 25, 2013 5:43 PM
 To: solr-user@lucene.apache.org
 Subject: How to always tokenize on underscore?


 [Using SolrCloud 4.4.0]

 I have a text field where the data will sometimes be delimited by
 whitespace, and sometimes by underscore.  For example, both of the
 following are possible input values:

 Group_EN_1000232142_blah_1000232142abc_foo
 Group EN 1000232142 blah 1000232142abc foo

 What I'd like to do is have underscores treated as spaces for
 tokenization purposes.  I've tried using a PatternReplaceFilterFactory
 with:

fieldType name=text_general class=solr.TextField
 positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.PatternReplaceFilterFactory pattern=_
 replacement=  replace=all /
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.PatternReplaceFilterFactory pattern=_
 replacement=  replace=all /
  /analyzer
/fieldType

 but that seems to do the pattern replacement on each token, rather
 than splitting tokens into multiple tokens based on the pattern.  So
 with the input Group_EN_1000232142_blah_1000232142abc_foo I end up
 with a single token of group en 1000232142 blah 1000232142abc foo
 rather than what I want, which is 6 tokens: group, en,
 1000232142, blah, 1000232142abc, foo.

 Is there a way to configure for the behavior I'm looking for, or would
 I need to write a customer tokenizer?

 Thanks!

 -Greg


Re: Solr 4.3: Recovering from Too many values for UnInvertedField faceting on field

2013-09-03 Thread Greg Preston
Our index is too large to uninvert on the fly, so we've been looking
into using DocValues to keep a particular field uninverted at index
time.  See http://wiki.apache.org/solr/DocValues

I don't know if this will solve your problem, but it might be worth
trying it out.

-Greg


On Tue, Sep 3, 2013 at 7:04 AM, Dennis Schafroth den...@indexdata.com wrote:
 We are harvesting and indexing bibliographic data, thus having many distinct 
 author names in our index. While testing Solr 4 I believe I had pushed a 
 single core to 100 million records (91GB of data) and everything was working 
 fine and fast. After adding a little more to the index, then following 
 started to happen:

 17328668 [searcherExecutor-4-thread-1] WARN org.apache.solr.core.SolrCore – 
 Approaching too many values for UnInvertedField faceting on field 
 'author_exact' : bucket size=16726546
 17328701 [searcherExecutor-4-thread-1] INFO org.apache.solr.core.SolrCore – 
 UnInverted multi-valued field 
 {field=author_exact,memSize=336715415,tindexSize=5001903,time=31595,phase1=31465,nTerms=12048027,bigTerms=0,termInstances=57751332,uses=0}
 18103757 [searcherExecutor-4-thread-1] ERROR org.apache.solr.core.SolrCore – 
 org.apache.solr.common.SolrException: Too many values for UnInvertedField 
 faceting on field author_exact
 at org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:181)
 at 
 org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:664)

 I can see that we reached a limit of bucket size. Is there a way to adjust 
 this? The index also seem to explode in size (217GB).

 Thinking that I had reached a limit for what a single core could handle in 
 terms of facet, I deleted records in the index, but even now at 1/3 (32 
 million) it will still fails with above error. I have optimised with 
 expungeDeleted=true. The index is  somewhat larger (76GB) than I would have 
 expected.

 While we can still use the index and get facets back using enum method on 
 that field, I would still like a way to fix the index if possible. Any 
 suggestions?

 cheers,
 :-Dennis


Re: Question about SOLR-5017 - Allow sharding based on the value of a field

2013-08-28 Thread Greg Preston
I don't know about SOLR-5017, but why don't you want to use parent_id
as a shard key?

So if you've got a doc with a key of abc123 and a  parent_id of 456,
just use a key of 456!abc123 and all docs with the same parent_id
will go to the same shard.
We're doing something similar and limiting queries to the single shard
that hosts the relevant docs by setting shard.keys=456! on queries.

-Greg


On Wed, Aug 28, 2013 at 10:04 AM, adfel70 adfe...@gmail.com wrote:
 Hi
 I'm looking into allowing query joins in solr cloud.
 This has the limitation of having to index all the documents that are
 joineable together to the same shard.
 I'm wondering if  SOLR-5017
 https://issues.apache.org/jira/browse/SOLR-5017   would give me the
 ability to do so without implementing my own routing mechanism?

 If I add a field named parent_id and give that field the same value in all
 the documents that I want to join, it seems, theoretically, that it will be
 enough.

 Am I correct?

 Thanks.





 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Question-about-SOLR-5017-Allow-sharding-based-on-the-value-of-a-field-tp4087050.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Shard splitting error: cannot uncache file=_1.nvm

2013-08-27 Thread Greg Preston
I haven't been able to successfully split a shard with Solr 4.4.0

If I have an empty index, or all documents would go to one side of the
split, I hit SOLR-5144.  But if I avoid that case, I consistently get
this error:

290391 [qtp243983770-60] INFO
org.apache.solr.update.processor.LogUpdateProcessor  –
[marin_shard1_1_replica1] webapp=/solr path=/update
params={waitSearcher=trueopenSearcher=falsecommit=truewt=javabincommit_end_point=trueversion=2softCommit=false}
{} 0 2
290392 [qtp243983770-60] ERROR org.apache.solr.core.SolrCore  –
java.io.IOException: cannot uncache file=_1.nvm: it was separately
also created in the delegate directory
at 
org.apache.lucene.store.NRTCachingDirectory.unCache(NRTCachingDirectory.java:297)
at 
org.apache.lucene.store.NRTCachingDirectory.sync(NRTCachingDirectory.java:216)
at 
org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4109)
at 
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2809)
at 
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2897)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2872)
at 
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:549)
at 
org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)


I've seen LUCENE-4238, but that was closed as a test error.


-Greg


Re: SOLR Prevent solr of modifying fields when update doc

2013-08-24 Thread Greg Preston
But there is an API for sending a delta over the wire, and server side it
does a read, overlay, delete, and insert.  And only the fields you sent
will be changed.

*Might require your unchanged fields to all be stored, though.


-Greg


On Fri, Aug 23, 2013 at 7:08 PM, Lance Norskog goks...@gmail.com wrote:

 Solr does not by default generate unique IDs. It uses what you give as
 your unique field, usually called 'id'.

 What software do you use to index data from your RSS feeds? Maybe that is
 creating a new 'id' field?

 There is no partial update, Solr (Lucene) always rewrites the complete
 document.


 On 08/23/2013 09:03 AM, Greg Preston wrote:

 Perhaps an atomic update that only changes the fields you want to change?

 -Greg


 On Fri, Aug 23, 2013 at 4:16 AM, Luís Portela Afonso
 meligalet...@gmail.com wrote:

 Hi thanks by the answer, but the uniqueId is generated by me. But when
 solr indexes and there is an update in a doc, it deletes the doc and
 creates a new one, so it generates a new UUID.
 It is not suitable for me, because i want that solr just updates some
 fields, because the UUID is the key that i use to map it to an user in my
 database.

 Right now i'm using information that comes from the source and never
 chages, as my uniqueId, like for example the guid, that exists in some rss
 feeds, or if it doesn't exists i use link.

 I think there is any simple solution for me, because for what i have
 read, when an update to a doc exists, SOLR deletes the old one and create a
 new one, right?

 On Aug 23, 2013, at 12:07 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  Well, not much in the way of help because you can't do what you
 want AFAIK. I don't think UUID is suitable for your use-case. Why not
 use your uniqueId?

 Or generate something yourself...

 Best
 Erick


 On Thu, Aug 22, 2013 at 5:56 PM, Luís Portela Afonso 
 meligalet...@gmail.com

 wrote:
 Hi,

 How can i prevent solr from update some fields when updating a doc?
 The problem is, i have an uuid with the field name uuid, but it is not
 an
 unique key. When a rss source updates a feed, solr will update the doc
 with
 the same link but it generates a new uuid. This is not the desired
 because
 this id is used by me to relate feeds with an user.

 Can someone help me?

 Many Thanks





Re: SOLR Prevent solr of modifying fields when update doc

2013-08-23 Thread Greg Preston
Perhaps an atomic update that only changes the fields you want to change?

-Greg


On Fri, Aug 23, 2013 at 4:16 AM, Luís Portela Afonso
meligalet...@gmail.com wrote:
 Hi thanks by the answer, but the uniqueId is generated by me. But when solr 
 indexes and there is an update in a doc, it deletes the doc and creates a new 
 one, so it generates a new UUID.
 It is not suitable for me, because i want that solr just updates some fields, 
 because the UUID is the key that i use to map it to an user in my database.

 Right now i'm using information that comes from the source and never chages, 
 as my uniqueId, like for example the guid, that exists in some rss feeds, or 
 if it doesn't exists i use link.

 I think there is any simple solution for me, because for what i have read, 
 when an update to a doc exists, SOLR deletes the old one and create a new 
 one, right?

 On Aug 23, 2013, at 12:07 PM, Erick Erickson erickerick...@gmail.com wrote:

 Well, not much in the way of help because you can't do what you
 want AFAIK. I don't think UUID is suitable for your use-case. Why not
 use your uniqueId?

 Or generate something yourself...

 Best
 Erick


 On Thu, Aug 22, 2013 at 5:56 PM, Luís Portela Afonso meligalet...@gmail.com
 wrote:

 Hi,

 How can i prevent solr from update some fields when updating a doc?
 The problem is, i have an uuid with the field name uuid, but it is not an
 unique key. When a rss source updates a feed, solr will update the doc with
 the same link but it generates a new uuid. This is not the desired because
 this id is used by me to relate feeds with an user.

 Can someone help me?

 Many Thanks



Autosuggest on very large index

2013-08-20 Thread Greg Preston
Using 4.4.0 -

I would like to be able to do an autosuggest query against one of the
fields in our index and have the results be limited by an fq.

I can get exactly the results I want with a facet query using a
facet.prefix, but the first query takes ~5 minutes to run on our QA
env (~240M docs).  I'm afraid to attempt it on prod (~2B docs).
Subsequent queries are sufficiently fast (~500ms).

I'm assuming the first query is uninverting the field.  Is there any
way to mark that field so that an uninverted copy is maintained as
updates come in?  We plan to soft commit every 5 minutes, and we'd
prefer to not be continuously uninverting this one field.

Or is there a better way to do what I'm trying to do?  I've looked at
the spellcheck component a little bit, but it looks like I can't
filter results by fq.  The fq I'm using is based on which client is
logged in, and we can't autosuggest terms from one client to another.

Thanks.

-Greg


Re: Autosuggest on very large index

2013-08-20 Thread Greg Preston
The filter query would be on a different field (clientId) than the
field we want to autosuggest on (title).

Or are you proposing we index a compound field that would be
clientId+titleTokens so we would then prefix the suggester with
clientId+userInput ?

Interesting idea.

-Greg


On Tue, Aug 20, 2013 at 11:21 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
 I am not entirely sure but the Suggester's FST uses prefixes so you may be 
 able to prefix the value you otherwise use for the filter query when you 
 build the suggester.

 -Original message-
 From:Greg Preston gpres...@marinsoftware.com
 Sent: Tuesday 20th August 2013 20:00
 To: solr-user@lucene.apache.org
 Subject: Autosuggest on very large index

 Using 4.4.0 -

 I would like to be able to do an autosuggest query against one of the
 fields in our index and have the results be limited by an fq.

 I can get exactly the results I want with a facet query using a
 facet.prefix, but the first query takes ~5 minutes to run on our QA
 env (~240M docs).  I'm afraid to attempt it on prod (~2B docs).
 Subsequent queries are sufficiently fast (~500ms).

 I'm assuming the first query is uninverting the field.  Is there any
 way to mark that field so that an uninverted copy is maintained as
 updates come in?  We plan to soft commit every 5 minutes, and we'd
 prefer to not be continuously uninverting this one field.

 Or is there a better way to do what I'm trying to do?  I've looked at
 the spellcheck component a little bit, but it looks like I can't
 filter results by fq.  The fq I'm using is based on which client is
 logged in, and we can't autosuggest terms from one client to another.

 Thanks.

 -Greg


Re: Autosuggest on very large index

2013-08-20 Thread Greg Preston
DocValues looks interesting, a non-inverted field.  I'll play with it
a bit and see how it works.  Thanks for the suggestion.

I don't know how many total terms we've got, but each document is
only 2-5 words/terms on average, and there is a TON of overlap between
docs.



-Greg


On Tue, Aug 20, 2013 at 11:38 AM, Jack Krupansky
j...@basetechnology.com wrote:
 Sounds like a problem for DocValues - assuming the number of unique values
 fits reasonably in memory to avoid I/O.

 How many unique values do you have or contemplate for two your billion
 documents?

 Two possibilities:

 1. You need a lot more hardware.
 2. You need to scale back your ambitions.

 -- Jack Krupansky

 -Original Message- From: Greg Preston
 Sent: Tuesday, August 20, 2013 2:00 PM

 To: solr-user@lucene.apache.org
 Subject: Autosuggest on very large index

 Using 4.4.0 -

 I would like to be able to do an autosuggest query against one of the
 fields in our index and have the results be limited by an fq.

 I can get exactly the results I want with a facet query using a
 facet.prefix, but the first query takes ~5 minutes to run on our QA
 env (~240M docs).  I'm afraid to attempt it on prod (~2B docs).
 Subsequent queries are sufficiently fast (~500ms).

 I'm assuming the first query is uninverting the field.  Is there any
 way to mark that field so that an uninverted copy is maintained as
 updates come in?  We plan to soft commit every 5 minutes, and we'd
 prefer to not be continuously uninverting this one field.

 Or is there a better way to do what I'm trying to do?  I've looked at
 the spellcheck component a little bit, but it looks like I can't
 filter results by fq.  The fq I'm using is based on which client is
 logged in, and we can't autosuggest terms from one client to another.

 Thanks.

 -Greg


Re: Getting the shard a document lives on in resultset

2013-08-20 Thread Greg Preston
I know I've done this in a search via the admin console, but I can't
remember/find the exact syntax right now...

-Greg


On Tue, Aug 20, 2013 at 12:56 PM, AdamP adamph...@gmail.com wrote:
 Hi,

 We have several shards which we're querying across using distributed search.
 This initial search only returns basic information to the user.  When a user
 requests more information about a document, we do a separate query using
 only the uniqueID for that document.  The problem is, I don't know how to
 tell which shard a document lives on which means I have to do another
 distributed search instead of going directly to the shard with the data.  Is
 there a way to get the shardID as part of the resultset?

 I've found this old ticket (https://issues.apache.org/jira/browse/SOLR-705),
 but it's not clear what parameters you need to pass in to get the shardID.
 From a quick glance at the code, I'm not sure these changes are present in
 the current versions of Solr.

 We're currently on 4.3.0.

 Thanks



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Getting-the-shard-a-document-lives-on-in-resultset-tp4085731.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Getting the shard a document lives on in resultset

2013-08-20 Thread Greg Preston
Found it.  Add [shard] to your fl.

-Greg


On Tue, Aug 20, 2013 at 1:24 PM, Greg Preston
gpres...@marinsoftware.com wrote:
 I know I've done this in a search via the admin console, but I can't
 remember/find the exact syntax right now...

 -Greg


 On Tue, Aug 20, 2013 at 12:56 PM, AdamP adamph...@gmail.com wrote:
 Hi,

 We have several shards which we're querying across using distributed search.
 This initial search only returns basic information to the user.  When a user
 requests more information about a document, we do a separate query using
 only the uniqueID for that document.  The problem is, I don't know how to
 tell which shard a document lives on which means I have to do another
 distributed search instead of going directly to the shard with the data.  Is
 there a way to get the shardID as part of the resultset?

 I've found this old ticket (https://issues.apache.org/jira/browse/SOLR-705),
 but it's not clear what parameters you need to pass in to get the shardID.
 From a quick glance at the code, I'm not sure these changes are present in
 the current versions of Solr.

 We're currently on 4.3.0.

 Thanks



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Getting-the-shard-a-document-lives-on-in-resultset-tp4085731.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Share splitting at 23 million documents - OOM

2013-08-16 Thread Greg Preston
Have you tried it with a smaller number of documents?  I haven't been able
to successfully split a shard with 4.4.0 with even a handful of docs.


-Greg


On Fri, Aug 16, 2013 at 7:09 AM, Harald Kirsch harald.kir...@raytion.comwrote:

 Hi all.

 Using the example setup of solr-4.4.0, I was able to easily feed 23
 million documents from ClueWeb09.

 The I tried to split the one shard into tqo. The size on disk is:

 % du -sh collection1
 118Gcollection1

 I started Solr with 8GB for the JVM:

 java -Xmx8000m -DzkRun -DnumShards=2 
 -Dbootstrap_confdir=./solr/**collection1/conf
 -Dcollection.configName=myconf -jar start.jar

 Then I asked for the split

 http://localhost:8983/solr/**admin/collections?action=**
 SPLITSHARDcollection=**collection1shard=shard1http://localhost:8983/solr/admin/collections?action=SPLITSHARDcollection=collection1shard=shard1

 After a while I got the OOM in the logs:

 841168 [qtp614872954-17] ERROR org.apache.solr.servlet.**SolrDispatchFilter
  – null:java.lang.**RuntimeException: java.lang.OutOfMemoryError: Java
 heap space

 My question: is it to be expected that the split needs huge amounts of RAM
 or is there a chance that some configuration or procedure change could get
 me past this?

 Regards,
 Harald.
 --
 Harald Kirsch
 Raytion GmbH
 Kaiser-Friedrich-Ring 74
 40547 Duesseldorf
 Fon +49-211-550266-0
 Fax +49-211-550266-19
 http://www.raytion.com



Re: Split Shard Error - maxValue must be non-negative

2013-08-13 Thread Greg Preston
I'm running into the same issue using composite routing keys when all of
the shard keys end up in one of the subshards.

-Greg


On Tue, Aug 13, 2013 at 9:34 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Scratch that. I obviously didn't pay attention to the stack trace.
 There is no workaround until 4.5 for this issue because we split the
 range by half and thus cannot guarantee that all segments will have
 numDocs  0.

 On Tue, Aug 13, 2013 at 9:25 PM, Shalin Shekhar Mangar
 shalinman...@gmail.com wrote:
  On Tue, Aug 13, 2013 at 9:15 PM, Robert Muir rcm...@gmail.com wrote:
  On Tue, Aug 13, 2013 at 11:39 AM, Shalin Shekhar Mangar
  shalinman...@gmail.com wrote:
  The splitting code calls commit before it starts the splitting. It
 creates
  a LiveDocsReader using a bitset created by the split. This reader is
 merged
  to an index using addIndexes.
 
  Shouldn't the addIndexes code then ignore all such 0-document segments?
 
 
 
  Not in 4.4: https://issues.apache.org/jira/browse/LUCENE-5116
 
 
  Sorry, I didn't notice that. So 4.4 users must call commit/optimize
  with expungeDeletes=true until 4.5 is released if they run into this
  problem.
 
  --
  Regards,
  Shalin Shekhar Mangar.



 --
 Regards,
 Shalin Shekhar Mangar.



[4.4.0] Shard splitting failure (simplified case)

2013-08-12 Thread Greg Preston
I've simplified things from my previous email, and I'm still seeing errors.

Using solr 4.4.0 with two nodes, starting with a single shard.  Collection
is named marin, host names are dumbo and solrcloud1.  I bring up an empty
cloud and index 50 documents.  I can query them and everything looks fine.
 This is clusterstate.json at that point:

{marin:{
shards:{shard1:{
range:8000-7fff,
state:active,
replicas:{
  dumbo:8983_solr_marin:{
state:active,
core:marin,
node_name:dumbo:8983_solr,
base_url:http://dumbo:8983/solr;,
leader:true},
  solrcloud1:8983_solr_marin:{
state:active,
core:marin,
node_name:solrcloud1:8983_solr,
base_url:http://solrcloud1:8983/solr,
router:compositeId}}

I attempt to split with
http://dumbo:8983/solr/admin/collections?action=SPLITSHARDcollection=marinshard=shard1

After 127559ms, that call returns with
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:I was
asked to wait on state active for solrcloud1:8983_solr but I still do not
see the requested state. I see state: recovering live:true

clusterstate.json at this point:

{marin:{
shards:{
  shard1:{
range:8000-7fff,
state:active,
replicas:{
  dumbo:8983_solr_marin:{
state:active,
core:marin,
node_name:dumbo:8983_solr,
base_url:http://dumbo:8983/solr;,
leader:true},
  solrcloud1:8983_solr_marin:{
state:active,
core:marin,
node_name:solrcloud1:8983_solr,
base_url:http://solrcloud1:8983/solr}}},
  shard1_0:{
range:8000-,
state:construction,
replicas:{
  dumbo:8983_solr_marin_shard1_0_replica1:{
state:active,
core:marin_shard1_0_replica1,
node_name:dumbo:8983_solr,
base_url:http://dumbo:8983/solr;,
leader:true},
  solrcloud1:8983_solr_marin_shard1_0_replica2:{
state:active,
core:marin_shard1_0_replica2,
node_name:solrcloud1:8983_solr,
base_url:http://solrcloud1:8983/solr}}},
  shard1_1:{
range:0-7fff,
state:construction,
replicas:{
  dumbo:8983_solr_marin_shard1_1_replica1:{
state:active,
core:marin_shard1_1_replica1,
node_name:dumbo:8983_solr,
base_url:http://dumbo:8983/solr;,
leader:true},
  solrcloud1:8983_solr_marin_shard1_1_replica2:{
state:recovering,
core:marin_shard1_1_replica2,
node_name:solrcloud1:8983_solr,
base_url:http://solrcloud1:8983/solr,
router:compositeId}}


In the logs on dumbo, I see several of these:

290391 [qtp243983770-60] INFO
 org.apache.solr.update.processor.LogUpdateProcessor  –
[marin_shard1_1_replica1] webapp=/solr path=/update
params={waitSearcher=trueopenSearcher=falsecommit=truewt=javabincommit_end_point=trueversion=2softCommit=false}
{} 0 2
290392 [qtp243983770-60] ERROR org.apache.solr.core.SolrCore  –
java.io.IOException: cannot uncache file=_1.nvm: it was separately also
created in the delegate directory
at
org.apache.lucene.store.NRTCachingDirectory.unCache(NRTCachingDirectory.java:297)
at
org.apache.lucene.store.NRTCachingDirectory.sync(NRTCachingDirectory.java:216)
at
org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4109)
at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2809)
at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2897)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2872)
at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:549)
at
org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)

and then finally this:

406671 [qtp243983770-22] ERROR org.apache.solr.core.SolrCore  –
org.apache.solr.common.SolrException: I was asked to wait on state active
for solrcloud1:8983_solr but I still do not see the requested state. I see
state: recovering live:true
at
org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:966)
at
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:191)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:611)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:209)
at

Re: What gets written to the other shards?

2013-08-12 Thread Greg Preston
Are you manually setting the shard on each document?  If not, documents
will be hashed across all the shards.

-Greg


On Mon, Aug 12, 2013 at 3:50 PM, Thierry Thelliez 
thierry.thelliez.t...@gmail.com wrote:

 Hello,  I am trying to set a four shard system for the first time.  I do
 not understand why all the shards data are growing at about the same rate
 when I push the documents to only one shard.

 The four shards represent four calendar years.  And for now, on a
 development machine, these four shards run on four different ports.

 The first shard is started with Zookeeper.

 The log of the other shards is filed with something like:

 7882051 [qtp1154079020-1245] INFO
 org.apache.solr.update.processor.LogUpdateProcessor – [collection1]
 webapp=/solr path=/update params={distrib.from=

 http://x.y.z.4:50121/solr/collection1/update.distrib=TOLEADERwt=javabinversion=2
 }
 {add=[14939-96467-304 (1443204912169091072), 14939-96467-308
 (1443204912179576832), 14939-96467-310 (1443204912185868288),
 14939-96467-311 (1443204912192159744), 14939-96467-313
 (1443204912204742656), 14939-96467-314 (1443204912220471296),
 14939-96467-318 (1443204912239345664), 14939-96467-319
 (144320491225088), 14939-96467-322 (1443204912257171456),
 14939-96467-324 (1443204912263462912)]} 0 282

 What is getting written to the other shards? Is a separate index computed
 on all four shards?  I thought that when pushing a document to one shard,
 only that shard would update its index.


 Thanks,
 Thierry



Re: Shard splitting failure, with and without composite hashing

2013-08-11 Thread Greg Preston
Oops, I somehow forgot to mention that.  The errors I'm seeing are with the
release version of Solr 4.4.0.  I mentioned 4.1.0 as that's what we
currently have in prod, and we want to upgrade to 4.4.0 so we can do shard
splitting.  Towards that end, I'm testing shard splitting in 4.4.0 and
seeing these errors.

-Greg


On Sun, Aug 11, 2013 at 7:51 AM, Erick Erickson erickerick...@gmail.comwrote:

 The very first thing I'd do is go to Solr 4.4. There have been
 a lot of improvements in this code in the intervening 3
 versions.

 If the problem still occurs in 4.4, it'll get a lot more attention
 than 4.1

 FWIW,
 Erick


 On Fri, Aug 9, 2013 at 7:32 PM, Greg Preston gpres...@marinsoftware.com
 wrote:

  Howdy,
 
  I'm trying to test shard splitting, and it's not working for me.  I've
 got
  a 4 node cloud with a single collection and 2 shards.
 
  I've indexed 170k small documents, and I'm using the compositeId router,
  with an internal client id as the shard key, with 4 distinct values
  across the data set.  For my testing, the values of the shard keys are 1
  through 4.  Before splitting, shard1 contains 100k docs (all of the docs
  for shard keys 1 and 4) and shard2 contains 70k docs (all of the docs for
  shard keys 2 and 3).
 
  In prod, we're going to have thousands of unique shard keys, but for now,
  I'm testing at a smaller scale.  I attempt to split shard2 with
 
 
 http://host0:8983/solr/admin/collections?action=SPLITSHARDcollection=collshard=shard2
 
  I understand the shard splitting is on hash range, not document count,
 and
  it shouldn't split up documents within a single shard key, so I'm ok with
  it if both shard keys end up in the same sub-shard.
 
  I see the following in the logs:
 
  689524 [qtp259549756-119] ERROR
 org.apache.solr.servlet.SolrDispatchFilter
   – null:java.lang.RuntimeException: java.lang.IllegalArgumentException:
  maxValue must be non-negative (got: -1)
  at
 
 
 org.apache.solr.handler.admin.CoreAdminHandler.handleSplitAction(CoreAdminHandler.java:290)
  at
 
 
 org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:186)
  at
 
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
  at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:611)
  at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:209)
  at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158)
  at
 
 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
  at
 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
  at
 
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
  at
 
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
  at
 
 
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
  at
 
 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
  at
  org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
  at
 
 
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
  at
 
 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
  at
 
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
  at
 
 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
  at
 
 
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
  at
 
 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
  at org.eclipse.jetty.server.Server.handle(Server.java:368)
  at
 
 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
  at
 
 
 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
  at
 
 
 org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
  at
 
 
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
  at
 org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861)
  at
  org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
  at
 
 
 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
  at
 
 
 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
  at
 
 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
  at
 
 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
  at java.lang.Thread.run

Shard splitting failure, with and without composite hashing

2013-08-09 Thread Greg Preston
Howdy,

I'm trying to test shard splitting, and it's not working for me.  I've got
a 4 node cloud with a single collection and 2 shards.

I've indexed 170k small documents, and I'm using the compositeId router,
with an internal client id as the shard key, with 4 distinct values
across the data set.  For my testing, the values of the shard keys are 1
through 4.  Before splitting, shard1 contains 100k docs (all of the docs
for shard keys 1 and 4) and shard2 contains 70k docs (all of the docs for
shard keys 2 and 3).

In prod, we're going to have thousands of unique shard keys, but for now,
I'm testing at a smaller scale.  I attempt to split shard2 with
http://host0:8983/solr/admin/collections?action=SPLITSHARDcollection=collshard=shard2

I understand the shard splitting is on hash range, not document count, and
it shouldn't split up documents within a single shard key, so I'm ok with
it if both shard keys end up in the same sub-shard.

I see the following in the logs:

689524 [qtp259549756-119] ERROR org.apache.solr.servlet.SolrDispatchFilter
 – null:java.lang.RuntimeException: java.lang.IllegalArgumentException:
maxValue must be non-negative (got: -1)
at
org.apache.solr.handler.admin.CoreAdminHandler.handleSplitAction(CoreAdminHandler.java:290)
at
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:186)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:611)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:209)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861)
at
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.IllegalArgumentException: maxValue must be
non-negative (got: -1)
at
org.apache.lucene.util.packed.PackedInts.bitsRequired(PackedInts.java:1184)
at
org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:140)
at
org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
at
org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
at
org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
at
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
at
org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:2488)
at
org.apache.solr.update.SolrIndexSplitter.split(SolrIndexSplitter.java:125)
at
org.apache.solr.update.DirectUpdateHandler2.split(DirectUpdateHandler2.java:766)