[ https://issues.apache.org/jira/browse/SOLR-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759784#action_12759784 ]
Artem Russakovskii edited comment on SOLR-1458 at 9/25/09 3:32 PM: ------------------------------------------------------------------- I haven't changed any configs yet, and this probably doesn't come as a shock to you guys, but the master just ran out of space. Upon inspection, I found 30+ snapshot dirs sitting around in /data. Paul, adding your deletionPolicy fix didn't delete the files, even after optimize. Is that expected? {code} drwxrwxr-x 2 bla bla 4096 Sep 23 18:42 snapshot.20090923064214 drwxrwxr-x 2 bla bla 4096 Sep 23 19:15 snapshot.20090923071530 drwxrwxr-x 2 bla bla 4096 Sep 23 19:45 snapshot.20090923074535 drwxrwxr-x 2 bla bla 4096 Sep 23 20:15 snapshot.20090923081531 drwxrwxr-x 2 bla bla 4096 Sep 23 21:15 snapshot.20090923091531 drwxrwxr-x 2 bla bla 4096 Sep 23 22:15 snapshot.20090923101532 drwxrwxr-x 2 bla bla 4096 Sep 23 23:15 snapshot.20090923111533 drwxrwxr-x 2 bla bla 4096 Sep 24 01:15 snapshot.20090924011501 drwxrwxr-x 2 bla bla 4096 Sep 24 13:15 snapshot.20090924011535 drwxrwxr-x 2 bla bla 4096 Sep 24 02:15 snapshot.20090924021501 drwxrwxr-x 2 bla bla 4096 Sep 24 14:15 snapshot.20090924021534 drwxrwxr-x 2 bla bla 4096 Sep 24 15:15 snapshot.20090924031501 drwxrwxr-x 2 bla bla 4096 Sep 24 03:15 snapshot.20090924031502 drwxrwxr-x 2 bla bla 4096 Sep 24 04:15 snapshot.20090924041501 drwxrwxr-x 2 bla bla 4096 Sep 24 16:15 snapshot.20090924041536 drwxrwxr-x 2 bla bla 4096 Sep 24 05:15 snapshot.20090924051501 drwxrwxr-x 2 bla bla 4096 Sep 24 17:15 snapshot.20090924051537 drwxrwxr-x 2 bla bla 4096 Sep 24 06:15 snapshot.20090924061501 drwxrwxr-x 2 bla bla 4096 Sep 24 18:15 snapshot.20090924061534 drwxrwxr-x 2 bla bla 4096 Sep 24 07:15 snapshot.20090924071501 drwxrwxr-x 2 bla bla 4096 Sep 24 19:15 snapshot.20090924071533 drwxrwxr-x 2 bla bla 4096 Sep 24 08:15 snapshot.20090924081534 drwxrwxr-x 2 bla bla 4096 Sep 24 20:15 snapshot.20090924081535 drwxrwxr-x 2 bla bla 4096 Sep 24 09:15 snapshot.20090924091501 drwxrwxr-x 2 bla bla 4096 Sep 24 21:15 snapshot.20090924091532 drwxrwxr-x 2 bla bla 4096 Sep 24 10:15 snapshot.20090924101501 drwxrwxr-x 2 bla bla 4096 Sep 24 22:15 snapshot.20090924101533 drwxrwxr-x 2 bla bla 4096 Sep 24 11:15 snapshot.20090924111501 drwxrwxr-x 2 bla bla 4096 Sep 24 23:15 snapshot.20090924111532 drwxrwxr-x 2 bla bla 4096 Sep 24 12:15 snapshot.20090924121532 drwxrwxr-x 2 bla bla 4096 Sep 24 00:15 snapshot.20090924121533 drwxrwxr-x 2 bla bla 4096 Sep 25 01:15 snapshot.20090925011533 drwxrwxr-x 2 bla bla 4096 Sep 25 13:15 snapshot.20090925011540 drwxrwxr-x 2 bla bla 4096 Sep 25 02:15 snapshot.20090925021534 drwxrwxr-x 2 bla bla 4096 Sep 25 14:15 snapshot.20090925021540 drwxrwxr-x 2 bla bla 4096 Sep 25 03:15 snapshot.20090925031535 drwxrwxr-x 2 bla bla 4096 Sep 25 15:15 snapshot.20090925031540 drwxrwxr-x 2 bla bla 4096 Sep 25 15:29 snapshot.20090925032931 drwxrwxr-x 2 bla bla 4096 Sep 25 04:15 snapshot.20090925041535 drwxrwxr-x 2 bla bla 4096 Sep 25 05:15 snapshot.20090925051539 drwxrwxr-x 2 bla bla 4096 Sep 25 06:15 snapshot.20090925061538 drwxrwxr-x 2 bla bla 4096 Sep 25 07:15 snapshot.20090925071539 drwxrwxr-x 2 bla bla 4096 Sep 25 08:15 snapshot.20090925081539 drwxrwxr-x 2 bla bla 4096 Sep 25 09:15 snapshot.20090925091538 drwxrwxr-x 2 bla bla 4096 Sep 25 09:52 snapshot.20090925095213 drwxrwxr-x 2 bla bla 4096 Sep 25 10:15 snapshot.20090925101540 drwxrwxr-x 2 bla bla 4096 Sep 25 11:15 snapshot.20090925111538 drwxrwxr-x 2 bla bla 4096 Sep 25 00:15 snapshot.20090925121534 drwxrwxr-x 2 bla bla 4096 Sep 25 12:15 snapshot.20090925121538 {code} was (Author: archon810): I haven't changed any configs yet, and this probably doesn't come as a shock to you guys, but the master just ran out of space. Upon inspection, I found 30+ snapshot dirs sitting around in /data. > Java Replication error: NullPointerException SEVERE: SnapPull failed on > 2009-09-22 nightly > ------------------------------------------------------------------------------------------ > > Key: SOLR-1458 > URL: https://issues.apache.org/jira/browse/SOLR-1458 > Project: Solr > Issue Type: Bug > Components: replication (java) > Affects Versions: 1.4 > Environment: CentOS x64 > 8GB RAM > Tomcat, running with 7G max memory; memory usage is <2GB, so it's not the > problem > Host a: master > Host b: slave > Multiple single core Solr instances, using JNDI. > Java replication > Reporter: Artem Russakovskii > Assignee: Noble Paul > Fix For: 1.4 > > Attachments: SOLR-1458.patch, SOLR-1458.patch, SOLR-1458.patch, > SOLR-1458.patch, SolrDeletionPolicy.patch, SolrDeletionPolicy.patch > > > After finally figuring out the new Java based replication, we have started > both the slave and the master and issued optimize to all master Solr > instances. This triggered some replication to go through just fine, but it > looks like some of it is failing. > Here's what I'm getting in the slave logs, repeatedly for each shard: > {code} > SEVERE: SnapPull failed > java.lang.NullPointerException > at > org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:271) > at > org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:258) > at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > at > java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > {code} > If I issue an optimize again on the master to one of the shards, it then > triggers a replication and replicates OK. I have a feeling that these > SnapPull failures appear later on but right now I don't have enough to form a > pattern. > Here's replication.properties on one of the failed slave instances. > {code} > cat data/replication.properties > #Replication details > #Wed Sep 23 19:35:30 PDT 2009 > replicationFailedAtList=1253759730020,1253759700018,1253759670019,1253759640018,1253759610018,1253759580022,1253759550019,1253759520016,1253759490026,1253759460016 > previousCycleTimeInSeconds=0 > timesFailed=113 > indexReplicatedAtList=1253759730020,1253759700018,1253759670019,1253759640018,1253759610018,1253759580022,1253759550019,1253759520016,1253759490026,1253759460016 > indexReplicatedAt=1253759730020 > replicationFailedAt=1253759730020 > lastCycleBytesDownloaded=0 > timesIndexReplicated=113 > {code} > and another > {code} > cat data/replication.properties > #Replication details > #Wed Sep 23 18:42:01 PDT 2009 > replicationFailedAtList=1253756490034,1253756460169 > previousCycleTimeInSeconds=1 > timesFailed=2 > indexReplicatedAtList=1253756521284,1253756490034,1253756460169 > indexReplicatedAt=1253756521284 > replicationFailedAt=1253756490034 > lastCycleBytesDownloaded=22932293 > timesIndexReplicated=3 > {code} > Some relevant configs: > In solrconfig.xml: > {code} > <!-- For docs see http://wiki.apache.org/solr/SolrReplication --> > <requestHandler name="/replication" class="solr.ReplicationHandler" > > <lst name="master"> > <str name="enable">${enable.master:false}</str> > <str name="replicateAfter">optimize</str> > <str name="backupAfter">optimize</str> > <str name="commitReserveDuration">00:00:20</str> > </lst> > <lst name="slave"> > <str name="enable">${enable.slave:false}</str> > <!-- url of master, from properties file --> > <str name="masterUrl">${master.url}</str> > <!-- how often to check master --> > <str name="pollInterval">00:00:30</str> > </lst> > </requestHandler> > {code} > The slave then has this in solrcore.properties: > {code} > enable.slave=true > master.url=URLOFMASTER/replication > {code} > and the master has > {code} > enable.master=true > {code} > I'd be glad to provide more details but I'm not sure what else I can do. > SOLR-926 may be relevant. > Thanks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.