[ 
https://issues.apache.org/jira/browse/SOLR-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759784#action_12759784
 ] 

Artem Russakovskii edited comment on SOLR-1458 at 9/25/09 3:32 PM:
-------------------------------------------------------------------

I haven't changed any configs yet, and this probably doesn't come as a shock to 
you guys, but the master just ran out of space. Upon inspection, I found 30+ 
snapshot dirs sitting around in /data.

Paul, adding your deletionPolicy fix didn't delete the files, even after 
optimize. Is that expected?

{code}
drwxrwxr-x  2 bla bla  4096 Sep 23 18:42 snapshot.20090923064214
drwxrwxr-x  2 bla bla  4096 Sep 23 19:15 snapshot.20090923071530
drwxrwxr-x  2 bla bla  4096 Sep 23 19:45 snapshot.20090923074535
drwxrwxr-x  2 bla bla  4096 Sep 23 20:15 snapshot.20090923081531
drwxrwxr-x  2 bla bla  4096 Sep 23 21:15 snapshot.20090923091531
drwxrwxr-x  2 bla bla  4096 Sep 23 22:15 snapshot.20090923101532
drwxrwxr-x  2 bla bla  4096 Sep 23 23:15 snapshot.20090923111533
drwxrwxr-x  2 bla bla  4096 Sep 24 01:15 snapshot.20090924011501
drwxrwxr-x  2 bla bla  4096 Sep 24 13:15 snapshot.20090924011535
drwxrwxr-x  2 bla bla  4096 Sep 24 02:15 snapshot.20090924021501
drwxrwxr-x  2 bla bla  4096 Sep 24 14:15 snapshot.20090924021534
drwxrwxr-x  2 bla bla  4096 Sep 24 15:15 snapshot.20090924031501
drwxrwxr-x  2 bla bla  4096 Sep 24 03:15 snapshot.20090924031502
drwxrwxr-x  2 bla bla  4096 Sep 24 04:15 snapshot.20090924041501
drwxrwxr-x  2 bla bla  4096 Sep 24 16:15 snapshot.20090924041536
drwxrwxr-x  2 bla bla  4096 Sep 24 05:15 snapshot.20090924051501
drwxrwxr-x  2 bla bla  4096 Sep 24 17:15 snapshot.20090924051537
drwxrwxr-x  2 bla bla  4096 Sep 24 06:15 snapshot.20090924061501
drwxrwxr-x  2 bla bla  4096 Sep 24 18:15 snapshot.20090924061534
drwxrwxr-x  2 bla bla  4096 Sep 24 07:15 snapshot.20090924071501
drwxrwxr-x  2 bla bla  4096 Sep 24 19:15 snapshot.20090924071533
drwxrwxr-x  2 bla bla  4096 Sep 24 08:15 snapshot.20090924081534
drwxrwxr-x  2 bla bla  4096 Sep 24 20:15 snapshot.20090924081535
drwxrwxr-x  2 bla bla  4096 Sep 24 09:15 snapshot.20090924091501
drwxrwxr-x  2 bla bla  4096 Sep 24 21:15 snapshot.20090924091532
drwxrwxr-x  2 bla bla  4096 Sep 24 10:15 snapshot.20090924101501
drwxrwxr-x  2 bla bla  4096 Sep 24 22:15 snapshot.20090924101533
drwxrwxr-x  2 bla bla  4096 Sep 24 11:15 snapshot.20090924111501
drwxrwxr-x  2 bla bla  4096 Sep 24 23:15 snapshot.20090924111532
drwxrwxr-x  2 bla bla  4096 Sep 24 12:15 snapshot.20090924121532
drwxrwxr-x  2 bla bla  4096 Sep 24 00:15 snapshot.20090924121533
drwxrwxr-x  2 bla bla  4096 Sep 25 01:15 snapshot.20090925011533
drwxrwxr-x  2 bla bla  4096 Sep 25 13:15 snapshot.20090925011540
drwxrwxr-x  2 bla bla  4096 Sep 25 02:15 snapshot.20090925021534
drwxrwxr-x  2 bla bla  4096 Sep 25 14:15 snapshot.20090925021540
drwxrwxr-x  2 bla bla  4096 Sep 25 03:15 snapshot.20090925031535
drwxrwxr-x  2 bla bla  4096 Sep 25 15:15 snapshot.20090925031540
drwxrwxr-x  2 bla bla  4096 Sep 25 15:29 snapshot.20090925032931
drwxrwxr-x  2 bla bla  4096 Sep 25 04:15 snapshot.20090925041535
drwxrwxr-x  2 bla bla  4096 Sep 25 05:15 snapshot.20090925051539
drwxrwxr-x  2 bla bla  4096 Sep 25 06:15 snapshot.20090925061538
drwxrwxr-x  2 bla bla  4096 Sep 25 07:15 snapshot.20090925071539
drwxrwxr-x  2 bla bla  4096 Sep 25 08:15 snapshot.20090925081539
drwxrwxr-x  2 bla bla  4096 Sep 25 09:15 snapshot.20090925091538
drwxrwxr-x  2 bla bla  4096 Sep 25 09:52 snapshot.20090925095213
drwxrwxr-x  2 bla bla  4096 Sep 25 10:15 snapshot.20090925101540
drwxrwxr-x  2 bla bla  4096 Sep 25 11:15 snapshot.20090925111538
drwxrwxr-x  2 bla bla  4096 Sep 25 00:15 snapshot.20090925121534
drwxrwxr-x  2 bla bla  4096 Sep 25 12:15 snapshot.20090925121538
{code}

      was (Author: archon810):
    I haven't changed any configs yet, and this probably doesn't come as a 
shock to you guys, but the master just ran out of space. Upon inspection, I 
found 30+ snapshot dirs sitting around in /data.
  
> Java Replication error: NullPointerException SEVERE: SnapPull failed on 
> 2009-09-22 nightly
> ------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1458
>                 URL: https://issues.apache.org/jira/browse/SOLR-1458
>             Project: Solr
>          Issue Type: Bug
>          Components: replication (java)
>    Affects Versions: 1.4
>         Environment: CentOS x64
> 8GB RAM
> Tomcat, running with 7G max memory; memory usage is <2GB, so it's not the 
> problem
> Host a: master
> Host b: slave
> Multiple single core Solr instances, using JNDI.
> Java replication
>            Reporter: Artem Russakovskii
>            Assignee: Noble Paul
>             Fix For: 1.4
>
>         Attachments: SOLR-1458.patch, SOLR-1458.patch, SOLR-1458.patch, 
> SOLR-1458.patch, SolrDeletionPolicy.patch, SolrDeletionPolicy.patch
>
>
> After finally figuring out the new Java based replication, we have started 
> both the slave and the master and issued optimize to all master Solr 
> instances. This triggered some replication to go through just fine, but it 
> looks like some of it is failing.
> Here's what I'm getting in the slave logs, repeatedly for each shard:
> {code} 
> SEVERE: SnapPull failed 
> java.lang.NullPointerException
>         at 
> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:271)
>         at 
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:258)
>         at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>         at 
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> {code} 
> If I issue an optimize again on the master to one of the shards, it then 
> triggers a replication and replicates OK. I have a feeling that these 
> SnapPull failures appear later on but right now I don't have enough to form a 
> pattern.
> Here's replication.properties on one of the failed slave instances.
> {code}
> cat data/replication.properties 
> #Replication details
> #Wed Sep 23 19:35:30 PDT 2009
> replicationFailedAtList=1253759730020,1253759700018,1253759670019,1253759640018,1253759610018,1253759580022,1253759550019,1253759520016,1253759490026,1253759460016
> previousCycleTimeInSeconds=0
> timesFailed=113
> indexReplicatedAtList=1253759730020,1253759700018,1253759670019,1253759640018,1253759610018,1253759580022,1253759550019,1253759520016,1253759490026,1253759460016
> indexReplicatedAt=1253759730020
> replicationFailedAt=1253759730020
> lastCycleBytesDownloaded=0
> timesIndexReplicated=113
> {code}
> and another
> {code}
> cat data/replication.properties 
> #Replication details
> #Wed Sep 23 18:42:01 PDT 2009
> replicationFailedAtList=1253756490034,1253756460169
> previousCycleTimeInSeconds=1
> timesFailed=2
> indexReplicatedAtList=1253756521284,1253756490034,1253756460169
> indexReplicatedAt=1253756521284
> replicationFailedAt=1253756490034
> lastCycleBytesDownloaded=22932293
> timesIndexReplicated=3
> {code}
> Some relevant configs:
> In solrconfig.xml:
> {code}
> <!-- For docs see http://wiki.apache.org/solr/SolrReplication -->
>   <requestHandler name="/replication" class="solr.ReplicationHandler" >
>     <lst name="master">
>         <str name="enable">${enable.master:false}</str>
>         <str name="replicateAfter">optimize</str>
>         <str name="backupAfter">optimize</str>
>         <str name="commitReserveDuration">00:00:20</str>
>     </lst>
>     <lst name="slave">
>         <str name="enable">${enable.slave:false}</str>
>         <!-- url of master, from properties file -->
>         <str name="masterUrl">${master.url}</str>
>         <!-- how often to check master -->
>         <str name="pollInterval">00:00:30</str>
>     </lst>
>   </requestHandler>
> {code}
> The slave then has this in solrcore.properties:
> {code}
> enable.slave=true
> master.url=URLOFMASTER/replication
> {code}
> and the master has
> {code}
> enable.master=true
> {code}
> I'd be glad to provide more details but I'm not sure what else I can do.  
> SOLR-926 may be relevant.
> Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to