Re: [Openstack] [swift] Object replication failure counts confusing in 2.7.0

Mark Kirkwood Tue, 17 May 2016 21:59:19 -0700

On 17/05/16 17:43, Mark Kirkwood wrote:

(snippage)
I'm seeing some replication errors in the object server log:
May 17 05:27:36 markir-dev-ostor001 object-server: Starting objectreplication pass.May 17 05:27:36 markir-dev-ostor001 object-server: 1/1 (100.00%)partitions replicated in 0.03s (38.19/sec, 0s remaining)May 17 05:27:36 markir-dev-ostor001 object-server: 2 successes, 0failuresMay 17 05:27:36 markir-dev-ostor001 object-server: 1 suffixes checked- 0.00% hashed, 0.00% syncedMay 17 05:27:36 markir-dev-ostor001 object-server: Partition times:max 0.0210s, min 0.0210s, med 0.0210sMay 17 05:27:36 markir-dev-ostor001 object-server: Object replicationcomplete. (0.00 minutes)May 17 05:27:36 markir-dev-ostor001 object-server: Replicationsleeping for 30 seconds.May 17 05:27:40 markir-dev-ostor001 object-server: Begin object audit"forever" mode (ALL)May 17 05:27:40 markir-dev-ostor001 object-server: Begin object audit"forever" mode (ZBF)May 17 05:27:40 markir-dev-ostor001 object-server: Object audit (ZBF).Since Tue May 17 05:27:40 2016: Locally: 1 passed, 0 quarantined, 0errors, files/sec: 83.24, bytes/sec: 0.00, Total time: 0.01, Auditingtime: 0.00, Rate: 0.00May 17 05:27:40 markir-dev-ostor001 object-server: Object audit (ZBF)"forever" mode completed: 0.01s. Total quarantined: 0, Total errors:0, Total files/sec: 66.89, Total bytes/sec: 0.00, Auditing time: 0.01,Rate: 0.75May 17 05:27:45 markir-dev-ostor001 object-server: ::ffff:10.0.3.242 -- [17/May/2016:05:27:45 +0000] "REPLICATE /1/899" 200 56 "-" "-""object-replicator 18131" 0.0014 "-" 29108 0May 17 05:27:45 markir-dev-ostor001 object-server: ::ffff:10.0.3.242 -- [17/May/2016:05:27:45 +0000] "REPLICATE /1/899" 200 56 "-" "-""object-replicator 18131" 0.0016 "-" 29109 0May 17 05:28:06 markir-dev-ostor001 object-server: Starting objectreplication pass.May 17 05:28:06 markir-dev-ostor001 object-server: 1/1 (100.00%)partitions replicated in 0.02s (49.85/sec, 0s remaining)May 17 05:28:06 markir-dev-ostor001 object-server: 2 successes, 6failures <==============================May 17 05:28:06 markir-dev-ostor001 object-server: 1 suffixes checked- 0.00% hashed, 0.00% syncedMay 17 05:28:06 markir-dev-ostor001 object-server: Partition times:max 0.0155s, min 0.0155s, med 0.0155sMay 17 05:28:06 markir-dev-ostor001 object-server: Object replicationcomplete. (0.00 minutes)May 17 05:28:06 markir-dev-ostor001 object-server: Replicationsleeping for 30 seconds.


I've figured out one case:

Adding some debugging code and traceback gives more interesting output(see attached diff):

May 18 04:31:17 markir-dev-ostor002 object-server: object replicationfailure 4, detail Traceback (most recent call last):#012 File"/opt/cat/openstack/swift/local/lib/python2.7/site-packages/swift/obj/replicator.py",line 622, in build_replication_jobs#012 int(partition))#012ValueError:invalid literal for int() with base 10: 'auditor_status_ALL.json'#012


The code is doing:

|try: job_path = join(obj_path, partition) part_nodes =policy.object_ring.get_part_nodes( int(partition)) <=== 622 |


Looking at what is in my object dirs:

|$ ls /srv/node/2/objects/ 899 auditor_status_ALL.json |

||Yep, that's gotta hurt! We wither shouldn't be writing the audit jsonfile there or should make the replicator code ignore it! Shall I raisean issue?


Cheers

Mark

diff --git a/swift/obj/replicator.py b/swift/obj/replicator.py
index ff89b32..f799157 100644
--- a/swift/obj/replicator.py
+++ b/swift/obj/replicator.py
@@ -15,6 +15,7 @@
 
 import os
 import errno
+import traceback
 from os.path import isdir, isfile, join, dirname
 import random
 import shutil
@@ -361,6 +362,7 @@ class ObjectReplicator(Daemon):
                                      target_dev['device'])
                                     for target_dev in job['nodes']])
             self.stats['success'] += len(target_devs_info - failure_devs_info)
+            self.logger.warning('object replication failure 0')
             self._add_failure_stats(failure_devs_info)
             if not handoff_partition_deleted:
                 self.handoffs_remaining += 1
@@ -492,6 +494,7 @@ class ObjectReplicator(Daemon):
             self.logger.exception(_("Error syncing partition"))
         finally:
             self.stats['success'] += len(target_devs_info - failure_devs_info)
+            self.logger.warning('object replication failure 1')
             self._add_failure_stats(failure_devs_info)
             self.partition_times.append(time.time() - begin)
             self.logger.timing_since('partition.update.timing', begin)
@@ -591,6 +594,7 @@ class ObjectReplicator(Daemon):
             obj_path = join(dev_path, data_dir)
             tmp_path = join(dev_path, get_tmp_dir(policy))
             if self.mount_check and not ismount(dev_path):
+                self.logger.warning('object replication failure 2')
                 self._add_failure_stats(
                     [(failure_dev['replication_ip'],
                       failure_dev['device'])
@@ -628,12 +632,15 @@ class ObjectReplicator(Daemon):
                              partition=partition,
                              region=local_dev['region']))
                 except ValueError:
+                    trace = traceback.format_exc()
                     if part_nodes:
+                        self.logger.warning('object replication failure 3')
                         self._add_failure_stats(
                             [(failure_dev['replication_ip'],
                               failure_dev['device'])
                              for failure_dev in nodes])
                     else:
+                        self.logger.warning('object replication failure 4, detail %s', trace)
                         self._add_failure_stats(
                             [(failure_dev['replication_ip'],
                               failure_dev['device'])
@@ -711,6 +718,7 @@ class ObjectReplicator(Daemon):
                     continue
                 dev_path = join(self.devices_dir, job['device'])
                 if self.mount_check and not ismount(dev_path):
+                    self.logger.warning('object replication failure 5')
                     self._add_failure_stats([(failure_dev['replication_ip'],
                                               failure_dev['device'])
                                              for failure_dev in job['nodes']])
@@ -749,10 +757,12 @@ class ObjectReplicator(Daemon):
                 self.run_pool.waitall()
         except (Exception, Timeout):
             if current_nodes:
+                self.logger.warning('object replication failure 6')
                 self._add_failure_stats([(failure_dev['replication_ip'],
                                           failure_dev['device'])
                                          for failure_dev in current_nodes])
             else:
+                self.logger.warning('object replication failure 7')
                 self._add_failure_stats(self.all_devs_info)
             self.logger.exception(_("Exception in top-level replication loop"))
             self.kill_coros()

_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to     : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack

Re: [Openstack] [swift] Object replication failure counts confusing in 2.7.0

Reply via email to