Re: [ceph-users] RGW Multisite metadata sync init

2017-09-07 Thread David Turner
I sent the output of all of the files including the logs to you.  Thank you
for your help so far.

On Thu, Sep 7, 2017 at 4:48 PM Yehuda Sadeh-Weinraub 
wrote:

> On Thu, Sep 7, 2017 at 11:37 PM, David Turner 
> wrote:
> > I'm pretty sure I'm using the cluster admin user/keyring.  Is there any
> > output that would be helpful?  Period, zonegroup get, etc?
>
>  - radosgw-admin period get
>  - radosgw-admin zone list
>  - radosgw-admin zonegroup list
>
> For each zone, zonegroup in result:
>  - radosgw-admin zone get --rgw-zone=
>  - radosgw-admin zonegroup get --rgw-zonegroup=
>
>  - rados lspools
>
> Also, create a user with --debug-rgw=20 --debug-ms=1, need to look at the
> log.
>
> Yehuda
>
>
> >
> > On Thu, Sep 7, 2017 at 4:27 PM Yehuda Sadeh-Weinraub 
> > wrote:
> >>
> >> On Thu, Sep 7, 2017 at 11:02 PM, David Turner 
> >> wrote:
> >> > I created a test user named 'ice' and then used it to create a bucket
> >> > named
> >> > ice.  The bucket ice can be found in the second dc, but not the user.
> >> > `mdlog list` showed ice for the bucket, but not for the user.  I
> >> > performed
> >> > the same test in the internal realm and it showed the user and bucket
> >> > both
> >> > in `mdlog list`.
> >> >
> >>
> >> Maybe your radosgw-admin command is running with a ceph user that
> >> doesn't have permissions to write to the log pool? (probably not,
> >> because you are able to run the sync init commands).
> >> Another very slim explanation would be if you had for some reason
> >> overlapping zones configuration that shared some of the config but not
> >> all of it, having radosgw running against the correct one and
> >> radosgw-admin against the bad one. I don't think it's the second
> >> option.
> >>
> >> Yehuda
> >>
> >> >
> >> >
> >> > On Thu, Sep 7, 2017 at 3:27 PM Yehuda Sadeh-Weinraub <
> yeh...@redhat.com>
> >> > wrote:
> >> >>
> >> >> On Thu, Sep 7, 2017 at 10:04 PM, David Turner  >
> >> >> wrote:
> >> >> > One realm is called public with a zonegroup called public-zg with a
> >> >> > zone
> >> >> > for
> >> >> > each datacenter.  The second realm is called internal with a
> >> >> > zonegroup
> >> >> > called internal-zg with a zone for each datacenter.  they each have
> >> >> > their
> >> >> > own rgw's and load balancers.  The needs of our public facing rgw's
> >> >> > and
> >> >> > load
> >> >> > balancers vs internal use ones was different enough that we split
> >> >> > them
> >> >> > up
> >> >> > completely.  We also have a local realm that does not use multisite
> >> >> > and
> >> >> > a
> >> >> > 4th realm called QA that mimics the public realm as much as
> possible
> >> >> > for
> >> >> > staging configuration stages for the rgw daemons.  All 4 realms
> have
> >> >> > their
> >> >> > own buckets, users, etc and that is all working fine.  For all of
> the
> >> >> > radosgw-admin commands I am using the proper identifiers to make
> sure
> >> >> > that
> >> >> > each datacenter and realm are running commands on exactly what I
> >> >> > expect
> >> >> > them
> >> >> > to (--rgw-realm=public --rgw-zonegroup=public-zg
> >> >> > --rgw-zone=public-dc1
> >> >> > --source-zone=public-dc2).
> >> >> >
> >> >> > The data sync issue was in the internal realm but running a data
> sync
> >> >> > init
> >> >> > and kickstarting the rgw daemons in each datacenter fixed the data
> >> >> > discrepancies (I'm thinking it had something to do with a power
> >> >> > failure
> >> >> > a
> >> >> > few months back that I just noticed recently).  The metadata sync
> >> >> > issue
> >> >> > is
> >> >> > in the public realm.  I have no idea what is causing this to not
> sync
> >> >> > properly since running a `metadata sync init` catches it back up to
> >> >> > the
> >> >> > primary zone, but then it doesn't receive any new users created
> after
> >> >> > that.
> >> >> >
> >> >>
> >> >> Sounds like an issue with the metadata log in the primary master
> zone.
> >> >> Not sure what could go wrong there, but maybe the master zone doesn't
> >> >> know that it is a master zone, or it's set to not log metadata. Or
> >> >> maybe there's a problem when the secondary is trying to fetch the
> >> >> metadata log. Maybe some kind of # of shards mismatch (though not
> >> >> likely).
> >> >> Try to see if the master logs any changes: should use the
> >> >> 'radosgw-admin mdlog list' command.
> >> >>
> >> >> Yehuda
> >> >>
> >> >> > On Thu, Sep 7, 2017 at 2:52 PM Yehuda Sadeh-Weinraub
> >> >> > 
> >> >> > wrote:
> >> >> >>
> >> >> >> On Thu, Sep 7, 2017 at 7:44 PM, David Turner <
> drakonst...@gmail.com>
> >> >> >> wrote:
> >> >> >> > Ok, I've been testing, investigating, researching, etc for the
> >> >> >> > last
> >> >> >> > week
> >> >> >> > and
> >> >> >> > I don't have any problems with data syncing.  The clients on one
> >> >> >> > side
> >> >> >> > are
> >> >> >> > creating multipart objects while the 

Re: [ceph-users] RGW Multisite metadata sync init

2017-09-07 Thread Yehuda Sadeh-Weinraub
On Thu, Sep 7, 2017 at 11:37 PM, David Turner  wrote:
> I'm pretty sure I'm using the cluster admin user/keyring.  Is there any
> output that would be helpful?  Period, zonegroup get, etc?

 - radosgw-admin period get
 - radosgw-admin zone list
 - radosgw-admin zonegroup list

For each zone, zonegroup in result:
 - radosgw-admin zone get --rgw-zone=
 - radosgw-admin zonegroup get --rgw-zonegroup=

 - rados lspools

Also, create a user with --debug-rgw=20 --debug-ms=1, need to look at the log.

Yehuda


>
> On Thu, Sep 7, 2017 at 4:27 PM Yehuda Sadeh-Weinraub 
> wrote:
>>
>> On Thu, Sep 7, 2017 at 11:02 PM, David Turner 
>> wrote:
>> > I created a test user named 'ice' and then used it to create a bucket
>> > named
>> > ice.  The bucket ice can be found in the second dc, but not the user.
>> > `mdlog list` showed ice for the bucket, but not for the user.  I
>> > performed
>> > the same test in the internal realm and it showed the user and bucket
>> > both
>> > in `mdlog list`.
>> >
>>
>> Maybe your radosgw-admin command is running with a ceph user that
>> doesn't have permissions to write to the log pool? (probably not,
>> because you are able to run the sync init commands).
>> Another very slim explanation would be if you had for some reason
>> overlapping zones configuration that shared some of the config but not
>> all of it, having radosgw running against the correct one and
>> radosgw-admin against the bad one. I don't think it's the second
>> option.
>>
>> Yehuda
>>
>> >
>> >
>> > On Thu, Sep 7, 2017 at 3:27 PM Yehuda Sadeh-Weinraub 
>> > wrote:
>> >>
>> >> On Thu, Sep 7, 2017 at 10:04 PM, David Turner 
>> >> wrote:
>> >> > One realm is called public with a zonegroup called public-zg with a
>> >> > zone
>> >> > for
>> >> > each datacenter.  The second realm is called internal with a
>> >> > zonegroup
>> >> > called internal-zg with a zone for each datacenter.  they each have
>> >> > their
>> >> > own rgw's and load balancers.  The needs of our public facing rgw's
>> >> > and
>> >> > load
>> >> > balancers vs internal use ones was different enough that we split
>> >> > them
>> >> > up
>> >> > completely.  We also have a local realm that does not use multisite
>> >> > and
>> >> > a
>> >> > 4th realm called QA that mimics the public realm as much as possible
>> >> > for
>> >> > staging configuration stages for the rgw daemons.  All 4 realms have
>> >> > their
>> >> > own buckets, users, etc and that is all working fine.  For all of the
>> >> > radosgw-admin commands I am using the proper identifiers to make sure
>> >> > that
>> >> > each datacenter and realm are running commands on exactly what I
>> >> > expect
>> >> > them
>> >> > to (--rgw-realm=public --rgw-zonegroup=public-zg
>> >> > --rgw-zone=public-dc1
>> >> > --source-zone=public-dc2).
>> >> >
>> >> > The data sync issue was in the internal realm but running a data sync
>> >> > init
>> >> > and kickstarting the rgw daemons in each datacenter fixed the data
>> >> > discrepancies (I'm thinking it had something to do with a power
>> >> > failure
>> >> > a
>> >> > few months back that I just noticed recently).  The metadata sync
>> >> > issue
>> >> > is
>> >> > in the public realm.  I have no idea what is causing this to not sync
>> >> > properly since running a `metadata sync init` catches it back up to
>> >> > the
>> >> > primary zone, but then it doesn't receive any new users created after
>> >> > that.
>> >> >
>> >>
>> >> Sounds like an issue with the metadata log in the primary master zone.
>> >> Not sure what could go wrong there, but maybe the master zone doesn't
>> >> know that it is a master zone, or it's set to not log metadata. Or
>> >> maybe there's a problem when the secondary is trying to fetch the
>> >> metadata log. Maybe some kind of # of shards mismatch (though not
>> >> likely).
>> >> Try to see if the master logs any changes: should use the
>> >> 'radosgw-admin mdlog list' command.
>> >>
>> >> Yehuda
>> >>
>> >> > On Thu, Sep 7, 2017 at 2:52 PM Yehuda Sadeh-Weinraub
>> >> > 
>> >> > wrote:
>> >> >>
>> >> >> On Thu, Sep 7, 2017 at 7:44 PM, David Turner 
>> >> >> wrote:
>> >> >> > Ok, I've been testing, investigating, researching, etc for the
>> >> >> > last
>> >> >> > week
>> >> >> > and
>> >> >> > I don't have any problems with data syncing.  The clients on one
>> >> >> > side
>> >> >> > are
>> >> >> > creating multipart objects while the multisite sync is creating
>> >> >> > them
>> >> >> > as
>> >> >> > whole objects and one of the datacenters is slower at cleaning up
>> >> >> > the
>> >> >> > shadow
>> >> >> > files.  That's the big discrepancy between object counts in the
>> >> >> > pools
>> >> >> > between datacenters.  I created a tool that goes through for each
>> >> >> > bucket
>> >> >> > in
>> >> >> > a realm and does a recursive listing of all objects in it for both
>> >> 

Re: [ceph-users] RGW Multisite metadata sync init

2017-09-07 Thread David Turner
I'm pretty sure I'm using the cluster admin user/keyring.  Is there any
output that would be helpful?  Period, zonegroup get, etc?

On Thu, Sep 7, 2017 at 4:27 PM Yehuda Sadeh-Weinraub 
wrote:

> On Thu, Sep 7, 2017 at 11:02 PM, David Turner 
> wrote:
> > I created a test user named 'ice' and then used it to create a bucket
> named
> > ice.  The bucket ice can be found in the second dc, but not the user.
> > `mdlog list` showed ice for the bucket, but not for the user.  I
> performed
> > the same test in the internal realm and it showed the user and bucket
> both
> > in `mdlog list`.
> >
>
> Maybe your radosgw-admin command is running with a ceph user that
> doesn't have permissions to write to the log pool? (probably not,
> because you are able to run the sync init commands).
> Another very slim explanation would be if you had for some reason
> overlapping zones configuration that shared some of the config but not
> all of it, having radosgw running against the correct one and
> radosgw-admin against the bad one. I don't think it's the second
> option.
>
> Yehuda
>
> >
> >
> > On Thu, Sep 7, 2017 at 3:27 PM Yehuda Sadeh-Weinraub 
> > wrote:
> >>
> >> On Thu, Sep 7, 2017 at 10:04 PM, David Turner 
> >> wrote:
> >> > One realm is called public with a zonegroup called public-zg with a
> zone
> >> > for
> >> > each datacenter.  The second realm is called internal with a zonegroup
> >> > called internal-zg with a zone for each datacenter.  they each have
> >> > their
> >> > own rgw's and load balancers.  The needs of our public facing rgw's
> and
> >> > load
> >> > balancers vs internal use ones was different enough that we split them
> >> > up
> >> > completely.  We also have a local realm that does not use multisite
> and
> >> > a
> >> > 4th realm called QA that mimics the public realm as much as possible
> for
> >> > staging configuration stages for the rgw daemons.  All 4 realms have
> >> > their
> >> > own buckets, users, etc and that is all working fine.  For all of the
> >> > radosgw-admin commands I am using the proper identifiers to make sure
> >> > that
> >> > each datacenter and realm are running commands on exactly what I
> expect
> >> > them
> >> > to (--rgw-realm=public --rgw-zonegroup=public-zg --rgw-zone=public-dc1
> >> > --source-zone=public-dc2).
> >> >
> >> > The data sync issue was in the internal realm but running a data sync
> >> > init
> >> > and kickstarting the rgw daemons in each datacenter fixed the data
> >> > discrepancies (I'm thinking it had something to do with a power
> failure
> >> > a
> >> > few months back that I just noticed recently).  The metadata sync
> issue
> >> > is
> >> > in the public realm.  I have no idea what is causing this to not sync
> >> > properly since running a `metadata sync init` catches it back up to
> the
> >> > primary zone, but then it doesn't receive any new users created after
> >> > that.
> >> >
> >>
> >> Sounds like an issue with the metadata log in the primary master zone.
> >> Not sure what could go wrong there, but maybe the master zone doesn't
> >> know that it is a master zone, or it's set to not log metadata. Or
> >> maybe there's a problem when the secondary is trying to fetch the
> >> metadata log. Maybe some kind of # of shards mismatch (though not
> >> likely).
> >> Try to see if the master logs any changes: should use the
> >> 'radosgw-admin mdlog list' command.
> >>
> >> Yehuda
> >>
> >> > On Thu, Sep 7, 2017 at 2:52 PM Yehuda Sadeh-Weinraub <
> yeh...@redhat.com>
> >> > wrote:
> >> >>
> >> >> On Thu, Sep 7, 2017 at 7:44 PM, David Turner 
> >> >> wrote:
> >> >> > Ok, I've been testing, investigating, researching, etc for the last
> >> >> > week
> >> >> > and
> >> >> > I don't have any problems with data syncing.  The clients on one
> side
> >> >> > are
> >> >> > creating multipart objects while the multisite sync is creating
> them
> >> >> > as
> >> >> > whole objects and one of the datacenters is slower at cleaning up
> the
> >> >> > shadow
> >> >> > files.  That's the big discrepancy between object counts in the
> pools
> >> >> > between datacenters.  I created a tool that goes through for each
> >> >> > bucket
> >> >> > in
> >> >> > a realm and does a recursive listing of all objects in it for both
> >> >> > datacenters and compares the 2 lists for any differences.  The data
> >> >> > is
> >> >> > definitely in sync between the 2 datacenters down to the modified
> >> >> > time
> >> >> > and
> >> >> > byte of each file in s3.
> >> >> >
> >> >> > The metadata is still not syncing for the other realm, though.  If
> I
> >> >> > run
> >> >> > `metadata sync init` then the second datacenter will catch up with
> >> >> > all
> >> >> > of
> >> >> > the new users, but until I do that newly created users on the
> primary
> >> >> > side
> >> >> > don't exist on the secondary side.  `metadata sync status`, `sync
> >> >> > status`,
> >> >> 

Re: [ceph-users] RGW Multisite metadata sync init

2017-09-07 Thread Yehuda Sadeh-Weinraub
On Thu, Sep 7, 2017 at 11:02 PM, David Turner  wrote:
> I created a test user named 'ice' and then used it to create a bucket named
> ice.  The bucket ice can be found in the second dc, but not the user.
> `mdlog list` showed ice for the bucket, but not for the user.  I performed
> the same test in the internal realm and it showed the user and bucket both
> in `mdlog list`.
>

Maybe your radosgw-admin command is running with a ceph user that
doesn't have permissions to write to the log pool? (probably not,
because you are able to run the sync init commands).
Another very slim explanation would be if you had for some reason
overlapping zones configuration that shared some of the config but not
all of it, having radosgw running against the correct one and
radosgw-admin against the bad one. I don't think it's the second
option.

Yehuda

>
>
> On Thu, Sep 7, 2017 at 3:27 PM Yehuda Sadeh-Weinraub 
> wrote:
>>
>> On Thu, Sep 7, 2017 at 10:04 PM, David Turner 
>> wrote:
>> > One realm is called public with a zonegroup called public-zg with a zone
>> > for
>> > each datacenter.  The second realm is called internal with a zonegroup
>> > called internal-zg with a zone for each datacenter.  they each have
>> > their
>> > own rgw's and load balancers.  The needs of our public facing rgw's and
>> > load
>> > balancers vs internal use ones was different enough that we split them
>> > up
>> > completely.  We also have a local realm that does not use multisite and
>> > a
>> > 4th realm called QA that mimics the public realm as much as possible for
>> > staging configuration stages for the rgw daemons.  All 4 realms have
>> > their
>> > own buckets, users, etc and that is all working fine.  For all of the
>> > radosgw-admin commands I am using the proper identifiers to make sure
>> > that
>> > each datacenter and realm are running commands on exactly what I expect
>> > them
>> > to (--rgw-realm=public --rgw-zonegroup=public-zg --rgw-zone=public-dc1
>> > --source-zone=public-dc2).
>> >
>> > The data sync issue was in the internal realm but running a data sync
>> > init
>> > and kickstarting the rgw daemons in each datacenter fixed the data
>> > discrepancies (I'm thinking it had something to do with a power failure
>> > a
>> > few months back that I just noticed recently).  The metadata sync issue
>> > is
>> > in the public realm.  I have no idea what is causing this to not sync
>> > properly since running a `metadata sync init` catches it back up to the
>> > primary zone, but then it doesn't receive any new users created after
>> > that.
>> >
>>
>> Sounds like an issue with the metadata log in the primary master zone.
>> Not sure what could go wrong there, but maybe the master zone doesn't
>> know that it is a master zone, or it's set to not log metadata. Or
>> maybe there's a problem when the secondary is trying to fetch the
>> metadata log. Maybe some kind of # of shards mismatch (though not
>> likely).
>> Try to see if the master logs any changes: should use the
>> 'radosgw-admin mdlog list' command.
>>
>> Yehuda
>>
>> > On Thu, Sep 7, 2017 at 2:52 PM Yehuda Sadeh-Weinraub 
>> > wrote:
>> >>
>> >> On Thu, Sep 7, 2017 at 7:44 PM, David Turner 
>> >> wrote:
>> >> > Ok, I've been testing, investigating, researching, etc for the last
>> >> > week
>> >> > and
>> >> > I don't have any problems with data syncing.  The clients on one side
>> >> > are
>> >> > creating multipart objects while the multisite sync is creating them
>> >> > as
>> >> > whole objects and one of the datacenters is slower at cleaning up the
>> >> > shadow
>> >> > files.  That's the big discrepancy between object counts in the pools
>> >> > between datacenters.  I created a tool that goes through for each
>> >> > bucket
>> >> > in
>> >> > a realm and does a recursive listing of all objects in it for both
>> >> > datacenters and compares the 2 lists for any differences.  The data
>> >> > is
>> >> > definitely in sync between the 2 datacenters down to the modified
>> >> > time
>> >> > and
>> >> > byte of each file in s3.
>> >> >
>> >> > The metadata is still not syncing for the other realm, though.  If I
>> >> > run
>> >> > `metadata sync init` then the second datacenter will catch up with
>> >> > all
>> >> > of
>> >> > the new users, but until I do that newly created users on the primary
>> >> > side
>> >> > don't exist on the secondary side.  `metadata sync status`, `sync
>> >> > status`,
>> >> > `metadata sync run` (only left running for 30 minutes before I ctrl+c
>> >> > it),
>> >> > etc don't show any problems... but the new users just don't exist on
>> >> > the
>> >> > secondary side until I run `metadata sync init`.  I created a new
>> >> > bucket
>> >> > with the new user and the bucket shows up in the second datacenter,
>> >> > but
>> >> > no
>> >> > objects because the objects don't have a valid owner.
>> >> >
>> >> > Thank you all for the 

Re: [ceph-users] RGW Multisite metadata sync init

2017-09-07 Thread David Turner
I created a test user named 'ice' and then used it to create a bucket named
ice.  The bucket ice can be found in the second dc, but not the user.
 `mdlog list` showed ice for the bucket, but not for the user.  I performed
the same test in the internal realm and it showed the user and bucket both
in `mdlog list`.



On Thu, Sep 7, 2017 at 3:27 PM Yehuda Sadeh-Weinraub 
wrote:

> On Thu, Sep 7, 2017 at 10:04 PM, David Turner 
> wrote:
> > One realm is called public with a zonegroup called public-zg with a zone
> for
> > each datacenter.  The second realm is called internal with a zonegroup
> > called internal-zg with a zone for each datacenter.  they each have their
> > own rgw's and load balancers.  The needs of our public facing rgw's and
> load
> > balancers vs internal use ones was different enough that we split them up
> > completely.  We also have a local realm that does not use multisite and a
> > 4th realm called QA that mimics the public realm as much as possible for
> > staging configuration stages for the rgw daemons.  All 4 realms have
> their
> > own buckets, users, etc and that is all working fine.  For all of the
> > radosgw-admin commands I am using the proper identifiers to make sure
> that
> > each datacenter and realm are running commands on exactly what I expect
> them
> > to (--rgw-realm=public --rgw-zonegroup=public-zg --rgw-zone=public-dc1
> > --source-zone=public-dc2).
> >
> > The data sync issue was in the internal realm but running a data sync
> init
> > and kickstarting the rgw daemons in each datacenter fixed the data
> > discrepancies (I'm thinking it had something to do with a power failure a
> > few months back that I just noticed recently).  The metadata sync issue
> is
> > in the public realm.  I have no idea what is causing this to not sync
> > properly since running a `metadata sync init` catches it back up to the
> > primary zone, but then it doesn't receive any new users created after
> that.
> >
>
> Sounds like an issue with the metadata log in the primary master zone.
> Not sure what could go wrong there, but maybe the master zone doesn't
> know that it is a master zone, or it's set to not log metadata. Or
> maybe there's a problem when the secondary is trying to fetch the
> metadata log. Maybe some kind of # of shards mismatch (though not
> likely).
> Try to see if the master logs any changes: should use the
> 'radosgw-admin mdlog list' command.
>
> Yehuda
>
> > On Thu, Sep 7, 2017 at 2:52 PM Yehuda Sadeh-Weinraub 
> > wrote:
> >>
> >> On Thu, Sep 7, 2017 at 7:44 PM, David Turner 
> >> wrote:
> >> > Ok, I've been testing, investigating, researching, etc for the last
> week
> >> > and
> >> > I don't have any problems with data syncing.  The clients on one side
> >> > are
> >> > creating multipart objects while the multisite sync is creating them
> as
> >> > whole objects and one of the datacenters is slower at cleaning up the
> >> > shadow
> >> > files.  That's the big discrepancy between object counts in the pools
> >> > between datacenters.  I created a tool that goes through for each
> bucket
> >> > in
> >> > a realm and does a recursive listing of all objects in it for both
> >> > datacenters and compares the 2 lists for any differences.  The data is
> >> > definitely in sync between the 2 datacenters down to the modified time
> >> > and
> >> > byte of each file in s3.
> >> >
> >> > The metadata is still not syncing for the other realm, though.  If I
> run
> >> > `metadata sync init` then the second datacenter will catch up with all
> >> > of
> >> > the new users, but until I do that newly created users on the primary
> >> > side
> >> > don't exist on the secondary side.  `metadata sync status`, `sync
> >> > status`,
> >> > `metadata sync run` (only left running for 30 minutes before I ctrl+c
> >> > it),
> >> > etc don't show any problems... but the new users just don't exist on
> the
> >> > secondary side until I run `metadata sync init`.  I created a new
> bucket
> >> > with the new user and the bucket shows up in the second datacenter,
> but
> >> > no
> >> > objects because the objects don't have a valid owner.
> >> >
> >> > Thank you all for the help with the data sync issue.  You pushed me
> into
> >> > good directions.  Does anyone have any insight as to what is
> preventing
> >> > the
> >> > metadata from syncing in the other realm?  I have 2 realms being sync
> >> > using
> >> > multi-site and it's only 1 of them that isn't getting the metadata
> >> > across.
> >> > As far as I can tell it is configured identically.
> >>
> >> What do you mean you have two realms? Zones and zonegroups need to
> >> exist in the same realm in order for meta and data sync to happen
> >> correctly. Maybe I'm misunderstanding.
> >>
> >> Yehuda
> >>
> >> >
> >> > On Thu, Aug 31, 2017 at 12:46 PM David Turner 
> >> > wrote:
> >> >>
> >> >> All of the messages from sync error 

Re: [ceph-users] RGW Multisite metadata sync init

2017-09-07 Thread Yehuda Sadeh-Weinraub
On Thu, Sep 7, 2017 at 10:04 PM, David Turner  wrote:
> One realm is called public with a zonegroup called public-zg with a zone for
> each datacenter.  The second realm is called internal with a zonegroup
> called internal-zg with a zone for each datacenter.  they each have their
> own rgw's and load balancers.  The needs of our public facing rgw's and load
> balancers vs internal use ones was different enough that we split them up
> completely.  We also have a local realm that does not use multisite and a
> 4th realm called QA that mimics the public realm as much as possible for
> staging configuration stages for the rgw daemons.  All 4 realms have their
> own buckets, users, etc and that is all working fine.  For all of the
> radosgw-admin commands I am using the proper identifiers to make sure that
> each datacenter and realm are running commands on exactly what I expect them
> to (--rgw-realm=public --rgw-zonegroup=public-zg --rgw-zone=public-dc1
> --source-zone=public-dc2).
>
> The data sync issue was in the internal realm but running a data sync init
> and kickstarting the rgw daemons in each datacenter fixed the data
> discrepancies (I'm thinking it had something to do with a power failure a
> few months back that I just noticed recently).  The metadata sync issue is
> in the public realm.  I have no idea what is causing this to not sync
> properly since running a `metadata sync init` catches it back up to the
> primary zone, but then it doesn't receive any new users created after that.
>

Sounds like an issue with the metadata log in the primary master zone.
Not sure what could go wrong there, but maybe the master zone doesn't
know that it is a master zone, or it's set to not log metadata. Or
maybe there's a problem when the secondary is trying to fetch the
metadata log. Maybe some kind of # of shards mismatch (though not
likely).
Try to see if the master logs any changes: should use the
'radosgw-admin mdlog list' command.

Yehuda

> On Thu, Sep 7, 2017 at 2:52 PM Yehuda Sadeh-Weinraub 
> wrote:
>>
>> On Thu, Sep 7, 2017 at 7:44 PM, David Turner 
>> wrote:
>> > Ok, I've been testing, investigating, researching, etc for the last week
>> > and
>> > I don't have any problems with data syncing.  The clients on one side
>> > are
>> > creating multipart objects while the multisite sync is creating them as
>> > whole objects and one of the datacenters is slower at cleaning up the
>> > shadow
>> > files.  That's the big discrepancy between object counts in the pools
>> > between datacenters.  I created a tool that goes through for each bucket
>> > in
>> > a realm and does a recursive listing of all objects in it for both
>> > datacenters and compares the 2 lists for any differences.  The data is
>> > definitely in sync between the 2 datacenters down to the modified time
>> > and
>> > byte of each file in s3.
>> >
>> > The metadata is still not syncing for the other realm, though.  If I run
>> > `metadata sync init` then the second datacenter will catch up with all
>> > of
>> > the new users, but until I do that newly created users on the primary
>> > side
>> > don't exist on the secondary side.  `metadata sync status`, `sync
>> > status`,
>> > `metadata sync run` (only left running for 30 minutes before I ctrl+c
>> > it),
>> > etc don't show any problems... but the new users just don't exist on the
>> > secondary side until I run `metadata sync init`.  I created a new bucket
>> > with the new user and the bucket shows up in the second datacenter, but
>> > no
>> > objects because the objects don't have a valid owner.
>> >
>> > Thank you all for the help with the data sync issue.  You pushed me into
>> > good directions.  Does anyone have any insight as to what is preventing
>> > the
>> > metadata from syncing in the other realm?  I have 2 realms being sync
>> > using
>> > multi-site and it's only 1 of them that isn't getting the metadata
>> > across.
>> > As far as I can tell it is configured identically.
>>
>> What do you mean you have two realms? Zones and zonegroups need to
>> exist in the same realm in order for meta and data sync to happen
>> correctly. Maybe I'm misunderstanding.
>>
>> Yehuda
>>
>> >
>> > On Thu, Aug 31, 2017 at 12:46 PM David Turner 
>> > wrote:
>> >>
>> >> All of the messages from sync error list are listed below.  The number
>> >> on
>> >> the left is how many times the error message is found.
>> >>
>> >>1811 "message": "failed to sync bucket instance:
>> >> (16) Device or resource busy"
>> >>   7 "message": "failed to sync bucket instance:
>> >> (5) Input\/output error"
>> >>  65 "message": "failed to sync object"
>> >>
>> >> On Tue, Aug 29, 2017 at 10:00 AM Orit Wasserman 
>> >> wrote:
>> >>>
>> >>>
>> >>> Hi David,
>> >>>
>> >>> On Mon, Aug 28, 2017 at 8:33 PM, David Turner 

Re: [ceph-users] RGW Multisite metadata sync init

2017-09-07 Thread David Turner
One realm is called public with a zonegroup called public-zg with a zone
for each datacenter.  The second realm is called internal with a zonegroup
called internal-zg with a zone for each datacenter.  they each have their
own rgw's and load balancers.  The needs of our public facing rgw's and
load balancers vs internal use ones was different enough that we split them
up completely.  We also have a local realm that does not use multisite and
a 4th realm called QA that mimics the public realm as much as possible for
staging configuration stages for the rgw daemons.  All 4 realms have their
own buckets, users, etc and that is all working fine.  For all of the
radosgw-admin commands I am using the proper identifiers to make sure that
each datacenter and realm are running commands on exactly what I expect
them to (--rgw-realm=public --rgw-zonegroup=public-zg --rgw-zone=public-dc1
--source-zone=public-dc2).

The data sync issue was in the internal realm but running a data sync init
and kickstarting the rgw daemons in each datacenter fixed the data
discrepancies (I'm thinking it had something to do with a power failure a
few months back that I just noticed recently).  The metadata sync issue is
in the public realm.  I have no idea what is causing this to not sync
properly since running a `metadata sync init` catches it back up to the
primary zone, but then it doesn't receive any new users created after that.

On Thu, Sep 7, 2017 at 2:52 PM Yehuda Sadeh-Weinraub 
wrote:

> On Thu, Sep 7, 2017 at 7:44 PM, David Turner 
> wrote:
> > Ok, I've been testing, investigating, researching, etc for the last week
> and
> > I don't have any problems with data syncing.  The clients on one side are
> > creating multipart objects while the multisite sync is creating them as
> > whole objects and one of the datacenters is slower at cleaning up the
> shadow
> > files.  That's the big discrepancy between object counts in the pools
> > between datacenters.  I created a tool that goes through for each bucket
> in
> > a realm and does a recursive listing of all objects in it for both
> > datacenters and compares the 2 lists for any differences.  The data is
> > definitely in sync between the 2 datacenters down to the modified time
> and
> > byte of each file in s3.
> >
> > The metadata is still not syncing for the other realm, though.  If I run
> > `metadata sync init` then the second datacenter will catch up with all of
> > the new users, but until I do that newly created users on the primary
> side
> > don't exist on the secondary side.  `metadata sync status`, `sync
> status`,
> > `metadata sync run` (only left running for 30 minutes before I ctrl+c
> it),
> > etc don't show any problems... but the new users just don't exist on the
> > secondary side until I run `metadata sync init`.  I created a new bucket
> > with the new user and the bucket shows up in the second datacenter, but
> no
> > objects because the objects don't have a valid owner.
> >
> > Thank you all for the help with the data sync issue.  You pushed me into
> > good directions.  Does anyone have any insight as to what is preventing
> the
> > metadata from syncing in the other realm?  I have 2 realms being sync
> using
> > multi-site and it's only 1 of them that isn't getting the metadata
> across.
> > As far as I can tell it is configured identically.
>
> What do you mean you have two realms? Zones and zonegroups need to
> exist in the same realm in order for meta and data sync to happen
> correctly. Maybe I'm misunderstanding.
>
> Yehuda
>
> >
> > On Thu, Aug 31, 2017 at 12:46 PM David Turner 
> wrote:
> >>
> >> All of the messages from sync error list are listed below.  The number
> on
> >> the left is how many times the error message is found.
> >>
> >>1811 "message": "failed to sync bucket instance:
> >> (16) Device or resource busy"
> >>   7 "message": "failed to sync bucket instance:
> >> (5) Input\/output error"
> >>  65 "message": "failed to sync object"
> >>
> >> On Tue, Aug 29, 2017 at 10:00 AM Orit Wasserman 
> >> wrote:
> >>>
> >>>
> >>> Hi David,
> >>>
> >>> On Mon, Aug 28, 2017 at 8:33 PM, David Turner 
> >>> wrote:
> 
>  The vast majority of the sync error list is "failed to sync bucket
>  instance: (16) Device or resource busy".  I can't find anything on
> Google
>  about this error message in relation to Ceph.  Does anyone have any
> idea
>  what this means? and/or how to fix it?
> >>>
> >>>
> >>> Those are intermediate errors resulting from several radosgw trying to
> >>> acquire the same sync log shard lease. It doesn't effect the sync
> progress.
> >>> Are there any other errors?
> >>>
> >>> Orit
> 
> 
>  On Fri, Aug 25, 2017 at 2:48 PM Casey Bodley 
> wrote:
> >
> > Hi David,
> >
> > The 'data sync 

Re: [ceph-users] RGW Multisite metadata sync init

2017-09-07 Thread Yehuda Sadeh-Weinraub
On Thu, Sep 7, 2017 at 7:44 PM, David Turner  wrote:
> Ok, I've been testing, investigating, researching, etc for the last week and
> I don't have any problems with data syncing.  The clients on one side are
> creating multipart objects while the multisite sync is creating them as
> whole objects and one of the datacenters is slower at cleaning up the shadow
> files.  That's the big discrepancy between object counts in the pools
> between datacenters.  I created a tool that goes through for each bucket in
> a realm and does a recursive listing of all objects in it for both
> datacenters and compares the 2 lists for any differences.  The data is
> definitely in sync between the 2 datacenters down to the modified time and
> byte of each file in s3.
>
> The metadata is still not syncing for the other realm, though.  If I run
> `metadata sync init` then the second datacenter will catch up with all of
> the new users, but until I do that newly created users on the primary side
> don't exist on the secondary side.  `metadata sync status`, `sync status`,
> `metadata sync run` (only left running for 30 minutes before I ctrl+c it),
> etc don't show any problems... but the new users just don't exist on the
> secondary side until I run `metadata sync init`.  I created a new bucket
> with the new user and the bucket shows up in the second datacenter, but no
> objects because the objects don't have a valid owner.
>
> Thank you all for the help with the data sync issue.  You pushed me into
> good directions.  Does anyone have any insight as to what is preventing the
> metadata from syncing in the other realm?  I have 2 realms being sync using
> multi-site and it's only 1 of them that isn't getting the metadata across.
> As far as I can tell it is configured identically.

What do you mean you have two realms? Zones and zonegroups need to
exist in the same realm in order for meta and data sync to happen
correctly. Maybe I'm misunderstanding.

Yehuda

>
> On Thu, Aug 31, 2017 at 12:46 PM David Turner  wrote:
>>
>> All of the messages from sync error list are listed below.  The number on
>> the left is how many times the error message is found.
>>
>>1811 "message": "failed to sync bucket instance:
>> (16) Device or resource busy"
>>   7 "message": "failed to sync bucket instance:
>> (5) Input\/output error"
>>  65 "message": "failed to sync object"
>>
>> On Tue, Aug 29, 2017 at 10:00 AM Orit Wasserman 
>> wrote:
>>>
>>>
>>> Hi David,
>>>
>>> On Mon, Aug 28, 2017 at 8:33 PM, David Turner 
>>> wrote:

 The vast majority of the sync error list is "failed to sync bucket
 instance: (16) Device or resource busy".  I can't find anything on Google
 about this error message in relation to Ceph.  Does anyone have any idea
 what this means? and/or how to fix it?
>>>
>>>
>>> Those are intermediate errors resulting from several radosgw trying to
>>> acquire the same sync log shard lease. It doesn't effect the sync progress.
>>> Are there any other errors?
>>>
>>> Orit


 On Fri, Aug 25, 2017 at 2:48 PM Casey Bodley  wrote:
>
> Hi David,
>
> The 'data sync init' command won't touch any actual object data, no.
> Resetting the data sync status will just cause a zone to restart a full 
> sync
> of the --source-zone's data changes log. This log only lists which
> buckets/shards have changes in them, which causes radosgw to consider them
> for bucket sync. So while the command may silence the warnings about data
> shards being behind, it's unlikely to resolve the issue with missing 
> objects
> in those buckets.
>
> When data sync is behind for an extended period of time, it's usually
> because it's stuck retrying previous bucket sync failures. The 'sync error
> list' may help narrow down where those failures are.
>
> There is also a 'bucket sync init' command to clear the bucket sync
> status. Following that with a 'bucket sync run' should restart a full sync
> on the bucket, pulling in any new objects that are present on the
> source-zone. I'm afraid that those commands haven't seen a lot of polish 
> or
> testing, however.
>
> Casey
>
>
> On 08/24/2017 04:15 PM, David Turner wrote:
>
> Apparently the data shards that are behind go in both directions, but
> only one zone is aware of the problem.  Each cluster has objects in their
> data pool that the other doesn't have.  I'm thinking about initiating a
> `data sync init` on both sides (one at a time) to get them back on the 
> same
> page.  Does anyone know if that command will overwrite any local data that
> the zone has that the other doesn't if you run `data sync init` on it?
>
> On Thu, Aug 24, 2017 at 1:51 PM David Turner 

Re: [ceph-users] RGW Multisite metadata sync init

2017-09-07 Thread David Turner
Ok, I've been testing, investigating, researching, etc for the last week
and I don't have any problems with data syncing.  The clients on one side
are creating multipart objects while the multisite sync is creating them as
whole objects and one of the datacenters is slower at cleaning up the
shadow files.  That's the big discrepancy between object counts in the
pools between datacenters.  I created a tool that goes through for each
bucket in a realm and does a recursive listing of all objects in it for
both datacenters and compares the 2 lists for any differences.  The data is
definitely in sync between the 2 datacenters down to the modified time and
byte of each file in s3.

The metadata is still not syncing for the other realm, though.  If I run
`metadata sync init` then the second datacenter will catch up with all of
the new users, but until I do that newly created users on the primary side
don't exist on the secondary side.  `metadata sync status`, `sync status`,
`metadata sync run` (only left running for 30 minutes before I ctrl+c it),
etc don't show any problems... but the new users just don't exist on the
secondary side until I run `metadata sync init`.  I created a new bucket
with the new user and the bucket shows up in the second datacenter, but no
objects because the objects don't have a valid owner.

Thank you all for the help with the data sync issue.  You pushed me into
good directions.  Does anyone have any insight as to what is preventing the
metadata from syncing in the other realm?  I have 2 realms being sync using
multi-site and it's only 1 of them that isn't getting the metadata across.
As far as I can tell it is configured identically.

On Thu, Aug 31, 2017 at 12:46 PM David Turner  wrote:

> All of the messages from sync error list are listed below.  The number on
> the left is how many times the error message is found.
>
>1811 "message": "failed to sync bucket instance:
> (16) Device or resource busy"
>   7 "message": "failed to sync bucket instance:
> (5) Input\/output error"
>  65 "message": "failed to sync object"
>
> On Tue, Aug 29, 2017 at 10:00 AM Orit Wasserman 
> wrote:
>
>>
>> Hi David,
>>
>> On Mon, Aug 28, 2017 at 8:33 PM, David Turner 
>> wrote:
>>
>>> The vast majority of the sync error list is "failed to sync bucket
>>> instance: (16) Device or resource busy".  I can't find anything on Google
>>> about this error message in relation to Ceph.  Does anyone have any idea
>>> what this means? and/or how to fix it?
>>>
>>
>> Those are intermediate errors resulting from several radosgw trying to
>> acquire the same sync log shard lease. It doesn't effect the sync progress.
>> Are there any other errors?
>>
>> Orit
>>
>>>
>>> On Fri, Aug 25, 2017 at 2:48 PM Casey Bodley  wrote:
>>>
 Hi David,

 The 'data sync init' command won't touch any actual object data, no.
 Resetting the data sync status will just cause a zone to restart a full
 sync of the --source-zone's data changes log. This log only lists which
 buckets/shards have changes in them, which causes radosgw to consider them
 for bucket sync. So while the command may silence the warnings about data
 shards being behind, it's unlikely to resolve the issue with missing
 objects in those buckets.

 When data sync is behind for an extended period of time, it's usually
 because it's stuck retrying previous bucket sync failures. The 'sync error
 list' may help narrow down where those failures are.

 There is also a 'bucket sync init' command to clear the bucket sync
 status. Following that with a 'bucket sync run' should restart a full sync
 on the bucket, pulling in any new objects that are present on the
 source-zone. I'm afraid that those commands haven't seen a lot of polish or
 testing, however.

 Casey

 On 08/24/2017 04:15 PM, David Turner wrote:

 Apparently the data shards that are behind go in both directions, but
 only one zone is aware of the problem.  Each cluster has objects in their
 data pool that the other doesn't have.  I'm thinking about initiating a
 `data sync init` on both sides (one at a time) to get them back on the same
 page.  Does anyone know if that command will overwrite any local data that
 the zone has that the other doesn't if you run `data sync init` on it?

 On Thu, Aug 24, 2017 at 1:51 PM David Turner 
 wrote:

> After restarting the 2 RGW daemons on the second site again,
> everything caught up on the metadata sync.  Is there something about 
> having
> 2 RGW daemons on each side of the multisite that might be causing an issue
> with the sync getting stale?  I have another realm set up the same way 
> that
> is having a hard time with its data shards 

Re: [ceph-users] RGW Multisite metadata sync init

2017-08-31 Thread David Turner
All of the messages from sync error list are listed below.  The number on
the left is how many times the error message is found.

   1811 "message": "failed to sync bucket instance:
(16) Device or resource busy"
  7 "message": "failed to sync bucket instance: (5)
Input\/output error"
 65 "message": "failed to sync object"

On Tue, Aug 29, 2017 at 10:00 AM Orit Wasserman  wrote:

>
> Hi David,
>
> On Mon, Aug 28, 2017 at 8:33 PM, David Turner 
> wrote:
>
>> The vast majority of the sync error list is "failed to sync bucket
>> instance: (16) Device or resource busy".  I can't find anything on Google
>> about this error message in relation to Ceph.  Does anyone have any idea
>> what this means? and/or how to fix it?
>>
>
> Those are intermediate errors resulting from several radosgw trying to
> acquire the same sync log shard lease. It doesn't effect the sync progress.
> Are there any other errors?
>
> Orit
>
>>
>> On Fri, Aug 25, 2017 at 2:48 PM Casey Bodley  wrote:
>>
>>> Hi David,
>>>
>>> The 'data sync init' command won't touch any actual object data, no.
>>> Resetting the data sync status will just cause a zone to restart a full
>>> sync of the --source-zone's data changes log. This log only lists which
>>> buckets/shards have changes in them, which causes radosgw to consider them
>>> for bucket sync. So while the command may silence the warnings about data
>>> shards being behind, it's unlikely to resolve the issue with missing
>>> objects in those buckets.
>>>
>>> When data sync is behind for an extended period of time, it's usually
>>> because it's stuck retrying previous bucket sync failures. The 'sync error
>>> list' may help narrow down where those failures are.
>>>
>>> There is also a 'bucket sync init' command to clear the bucket sync
>>> status. Following that with a 'bucket sync run' should restart a full sync
>>> on the bucket, pulling in any new objects that are present on the
>>> source-zone. I'm afraid that those commands haven't seen a lot of polish or
>>> testing, however.
>>>
>>> Casey
>>>
>>> On 08/24/2017 04:15 PM, David Turner wrote:
>>>
>>> Apparently the data shards that are behind go in both directions, but
>>> only one zone is aware of the problem.  Each cluster has objects in their
>>> data pool that the other doesn't have.  I'm thinking about initiating a
>>> `data sync init` on both sides (one at a time) to get them back on the same
>>> page.  Does anyone know if that command will overwrite any local data that
>>> the zone has that the other doesn't if you run `data sync init` on it?
>>>
>>> On Thu, Aug 24, 2017 at 1:51 PM David Turner 
>>> wrote:
>>>
 After restarting the 2 RGW daemons on the second site again, everything
 caught up on the metadata sync.  Is there something about having 2 RGW
 daemons on each side of the multisite that might be causing an issue with
 the sync getting stale?  I have another realm set up the same way that is
 having a hard time with its data shards being behind.  I haven't told them
 to resync, but yesterday I noticed 90 shards were behind.  It's caught back
 up to only 17 shards behind, but the oldest change not applied is 2 months
 old and no order of restarting RGW daemons is helping to resolve this.

 On Thu, Aug 24, 2017 at 10:59 AM David Turner 
 wrote:

> I have a RGW Multisite 10.2.7 set up for bi-directional syncing.  This
> has been operational for 5 months and working fine.  I recently created a
> new user on the master zone, used that user to create a bucket, and put in
> a public-acl object in there.  The Bucket created on the second site, but
> the user did not and the object errors out complaining about the 
> access_key
> not existing.
>
> That led me to think that the metadata isn't syncing, while bucket and
> data both are.  I've also confirmed that data is syncing for other buckets
> as well in both directions. The sync status from the second site was this.
>
>
>1.
>
>  metadata sync syncing
>
>2.
>
>full sync: 0/64 shards
>
>3.
>
>incremental sync: 64/64 shards
>
>4.
>
>metadata is caught up with master
>
>5.
>
>  data sync source: f4c12327-4721-47c9-a365-86332d84c227 
> (public-atl01)
>
>6.
>
>syncing
>
>7.
>
>full sync: 0/128 shards
>
>8.
>
>incremental sync: 128/128 shards
>
>9.
>
>data is caught up with source
>
>
>
> Sync status leads me to think that the second site 

Re: [ceph-users] RGW Multisite metadata sync init

2017-08-29 Thread Orit Wasserman
Hi David,

On Mon, Aug 28, 2017 at 8:33 PM, David Turner  wrote:

> The vast majority of the sync error list is "failed to sync bucket
> instance: (16) Device or resource busy".  I can't find anything on Google
> about this error message in relation to Ceph.  Does anyone have any idea
> what this means? and/or how to fix it?
>

Those are intermediate errors resulting from several radosgw trying to
acquire the same sync log shard lease. It doesn't effect the sync progress.
Are there any other errors?

Orit

>
> On Fri, Aug 25, 2017 at 2:48 PM Casey Bodley  wrote:
>
>> Hi David,
>>
>> The 'data sync init' command won't touch any actual object data, no.
>> Resetting the data sync status will just cause a zone to restart a full
>> sync of the --source-zone's data changes log. This log only lists which
>> buckets/shards have changes in them, which causes radosgw to consider them
>> for bucket sync. So while the command may silence the warnings about data
>> shards being behind, it's unlikely to resolve the issue with missing
>> objects in those buckets.
>>
>> When data sync is behind for an extended period of time, it's usually
>> because it's stuck retrying previous bucket sync failures. The 'sync error
>> list' may help narrow down where those failures are.
>>
>> There is also a 'bucket sync init' command to clear the bucket sync
>> status. Following that with a 'bucket sync run' should restart a full sync
>> on the bucket, pulling in any new objects that are present on the
>> source-zone. I'm afraid that those commands haven't seen a lot of polish or
>> testing, however.
>>
>> Casey
>>
>> On 08/24/2017 04:15 PM, David Turner wrote:
>>
>> Apparently the data shards that are behind go in both directions, but
>> only one zone is aware of the problem.  Each cluster has objects in their
>> data pool that the other doesn't have.  I'm thinking about initiating a
>> `data sync init` on both sides (one at a time) to get them back on the same
>> page.  Does anyone know if that command will overwrite any local data that
>> the zone has that the other doesn't if you run `data sync init` on it?
>>
>> On Thu, Aug 24, 2017 at 1:51 PM David Turner 
>> wrote:
>>
>>> After restarting the 2 RGW daemons on the second site again, everything
>>> caught up on the metadata sync.  Is there something about having 2 RGW
>>> daemons on each side of the multisite that might be causing an issue with
>>> the sync getting stale?  I have another realm set up the same way that is
>>> having a hard time with its data shards being behind.  I haven't told them
>>> to resync, but yesterday I noticed 90 shards were behind.  It's caught back
>>> up to only 17 shards behind, but the oldest change not applied is 2 months
>>> old and no order of restarting RGW daemons is helping to resolve this.
>>>
>>> On Thu, Aug 24, 2017 at 10:59 AM David Turner 
>>> wrote:
>>>
 I have a RGW Multisite 10.2.7 set up for bi-directional syncing.  This
 has been operational for 5 months and working fine.  I recently created a
 new user on the master zone, used that user to create a bucket, and put in
 a public-acl object in there.  The Bucket created on the second site, but
 the user did not and the object errors out complaining about the access_key
 not existing.

 That led me to think that the metadata isn't syncing, while bucket and
 data both are.  I've also confirmed that data is syncing for other buckets
 as well in both directions. The sync status from the second site was this.


1.

  metadata sync syncing

2.

full sync: 0/64 shards

3.

incremental sync: 64/64 shards

4.

metadata is caught up with master

5.

  data sync source: f4c12327-4721-47c9-a365-86332d84c227 
 (public-atl01)

6.

syncing

7.

full sync: 0/128 shards

8.

incremental sync: 128/128 shards

9.

data is caught up with source



 Sync status leads me to think that the second site believes it is up to
 date, even though it is missing a freshly created user.  I restarted all of
 the rgw daemons for the zonegroup, but it didn't trigger anything to fix
 the missing user in the second site.  I did some googling and found the
 sync init commands mentioned in a few ML posts and used metadata sync init
 and now have this as the sync status.


1.

  metadata sync preparing for full sync

2.

full sync: 64/64 shards

3.

full sync: 0 entries to sync

4.

Re: [ceph-users] RGW Multisite metadata sync init

2017-08-28 Thread David Turner
The vast majority of the sync error list is "failed to sync bucket
instance: (16) Device or resource busy".  I can't find anything on Google
about this error message in relation to Ceph.  Does anyone have any idea
what this means? and/or how to fix it?

On Fri, Aug 25, 2017 at 2:48 PM Casey Bodley  wrote:

> Hi David,
>
> The 'data sync init' command won't touch any actual object data, no.
> Resetting the data sync status will just cause a zone to restart a full
> sync of the --source-zone's data changes log. This log only lists which
> buckets/shards have changes in them, which causes radosgw to consider them
> for bucket sync. So while the command may silence the warnings about data
> shards being behind, it's unlikely to resolve the issue with missing
> objects in those buckets.
>
> When data sync is behind for an extended period of time, it's usually
> because it's stuck retrying previous bucket sync failures. The 'sync error
> list' may help narrow down where those failures are.
>
> There is also a 'bucket sync init' command to clear the bucket sync
> status. Following that with a 'bucket sync run' should restart a full sync
> on the bucket, pulling in any new objects that are present on the
> source-zone. I'm afraid that those commands haven't seen a lot of polish or
> testing, however.
>
> Casey
>
> On 08/24/2017 04:15 PM, David Turner wrote:
>
> Apparently the data shards that are behind go in both directions, but only
> one zone is aware of the problem.  Each cluster has objects in their data
> pool that the other doesn't have.  I'm thinking about initiating a `data
> sync init` on both sides (one at a time) to get them back on the same
> page.  Does anyone know if that command will overwrite any local data that
> the zone has that the other doesn't if you run `data sync init` on it?
>
> On Thu, Aug 24, 2017 at 1:51 PM David Turner 
> wrote:
>
>> After restarting the 2 RGW daemons on the second site again, everything
>> caught up on the metadata sync.  Is there something about having 2 RGW
>> daemons on each side of the multisite that might be causing an issue with
>> the sync getting stale?  I have another realm set up the same way that is
>> having a hard time with its data shards being behind.  I haven't told them
>> to resync, but yesterday I noticed 90 shards were behind.  It's caught back
>> up to only 17 shards behind, but the oldest change not applied is 2 months
>> old and no order of restarting RGW daemons is helping to resolve this.
>>
>> On Thu, Aug 24, 2017 at 10:59 AM David Turner 
>> wrote:
>>
>>> I have a RGW Multisite 10.2.7 set up for bi-directional syncing.  This
>>> has been operational for 5 months and working fine.  I recently created a
>>> new user on the master zone, used that user to create a bucket, and put in
>>> a public-acl object in there.  The Bucket created on the second site, but
>>> the user did not and the object errors out complaining about the access_key
>>> not existing.
>>>
>>> That led me to think that the metadata isn't syncing, while bucket and
>>> data both are.  I've also confirmed that data is syncing for other buckets
>>> as well in both directions. The sync status from the second site was this.
>>>
>>>
>>>1.
>>>
>>>  metadata sync syncing
>>>
>>>2.
>>>
>>>full sync: 0/64 shards
>>>
>>>3.
>>>
>>>incremental sync: 64/64 shards
>>>
>>>4.
>>>
>>>metadata is caught up with master
>>>
>>>5.
>>>
>>>  data sync source: f4c12327-4721-47c9-a365-86332d84c227 
>>> (public-atl01)
>>>
>>>6.
>>>
>>>syncing
>>>
>>>7.
>>>
>>>full sync: 0/128 shards
>>>
>>>8.
>>>
>>>incremental sync: 128/128 shards
>>>
>>>9.
>>>
>>>data is caught up with source
>>>
>>>
>>>
>>> Sync status leads me to think that the second site believes it is up to
>>> date, even though it is missing a freshly created user.  I restarted all of
>>> the rgw daemons for the zonegroup, but it didn't trigger anything to fix
>>> the missing user in the second site.  I did some googling and found the
>>> sync init commands mentioned in a few ML posts and used metadata sync init
>>> and now have this as the sync status.
>>>
>>>
>>>1.
>>>
>>>  metadata sync preparing for full sync
>>>
>>>2.
>>>
>>>full sync: 64/64 shards
>>>
>>>3.
>>>
>>>full sync: 0 entries to sync
>>>
>>>4.
>>>
>>>incremental sync: 0/64 shards
>>>
>>>5.
>>>
>>>metadata is behind on 70 shards
>>>
>>>6.
>>>
>>>oldest incremental change not applied: 2017-03-01 
>>> 21:13:43.0.126971s
>>>
>>>7.
>>>
>>>  data sync source: f4c12327-4721-47c9-a365-86332d84c227 
>>> (public-atl01)
>>>
>>>8.
>>>
>>>

Re: [ceph-users] RGW Multisite metadata sync init

2017-08-25 Thread Casey Bodley

Hi David,

The 'data sync init' command won't touch any actual object data, no. 
Resetting the data sync status will just cause a zone to restart a full 
sync of the --source-zone's data changes log. This log only lists which 
buckets/shards have changes in them, which causes radosgw to consider 
them for bucket sync. So while the command may silence the warnings 
about data shards being behind, it's unlikely to resolve the issue with 
missing objects in those buckets.


When data sync is behind for an extended period of time, it's usually 
because it's stuck retrying previous bucket sync failures. The 'sync 
error list' may help narrow down where those failures are.


There is also a 'bucket sync init' command to clear the bucket sync 
status. Following that with a 'bucket sync run' should restart a full 
sync on the bucket, pulling in any new objects that are present on the 
source-zone. I'm afraid that those commands haven't seen a lot of polish 
or testing, however.


Casey


On 08/24/2017 04:15 PM, David Turner wrote:
Apparently the data shards that are behind go in both directions, but 
only one zone is aware of the problem. Each cluster has objects in 
their data pool that the other doesn't have.  I'm thinking about 
initiating a `data sync init` on both sides (one at a time) to get 
them back on the same page.  Does anyone know if that command will 
overwrite any local data that the zone has that the other doesn't if 
you run `data sync init` on it?


On Thu, Aug 24, 2017 at 1:51 PM David Turner > wrote:


After restarting the 2 RGW daemons on the second site again,
everything caught up on the metadata sync.  Is there something
about having 2 RGW daemons on each side of the multisite that
might be causing an issue with the sync getting stale?  I have
another realm set up the same way that is having a hard time with
its data shards being behind.  I haven't told them to resync, but
yesterday I noticed 90 shards were behind.  It's caught back up to
only 17 shards behind, but the oldest change not applied is 2
months old and no order of restarting RGW daemons is helping to
resolve this.

On Thu, Aug 24, 2017 at 10:59 AM David Turner
> wrote:

I have a RGW Multisite 10.2.7 set up for bi-directional
syncing.  This has been operational for 5 months and working
fine.  I recently created a new user on the master zone, used
that user to create a bucket, and put in a public-acl object
in there.  The Bucket created on the second site, but the user
did not and the object errors out complaining about the
access_key not existing.

That led me to think that the metadata isn't syncing, while
bucket and data both are.  I've also confirmed that data is
syncing for other buckets as well in both directions. The sync
status from the second site was this.

1.

metadata sync syncing

2.

full sync:0/64shards

3.

incremental sync:64/64shards

4.

metadata iscaught up withmaster

5.

data sync
source:f4c12327-4721-47c9-a365-86332d84c227(public-atl01)

6.

syncing

7.

full sync:0/128shards

8.

incremental sync:128/128shards

9.

data iscaught up withsource


Sync status leads me to think that the second site believes it
is up to date, even though it is missing a freshly created
user.  I restarted all of the rgw daemons for the zonegroup,
but it didn't trigger anything to fix the missing user in the
second site. I did some googling and found the sync init
commands mentioned in a few ML posts and used metadata sync
init and now have this as the sync status.

1.

metadata sync preparing forfull sync

2.

full sync:64/64shards

3.

full sync:0entries to sync

4.

incremental sync:0/64shards

5.

metadata isbehind on 70shards

6.

oldest incremental change
notapplied:2017-03-0121:13:43.0.126971s

7.

data sync
source:f4c12327-4721-47c9-a365-86332d84c227(public-atl01)

8.

syncing

9.

full sync:0/128shards

   10.

incremental sync:128/128shards

   11.

data iscaught up withsource


It definitely triggered a fresh sync and told it to forget
about what it's previously applied as the date of the oldest
change not applied is the day we initially set up multisite
for this zone.  The problem is that was over 12 hours ago and
the sync stat hasn't caught up on any shards yet.

  

Re: [ceph-users] RGW Multisite metadata sync init

2017-08-25 Thread Casey Bodley

Hi David,

The 'radosgw-admin sync error list' command may be useful in debugging 
sync failures for specific entries. For users, we've seen some sync 
failures caused by conflicting user metadata that was only present on 
the secondary site. For example, a user that had the same access key or 
email address, which we require to be unique.


Running multiple gateways on the same zone is fully supported, and 
unlikely to cause these kinds of issues.



On 08/24/2017 01:51 PM, David Turner wrote:
After restarting the 2 RGW daemons on the second site again, 
everything caught up on the metadata sync.  Is there something about 
having 2 RGW daemons on each side of the multisite that might be 
causing an issue with the sync getting stale?  I have another realm 
set up the same way that is having a hard time with its data shards 
being behind.  I haven't told them to resync, but yesterday I noticed 
90 shards were behind. It's caught back up to only 17 shards behind, 
but the oldest change not applied is 2 months old and no order of 
restarting RGW daemons is helping to resolve this.


On Thu, Aug 24, 2017 at 10:59 AM David Turner > wrote:


I have a RGW Multisite 10.2.7 set up for bi-directional syncing. 
This has been operational for 5 months and working fine.  I

recently created a new user on the master zone, used that user to
create a bucket, and put in a public-acl object in there.  The
Bucket created on the second site, but the user did not and the
object errors out complaining about the access_key not existing.

That led me to think that the metadata isn't syncing, while bucket
and data both are.  I've also confirmed that data is syncing for
other buckets as well in both directions. The sync status from the
second site was this.

1.

metadata sync syncing

2.

full sync:0/64shards

3.

incremental sync:64/64shards

4.

metadata iscaught up withmaster

5.

data sync
source:f4c12327-4721-47c9-a365-86332d84c227(public-atl01)

6.

syncing

7.

full sync:0/128shards

8.

incremental sync:128/128shards

9.

data iscaught up withsource


Sync status leads me to think that the second site believes it is
up to date, even though it is missing a freshly created user.  I
restarted all of the rgw daemons for the zonegroup, but it didn't
trigger anything to fix the missing user in the second site.  I
did some googling and found the sync init commands mentioned in a
few ML posts and used metadata sync init and now have this as the
sync status.

1.

metadata sync preparing forfull sync

2.

full sync:64/64shards

3.

full sync:0entries to sync

4.

incremental sync:0/64shards

5.

metadata isbehind on 70shards

6.

oldest incremental change notapplied:2017-03-0121:13:43.0.126971s

7.

data sync
source:f4c12327-4721-47c9-a365-86332d84c227(public-atl01)

8.

syncing

9.

full sync:0/128shards

   10.

incremental sync:128/128shards

   11.

data iscaught up withsource


It definitely triggered a fresh sync and told it to forget about
what it's previously applied as the date of the oldest change not
applied is the day we initially set up multisite for this zone. 
The problem is that was over 12 hours ago and the sync stat hasn't

caught up on any shards yet.

Does anyone have any suggestions other than blast the second site
and set it back up with a fresh start (the only option I can think
of at this point)?

Thank you,
David Turner



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW Multisite metadata sync init

2017-08-24 Thread David Turner
Apparently the data shards that are behind go in both directions, but only
one zone is aware of the problem.  Each cluster has objects in their data
pool that the other doesn't have.  I'm thinking about initiating a `data
sync init` on both sides (one at a time) to get them back on the same
page.  Does anyone know if that command will overwrite any local data that
the zone has that the other doesn't if you run `data sync init` on it?

On Thu, Aug 24, 2017 at 1:51 PM David Turner  wrote:

> After restarting the 2 RGW daemons on the second site again, everything
> caught up on the metadata sync.  Is there something about having 2 RGW
> daemons on each side of the multisite that might be causing an issue with
> the sync getting stale?  I have another realm set up the same way that is
> having a hard time with its data shards being behind.  I haven't told them
> to resync, but yesterday I noticed 90 shards were behind.  It's caught back
> up to only 17 shards behind, but the oldest change not applied is 2 months
> old and no order of restarting RGW daemons is helping to resolve this.
>
> On Thu, Aug 24, 2017 at 10:59 AM David Turner 
> wrote:
>
>> I have a RGW Multisite 10.2.7 set up for bi-directional syncing.  This
>> has been operational for 5 months and working fine.  I recently created a
>> new user on the master zone, used that user to create a bucket, and put in
>> a public-acl object in there.  The Bucket created on the second site, but
>> the user did not and the object errors out complaining about the access_key
>> not existing.
>>
>> That led me to think that the metadata isn't syncing, while bucket and
>> data both are.  I've also confirmed that data is syncing for other buckets
>> as well in both directions. The sync status from the second site was this.
>>
>>
>>1.
>>
>>  metadata sync syncing
>>
>>2.
>>
>>full sync: 0/64 shards
>>
>>3.
>>
>>incremental sync: 64/64 shards
>>
>>4.
>>
>>metadata is caught up with master
>>
>>5.
>>
>>  data sync source: f4c12327-4721-47c9-a365-86332d84c227 
>> (public-atl01)
>>
>>6.
>>
>>syncing
>>
>>7.
>>
>>full sync: 0/128 shards
>>
>>8.
>>
>>incremental sync: 128/128 shards
>>
>>9.
>>
>>data is caught up with source
>>
>>
>>
>> Sync status leads me to think that the second site believes it is up to
>> date, even though it is missing a freshly created user.  I restarted all of
>> the rgw daemons for the zonegroup, but it didn't trigger anything to fix
>> the missing user in the second site.  I did some googling and found the
>> sync init commands mentioned in a few ML posts and used metadata sync init
>> and now have this as the sync status.
>>
>>
>>1.
>>
>>  metadata sync preparing for full sync
>>
>>2.
>>
>>full sync: 64/64 shards
>>
>>3.
>>
>>full sync: 0 entries to sync
>>
>>4.
>>
>>incremental sync: 0/64 shards
>>
>>5.
>>
>>metadata is behind on 70 shards
>>
>>6.
>>
>>oldest incremental change not applied: 2017-03-01 
>> 21:13:43.0.126971s
>>
>>7.
>>
>>  data sync source: f4c12327-4721-47c9-a365-86332d84c227 
>> (public-atl01)
>>
>>8.
>>
>>syncing
>>
>>9.
>>
>>full sync: 0/128 shards
>>
>>10.
>>
>>incremental sync: 128/128 shards
>>
>>11.
>>
>>data is caught up with source
>>
>>
>>
>> It definitely triggered a fresh sync and told it to forget about what
>> it's previously applied as the date of the oldest change not applied is the
>> day we initially set up multisite for this zone.  The problem is that was
>> over 12 hours ago and the sync stat hasn't caught up on any shards yet.
>>
>> Does anyone have any suggestions other than blast the second site and set
>> it back up with a fresh start (the only option I can think of at this
>> point)?
>>
>> Thank you,
>> David Turner
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW Multisite metadata sync init

2017-08-24 Thread David Turner
After restarting the 2 RGW daemons on the second site again, everything
caught up on the metadata sync.  Is there something about having 2 RGW
daemons on each side of the multisite that might be causing an issue with
the sync getting stale?  I have another realm set up the same way that is
having a hard time with its data shards being behind.  I haven't told them
to resync, but yesterday I noticed 90 shards were behind.  It's caught back
up to only 17 shards behind, but the oldest change not applied is 2 months
old and no order of restarting RGW daemons is helping to resolve this.

On Thu, Aug 24, 2017 at 10:59 AM David Turner  wrote:

> I have a RGW Multisite 10.2.7 set up for bi-directional syncing.  This has
> been operational for 5 months and working fine.  I recently created a new
> user on the master zone, used that user to create a bucket, and put in a
> public-acl object in there.  The Bucket created on the second site, but the
> user did not and the object errors out complaining about the access_key not
> existing.
>
> That led me to think that the metadata isn't syncing, while bucket and
> data both are.  I've also confirmed that data is syncing for other buckets
> as well in both directions. The sync status from the second site was this.
>
>
>1.
>
>  metadata sync syncing
>
>2.
>
>full sync: 0/64 shards
>
>3.
>
>incremental sync: 64/64 shards
>
>4.
>
>metadata is caught up with master
>
>5.
>
>  data sync source: f4c12327-4721-47c9-a365-86332d84c227 (public-atl01)
>
>6.
>
>syncing
>
>7.
>
>full sync: 0/128 shards
>
>8.
>
>incremental sync: 128/128 shards
>
>9.
>
>data is caught up with source
>
>
>
> Sync status leads me to think that the second site believes it is up to
> date, even though it is missing a freshly created user.  I restarted all of
> the rgw daemons for the zonegroup, but it didn't trigger anything to fix
> the missing user in the second site.  I did some googling and found the
> sync init commands mentioned in a few ML posts and used metadata sync init
> and now have this as the sync status.
>
>
>1.
>
>  metadata sync preparing for full sync
>
>2.
>
>full sync: 64/64 shards
>
>3.
>
>full sync: 0 entries to sync
>
>4.
>
>incremental sync: 0/64 shards
>
>5.
>
>metadata is behind on 70 shards
>
>6.
>
>oldest incremental change not applied: 2017-03-01 
> 21:13:43.0.126971s
>
>7.
>
>  data sync source: f4c12327-4721-47c9-a365-86332d84c227 (public-atl01)
>
>8.
>
>syncing
>
>9.
>
>full sync: 0/128 shards
>
>10.
>
>incremental sync: 128/128 shards
>
>11.
>
>data is caught up with source
>
>
>
> It definitely triggered a fresh sync and told it to forget about what it's
> previously applied as the date of the oldest change not applied is the day
> we initially set up multisite for this zone.  The problem is that was over
> 12 hours ago and the sync stat hasn't caught up on any shards yet.
>
> Does anyone have any suggestions other than blast the second site and set
> it back up with a fresh start (the only option I can think of at this
> point)?
>
> Thank you,
> David Turner
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com