Re: [ceph-users] scrub error on firefly

2014-07-12 Thread Samuel Just
When you see another one, can you include the xattrs on the files as
well (you can use the attr(1) utility)?
-Sam

On Sat, Jul 12, 2014 at 9:51 AM, Randy Smith  wrote:
> That image is the root file system for a linux ldap server.
>
> --
> Randall Smith
> Adams State University
> www.adams.edu
> 719-587-7741
>
> On Jul 12, 2014 10:34 AM, "Samuel Just"  wrote:
>>
>> Here's a diff of the two files.  One of the two files appears to
>> contain ceph leveldb keys?  Randy, do you have an idea of what this
>> rbd image is being used for (rb.0.b0ce3.238e1f29, that is).
>> -Sam
>>
>> On Fri, Jul 11, 2014 at 7:25 PM, Randy Smith  wrote:
>> > Greetings,
>> >
>> > Well it happened again with two pgs this time, still in the same rbd
>> > image.
>> > They are at http://people.adams.edu/~rbsmith/osd.tar. I think I grabbed
>> > the
>> > files correctly. If not, let me know and I'll try again on the next
>> > failure.
>> > It certainly is happening often enough.
>> >
>> >
>> > On Fri, Jul 11, 2014 at 3:39 PM, Samuel Just 
>> > wrote:
>> >>
>> >> And grab the xattrs as well.
>> >> -Sam
>> >>
>> >> On Fri, Jul 11, 2014 at 2:39 PM, Samuel Just 
>> >> wrote:
>> >> > Right.
>> >> > -Sam
>> >> >
>> >> > On Fri, Jul 11, 2014 at 2:05 PM, Randy Smith 
>> >> > wrote:
>> >> >> Greetings,
>> >> >>
>> >> >> I'm using xfs.
>> >> >>
>> >> >> Also, when, in a previous email, you asked if I could send the
>> >> >> object,
>> >> >> do
>> >> >> you mean the files from each server named something like this:
>> >> >>
>> >> >>
>> >> >> ./3.c6_head/DIR_6/DIR_C/DIR_5/rb.0.b0ce3.238e1f29.000b__head_34DC35C6__3
>> >> >> ?
>> >> >>
>> >> >>
>> >> >> On Fri, Jul 11, 2014 at 2:00 PM, Samuel Just 
>> >> >> wrote:
>> >> >>>
>> >> >>> Also, what filesystem are you using?
>> >> >>> -Sam
>> >> >>>
>> >> >>> On Fri, Jul 11, 2014 at 10:37 AM, Sage Weil 
>> >> >>> wrote:
>> >> >>> > One other thing we might also try is catching this earlier (on
>> >> >>> > first
>> >> >>> > read
>> >> >>> > of corrupt data) instead of waiting for scrub.  If you are not
>> >> >>> > super
>> >> >>> > performance sensitive, you can add
>> >> >>> >
>> >> >>> >  filestore sloppy crc = true
>> >> >>> >  filestore sloppy crc block size = 524288
>> >> >>> >
>> >> >>> > That will track and verify CRCs on any large (>512k) writes.
>> >> >>> > Smaller
>> >> >>> > block sizes will give more precision and more checks, but will
>> >> >>> > generate
>> >> >>> > larger xattrs and have a bigger impact on performance...
>> >> >>> >
>> >> >>> > sage
>> >> >>> >
>> >> >>> >
>> >> >>> > On Fri, 11 Jul 2014, Samuel Just wrote:
>> >> >>> >
>> >> >>> >> When you get the next inconsistency, can you copy the actual
>> >> >>> >> objects
>> >> >>> >> from the osd store trees and get them to us?  That might provide
>> >> >>> >> a
>> >> >>> >> clue.
>> >> >>> >> -Sam
>> >> >>> >>
>> >> >>> >> On Fri, Jul 11, 2014 at 6:52 AM, Randy Smith 
>> >> >>> >> wrote:
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> > On Thu, Jul 10, 2014 at 4:40 PM, Samuel Just
>> >> >>> >> > 
>> >> >>> >> > wrote:
>> >> >>> >> >>
>> >> >>> >> >> It could be an indication of a problem on osd 5, but the
>> >> >>> >> >> timing
>> >> >>> >> >> is
>> >> >>> >> >> worrying.  Can you attach your ceph.conf?
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> > Attached.
>> >> >>> >> >
>> >> >>> >> >>
>> >> >>> >> >> Have there been any osds
>> >> >>> >> >> going down, new osds added, anything to cause recovery?
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> > I upgraded to firefly last week. As part of the upgrade I,
>> >> >>> >> > obviously,
>> >> >>> >> > had to
>> >> >>> >> > restart every osd. Also, I attempted to switch to the optimal
>> >> >>> >> > tunables but
>> >> >>> >> > doing so degraded 27% of my cluster and made most of my VMs
>> >> >>> >> > unresponsive. I
>> >> >>> >> > switched back to the legacy tunables and everything was happy
>> >> >>> >> > again.
>> >> >>> >> > Both of
>> >> >>> >> > those operations, of course, caused recoveries. I have made no
>> >> >>> >> > changes since
>> >> >>> >> > then.
>> >> >>> >> >
>> >> >>> >> >>
>> >> >>> >> >>  Anything in
>> >> >>> >> >> dmesg to indicate an fs problem?
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> > Nothing. The system went inconsistent again this morning,
>> >> >>> >> > again
>> >> >>> >> > on
>> >> >>> >> > the same
>> >> >>> >> > rbd but different osds this time.
>> >> >>> >> >
>> >> >>> >> > 2014-07-11 05:48:12.857657 osd.1 192.168.253.77:6801/12608 904
>> >> >>> >> > :
>> >> >>> >> > [ERR] 3.76
>> >> >>> >> > shard 1: soid 1280076/rb.0.b0ce3.238e1f29.025c/head//3
>> >> >>> >> > digest
>> >> >>> >> > 2198242284 != known digest 3879754377
>> >> >>> >> > 2014-07-11 05:49:29.020024 osd.1 192.168.253.77:6801/12608 905
>> >> >>> >> > :
>> >> >>> >> > [ERR] 3.76
>> >> >>> >> > deep-scrub 0 missing, 1 inconsistent objects
>> >> >>> >> > 2014-07-11 05:49:29.020029 osd.1 192.168.253.77:6801/12608 906
>> >> >>> >> > :
>> >> >>> >> > [ERR] 3.76
>> >> >>> >> > deep-scr

Re: [ceph-users] logrotate

2014-07-12 Thread Uwe Grohnwaldt
Hi,

we are observing the same problem. After logrotate the new logfile is empty. 
The old logfiles are marked as deleted in lsof. At the moment we are 
restarting osds on a regular basis.

Uwe

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> James Eckersall
> Sent: Freitag, 11. Juli 2014 17:06
> To: Sage Weil
> Cc: ceph-us...@ceph.com
> Subject: Re: [ceph-users] logrotate
>
> Hi Sage,
>
> Many thanks for the info.
> I have inherited this cluster, but I believe it may have been created with
> mkcephfs rather than ceph-deploy.
>
> I'll touch the done files and see what happens.  Looking at the logic in the
> logrotate script I'm sure this will resolve the problem.
>
> Thanks
>
> J
>
>
> On 11 July 2014 15:04, Sage Weil   > wrote:
>
>
>   On Fri, 11 Jul 2014, James Eckersall wrote:
>   > Upon further investigation, it looks like this part of the ceph
> logrotate
>   > script is causing me the problem:
>   >
>   > if [ -e "/var/lib/ceph/$daemon/$f/done" ] && [ -e
>   > "/var/lib/ceph/$daemon/$f/upstart" ] && [ ! -e
>   > "/var/lib/ceph/$daemon/$f/sysvinit" ]; then
>   >
>   > I don't have a "done" file in the mounted directory for any of my
> osd's.  My
>   > mon's all have the done file and logrotate is working fine for those.
>
>
>   Was this cluster created a while ago with mkcephfs?
>
>
>   > So my question is, what is the purpose of the "done" file and
> should I just
>   > create one for each of my osd's ?
>
>
>   It's used by the newer ceph-disk stuff to indicate whether the OSD
>   directory is propertly 'prepared' and whether the startup stuff should
> pay
>   attention.
>
>   If these are active OSDs, yeah, just touch 'done'.  (Don't touch
> sysvinit,
>   though, if you are enumerating the daemons in ceph.conf with host =
> foo
>   lines.)
>
>   sage
>
>
>
>   >
>   >
>   >
>   > On 10 July 2014 11:10, James Eckersall   > wrote:
>   >   Hi,
>   > I've just upgraded a ceph cluster from Ubuntu 12.04 with 0.73.1 to
>   > Ubuntu 14.04 with 0.80.1.
>   >
>   > I've noticed that the log rotation doesn't appear to work correctly.
>   > The OSD's are just not logging to the current ceph-osd-X.log file.
>   > If I restart the OSD's, they start logging, but then overnight, they
>   > stop logging when the logs are rotated.
>   >
>   > Has anyone else noticed a problem with this?
>   >
>   >
>   >
>   >
>



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] logrotate

2014-07-12 Thread Sage Weil
A simple reload should be sufficient, or kill -HUP.

I'm not sure where this should be documented.. We need to look back at where 
the logrotate config changed to check for the done marker.

sage

On July 12, 2014 6:40:10 PM PDT, Uwe Grohnwaldt  wrote:
>Hi,
>
>we are observing the same problem. After logrotate the new logfile is
>empty. 
>The old logfiles are marked as deleted in lsof. At the moment we are 
>restarting osds on a regular basis.
>
>Uwe
>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>Of
>> James Eckersall
>> Sent: Freitag, 11. Juli 2014 17:06
>> To: Sage Weil
>> Cc: ceph-us...@ceph.com
>> Subject: Re: [ceph-users] logrotate
>>
>> Hi Sage,
>>
>> Many thanks for the info.
>> I have inherited this cluster, but I believe it may have been created
>with
>> mkcephfs rather than ceph-deploy.
>>
>> I'll touch the done files and see what happens.  Looking at the logic
>in the
>> logrotate script I'm sure this will resolve the problem.
>>
>> Thanks
>>
>> J
>>
>>
>> On 11 July 2014 15:04, Sage Weil >  > wrote:
>>
>>
>>  On Fri, 11 Jul 2014, James Eckersall wrote:
>>  > Upon further investigation, it looks like this part of the ceph
>> logrotate
>>  > script is causing me the problem:
>>  >
>>  > if [ -e "/var/lib/ceph/$daemon/$f/done" ] && [ -e
>>  > "/var/lib/ceph/$daemon/$f/upstart" ] && [ ! -e
>>  > "/var/lib/ceph/$daemon/$f/sysvinit" ]; then
>>  >
>>  > I don't have a "done" file in the mounted directory for any of my
>> osd's.  My
>>  > mon's all have the done file and logrotate is working fine for
>those.
>>
>>
>>  Was this cluster created a while ago with mkcephfs?
>>
>>
>>  > So my question is, what is the purpose of the "done" file and
>> should I just
>>  > create one for each of my osd's ?
>>
>>
>>  It's used by the newer ceph-disk stuff to indicate whether the OSD
>>  directory is propertly 'prepared' and whether the startup stuff
>should
>> pay
>>  attention.
>>
>>  If these are active OSDs, yeah, just touch 'done'.  (Don't touch
>> sysvinit,
>>  though, if you are enumerating the daemons in ceph.conf with host =
>> foo
>>  lines.)
>>
>>  sage
>>
>>
>>
>>  >
>>  >
>>  >
>>  > On 10 July 2014 11:10, James Eckersall >  > wrote:
>>  >   Hi,
>>  > I've just upgraded a ceph cluster from Ubuntu 12.04 with 0.73.1 to
>>  > Ubuntu 14.04 with 0.80.1.
>>  >
>>  > I've noticed that the log rotation doesn't appear to work
>correctly.
>>  > The OSD's are just not logging to the current ceph-osd-X.log file.
>>  > If I restart the OSD's, they start logging, but then overnight,
>they
>>  > stop logging when the logs are rotated.
>>  >
>>  > Has anyone else noticed a problem with this?
>>  >
>>  >
>>  >
>>  >
>>
>
>
>
>
>
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Sent from Kaiten Mail. Please excuse my brevity.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [URGENT]. Can't connect to CEPH after upgrade from 0.72 to 0.80

2014-07-12 Thread Andrija Panic
Hi,

Sorry to bother, but I have urgent situation: upgraded CEPH from 0.72 to
0.80 (centos 6.5), and now all my CloudStack HOSTS can not connect.

I did basic "yum update ceph" on the first MON leader, and all CEPH
services on that HOST, have been restarted - done same on other CEPH nodes
(I have 1MON + 2 OSD per physical host), then I have set variables to
optimal with "ceph osd crush tunables optimal" and after some rebalancing,
ceph shows HEALTH_OK.

Also, I can create new images with qemu-img -f rbd rbd:/cloudstack

Libvirt 1.2.3 was compiled while ceph was 0.72, but I got instructions from
Wido that I don't need to REcompile now with ceph 0.80...

Libvirt logs:

libvirt: Storage Driver error : Storage pool not found: no storage pool
with matching uuid ‡Îhyš___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [URGENT]. Can't connect to CEPH after upgrade from 0.72 to 0.80

2014-07-12 Thread Mark Kirkwood

On 13/07/14 17:07, Andrija Panic wrote:

Hi,

Sorry to bother, but I have urgent situation: upgraded CEPH from 0.72 to
0.80 (centos 6.5), and now all my CloudStack HOSTS can not connect.

I did basic "yum update ceph" on the first MON leader, and all CEPH
services on that HOST, have been restarted - done same on other CEPH
nodes (I have 1MON + 2 OSD per physical host), then I have set variables
to optimal with "ceph osd crush tunables optimal" and after some
rebalancing, ceph shows HEALTH_OK.

Also, I can create new images with qemu-img -f rbd rbd:/cloudstack

Libvirt 1.2.3 was compiled while ceph was 0.72, but I got instructions
from Wido that I don't need to REcompile now with ceph 0.80...

Libvirt logs:

libvirt: Storage Driver error : Storage pool not found: no storage pool
with matching uuid ‡Îhyš

Have you got any ceph logs to examine on the host running libvirt? When 
I try to connect a v0.72 client to v0.81 cluster I get:


2014-07-13 18:21:23.860898 7fc3bd2ca700  0 -- 192.168.122.41:0/1002012 
>> 192.168.122.21:6789/0 pipe(0x7fc3c00241f0 sd=3 :49451 s=1 pgs=0 cs=0 
l=1 c=0x7fc3c0024450).connect protocol feature mismatch, my f < 
peer 5f missing 50


Regards

Mark

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [URGENT]. Can't connect to CEPH after upgrade from 0.72 to 0.80

2014-07-12 Thread Andrija Panic
Hi Mark,
actually, CEPH is running fine, and I have deployed NEW host (new compile
libvirt with ceph 0.8 devel, and newer kernel) - and it works... so
migrating some VMs to this new host...

I have 3 physical hosts, that are both MON and 2x OSD per host, all3 don't
work-cloudstack/libvirt...

Any suggestion on need to recompile libvirt ? I got info from Wido, that
libvirt does NOT need to be recompiled


Best


On 13 July 2014 08:35, Mark Kirkwood  wrote:

> On 13/07/14 17:07, Andrija Panic wrote:
>
>> Hi,
>>
>> Sorry to bother, but I have urgent situation: upgraded CEPH from 0.72 to
>> 0.80 (centos 6.5), and now all my CloudStack HOSTS can not connect.
>>
>> I did basic "yum update ceph" on the first MON leader, and all CEPH
>> services on that HOST, have been restarted - done same on other CEPH
>> nodes (I have 1MON + 2 OSD per physical host), then I have set variables
>> to optimal with "ceph osd crush tunables optimal" and after some
>> rebalancing, ceph shows HEALTH_OK.
>>
>> Also, I can create new images with qemu-img -f rbd rbd:/cloudstack
>>
>> Libvirt 1.2.3 was compiled while ceph was 0.72, but I got instructions
>> from Wido that I don't need to REcompile now with ceph 0.80...
>>
>> Libvirt logs:
>>
>> libvirt: Storage Driver error : Storage pool not found: no storage pool
>> with matching uuid ‡Îhyš>
>> Note there are some strange "uuid" - not sure what is happening ?
>>
>> Did I forget to do something after CEPH upgrade ?
>>
>
> Have you got any ceph logs to examine on the host running libvirt? When I
> try to connect a v0.72 client to v0.81 cluster I get:
>
> 2014-07-13 18:21:23.860898 7fc3bd2ca700  0 -- 192.168.122.41:0/1002012 >>
> 192.168.122.21:6789/0 pipe(0x7fc3c00241f0 sd=3 :49451 s=1 pgs=0 cs=0 l=1
> c=0x7fc3c0024450).connect protocol feature mismatch, my f < peer
> 5f missing 50
>
> Regards
>
> Mark
>
>


-- 

Andrija Panić
--
  http://admintweets.com
--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com