Re: [ceph-users] leveldb compaction error

2015-10-07 Thread Selcuk TUNC
Hi Narendra,

we upgraded from (0.80.9)Firefly to Hammer.

On Thu, Oct 8, 2015 at 2:49 AM, Narendra Trivedi (natrived) <
natri...@cisco.com> wrote:

> Hi Selcuk,
>
>
>
> Which version of ceph did you upgrade from to Hammer (0.94)?
>
>
>
> --Narendra
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *Selcuk TUNC
> *Sent:* Thursday, September 17, 2015 12:41 AM
> *To:* ceph-users@lists.ceph.com
> *Subject:* [ceph-users] leveldb compaction error
>
>
>
> hello,
>
>
>
> we have noticed leveldb compaction on mount causes a segmentation fault in
> hammer release(0.94).
>
> It seems related to this pull request (github.com/ceph/ceph/pull/4372).
> Are you planning to backport
>
> this fix to next hammer release?
>
>
>
> --
>
> st
>



-- 
st
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] proxmox 4.0 release : lxc with krbd support and qemu librbd improvements

2015-10-07 Thread Irek Fasikhov
Hi, Alexandre.

Very Very Good!
Thank you for your work! :)

С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757

2015-10-07 7:25 GMT+03:00 Alexandre DERUMIER :

> Hi,
>
> proxmox 4.0 has been released:
>
> http://forum.proxmox.com/threads/23780-Proxmox-VE-4-0-released!
>
>
> Some ceph improvements :
>
> - lxc containers with krbd support (multiple disks + snapshots)
> - qemu with jemalloc support (improve librbd performance)
> - qemu iothread option by disk (improve scaling rbd  with multiple disk)
> - librbd hammer version
>
> Regards,
>
> Alexandre
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS "corruption" -- Nulled bytes

2015-10-07 Thread Sage Weil
On Wed, 7 Oct 2015, Adam Tygart wrote:
> Does this patch fix files that have been corrupted in this manner?

Nope, it'll only prevent it from happening to new files (that haven't yet 
been migrated between the cache and base tier).

> If not, or I guess even if it does, is there a way to walk the
> metadata and data pools and find objects that are affected?

Hmm, this may actually do the trick.. find a file that appears to be 
zeroed, and do truncate it up and then down again.  For example, of foo is 
100 bytes, do

 truncate --size 101 foo
 truncate --size 100 foo

then unmount and remound the client and see if the content reappears.

Assuming that works (it did in my simple test) it'd be pretty easy to 
write something that walks the tree and does the truncate trick for any 
file whose first however many bytes are 0 (though it will mess up 
mtime...).

> Is that '_' xattr in hammer? If so, how can I access it? Doing a
> listxattr on the inode just lists 'parent', and doing the same on the
> parent directory's inode simply lists 'parent'.

This is the file in /var/lib/ceph/osd/ceph-NNN/current.  For example,

$ attr -l ./3.0_head/100.__head_F0B56F30__3
Attribute "cephos.spill_out" has a 2 byte value for 
./3.0_head/100.__head_F0B56F30__3
Attribute "cephos.seq" has a 23 byte value for 
./3.0_head/100.__head_F0B56F30__3
Attribute "ceph._" has a 250 byte value for 
./3.0_head/100.__head_F0B56F30__3
Attribute "ceph._@1" has a 5 byte value for 
./3.0_head/100.__head_F0B56F30__3
Attribute "ceph.snapset" has a 31 byte value for 
./3.0_head/100.__head_F0B56F30__3

...but hopefully you won't need to touch any of that ;)

sage


> 
> Thanks for your time.
> 
> --
> Adam
> 
> 
> On Mon, Oct 5, 2015 at 9:36 AM, Sage Weil  wrote:
> > On Mon, 5 Oct 2015, Adam Tygart wrote:
> >> Okay, this has happened several more times. Always seems to be a small
> >> file that should be read-only (perhaps simultaneously) on many
> >> different clients. It is just through the cephfs interface that the
> >> files are corrupted, the objects in the cachepool and erasure coded
> >> pool are still correct. I am beginning to doubt these files are
> >> getting a truncation request.
> >
> > This is still consistent with the #12551 bug.  The object data is correct,
> > but the cephfs truncation metadata on the object is wrong, causing it to
> > be implicitly zeroed out on read.  It's easily triggered by writers who
> > use O_TRUNC on open...
> >
> >> Twice now have been different perl files, once was someones .bashrc,
> >> once was an input file for another application, timestamps on the
> >> files indicate that the files haven't been modified in weeks.
> >>
> >> Any other possibilites? Or any way to figure out what happened?
> >
> > You can confirm by extracting the '_' xattr on the object (append any @1
> > etc fragments) and feeding it to ceph-dencoder with
> >
> >  ceph-dencoder type object_info_t import  decode 
> > dump_json
> >
> > and confirming that truncate_seq is 0, and verifying that the truncate_seq
> > on the read request is non-zero.. you'd need to turn up the osd logs with
> > debug ms = 1 and look for the osd_op that looks like "read 0~$length
> > [$truncate_seq@$truncate_size]" (with real values in there).
> >
> > ...but it really sounds like you're hitting the bug.  Unfortunately
> > the fix is not backported to hammer just yet.  You can follow
> > http://tracker.ceph.com/issues/13034
> >
> > sage
> >
> >
> >
> >>
> >> --
> >> Adam
> >>
> >> On Sun, Sep 27, 2015 at 10:44 PM, Adam Tygart  wrote:
> >> > I've done some digging into cp and mv's semantics (from coreutils). If
> >> > the inode is existing, the file will get truncated, then data will get
> >> > copied in. This is definitely within the scope of the bug above.
> >> >
> >> > --
> >> > Adam
> >> >
> >> > On Fri, Sep 25, 2015 at 8:08 PM, Adam Tygart  wrote:
> >> >> It may have been. Although the timestamp on the file was almost a
> >> >> month ago. The typical workflow for this particular file is to copy an
> >> >> updated version overtop of it.
> >> >>
> >> >> i.e. 'cp qss kstat'
> >> >>
> >> >> I'm not sure if cp semantics would keep the same inode and simply
> >> >> truncate/overwrite the contents, or if it would do an unlink and then
> >> >> create a new file.
> >> >> --
> >> >> Adam
> >> >>
> >> >> On Fri, Sep 25, 2015 at 8:00 PM, Ivo Jimenez  wrote:
> >> >>> Looks like you might be experiencing this bug:
> >> >>>
> >> >>>   http://tracker.ceph.com/issues/12551
> >> >>>
> >> >>> Fix has been merged to master and I believe it'll be part of 
> >> >>> infernalis. The
> >> >>> original reproducer involved truncating/overwriting files. In your 
> >> >>> example,
> >> >>> do you know if 'kstat' has been truncated/overwritten prior to 
> >> >>> generating
> >> >>> the md5sums?
> >> >>>
> >> >>> On Fri, Sep 25, 2015 at 2:11 PM Adam Tygart  wrote:
> >> 
> >>  Hell

Re: [ceph-users] pgs stuck inactive and unclean, too feww PGs per OSD

2015-10-07 Thread Christian Balzer

Hello,

On Thu, 8 Oct 2015 12:21:40 +0800 (CST) wikison wrote:

> Here, like this :
> esta@monitorOne:~$ sudo ceph osd tree
> ID WEIGHT  TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
> -3 4.39996 root defualt
 
That's your problem. It should be "default" 

Your manually edited the crush map, right?

> -2 1.0 host storageTwo
>  0 0.0 osd.0 up  1.0  1.0
>  1 1.0 osd.1 up  1.0  1.0
> -4 1.0 host storageFour
>  2 0.0 osd.2 up  1.0  1.0
>  3 1.0 osd.3 up  1.0  1.0
> -5 1.0 host storageLast
>  4 0.0 osd.4 up  1.0  1.0
>  5 1.0 osd.5 up  1.0  1.0
> -6 1.0 host storageOne
>  6 0.0 osd.6 up  1.0  1.0
>  7 1.0 osd.7 up  1.0  1.0
> -1   0 root default
>
Nothing under the default root, so the default rule to allocate PGs can't
find anything.
 
Christian

> I have four storage nodes. Each of them has two independent hard drive
> to store data. One is 120GB SSD, and the other is 1TB HDD. I set the
> weight of SSD is 0.1 and weight of HDD is 1.0.
> 
> 
> 
> 
> 
> --
> 
> Zhen Wang
> Shanghai Jiao Tong University
> 
> 
> 
> At 2015-10-08 11:32:52, "Christian Balzer"  wrote:
> >
> >Hello,
> >
> >On Thu, 8 Oct 2015 11:27:46 +0800 (CST) wikison wrote:
> >
> >> Hi,
> >> I've removed the rbd pool and created it again. It picked up
> >> my default settings but there are still some problems. After running
> >> "sudo ceph -s", the output is as follow: 
> >> cluster 0b9b05db-98fe-49e6-b12b-1cce0645c015
> >>  health HEALTH_WARN
> >> 512 pgs stuck inactive
> >> 512 pgs stuck unclean
> >>  monmap e1: 1 mons at {monitorOne=192.168.1.153:6789/0}
> >> election epoch 1, quorum 0 monitorOne
> >>  osdmap e62: 8 osds: 8 up, 8 in
> >>   pgmap v219: 512 pgs, 1 pools, 0 bytes data, 0 objects
> >> 8460 MB used, 4162 GB / 4171 GB avail
> >>  512 creating
> >> 
> >Output of "ceph osd tree" please.
> >
> >The only reason I can think of is if your OSDs are up, but have no
> >weight.
> >
> >Christian
> >
> >> Ceph stucks in creating the pgs forever. Those pgs are stuck in
> >> inactive and unclean. And the Ceph pg query hangs forever. I googled
> >> this problem and didn't get a clue. Is there anything I missed?
> >> Any idea to help me?
> >> 
> >> 
> >> --
> >> 
> >> Zhen Wang
> >> 
> >> 
> >> 
> >> At 2015-10-07 13:05:51, "Christian Balzer"  wrote:
> >> >
> >> >Hello,
> >> >On Wed, 7 Oct 2015 12:57:58 +0800 (CST) wikison wrote:
> >> >
> >> >This is a very old bug, misfeature. 
> >> >And creeps up every week or so here, google is your friend.
> >> >
> >> >> Hi, 
> >> >> I have a cluster of one monitor and eight OSDs. These OSDs are
> >> >> running on four hosts(each host has two OSDs). When I set up
> >> >> everything and started Ceph, I got this: esta@monitorOne:~$ sudo
> >> >> ceph -s [sudo] password for esta: cluster
> >> >> 0b9b05db-98fe-49e6-b12b-1cce0645c015 health HEALTH_WARN
> >> >> 64 pgs stuck inactive
> >> >> 64 pgs stuck unclean
> >> >> too few PGs per OSD (8 < min 30)
> >> >
> >> >Those 3 lines tell you pretty much all there is wrong.
> >> >You did (correctly) set the defaul pg and pgp nums to something
> >> >sensible (512) in your ceph.conf.
> >> >Unfortunately when creating the initial pool (rbd) it still ignores
> >> >those settings.
> >> >
> >> >You could try to increase those for your pool, which may or may not
> >> >work.
> >> >
> >> >The easier and faster way is to remove the rbd pool and create it
> >> >again. This should pick up your default settings.
> >> >
> >> >Christian
> >> >
> >> >>  monmap e1: 1 mons at {monitorOne=192.168.1.153:6789/0}
> >> >> election epoch 1, quorum 0 monitorOne
> >> >>  osdmap e58: 8 osds: 8 up, 8 in
> >> >>   pgmap v191: 64 pgs, 1 pools, 0 bytes data, 0 objects
> >> >> 8460 MB used, 4162 GB / 4171 GB avail
> >> >>   64 creating
> >> >> 
> >> >> 
> >> >> How to deal with this HEALTH_WARN status?
> >> >> This is my ceph.conf:
> >> >> [global]
> >> >> 
> >> >> 
> >> >> fsid=
> >> >> 0b9b05db-98fe-49e6-b12b-1cce0645c015
> >> >> 
> >> >> 
> >> >> mon initial members = monitorOne
> >> >> mon host= 192.168.1.153
> >> >> filestore_xattr_use_omap= true
> >> >> 
> >> >> 
> >> >> public network  = 192.168.1.0/24
> >> >> cluster network = 10.0.0.0/24
> >> >> pid file= /var/run/ceph/$name.pid
> >> >> 
> >> >> 
> >> >> auth cluster required  = cephx
> >> >> auth service required  = cephx
> >> >> auth client 

Re: [ceph-users] pgs stuck inactive and unclean, too feww PGs per OSD

2015-10-07 Thread wikison
Here, like this :
esta@monitorOne:~$ sudo ceph osd tree
ID WEIGHT  TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
-3 4.39996 root defualt
-2 1.0 host storageTwo
 0 0.0 osd.0 up  1.0  1.0
 1 1.0 osd.1 up  1.0  1.0
-4 1.0 host storageFour
 2 0.0 osd.2 up  1.0  1.0
 3 1.0 osd.3 up  1.0  1.0
-5 1.0 host storageLast
 4 0.0 osd.4 up  1.0  1.0
 5 1.0 osd.5 up  1.0  1.0
-6 1.0 host storageOne
 6 0.0 osd.6 up  1.0  1.0
 7 1.0 osd.7 up  1.0  1.0
-1   0 root default

I have four storage nodes. Each of them has two independent hard drive to store 
data. One is 120GB SSD, and the other is 1TB HDD. I set the weight of SSD is 
0.1 and weight of HDD is 1.0.





--

Zhen Wang
Shanghai Jiao Tong University



At 2015-10-08 11:32:52, "Christian Balzer"  wrote:
>
>Hello,
>
>On Thu, 8 Oct 2015 11:27:46 +0800 (CST) wikison wrote:
>
>> Hi,
>> I've removed the rbd pool and created it again. It picked up my
>> default settings but there are still some problems. After running "sudo
>> ceph -s", the output is as follow: 
>> cluster 0b9b05db-98fe-49e6-b12b-1cce0645c015
>>  health HEALTH_WARN
>> 512 pgs stuck inactive
>> 512 pgs stuck unclean
>>  monmap e1: 1 mons at {monitorOne=192.168.1.153:6789/0}
>> election epoch 1, quorum 0 monitorOne
>>  osdmap e62: 8 osds: 8 up, 8 in
>>   pgmap v219: 512 pgs, 1 pools, 0 bytes data, 0 objects
>> 8460 MB used, 4162 GB / 4171 GB avail
>>  512 creating
>> 
>Output of "ceph osd tree" please.
>
>The only reason I can think of is if your OSDs are up, but have no weight.
>
>Christian
>
>> Ceph stucks in creating the pgs forever. Those pgs are stuck in inactive
>> and unclean. And the Ceph pg query hangs forever. I googled this problem
>> and didn't get a clue. Is there anything I missed?
>> Any idea to help me?
>> 
>> 
>> --
>> 
>> Zhen Wang
>> 
>> 
>> 
>> At 2015-10-07 13:05:51, "Christian Balzer"  wrote:
>> >
>> >Hello,
>> >On Wed, 7 Oct 2015 12:57:58 +0800 (CST) wikison wrote:
>> >
>> >This is a very old bug, misfeature. 
>> >And creeps up every week or so here, google is your friend.
>> >
>> >> Hi, 
>> >> I have a cluster of one monitor and eight OSDs. These OSDs are running
>> >> on four hosts(each host has two OSDs). When I set up everything and
>> >> started Ceph, I got this: esta@monitorOne:~$ sudo ceph -s [sudo]
>> >> password for esta: cluster 0b9b05db-98fe-49e6-b12b-1cce0645c015
>> >>  health HEALTH_WARN
>> >> 64 pgs stuck inactive
>> >> 64 pgs stuck unclean
>> >> too few PGs per OSD (8 < min 30)
>> >
>> >Those 3 lines tell you pretty much all there is wrong.
>> >You did (correctly) set the defaul pg and pgp nums to something sensible
>> >(512) in your ceph.conf.
>> >Unfortunately when creating the initial pool (rbd) it still ignores
>> >those settings.
>> >
>> >You could try to increase those for your pool, which may or may not
>> >work.
>> >
>> >The easier and faster way is to remove the rbd pool and create it again.
>> >This should pick up your default settings.
>> >
>> >Christian
>> >
>> >>  monmap e1: 1 mons at {monitorOne=192.168.1.153:6789/0}
>> >> election epoch 1, quorum 0 monitorOne
>> >>  osdmap e58: 8 osds: 8 up, 8 in
>> >>   pgmap v191: 64 pgs, 1 pools, 0 bytes data, 0 objects
>> >> 8460 MB used, 4162 GB / 4171 GB avail
>> >>   64 creating
>> >> 
>> >> 
>> >> How to deal with this HEALTH_WARN status?
>> >> This is my ceph.conf:
>> >> [global]
>> >> 
>> >> 
>> >> fsid= 0b9b05db-98fe-49e6-b12b-1cce0645c015
>> >> 
>> >> 
>> >> mon initial members = monitorOne
>> >> mon host= 192.168.1.153
>> >> filestore_xattr_use_omap= true
>> >> 
>> >> 
>> >> public network  = 192.168.1.0/24
>> >> cluster network = 10.0.0.0/24
>> >> pid file= /var/run/ceph/$name.pid
>> >> 
>> >> 
>> >> auth cluster required  = cephx
>> >> auth service required  = cephx
>> >> auth client required   = cephx
>> >> 
>> >> 
>> >> osd pool default size   = 3
>> >> osd pool default min size   = 2
>> >> osd pool default pg num = 512
>> >> osd pool default pgp num= 512
>> >> osd crush chooseleaf type   = 1
>> >> osd journal size= 1024
>> >> 
>> >> 
>> >> [mon]
>> >> 
>> >> 
>> >> [mon.0]
>> >> host = monitorOne
>> >> mon addr = 192.168.1.153:6789
>> >> 
>> >> 
>> >> [osd]
>> >> 
>> >> 
>> >> [osd.0]
>> >> host = storageOne
>> >> 
>> >> 
>> >> [osd.1]
>> >> host = storage

Re: [ceph-users] pgs stuck inactive and unclean, too feww PGs per OSD

2015-10-07 Thread Chris Jones
One possibility, it may be that the crush map is not creating. Look at your
/etc/ceph/ceph.conf file and see if you have something under the OSD
section (actually could be in global too) that looks like the following:

osd crush update on start = false

If that line is there and if you're not modifying the crush map manually or
via automation then remove it or comment it out. It stops the automatic
creation of the crush map. Another thing maybe the following:

osd crush chooseleaf type = 1

The above line I believe is the default but I have seem some have 3 for
rack etc and it cause some issues unless you have modified the crush map
correctly.

-Chris

On Wed, Oct 7, 2015 at 11:27 PM, wikison  wrote:

> Hi,
> I've removed the rbd pool and created it again. It picked up my
> default settings but there are still some problems.
> After running "sudo ceph -s", the output is as follow:
>
> cluster 0b9b05db-98fe-49e6-b12b-1cce0645c015
>  health HEALTH_WARN
> 512 pgs stuck inactive
> 512 pgs stuck unclean
>  monmap e1: 1 mons at {monitorOne=192.168.1.153:6789/0}
> election epoch 1, quorum 0 monitorOne
>  osdmap e62: 8 osds: 8 up, 8 in
>   pgmap v219: 512 pgs, 1 pools, 0 bytes data, 0 objects
> 8460 MB used, 4162 GB / 4171 GB avail
>  512 creating
>
> Ceph stucks in creating the pgs forever. Those pgs are stuck in inactive
> and unclean. And the Ceph pg query hangs forever.
> I googled this problem and didn't get a clue.
> Is there anything I missed?
> Any idea to help me?
>
> --
> Zhen Wang
>
>
> At 2015-10-07 13:05:51, "Christian Balzer"  wrote:
> >
> >Hello,
> >On Wed, 7 Oct 2015 12:57:58 +0800 (CST) wikison wrote:
> >
> >This is a very old bug, misfeature.
> >And creeps up every week or so here, google is your friend.
> >
> >> Hi,
> >> I have a cluster of one monitor and eight OSDs. These OSDs are running
> >> on four hosts(each host has two OSDs). When I set up everything and
> >> started Ceph, I got this: esta@monitorOne:~$ sudo ceph -s [sudo]
> >> password for esta: cluster 0b9b05db-98fe-49e6-b12b-1cce0645c015
> >>  health HEALTH_WARN
> >> 64 pgs stuck inactive
> >> 64 pgs stuck unclean
> >> too few PGs per OSD (8 < min 30)
> >
> >Those 3 lines tell you pretty much all there is wrong.
> >You did (correctly) set the defaul pg and pgp nums to something sensible
> >(512) in your ceph.conf.
> >Unfortunately when creating the initial pool (rbd) it still ignores those
> >settings.
> >
> >You could try to increase those for your pool, which may or may not work.
> >
> >The easier and faster way is to remove the rbd pool and create it again.
> >This should pick up your default settings.
> >
> >Christian
> >
> >>  monmap e1: 1 mons at {monitorOne=192.168.1.153:6789/0}
> >> election epoch 1, quorum 0 monitorOne
> >>  osdmap e58: 8 osds: 8 up, 8 in
> >>   pgmap v191: 64 pgs, 1 pools, 0 bytes data, 0 objects
> >> 8460 MB used, 4162 GB / 4171 GB avail
> >>   64 creating
> >>
> >>
> >> How to deal with this HEALTH_WARN status?
> >> This is my ceph.conf:
> >> [global]
> >>
> >>
> >> fsid= 0b9b05db-98fe-49e6-b12b-1cce0645c015
> >>
> >>
> >> mon initial members = monitorOne
> >> mon host= 192.168.1.153
> >> filestore_xattr_use_omap= true
> >>
> >>
> >> public network  = 192.168.1.0/24
> >> cluster network = 10.0.0.0/24
> >> pid file= /var/run/ceph/$name.pid
> >>
> >>
> >> auth cluster required  = cephx
> >> auth service required  = cephx
> >> auth client required   = cephx
> >>
> >>
> >> osd pool default size   = 3
> >> osd pool default min size   = 2
> >> osd pool default pg num = 512
> >> osd pool default pgp num= 512
> >> osd crush chooseleaf type   = 1
> >> osd journal size= 1024
> >>
> >>
> >> [mon]
> >>
> >>
> >> [mon.0]
> >> host = monitorOne
> >> mon addr = 192.168.1.153:6789
> >>
> >>
> >> [osd]
> >>
> >>
> >> [osd.0]
> >> host = storageOne
> >>
> >>
> >> [osd.1]
> >> host = storageTwo
> >>
> >>
> >> [osd.2]
> >> host = storageFour
> >>
> >>
> >> [osd.3]
> >> host = storageLast
> >>
> >>
> >> Could anybody help me?
> >>
> >> best regards,
> >>
> >> --
> >>
> >> Zhen Wang
> >
> >--
> >Christian BalzerNetwork/Systems Engineer
> >ch...@gol.comGlobal OnLine Japan/Fusion Communications
> >http://www.gol.com/
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Best Regards,
Chris Jones

cjo...@cloudm2.com
(p) 770.655.0770
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS "corruption" -- Nulled bytes

2015-10-07 Thread Adam Tygart
Does this patch fix files that have been corrupted in this manner?

If not, or I guess even if it does, is there a way to walk the
metadata and data pools and find objects that are affected?

Is that '_' xattr in hammer? If so, how can I access it? Doing a
listxattr on the inode just lists 'parent', and doing the same on the
parent directory's inode simply lists 'parent'.

Thanks for your time.

--
Adam


On Mon, Oct 5, 2015 at 9:36 AM, Sage Weil  wrote:
> On Mon, 5 Oct 2015, Adam Tygart wrote:
>> Okay, this has happened several more times. Always seems to be a small
>> file that should be read-only (perhaps simultaneously) on many
>> different clients. It is just through the cephfs interface that the
>> files are corrupted, the objects in the cachepool and erasure coded
>> pool are still correct. I am beginning to doubt these files are
>> getting a truncation request.
>
> This is still consistent with the #12551 bug.  The object data is correct,
> but the cephfs truncation metadata on the object is wrong, causing it to
> be implicitly zeroed out on read.  It's easily triggered by writers who
> use O_TRUNC on open...
>
>> Twice now have been different perl files, once was someones .bashrc,
>> once was an input file for another application, timestamps on the
>> files indicate that the files haven't been modified in weeks.
>>
>> Any other possibilites? Or any way to figure out what happened?
>
> You can confirm by extracting the '_' xattr on the object (append any @1
> etc fragments) and feeding it to ceph-dencoder with
>
>  ceph-dencoder type object_info_t import  decode 
> dump_json
>
> and confirming that truncate_seq is 0, and verifying that the truncate_seq
> on the read request is non-zero.. you'd need to turn up the osd logs with
> debug ms = 1 and look for the osd_op that looks like "read 0~$length
> [$truncate_seq@$truncate_size]" (with real values in there).
>
> ...but it really sounds like you're hitting the bug.  Unfortunately
> the fix is not backported to hammer just yet.  You can follow
> http://tracker.ceph.com/issues/13034
>
> sage
>
>
>
>>
>> --
>> Adam
>>
>> On Sun, Sep 27, 2015 at 10:44 PM, Adam Tygart  wrote:
>> > I've done some digging into cp and mv's semantics (from coreutils). If
>> > the inode is existing, the file will get truncated, then data will get
>> > copied in. This is definitely within the scope of the bug above.
>> >
>> > --
>> > Adam
>> >
>> > On Fri, Sep 25, 2015 at 8:08 PM, Adam Tygart  wrote:
>> >> It may have been. Although the timestamp on the file was almost a
>> >> month ago. The typical workflow for this particular file is to copy an
>> >> updated version overtop of it.
>> >>
>> >> i.e. 'cp qss kstat'
>> >>
>> >> I'm not sure if cp semantics would keep the same inode and simply
>> >> truncate/overwrite the contents, or if it would do an unlink and then
>> >> create a new file.
>> >> --
>> >> Adam
>> >>
>> >> On Fri, Sep 25, 2015 at 8:00 PM, Ivo Jimenez  wrote:
>> >>> Looks like you might be experiencing this bug:
>> >>>
>> >>>   http://tracker.ceph.com/issues/12551
>> >>>
>> >>> Fix has been merged to master and I believe it'll be part of infernalis. 
>> >>> The
>> >>> original reproducer involved truncating/overwriting files. In your 
>> >>> example,
>> >>> do you know if 'kstat' has been truncated/overwritten prior to generating
>> >>> the md5sums?
>> >>>
>> >>> On Fri, Sep 25, 2015 at 2:11 PM Adam Tygart  wrote:
>> 
>>  Hello all,
>> 
>>  I've run into some sort of bug with CephFS. Client reads of a
>>  particular file return nothing but 40KB of Null bytes. Doing a rados
>>  level get of the inode returns the whole file, correctly.
>> 
>>  Tested via Linux 4.1, 4.2 kernel clients, and the 0.94.3 fuse client.
>> 
>>  Attached is a dynamic printk debug of the ceph module from the linux
>>  4.2 client while cat'ing the file.
>> 
>>  My current thought is that there has to be a cache of the object
>>  *somewhere* that a 'rados get' bypasses.
>> 
>>  Even on hosts that have *never* read the file before, it is returning
>>  Null bytes from the kernel and fuse mounts.
>> 
>>  Background:
>> 
>>  24x CentOS 7.1 hosts serving up RBD and CephFS with Ceph 0.94.3.
>>  CephFS is a EC k=8, m=4 pool with a size 3 writeback cache in front of 
>>  it.
>> 
>>  # rados -p cachepool get 10004096b95. /tmp/kstat-cache
>>  # rados -p ec84pool get 10004096b95. /tmp/kstat-ec
>>  # md5sum /tmp/kstat*
>>  ddfbe886420a2cb860b46dc70f4f9a0d  /tmp/kstat-cache
>>  ddfbe886420a2cb860b46dc70f4f9a0d  /tmp/kstat-ec
>>  # file /tmp/kstat*
>>  /tmp/kstat-cache: Perl script, ASCII text executable
>>  /tmp/kstat-ec:Perl script, ASCII text executable
>> 
>>  # md5sum ~daveturner/bin/kstat
>>  1914e941c2ad5245a23e3e1d27cf8fde  /homes/daveturner/bin/kstat
>>  # file ~daveturner/bin/kstat
>>  /homes/daveturner/bin/kstat: data
>> >>

Re: [ceph-users] pgs stuck inactive and unclean, too feww PGs per OSD

2015-10-07 Thread Christian Balzer

Hello,

On Thu, 8 Oct 2015 11:27:46 +0800 (CST) wikison wrote:

> Hi,
> I've removed the rbd pool and created it again. It picked up my
> default settings but there are still some problems. After running "sudo
> ceph -s", the output is as follow: 
> cluster 0b9b05db-98fe-49e6-b12b-1cce0645c015
>  health HEALTH_WARN
> 512 pgs stuck inactive
> 512 pgs stuck unclean
>  monmap e1: 1 mons at {monitorOne=192.168.1.153:6789/0}
> election epoch 1, quorum 0 monitorOne
>  osdmap e62: 8 osds: 8 up, 8 in
>   pgmap v219: 512 pgs, 1 pools, 0 bytes data, 0 objects
> 8460 MB used, 4162 GB / 4171 GB avail
>  512 creating
> 
Output of "ceph osd tree" please.

The only reason I can think of is if your OSDs are up, but have no weight.

Christian

> Ceph stucks in creating the pgs forever. Those pgs are stuck in inactive
> and unclean. And the Ceph pg query hangs forever. I googled this problem
> and didn't get a clue. Is there anything I missed?
> Any idea to help me?
> 
> 
> --
> 
> Zhen Wang
> 
> 
> 
> At 2015-10-07 13:05:51, "Christian Balzer"  wrote:
> >
> >Hello,
> >On Wed, 7 Oct 2015 12:57:58 +0800 (CST) wikison wrote:
> >
> >This is a very old bug, misfeature. 
> >And creeps up every week or so here, google is your friend.
> >
> >> Hi, 
> >> I have a cluster of one monitor and eight OSDs. These OSDs are running
> >> on four hosts(each host has two OSDs). When I set up everything and
> >> started Ceph, I got this: esta@monitorOne:~$ sudo ceph -s [sudo]
> >> password for esta: cluster 0b9b05db-98fe-49e6-b12b-1cce0645c015
> >>  health HEALTH_WARN
> >> 64 pgs stuck inactive
> >> 64 pgs stuck unclean
> >> too few PGs per OSD (8 < min 30)
> >
> >Those 3 lines tell you pretty much all there is wrong.
> >You did (correctly) set the defaul pg and pgp nums to something sensible
> >(512) in your ceph.conf.
> >Unfortunately when creating the initial pool (rbd) it still ignores
> >those settings.
> >
> >You could try to increase those for your pool, which may or may not
> >work.
> >
> >The easier and faster way is to remove the rbd pool and create it again.
> >This should pick up your default settings.
> >
> >Christian
> >
> >>  monmap e1: 1 mons at {monitorOne=192.168.1.153:6789/0}
> >> election epoch 1, quorum 0 monitorOne
> >>  osdmap e58: 8 osds: 8 up, 8 in
> >>   pgmap v191: 64 pgs, 1 pools, 0 bytes data, 0 objects
> >> 8460 MB used, 4162 GB / 4171 GB avail
> >>   64 creating
> >> 
> >> 
> >> How to deal with this HEALTH_WARN status?
> >> This is my ceph.conf:
> >> [global]
> >> 
> >> 
> >> fsid= 0b9b05db-98fe-49e6-b12b-1cce0645c015
> >> 
> >> 
> >> mon initial members = monitorOne
> >> mon host= 192.168.1.153
> >> filestore_xattr_use_omap= true
> >> 
> >> 
> >> public network  = 192.168.1.0/24
> >> cluster network = 10.0.0.0/24
> >> pid file= /var/run/ceph/$name.pid
> >> 
> >> 
> >> auth cluster required  = cephx
> >> auth service required  = cephx
> >> auth client required   = cephx
> >> 
> >> 
> >> osd pool default size   = 3
> >> osd pool default min size   = 2
> >> osd pool default pg num = 512
> >> osd pool default pgp num= 512
> >> osd crush chooseleaf type   = 1
> >> osd journal size= 1024
> >> 
> >> 
> >> [mon]
> >> 
> >> 
> >> [mon.0]
> >> host = monitorOne
> >> mon addr = 192.168.1.153:6789
> >> 
> >> 
> >> [osd]
> >> 
> >> 
> >> [osd.0]
> >> host = storageOne
> >> 
> >> 
> >> [osd.1]
> >> host = storageTwo
> >> 
> >> 
> >> [osd.2]
> >> host = storageFour
> >> 
> >> 
> >> [osd.3]
> >> host = storageLast
> >> 
> >> 
> >> Could anybody help me?
> >> 
> >> best regards,
> >> 
> >> --
> >> 
> >> Zhen Wang
> >
> >-- 
> >Christian BalzerNetwork/Systems Engineer
> >ch...@gol.comGlobal OnLine Japan/Fusion Communications
> >http://www.gol.com/


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs stuck inactive and unclean, too feww PGs per OSD

2015-10-07 Thread wikison
Hi,
I've removed the rbd pool and created it again. It picked up my default 
settings but there are still some problems.
After running "sudo ceph -s", the output is as follow:
 
cluster 0b9b05db-98fe-49e6-b12b-1cce0645c015
 health HEALTH_WARN
512 pgs stuck inactive
512 pgs stuck unclean
 monmap e1: 1 mons at {monitorOne=192.168.1.153:6789/0}
election epoch 1, quorum 0 monitorOne
 osdmap e62: 8 osds: 8 up, 8 in
  pgmap v219: 512 pgs, 1 pools, 0 bytes data, 0 objects
8460 MB used, 4162 GB / 4171 GB avail
 512 creating

Ceph stucks in creating the pgs forever. Those pgs are stuck in inactive and 
unclean. And the Ceph pg query hangs forever.
I googled this problem and didn't get a clue.
Is there anything I missed?
Any idea to help me?


--

Zhen Wang



At 2015-10-07 13:05:51, "Christian Balzer"  wrote:
>
>Hello,
>On Wed, 7 Oct 2015 12:57:58 +0800 (CST) wikison wrote:
>
>This is a very old bug, misfeature. 
>And creeps up every week or so here, google is your friend.
>
>> Hi, 
>> I have a cluster of one monitor and eight OSDs. These OSDs are running
>> on four hosts(each host has two OSDs). When I set up everything and
>> started Ceph, I got this: esta@monitorOne:~$ sudo ceph -s [sudo]
>> password for esta: cluster 0b9b05db-98fe-49e6-b12b-1cce0645c015
>>  health HEALTH_WARN
>> 64 pgs stuck inactive
>> 64 pgs stuck unclean
>> too few PGs per OSD (8 < min 30)
>
>Those 3 lines tell you pretty much all there is wrong.
>You did (correctly) set the defaul pg and pgp nums to something sensible
>(512) in your ceph.conf.
>Unfortunately when creating the initial pool (rbd) it still ignores those
>settings.
>
>You could try to increase those for your pool, which may or may not work.
>
>The easier and faster way is to remove the rbd pool and create it again.
>This should pick up your default settings.
>
>Christian
>
>>  monmap e1: 1 mons at {monitorOne=192.168.1.153:6789/0}
>> election epoch 1, quorum 0 monitorOne
>>  osdmap e58: 8 osds: 8 up, 8 in
>>   pgmap v191: 64 pgs, 1 pools, 0 bytes data, 0 objects
>> 8460 MB used, 4162 GB / 4171 GB avail
>>   64 creating
>> 
>> 
>> How to deal with this HEALTH_WARN status?
>> This is my ceph.conf:
>> [global]
>> 
>> 
>> fsid= 0b9b05db-98fe-49e6-b12b-1cce0645c015
>> 
>> 
>> mon initial members = monitorOne
>> mon host= 192.168.1.153
>> filestore_xattr_use_omap= true
>> 
>> 
>> public network  = 192.168.1.0/24
>> cluster network = 10.0.0.0/24
>> pid file= /var/run/ceph/$name.pid
>> 
>> 
>> auth cluster required  = cephx
>> auth service required  = cephx
>> auth client required   = cephx
>> 
>> 
>> osd pool default size   = 3
>> osd pool default min size   = 2
>> osd pool default pg num = 512
>> osd pool default pgp num= 512
>> osd crush chooseleaf type   = 1
>> osd journal size= 1024
>> 
>> 
>> [mon]
>> 
>> 
>> [mon.0]
>> host = monitorOne
>> mon addr = 192.168.1.153:6789
>> 
>> 
>> [osd]
>> 
>> 
>> [osd.0]
>> host = storageOne
>> 
>> 
>> [osd.1]
>> host = storageTwo
>> 
>> 
>> [osd.2]
>> host = storageFour
>> 
>> 
>> [osd.3]
>> host = storageLast
>> 
>> 
>> Could anybody help me?
>> 
>> best regards,
>> 
>> --
>> 
>> Zhen Wang
>
>-- 
>Christian BalzerNetwork/Systems Engineer
>ch...@gol.com  Global OnLine Japan/Fusion Communications
>http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph osd start failed

2015-10-07 Thread Fulin Sun
The failing message looks like following: 

What would be the root cause ? 

=== osd.0 === 
failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.0 
--keyring=/var/lib/ceph/osd/ceph-0/keyring osd crush create-or-move -- 0 1.74 
host=certstor-18 root=default'
=== osd.1 === 
failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.1 
--keyring=/var/lib/ceph/osd/ceph-1/keyring osd crush create-or-move -- 1 1.74 
host=certstor-18 root=default'
=== osd.2 === 
failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.2 
--keyring=/var/lib/ceph/osd/ceph-2/keyring osd crush create-or-move -- 2 1.74 
host=certstor-18 root=default'
=== osd.3 === 
failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.3 
--keyring=/var/lib/ceph/osd/ceph-3/keyring osd crush create-or-move -- 3 1.74 
host=certstor-18 root=default'
=== osd.4 === 
failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.4 
--keyring=/var/lib/ceph/osd/ceph-4/keyring osd crush create-or-move -- 4 1.74 
host=certstor-18 root=default'
=== osd.5 === 
failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.5 
--keyring=/var/lib/ceph/osd/ceph-5/keyring osd crush create-or-move -- 5 1.74 
host=certstor-18 root=default'
=== osd.6 === 
failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.6 
--keyring=/var/lib/ceph/osd/ceph-6/keyring osd crush create-or-move -- 6 1.74 
host=certstor-18 root=default'
=== osd.7 === 
failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.7 
--keyring=/var/lib/ceph/osd/ceph-7/keyring osd crush create-or-move -- 7 1.74 
host=certstor-18 root=default'
=== osd.8 === 
failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.8 
--keyring=/var/lib/ceph/osd/ceph-8/keyring osd crush create-or-move -- 8 1.74 
host=certstor-18 root=default'
=== osd.9 === 






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] leveldb compaction error

2015-10-07 Thread Narendra Trivedi (natrived)
Hi Selcuk,

Which version of ceph did you upgrade from to Hammer (0.94)?

--Narendra

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Selcuk 
TUNC
Sent: Thursday, September 17, 2015 12:41 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] leveldb compaction error

hello,

we have noticed leveldb compaction on mount causes a segmentation fault in 
hammer release(0.94).
It seems related to this pull request 
(github.com/ceph/ceph/pull/4372). Are 
you planning to backport
this fix to next hammer release?

--
st
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] O_DIRECT on deep-scrub read

2015-10-07 Thread Sage Weil
On Wed, 7 Oct 2015, David Zafman wrote:
> 
> There would be a benefit to doing fadvise POSIX_FADV_DONTNEED after 
> deep-scrub reads for objects not recently accessed by clients.

Yeah, it's the 'except for stuff already in cache' part that we don't do 
(and the kernel doesn't give us a good interface for).  IIRC there was a 
patch that guessed based on whether the obc was already in cache, which 
seems like a pretty decent heuristic, but I forget if that was in the 
final version.

> I see the NewStore objectstore sometimes using the O_DIRECT  flag for writes.
> This concerns me because the open(2) man pages says:
> 
> "Applications should avoid mixing O_DIRECT and normal I/O to the same file,
> and especially to overlapping byte regions in the same file.  Even when the
> filesystem correctly handles the coherency issues in this situation, overall
> I/O throughput is likely to be slower than using either mode alone."

Yeah: an O_DIRECT write will do a cache flush on the write range, so if 
there was already dirty data in cache you'll write twice.  There's 
similarly an invalidate on read.  I need to go back through the newstore 
code and see how the modes are being mixed and how it can be avoided...

sage


> 
> David
> 
> On 10/7/15 7:50 AM, Sage Weil wrote:
> > It's not, but it would not be ahrd to do this.  There are fadvise-style
> > hints being passed down that could trigger O_DIRECT reads in this case.
> > That may not be the best choice, though--it won't use data that happens
> > to be in cache and it'll also throw it out..
> > 
> > On Wed, 7 Oct 2015, Pawe? Sadowski wrote:
> > 
> > > Hi,
> > > 
> > > Can anyone tell if deep scrub is done using O_DIRECT flag or not? I'm
> > > not able to verify that in source code.
> > > 
> > > If not would it be possible to add such feature (maybe config option) to
> > > help keeping Linux page cache in better shape?
> > > 
> > > Thanks,
> > > 
> > > -- 
> > > PS
> > > 
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] O_DIRECT on deep-scrub read

2015-10-07 Thread David Zafman


There would be a benefit to doing fadvise POSIX_FADV_DONTNEED after 
deep-scrub reads for objects not recently accessed by clients.


I see the NewStore objectstore sometimes using the O_DIRECT  flag for 
writes.  This concerns me because the open(2) man pages says:


"Applications should avoid mixing O_DIRECT and normal I/O to the same 
file, and especially to overlapping byte regions in the same file.  Even 
when the filesystem correctly handles the coherency issues in this 
situation, overall I/O throughput is likely to be slower than using 
either mode alone."


David

On 10/7/15 7:50 AM, Sage Weil wrote:

It's not, but it would not be ahrd to do this.  There are fadvise-style
hints being passed down that could trigger O_DIRECT reads in this case.
That may not be the best choice, though--it won't use data that happens
to be in cache and it'll also throw it out..

On Wed, 7 Oct 2015, Pawe? Sadowski wrote:


Hi,

Can anyone tell if deep scrub is done using O_DIRECT flag or not? I'm
not able to verify that in source code.

If not would it be possible to add such feature (maybe config option) to
help keeping Linux page cache in better shape?

Thanks,

--
PS

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Potential OSD deadlock?

2015-10-07 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

We forgot to upload the ceph.log yesterday. It is there now.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> I upped the debug on about everything and ran the test for about 40
> minutes. I took OSD.19 on ceph1 doen and then brought it back in.
> There was at least one op on osd.19 that was blocked for over 1,000
> seconds. Hopefully this will have something that will cast a light on
> what is going on.
>
> We are going to upgrade this cluster to Infernalis tomorrow and rerun
> the test to verify the results from the dev cluster. This cluster
> matches the hardware of our production cluster but is not yet in
> production so we can safely wipe it to downgrade back to Hammer.
>
> Logs are located at http://dev.v3trae.net/~jlavoy/ceph/logs/
>
> Let me know what else we can do to help.
>
> Thanks,
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.2.0
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWFFwACRDmVDuy+mK58QAAs/UP/1L+y7DEfHqD/5OpkiNQ
> xuEEDm7fNJK58tLRmKsCrDrsFUvWCjiqUwboPg/E40e2GN7Lt+VkhMUEUWoo
> e3L20ig04c8Zu6fE/SXX3lnvayxsWTPcMnYI+HsmIV9E/efDLVLEf6T4fvXg
> 5dKLiqQ8Apu+UMVfd1+aKKDdLdnYlgBCZcIV9AQe1GB8X2VJJhmNWh6TQ3Xr
> gNXDexBdYjFBLu84FXOITd3ZtyUkgx/exCUMmwsJSc90jduzipS5hArvf7LN
> HD6m1gBkZNbfWfc/4nzqOQnKdY1pd9jyoiQM70jn0R5b2BlZT0wLjiAJm+07
> eCCQ99TZHFyeu1LyovakrYncXcnPtP5TfBFZW952FWQugupvxPCcaduz+GJV
> OhPAJ9dv90qbbGCO+8kpTMAD1aHgt/7+0/hKZTg8WMHhua68SFCXmdGAmqje
> IkIKswIAX4/uIoo5mK4TYB5HdEMJf9DzBFd+1RzzfRrrRalVkBfsu5ChFTx3
> mu5LAMwKTslvILMxAct0JwnwkOX5Gd+OFvmBRdm16UpDaDTQT2DfykylcmJd
> Cf9rPZxUv0ZHtZyTTyP2e6vgrc7UM/Ie5KonABxQ11mGtT8ysra3c9kMhYpw
> D6hcAZGtdvpiBRXBC5gORfiFWFxwu5kQ+daUhgUIe/O/EWyeD0rirZoqlLnZ
> EDrG
> =BZVw
> -END PGP SIGNATURE-
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Tue, Oct 6, 2015 at 2:36 PM, Robert LeBlanc  wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> On my second test (a much longer one), it took nearly an hour, but a
>> few messages have popped up over a 20 window. Still far less than I
>> have been seeing.
>> - 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Tue, Oct 6, 2015 at 2:00 PM, Robert LeBlanc  wrote:
>>> -BEGIN PGP SIGNED MESSAGE-
>>> Hash: SHA256
>>>
>>> I'll capture another set of logs. Is there any other debugging you
>>> want turned up? I've seen the same thing where I see the message
>>> dispatched to the secondary OSD, but the message just doesn't show up
>>> for 30+ seconds in the secondary OSD logs.
>>> - 
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Tue, Oct 6, 2015 at 1:34 PM, Sage Weil  wrote:
 On Tue, 6 Oct 2015, Robert LeBlanc wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> I can't think of anything. In my dev cluster the only thing that has
> changed is the Ceph versions (no reboot). What I like is even though
> the disks are 100% utilized, it is preforming as I expect now. Client
> I/O is slightly degraded during the recovery, but no blocked I/O when
> the OSD boots or during the recovery period. This is with
> max_backfills set to 20, one backfill max in our production cluster is
> painful on OSD boot/recovery. I was able to reproduce this issue on
> our dev cluster very easily and very quickly with these settings. So
> far two tests and an hour later, only the blocked I/O when the OSD is
> marked out. We would love to see that go away too, but this is far
 (me too!)
> better than what we have now. This dev cluster also has
> osd_client_message_cap set to default (100).
>
> We need to stay on the Hammer version of Ceph and I'm willing to take
> the time to bisect this. If this is not a problem in Firefly/Giant,
> you you prefer a bisect to find the introduction of the problem
> (Firefly/Giant -> Hammer) or the introduction of the resolution
> (Hammer -> Infernalis)? Do you have some hints to reduce hitting a
> commit that prevents a clean build as that is my most limiting factor?

 Nothing comes to mind.  I think the best way to find this is still to see
 it happen in the logs with hammer.  The frustrating thing with that log
 dump you sent is that although I see plenty of slow request warnings in
 the osd logs, I don't see the requests arriving.  Maybe the logs weren't
 turned up for long enough?

 sage



> Thanks,
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Tue, Oct 6, 2015

Re: [ceph-users] avoid 3-mds fs laggy on 1 rejoin?

2015-10-07 Thread Dzianis Kahanovich

John Spray пишет:

[...]

There are part of log for restarted mds debug 7 (without standby-replplay, but 
IMHO no matter):


(PS How [un]safe multiple mds in current hammer? Now I try & temporary work with 
"set_max_mds 3", but 1 mds shutdown still too laggy for related client anymore. 
And anymore 4.2.3 kernel driver buggy and at least need hard tuning to normal 
work in real busy mode...)


2015-10-06 23:38:34.420499 7f256905a780  0 ceph version 0.94.3-242-g79385a8 
(79385a85beea9bccd82c99b6bda653f0224c4fcd), process ceph-mds, pid 17337

2015-10-06 23:38:34.590909 7f256905a780 -1 mds.-1.0 log_to_monitors 
{default=true}
2015-10-06 23:38:34.608101 7f2561355700  7 mds.-1.server handle_osd_map: full = 
0 epoch = 159091
2015-10-06 23:38:34.608457 7f2561355700  5 mds.-1.0 handle_mds_map epoch 7241 
from mon.2
2015-10-06 23:38:34.608558 7f2561355700  7 mds.-1.server handle_osd_map: full = 
0 epoch = 159091
2015-10-06 23:38:34.608614 7f2561355700  5 mds.-1.-1 handle_mds_map epoch 7241 
from mon.2
2015-10-06 23:38:34.608618 7f2561355700  5 mds.-1.-1  old map epoch 7241 <= 
7241, discarding
2015-10-06 23:38:35.047873 7f2561355700  7 mds.-1.server handle_osd_map: full = 
0 epoch = 159092
2015-10-06 23:38:35.390461 7f2561355700  5 mds.-1.-1 handle_mds_map epoch 7242 
from mon.2

2015-10-06 23:38:35.390529 7f2561355700  1 mds.-1.0 handle_mds_map standby
2015-10-06 23:38:35.607255 7f2561355700  5 mds.-1.0 handle_mds_map epoch 7243 
from mon.2
2015-10-06 23:38:35.607292 7f2561355700  1 mds.0.688 handle_mds_map i am now 
mds.0.688
2015-10-06 23:38:35.607310 7f2561355700  1 mds.0.688 handle_mds_map state change 
up:standby --> up:replay

2015-10-06 23:38:35.607313 7f2561355700  1 mds.0.688 replay_start
2015-10-06 23:38:35.607316 7f2561355700  7 mds.0.cache set_recovery_set
2015-10-06 23:38:35.607318 7f2561355700  1 mds.0.688  recovery set is
2015-10-06 23:38:35.607321 7f2561355700  2 mds.0.688 boot_start 0: opening 
inotable
2015-10-06 23:38:35.648096 7f2561355700  2 mds.0.688 boot_start 0: opening 
sessionmap

2015-10-06 23:38:35.648227 7f2561355700  2 mds.0.688 boot_start 0: opening mds 
log
2015-10-06 23:38:35.648230 7f2561355700  5 mds.0.log open discovering log bounds
2015-10-06 23:38:35.648281 7f2561355700  2 mds.0.688 boot_start 0: opening snap 
table
2015-10-06 23:38:35.695211 7f255bf47700  4 mds.0.log Waiting for journal 300 to 
recover...

2015-10-06 23:38:35.699398 7f255bf47700  4 mds.0.log Journal 300 recovered.
2015-10-06 23:38:35.699408 7f255bf47700  4 mds.0.log Recovered journal 300 in 
format 1
2015-10-06 23:38:35.699413 7f255bf47700  2 mds.0.688 boot_start 1: 
loading/discovering base inodes
2015-10-06 23:38:35.699501 7f255bf47700  0 mds.0.cache creating system inode 
with ino:100
2015-10-06 23:38:35.716269 7f255bf47700  0 mds.0.cache creating system inode 
with ino:1

2015-10-06 23:38:35.851761 7f255eb50700  2 mds.0.688 boot_start 2: replaying 
mds log
2015-10-06 23:38:36.272808 7f255bf47700  7 mds.0.cache adjust_subtree_auth -1,-2 
-> -2,-2 on [dir 1 / [2,head] auth v=0 cv=0/0 state=1073741824 f() n() 
hs=0+0,ss=0+0 0x5cde000]
2015-10-06 23:38:36.272837 7f255bf47700  7 mds.0.cache  current root is [dir 1 / 
[2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 | subtree=1 
0x5cde000]
2015-10-06 23:38:36.272849 7f255bf47700  7 mds.0.cache adjust_subtree_auth -1,-2 
-> -2,-2 on [dir 100 ~mds0/ [2,head] auth v=0 cv=0/0 state=1073741824 f() n() 
hs=0+0,ss=0+0 0x5cde3c0]
2015-10-06 23:38:36.272855 7f255bf47700  7 mds.0.cache  current root is [dir 100 
~mds0/ [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 | 
subtree=1 0x5cde3c0]
2015-10-06 23:38:36.282049 7f255bf47700  7 mds.0.cache 
adjust_bounded_subtree_auth -2,-2 -> 0,-2 on [dir 1 / [2,head] auth v=24837995 
cv=0/0 dir_auth=-2 state=1073741824 f(v175 m2015-10-05 14:34:45.120292 73=45+28) 
n(v2043504 rc2015-10-06 23:16:58.837591 b4465915097499 
1125727=964508+161219)/n(v2043504 rc2015-10-06 23:16:46.667354 b4465915097417 
1125726=964507+161219) hs=0+0,ss=0+0 | subtree=1 0x5cde000] bound_dfs []
2015-10-06 23:38:36.282086 7f255bf47700  7 mds.0.cache 
adjust_bounded_subtree_auth -2,-2 -> 0,-2 on [dir 1 / [2,head] auth v=24837995 
cv=0/0 dir_auth=-2 state=1073741824 f(v175 m2015-10-05 14:34:45.120292 73=45+28) 
n(v2043504 rc2015-10-06 23:16:58.837591 b4465915097499 
1125727=964508+161219)/n(v2043504 rc2015-10-06 23:16:46.667354 b4465915097417 
1125726=964507+161219) hs=0+0,ss=0+0 | subtree=1 0x5cde000] bounds
2015-10-06 23:38:36.282100 7f255bf47700  7 mds.0.cache  current root is [dir 1 / 
[2,head] auth v=24837995 cv=0/0 dir_auth=-2 state=1073741824 f(v175 m2015-10-05 
14:34:45.120292 73=45+28) n(v2043504 rc2015-10-06 23:16:58.837591 b4465915097499 
1125727=964508+161219)/n(v2043504 rc2015-10-06 23:16:46.667354 b4465915097417 
1125726=964507+161219) hs=0+0,ss=0+0 | subtree=1 0x5cde000]
2015-10-06 23:38:36.282147 7f255bf47700  7 mds.0.cache 
adjust_bounded_subtree_auth -2,-2 -> 0,-2 on [dir 100 ~mds0/ [2,head] auth 
v=28417783 cv=0

Re: [ceph-users] O_DIRECT on deep-scrub read

2015-10-07 Thread Milosz Tanski
On Wed, Oct 7, 2015 at 10:50 AM, Sage Weil  wrote:
> It's not, but it would not be ahrd to do this.  There are fadvise-style
> hints being passed down that could trigger O_DIRECT reads in this case.
> That may not be the best choice, though--it won't use data that happens
> to be in cache and it'll also throw it out..
>
> On Wed, 7 Oct 2015, Pawe? Sadowski wrote:
>
>> Hi,
>>
>> Can anyone tell if deep scrub is done using O_DIRECT flag or not? I'm
>> not able to verify that in source code.
>>
>> If not would it be possible to add such feature (maybe config option) to
>> help keeping Linux page cache in better shape?
>>
>> Thanks,

When I was working on preadv2 somebody brought up a per operation
O_DIRECT flag. There wasn't a clear use case at the time (outside of
to saying Linus would "love that").

>>
>> --
>> PS
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: mil...@adfin.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache tier experiences (for ample sized caches ^o^)

2015-10-07 Thread Christian Balzer

Hello Udo,

On Wed, 07 Oct 2015 11:40:11 +0200 Udo Lembke wrote:

> Hi Christian,
> 
> On 07.10.2015 09:04, Christian Balzer wrote:
> > 
> > ...
> > 
> > My main suspect for the excessive slowness are actually the Toshiba DT
> > type drives used. 
> > We only found out after deployment that these can go into a zombie mode
> > (20% of their usual performance for ~8 hours if not permanently until
> > power cycled) after a week of uptime.
> > Again, the HW cache is likely masking this for the steady state, but
> > asking a sick DT drive to seek (for reads) is just asking for trouble.
> > 
> > ...
> does this mean, you can reboot your OSD-Nodes one after the other and
> then your cluster should be fast enough for app. one week to bring the
> additional node in?
> 
Actually shut down (power cycle), a reboot won't "fix" that state as the
power to the backplane stays on.

And even if the drives would be at full speed, at this point in time
(2x over planned capacity) I'm not sure if that's enough.

6 month and 140 VMs earlier I might have just tried that, now I'm looking
for something that is going to work 100%, no ifs and whens.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] O_DIRECT on deep-scrub read

2015-10-07 Thread Sage Weil
It's not, but it would not be ahrd to do this.  There are fadvise-style 
hints being passed down that could trigger O_DIRECT reads in this case.  
That may not be the best choice, though--it won't use data that happens 
to be in cache and it'll also throw it out..

On Wed, 7 Oct 2015, Pawe? Sadowski wrote:

> Hi,
> 
> Can anyone tell if deep scrub is done using O_DIRECT flag or not? I'm
> not able to verify that in source code.
> 
> If not would it be possible to add such feature (maybe config option) to
> help keeping Linux page cache in better shape?
> 
> Thanks,
> 
> -- 
> PS
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Q on the hererogenity

2015-10-07 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

We just transitioned a pre-production cluster from CentOS7 to Debian
Jessie without any issues. I rebooted half of the nodes (6 nodes) on
day, let it sit for 24 hours and then rebooted the other half the next
day. This was running 0.94.3. I'll be transitioning this cluster back
today.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Oct 7, 2015 at 3:43 AM, Andrey Shevel  wrote:
> Hello everybody,
>
> we are discussing experimental installation ceph cluster under
> Scientific Linux 7.x (or CentOS 7.x).
>
> I wonder if it is possible that in one ceph cluster are used different
> operating platforms (e.g. CentOS, Ubuntu) ?
>
> Does anybody have information/experience on the topic ?
>
> Many thanks in advance.
>
>
> --
> Andrey Y Shevel
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.2.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWFSgfCRDmVDuy+mK58QAAe4YP/1rj8FQE3C9LFTt2/vb+
W87lUMgh7aLugWcdELkn5tutUaVochfJhNazQwryK2NnrMXt6R5r9B2Be67e
FNpfSA/gLtAvU4o3S5OMOi/0b1hGsGEE3gB7V4+Ita6BGd8HctJDkuK37+Uz
n8Jd/yx5nlX15gzikgmkvqoKRa3O+uTKUuyecyxdtSsAYccB9buaTiH1YRho
7rCa4vWwaUTEQbPHU0r4vGyFVulvNzoHrbG5vdSPZGfjoHERF38xn/A1s/Pv
8V6kpAbJa+cswYqwsGnu8x7DBvOyOieMW9RqpanZj5eJxbSWg6bDhppgPdTZ
H8qt99qgblIzEnZASMC2sI5G0S35N4E0qIfKyuYYTJhdlji19BfBcN2x1gXq
fXvVpn94q3LS1a5eKebxNtnK2rTE4UGJuMFh3XdHmL7dWWbDs9VuC+GgUE3v
EiscMiU5QZUE1otpHF4tWIW5aiRxsxpc5vDLaOmR5c8+xKZ0b0SG4I7nM5Mg
OQyGJyqNTw5HLqZINQMBcN6fZUat/Agdc/3d4WWyEk5HJABXXNE0rzVz37fA
7DNYpkyvhZ4b4CgcKAjDzvD/G52OvcCcIhK+tvu6o2P4Pcvg4p35gsZoEi4h
gYsvdIZ03P9k865lPFjrSu48MWx91ZG5cE+gNqZQcVYTFvlgUV2urrDX1Pa+
Yi9M
=ADkq
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] O_DIRECT on deep-scrub read

2015-10-07 Thread Paweł Sadowski
Hi,

Can anyone tell if deep scrub is done using O_DIRECT flag or not? I'm
not able to verify that in source code.

If not would it be possible to add such feature (maybe config option) to
help keeping Linux page cache in better shape?

Thanks,

-- 
PS

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Q on the hererogenity

2015-10-07 Thread Jan Schermer
We recently mixed CentOS/Ubuntu OSDs and ran into some issues, but I don't 
think those have anything to do with the distros but more likely with the fact 
that we ran -grsec kernel there.
YMMV. I don't think there's a reason it shouldn't work. It will be much harder 
to debug and tune, though if you plan to run it mixed in the long run.

Jan

> On 07 Oct 2015, at 11:43, Andrey Shevel  wrote:
> 
> Hello everybody,
> 
> we are discussing experimental installation ceph cluster under
> Scientific Linux 7.x (or CentOS 7.x).
> 
> I wonder if it is possible that in one ceph cluster are used different
> operating platforms (e.g. CentOS, Ubuntu) ?
> 
> Does anybody have information/experience on the topic ?
> 
> Many thanks in advance.
> 
> 
> -- 
> Andrey Y Shevel
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Q on the hererogenity

2015-10-07 Thread John Spray
Pretty unusual but not necessarily a problem -- if anything didn't
work I think we'd consider it a bug.  Discussions about getting the
uids/gids to match on OSD filesystems when running as non-root were
the last time I heard people chatting about this (and even then, only
matters if you move drives between machines).

John

On Wed, Oct 7, 2015 at 10:43 AM, Andrey Shevel  wrote:
> Hello everybody,
>
> we are discussing experimental installation ceph cluster under
> Scientific Linux 7.x (or CentOS 7.x).
>
> I wonder if it is possible that in one ceph cluster are used different
> operating platforms (e.g. CentOS, Ubuntu) ?
>
> Does anybody have information/experience on the topic ?
>
> Many thanks in advance.
>
>
> --
> Andrey Y Shevel
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Q on the hererogenity

2015-10-07 Thread Andrey Shevel
Hello everybody,

we are discussing experimental installation ceph cluster under
Scientific Linux 7.x (or CentOS 7.x).

I wonder if it is possible that in one ceph cluster are used different
operating platforms (e.g. CentOS, Ubuntu) ?

Does anybody have information/experience on the topic ?

Many thanks in advance.


-- 
Andrey Y Shevel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache tier experiences (for ample sized caches ^o^)

2015-10-07 Thread Udo Lembke
Hi Christian,

On 07.10.2015 09:04, Christian Balzer wrote:
> 
> ...
> 
> My main suspect for the excessive slowness are actually the Toshiba DT
> type drives used. 
> We only found out after deployment that these can go into a zombie mode
> (20% of their usual performance for ~8 hours if not permanently until power
> cycled) after a week of uptime.
> Again, the HW cache is likely masking this for the steady state, but
> asking a sick DT drive to seek (for reads) is just asking for trouble.
> 
> ...
does this mean, you can reboot your OSD-Nodes one after the other and then your 
cluster should be fast enough for app.
one week to bring the additional node in?

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache tier experiences (for ample sized caches ^o^)

2015-10-07 Thread Christian Balzer

Hello,

On Wed, 07 Oct 2015 07:34:16 +0200 Loic Dachary wrote:

> Hi Christian,
> 
> Interesting use case :-) How many OSDs / hosts do you have ? And how are
> they connected together ?
>
If you look far back in the archives you'd find that design.

And of course there will be a lot of "I told you so" comments, but it
worked just as planned while being within the design specifications. 

For example one of the first things I did was to have 64 VMs install
themselves automatically from a virtual CD-ROM in parallel. 
This Ceph cluster handled that w/o any slow requests and in decent time. 

To answer your question, just 2 nodes with 2 OSDs (RAID6 with a 4GB cache
Areca controller) each, replication of 2 obviously. 
Initially 3, now 6 compute nodes.
All interconnected via redundant 40Gb/s Infiniband (IPoIB), 2 ports per
server and 2 switches. 

While the low number of OSDs is obviously part of the problem here this is
masked by the journal SSDs and the large HW cache for the steady state. 
My revised design is 6 RAID10 OSDs per node, the change to RAID10 is
mostly to accommodate the type of VMs this cluster wasn't designed for in
the first place.

My main suspect for the excessive slowness are actually the Toshiba DT
type drives used. 
We only found out after deployment that these can go into a zombie mode
(20% of their usual performance for ~8 hours if not permanently until power
cycled) after a week of uptime.
Again, the HW cache is likely masking this for the steady state, but
asking a sick DT drive to seek (for reads) is just asking for trouble.

To illustrate this:
---
DSK |  sdd | busy 86% | read   0 | write 99 | avio 43.6 ms |
DSK |  sda | busy 12% | read   0 | write151 | avio 4.13 ms |
DSK |  sdc | busy  8% | read   0 | write139 | avio 2.82 ms |
DSK |  sdb | busy  7% | read   0 | write132 | avio 2.70 ms |
---
The above is a snippet from atop on another machine here, the 4 disks are
in a RAID 10.
I'm sure you can guess which one is the DT01ACA200 drive, sdb and sdc are
Hitachi HDS723020BLA642 and sda is a Toshiba MG03ACA200.

I have another production cluster that originally only had just 3 nodes
and 8 OSDs each. 
It performed much better using MG drives.

So the new node I'm trying to phase has these MG HDDs and the older ones
will be replaced eventually.

Christian

[snip]

-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com