Re: [ceph-users] Tracking the system calls for OSD write

2014-08-14 Thread Shu, Xinxin
The system call is invoked in FileStore::_do_transaction().

Cheers,
xinxin

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Sudarsan, Rajesh
Sent: Thursday, August 14, 2014 3:01 PM
To: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
Subject: [ceph-users] Tracking the system calls for OSD write

Hi,

I am trying to track the actual system open and write call on a OSD when a new 
file is created and written. So far, my tracking is as follows:

Using the debug log messages, I located the first write call in do_osd_ops 
function (case CEPH_OSD_OP_WRITE) in os/ReplicatedPG.cc (line 3727)

t->write(soid, op.extent.offset, op.extent.length, osd_op.indata);

where t is a transaction object.

This function is defined in os/ObjectStore.h (line 652)

void write(coll_t cid, const ghobject_t& oid, uint64_t off, uint64_t len,
   const bufferlist& data) {
  __u32 op = OP_WRITE;
  ::encode(op, tbl);
  ::encode(cid, tbl);
  ::encode(oid, tbl);
  ::encode(off, tbl);
  ::encode(len, tbl);
  assert(len == data.length());
  if (data.length() > largest_data_len) {
largest_data_len = data.length();
largest_data_off = off;
largest_data_off_in_tbl = tbl.length() + sizeof(__u32);  // we are 
about to
  }
  ::encode(data, tbl);
  ops++;
}

The encode functions are defined as a template in include/encoding.h (line 61) 
which eventually calls bufferlist.append in src/common/buffers.cc (line 1272) 
to insert the buffers from one list to another.

  void buffer::list::append(const list& bl)
  {
_len += bl._len;
for (std::list::const_iterator p = bl._buffers.begin();
 p != bl._buffers.end();
 ++p)
  _buffers.push_back(*p);
  }

Since the buffers are "push_back" I figured that there must be a call to 
pop_front in ReplicatedPG.cc . But there is not pop_front associated with any 
write. This is where I am stuck.

At this point I have two questions:

1.   When does the actual file open happen?

2.   Where is the system call to physically write the file to the disk?

Any help is appreciated.

Rajesh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to create multiple OSD's per host?

2014-08-14 Thread Bruce McFarland
I’ll try the prepare/activiate commands again. I spent the least amount of time 
with them since activate _always_ failed for me. I’ll go back and check my 
logs, but probably because I was attempting to activate the same location I 
used in the ‘prepare’ instead of the partition 1 like you suggest (which is 
exactly how it is show in the documentation example).

I seemed to get the closest to a working cluster using the ‘manual’ commands 
below. I could try changing the XFS mount point to be on a partition of the hdd 
I’m using for the osd.

mkdir /var/lib/ceph/osd/ceph-$OSD
mkfs -t xfs -f /dev/sd$i
mount -t xfs  /dev/sd$i /var/lib/ceph/osd/ceph-$OSD
ceph-osd -i $OSD --mkfs --mkkey --osd-journal /dev/md0p$PART

What I find most confusing using ceph-deploy with multiple osds on the same 
host is that when ‘ceph-deploy osd create [data] [journal]’ completes there is 
no osd directory for each osd under:

[root@ceph0 ceph]# ll /var/lib/ceph/osd/
total 0
[root@ceph0 ceph]#


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jason 
King
Sent: Thursday, August 14, 2014 8:13 PM
To: ceph-us...@ceph.com
Subject: Re: [ceph-users] How to create multiple OSD's per host?


2014-08-15 7:56 GMT+08:00 Bruce McFarland 
mailto:bruce.mcfarl...@taec.toshiba.com>>:
This is an example of the output from ‘ceph-deploy osd create [data] [journal’
I’ve noticed that all of the ‘ceph-conf’ commands use the same parameter of 
‘–name=osd.’  Everytime ceph-deploy is called. I end up with 30 osd’s – 29 in 
the prepared and 1 active according to the ‘ceph-disk list’ output and only 1 
osd that has a xfs mount point. I’ve tried both with all data/journal devices 
on the same ceph-deploy command line and issuing 1 ceph-deploy cmd for each OSD 
data/journal pair (easier to script).


+ ceph-deploy osd create ceph0:/dev/sdl:/dev/md0p17
[ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.10): /usr/bin/ceph-deploy osd create 
ceph0:/dev/sdl:/dev/md0p17
[ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks 
ceph0:/dev/sdl:/dev/md0p17
[ceph0][DEBUG ] connected to host: ceph0
[ceph0][DEBUG ] detect platform information from remote host
[ceph0][DEBUG ] detect machine type
[ceph_deploy.osd][INFO  ] Distro info: CentOS 6.5 Final
[ceph_deploy.osd][DEBUG ] Deploying osd to ceph0
[ceph0][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[ceph0][INFO  ] Running command: udevadm trigger --subsystem-match=block 
--action=add
[ceph_deploy.osd][DEBUG ] Preparing host ceph0 disk /dev/sdl journal 
/dev/md0p17 activate True
[ceph0][INFO  ] Running command: ceph-disk -v prepare --fs-type xfs --cluster 
ceph -- /dev/sdl /dev/md0p17
[ceph0][DEBUG ] Information: Moved requested sector from 34 to 2048 in
[ceph0][DEBUG ] order to align on 2048-sector boundaries.
[ceph0][DEBUG ] The operation has completed successfully.
[ceph0][DEBUG ] meta-data=/dev/sdl1  isize=2048   agcount=4, 
agsize=244188597 blks
[ceph0][DEBUG ]  =   sectsz=512   attr=2, 
projid32bit=0
[ceph0][DEBUG ] data =   bsize=4096   blocks=976754385, 
imaxpct=5
[ceph0][DEBUG ]  =   sunit=0  swidth=0 blks
[ceph0][DEBUG ] naming   =version 2  bsize=4096   ascii-ci=0
[ceph0][DEBUG ] log  =internal log   bsize=4096   blocks=476930, 
version=2
[ceph0][DEBUG ]  =   sectsz=512   sunit=0 blks, 
lazy-count=1
[ceph0][DEBUG ] realtime =none   extsz=4096   blocks=0, 
rtextents=0
[ceph0][DEBUG ] The operation has completed successfully.
[ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd 
--cluster=ceph --show-config-value=fsid
[ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf 
--cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs
[ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf 
--cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_xfs
[ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf 
--cluster=ceph --name=osd. --lookup osd_mount_options_xfs
[ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf 
--cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs
[ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd 
--cluster=ceph --show-config-value=osd_journal_size
[ceph0][WARNIN] DEBUG:ceph-disk:Journal /dev/md0p17 is a partition
[ceph0][WARNIN] WARNING:ceph-disk:OSD will not be hot-swappable if journal is 
not the same device as the osd data
[ceph0][WARNIN] DEBUG:ceph-disk:Creating osd partition on /dev/sdl
[ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/sbin/sgdisk 
--largest-new=1 --change-name=1:ceph data 
--partition-guid=1:a96b4af4-11f4-4257-9476-64a6e4c93c28 
--typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be -- /dev/sdl
[ceph0][WARNIN] INFO:ceph-disk:Running command: /sbin/partprobe /dev

Re: [ceph-users] Tracking the system calls for OSD write

2014-08-14 Thread Shu, Xinxin
The system call is invoked in FileStore::_do_transaction().

Cheers,
xinxin

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Sudarsan, Rajesh
Sent: Thursday, August 14, 2014 3:01 PM
To: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
Subject: [ceph-users] Tracking the system calls for OSD write

Hi,

I am trying to track the actual system open and write call on a OSD when a new 
file is created and written. So far, my tracking is as follows:

Using the debug log messages, I located the first write call in do_osd_ops 
function (case CEPH_OSD_OP_WRITE) in os/ReplicatedPG.cc (line 3727)

t->write(soid, op.extent.offset, op.extent.length, osd_op.indata);

where t is a transaction object.

This function is defined in os/ObjectStore.h (line 652)

void write(coll_t cid, const ghobject_t& oid, uint64_t off, uint64_t len,
   const bufferlist& data) {
  __u32 op = OP_WRITE;
  ::encode(op, tbl);
  ::encode(cid, tbl);
  ::encode(oid, tbl);
  ::encode(off, tbl);
  ::encode(len, tbl);
  assert(len == data.length());
  if (data.length() > largest_data_len) {
largest_data_len = data.length();
largest_data_off = off;
largest_data_off_in_tbl = tbl.length() + sizeof(__u32);  // we are 
about to
  }
  ::encode(data, tbl);
  ops++;
}

The encode functions are defined as a template in include/encoding.h (line 61) 
which eventually calls bufferlist.append in src/common/buffers.cc (line 1272) 
to insert the buffers from one list to another.

  void buffer::list::append(const list& bl)
  {
_len += bl._len;
for (std::list::const_iterator p = bl._buffers.begin();
 p != bl._buffers.end();
 ++p)
  _buffers.push_back(*p);
  }

Since the buffers are "push_back" I figured that there must be a call to 
pop_front in ReplicatedPG.cc . But there is not pop_front associated with any 
write. This is where I am stuck.

At this point I have two questions:

1.   When does the actual file open happen?

2.   Where is the system call to physically write the file to the disk?

Any help is appreciated.

Rajesh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster inconsistency?

2014-08-14 Thread Haomai Wang
Hi Kenneth,

I don't find valuable info in your logs, it lack of the necessary
debug output when accessing crash code.

But I scan the encode/decode implementation in GenericObjectMap and
find something bad.

For example, two oid has same hash and their name is:
A: "rb.data.123"
B: "rb-123"

In ghobject_t compare level, A < B. But GenericObjectMap encode "." to
"%e", so the key in DB is:
A: _GHOBJTOSEQ_:blah!51615000!!none!!rb%edata%e123!head
B: _GHOBJTOSEQ_:blah!51615000!!none!!rb-123!head

A > B

And it seemed that the escape function is useless and should be disabled.

I'm not sure whether Kenneth's problem is touching this bug. Because
this scene only occur when the object set is very large and make the
two object has same hash value.

Kenneth, could you have time to run "ceph-kv-store [path-to-osd] list
_GHOBJTOSEQ_| grep 6adb1100 -A 100". ceph-kv-store is a debug tool
which can be compiled from source. You can clone ceph repo and run
"./authongen.sh; ./configure; cd src; make ceph-kvstore-tool".
"path-to-osd" should be "/var/lib/ceph/osd-[id]/current/". "6adb1100"
is from your verbose log and the next 100 rows should know necessary
infos.

Hi sage, do you think we need to provided with upgrade function to fix it?


On Thu, Aug 14, 2014 at 7:36 PM, Kenneth Waegeman
 wrote:

>
> - Message from Haomai Wang  -
>Date: Thu, 14 Aug 2014 19:11:55 +0800
>
>From: Haomai Wang 
> Subject: Re: [ceph-users] ceph cluster inconsistency?
>  To: Kenneth Waegeman 
>
>
>> Could you add config "debug_keyvaluestore = 20/20" to the crashed osd
>> and replay the command causing crash?
>>
>> I would like to get more debug infos! Thanks.
>
>
> I included the log in attachment!
> Thanks!
>
>>
>> On Thu, Aug 14, 2014 at 4:41 PM, Kenneth Waegeman
>>  wrote:
>>>
>>>
>>> I have:
>>> osd_objectstore = keyvaluestore-dev
>>>
>>> in the global section of my ceph.conf
>>>
>>>
>>> [root@ceph002 ~]# ceph osd erasure-code-profile get profile11
>>> directory=/usr/lib64/ceph/erasure-code
>>> k=8
>>> m=3
>>> plugin=jerasure
>>> ruleset-failure-domain=osd
>>> technique=reed_sol_van
>>>
>>> the ecdata pool has this as profile
>>>
>>> pool 3 'ecdata' erasure size 11 min_size 8 crush_ruleset 2 object_hash
>>> rjenkins pg_num 128 pgp_num 128 last_change 161 flags hashpspool
>>> stripe_width 4096
>>>
>>> ECrule in crushmap
>>>
>>> rule ecdata {
>>> ruleset 2
>>> type erasure
>>> min_size 3
>>> max_size 20
>>> step set_chooseleaf_tries 5
>>> step take default-ec
>>> step choose indep 0 type osd
>>> step emit
>>> }
>>> root default-ec {
>>> id -8   # do not change unnecessarily
>>> # weight 140.616
>>> alg straw
>>> hash 0  # rjenkins1
>>> item ceph001-ec weight 46.872
>>> item ceph002-ec weight 46.872
>>> item ceph003-ec weight 46.872
>>> ...
>>>
>>> Cheers!
>>> Kenneth
>>>
>>> - Message from Haomai Wang  -
>>>Date: Thu, 14 Aug 2014 10:07:50 +0800
>>>From: Haomai Wang 
>>> Subject: Re: [ceph-users] ceph cluster inconsistency?
>>>  To: Kenneth Waegeman 
>>>  Cc: ceph-users 
>>>
>>>
>>>
 Hi Kenneth,

 Could you give your configuration related to EC and KeyValueStore?
 Not sure whether it's bug on KeyValueStore

 On Thu, Aug 14, 2014 at 12:06 AM, Kenneth Waegeman
  wrote:
>
>
> Hi,
>
> I was doing some tests with rados bench on a Erasure Coded pool (using
> keyvaluestore-dev objectstore) on 0.83, and I see some strangs things:
>
>
> [root@ceph001 ~]# ceph status
> cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
>  health HEALTH_WARN too few pgs per osd (4 < min 20)
>  monmap e1: 3 mons at
>
>
> {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0},
> election epoch 6, quorum 0,1,2 ceph001,ceph002,ceph003
>  mdsmap e116: 1/1/1 up {0=ceph001.cubone.os=up:active}, 2
> up:standby
>  osdmap e292: 78 osds: 78 up, 78 in
>   pgmap v48873: 320 pgs, 4 pools, 15366 GB data, 3841 kobjects
> 1381 GB used, 129 TB / 131 TB avail
>  320 active+clean
>
> There is around 15T of data, but only 1.3 T usage.
>
> This is also visible in rados:
>
> [root@ceph001 ~]# rados df
> pool name   category KB  objects   clones
> degraded  unfound   rdrd KB   wrwr
> KB
> data-  000
> 0   00000
> ecdata  -16113451009  39339590
> 0   011  3935632  16116850711
> metadata-  2   200
> 0   0   33   36   218
> rb

Re: [ceph-users] How to create multiple OSD's per host?

2014-08-14 Thread Jason King
2014-08-15 7:56 GMT+08:00 Bruce McFarland 
:

>  This is an example of the output from ‘ceph-deploy osd create [data]
> [journal’
>
> I’ve noticed that all of the ‘ceph-conf’ commands use the same parameter
> of ‘–name=osd.’  Everytime ceph-deploy is called. I end up with 30 osd’s –
> 29 in the prepared and 1 active according to the ‘ceph-disk list’ output
> and only 1 osd that has a xfs mount point. I’ve tried both with all
> data/journal devices on the same ceph-deploy command line and issuing 1
> ceph-deploy cmd for each OSD data/journal pair (easier to script).
>
>
>
>
>
> + ceph-deploy osd create ceph0:/dev/sdl:/dev/md0p17
>
> [ceph_deploy.conf][DEBUG ] found configuration file at:
> /root/.cephdeploy.conf
>
> [ceph_deploy.cli][INFO  ] Invoked (1.5.10): /usr/bin/ceph-deploy osd
> create ceph0:/dev/sdl:/dev/md0p17
>
> [ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks
> ceph0:/dev/sdl:/dev/md0p17
>
> [ceph0][DEBUG ] connected to host: ceph0
>
> [ceph0][DEBUG ] detect platform information from remote host
>
> [ceph0][DEBUG ] detect machine type
>
> [ceph_deploy.osd][INFO  ] Distro info: CentOS 6.5 Final
>
> [ceph_deploy.osd][DEBUG ] Deploying osd to ceph0
>
> [ceph0][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
>
> [ceph0][INFO  ] Running command: udevadm trigger --subsystem-match=block
> --action=add
>
> [ceph_deploy.osd][DEBUG ] Preparing host ceph0 disk /dev/sdl journal
> /dev/md0p17 activate True
>
> [ceph0][INFO  ] Running command: ceph-disk -v prepare --fs-type xfs
> --cluster ceph -- /dev/sdl /dev/md0p17
>
> [ceph0][DEBUG ] Information: Moved requested sector from 34 to 2048 in
>
> [ceph0][DEBUG ] order to align on 2048-sector boundaries.
>
> [ceph0][DEBUG ] The operation has completed successfully.
>
> [ceph0][DEBUG ] meta-data=/dev/sdl1  isize=2048   agcount=4,
> agsize=244188597 blks
>
> [ceph0][DEBUG ]  =   sectsz=512   attr=2,
> projid32bit=0
>
> [ceph0][DEBUG ] data =   bsize=4096
> blocks=976754385, imaxpct=5
>
> [ceph0][DEBUG ]  =   sunit=0  swidth=0 blks
>
> [ceph0][DEBUG ] naming   =version 2  bsize=4096   ascii-ci=0
>
> [ceph0][DEBUG ] log  =internal log   bsize=4096
> blocks=476930, version=2
>
> [ceph0][DEBUG ]  =   sectsz=512   sunit=0
> blks, lazy-count=1
>
> [ceph0][DEBUG ] realtime =none   extsz=4096   blocks=0,
> rtextents=0
>
> [ceph0][DEBUG ] The operation has completed successfully.
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
> --cluster=ceph --show-config-value=fsid
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_xfs
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_mount_options_xfs
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
> --cluster=ceph --show-config-value=osd_journal_size
>
> [ceph0][WARNIN] DEBUG:ceph-disk:Journal /dev/md0p17 is a partition
>
> [ceph0][WARNIN] WARNING:ceph-disk:OSD will not be hot-swappable if journal
> is not the same device as the osd data
>
> [ceph0][WARNIN] DEBUG:ceph-disk:Creating osd partition on /dev/sdl
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/sbin/sgdisk
> --largest-new=1 --change-name=1:ceph data
> --partition-guid=1:a96b4af4-11f4-4257-9476-64a6e4c93c28
> --typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be -- /dev/sdl
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /sbin/partprobe /dev/sdl
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /sbin/udevadm settle
>
> [ceph0][WARNIN] DEBUG:ceph-disk:Creating xfs fs on /dev/sdl1
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /sbin/mkfs -t xfs -f -i
> size=2048 -- /dev/sdl1
>
> [ceph0][WARNIN] DEBUG:ceph-disk:Mounting /dev/sdl1 on
> /var/lib/ceph/tmp/mnt.8xAu31 with options noatime
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /bin/mount -t xfs -o
> noatime -- /dev/sdl1 /var/lib/ceph/tmp/mnt.8xAu31
>
> [ceph0][WARNIN] DEBUG:ceph-disk:Preparing osd data dir
> /var/lib/ceph/tmp/mnt.8xAu31
>
> [ceph0][WARNIN] DEBUG:ceph-disk:Creating symlink
> /var/lib/ceph/tmp/mnt.8xAu31/journal -> /dev/md0p17
>
> [ceph0][WARNIN] DEBUG:ceph-disk:Unmounting /var/lib/ceph/tmp/mnt.8xAu31
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /bin/umount --
> /var/lib/ceph/tmp/mnt.8xAu31
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/sbin/sgdisk
> --typecode=1:4fbd7e29-9d25-41b8-afd0-062c0ceff05d -- /dev/sdl
>
> [ceph0][WARNIN] INFO:ceph-disk:calling partx on prepared device /dev/sdl
>
> [ceph0][WARNIN] INFO:ceph-disk:re-reading known partitions w

Re: [ceph-users] How to create multiple OSD's per host?

2014-08-14 Thread Jason King
2014-08-15 7:56 GMT+08:00 Bruce McFarland 
:

>  This is an example of the output from ‘ceph-deploy osd create [data]
> [journal’
>
> I’ve noticed that all of the ‘ceph-conf’ commands use the same parameter
> of ‘–name=osd.’  Everytime ceph-deploy is called. I end up with 30 osd’s –
> 29 in the prepared and 1 active according to the ‘ceph-disk list’ output
> and only 1 osd that has a xfs mount point. I’ve tried both with all
> data/journal devices on the same ceph-deploy command line and issuing 1
> ceph-deploy cmd for each OSD data/journal pair (easier to script).
>
>
>
>
>
> + ceph-deploy osd create ceph0:/dev/sdl:/dev/md0p17
>
> [ceph_deploy.conf][DEBUG ] found configuration file at:
> /root/.cephdeploy.conf
>
> [ceph_deploy.cli][INFO  ] Invoked (1.5.10): /usr/bin/ceph-deploy osd
> create ceph0:/dev/sdl:/dev/md0p17
>
> [ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks
> ceph0:/dev/sdl:/dev/md0p17
>
> [ceph0][DEBUG ] connected to host: ceph0
>
> [ceph0][DEBUG ] detect platform information from remote host
>
> [ceph0][DEBUG ] detect machine type
>
> [ceph_deploy.osd][INFO  ] Distro info: CentOS 6.5 Final
>
> [ceph_deploy.osd][DEBUG ] Deploying osd to ceph0
>
> [ceph0][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
>
> [ceph0][INFO  ] Running command: udevadm trigger --subsystem-match=block
> --action=add
>
> [ceph_deploy.osd][DEBUG ] Preparing host ceph0 disk /dev/sdl journal
> /dev/md0p17 activate True
>
> [ceph0][INFO  ] Running command: ceph-disk -v prepare --fs-type xfs
> --cluster ceph -- /dev/sdl /dev/md0p17
>
> [ceph0][DEBUG ] Information: Moved requested sector from 34 to 2048 in
>
> [ceph0][DEBUG ] order to align on 2048-sector boundaries.
>
> [ceph0][DEBUG ] The operation has completed successfully.
>
> [ceph0][DEBUG ] meta-data=/dev/sdl1  isize=2048   agcount=4,
> agsize=244188597 blks
>
> [ceph0][DEBUG ]  =   sectsz=512   attr=2,
> projid32bit=0
>
> [ceph0][DEBUG ] data =   bsize=4096
> blocks=976754385, imaxpct=5
>
> [ceph0][DEBUG ]  =   sunit=0  swidth=0 blks
>
> [ceph0][DEBUG ] naming   =version 2  bsize=4096   ascii-ci=0
>
> [ceph0][DEBUG ] log  =internal log   bsize=4096
> blocks=476930, version=2
>
> [ceph0][DEBUG ]  =   sectsz=512   sunit=0
> blks, lazy-count=1
>
> [ceph0][DEBUG ] realtime =none   extsz=4096   blocks=0,
> rtextents=0
>
> [ceph0][DEBUG ] The operation has completed successfully.
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
> --cluster=ceph --show-config-value=fsid
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_xfs
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_mount_options_xfs
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
> --cluster=ceph --show-config-value=osd_journal_size
>
> [ceph0][WARNIN] DEBUG:ceph-disk:Journal /dev/md0p17 is a partition
>
> [ceph0][WARNIN] WARNING:ceph-disk:OSD will not be hot-swappable if journal
> is not the same device as the osd data
>
> [ceph0][WARNIN] DEBUG:ceph-disk:Creating osd partition on /dev/sdl
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/sbin/sgdisk
> --largest-new=1 --change-name=1:ceph data
> --partition-guid=1:a96b4af4-11f4-4257-9476-64a6e4c93c28
> --typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be -- /dev/sdl
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /sbin/partprobe /dev/sdl
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /sbin/udevadm settle
>
> [ceph0][WARNIN] DEBUG:ceph-disk:Creating xfs fs on /dev/sdl1
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /sbin/mkfs -t xfs -f -i
> size=2048 -- /dev/sdl1
>
> [ceph0][WARNIN] DEBUG:ceph-disk:Mounting /dev/sdl1 on
> /var/lib/ceph/tmp/mnt.8xAu31 with options noatime
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /bin/mount -t xfs -o
> noatime -- /dev/sdl1 /var/lib/ceph/tmp/mnt.8xAu31
>
> [ceph0][WARNIN] DEBUG:ceph-disk:Preparing osd data dir
> /var/lib/ceph/tmp/mnt.8xAu31
>
> [ceph0][WARNIN] DEBUG:ceph-disk:Creating symlink
> /var/lib/ceph/tmp/mnt.8xAu31/journal -> /dev/md0p17
>
> [ceph0][WARNIN] DEBUG:ceph-disk:Unmounting /var/lib/ceph/tmp/mnt.8xAu31
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /bin/umount --
> /var/lib/ceph/tmp/mnt.8xAu31
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/sbin/sgdisk
> --typecode=1:4fbd7e29-9d25-41b8-afd0-062c0ceff05d -- /dev/sdl
>
> [ceph0][WARNIN] INFO:ceph-disk:calling partx on prepared device /dev/sdl
>
> [ceph0][WARNIN] INFO:ceph-disk:re-reading known partitions w

Re: [ceph-users] How to create multiple OSD's per host?

2014-08-14 Thread Jason King
2014-08-15 7:56 GMT+08:00 Bruce McFarland 
:

>  This is an example of the output from ‘ceph-deploy osd create [data]
> [journal’
>
> I’ve noticed that all of the ‘ceph-conf’ commands use the same parameter
> of ‘–name=osd.’  Everytime ceph-deploy is called. I end up with 30 osd’s –
> 29 in the prepared and 1 active according to the ‘ceph-disk list’ output
> and only 1 osd that has a xfs mount point. I’ve tried both with all
> data/journal devices on the same ceph-deploy command line and issuing 1
> ceph-deploy cmd for each OSD data/journal pair (easier to script).
>
>
>
>
>
> + ceph-deploy osd create ceph0:/dev/sdl:/dev/md0p17
>
> [ceph_deploy.conf][DEBUG ] found configuration file at:
> /root/.cephdeploy.conf
>
> [ceph_deploy.cli][INFO  ] Invoked (1.5.10): /usr/bin/ceph-deploy osd
> create ceph0:/dev/sdl:/dev/md0p17
>
> [ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks
> ceph0:/dev/sdl:/dev/md0p17
>
> [ceph0][DEBUG ] connected to host: ceph0
>
> [ceph0][DEBUG ] detect platform information from remote host
>
> [ceph0][DEBUG ] detect machine type
>
> [ceph_deploy.osd][INFO  ] Distro info: CentOS 6.5 Final
>
> [ceph_deploy.osd][DEBUG ] Deploying osd to ceph0
>
> [ceph0][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
>
> [ceph0][INFO  ] Running command: udevadm trigger --subsystem-match=block
> --action=add
>
> [ceph_deploy.osd][DEBUG ] Preparing host ceph0 disk /dev/sdl journal
> /dev/md0p17 activate True
>
> [ceph0][INFO  ] Running command: ceph-disk -v prepare --fs-type xfs
> --cluster ceph -- /dev/sdl /dev/md0p17
>
> [ceph0][DEBUG ] Information: Moved requested sector from 34 to 2048 in
>
> [ceph0][DEBUG ] order to align on 2048-sector boundaries.
>
> [ceph0][DEBUG ] The operation has completed successfully.
>
> [ceph0][DEBUG ] meta-data=/dev/sdl1  isize=2048   agcount=4,
> agsize=244188597 blks
>
> [ceph0][DEBUG ]  =   sectsz=512   attr=2,
> projid32bit=0
>
> [ceph0][DEBUG ] data =   bsize=4096
> blocks=976754385, imaxpct=5
>
> [ceph0][DEBUG ]  =   sunit=0  swidth=0 blks
>
> [ceph0][DEBUG ] naming   =version 2  bsize=4096   ascii-ci=0
>
> [ceph0][DEBUG ] log  =internal log   bsize=4096
> blocks=476930, version=2
>
> [ceph0][DEBUG ]  =   sectsz=512   sunit=0
> blks, lazy-count=1
>
> [ceph0][DEBUG ] realtime =none   extsz=4096   blocks=0,
> rtextents=0
>
> [ceph0][DEBUG ] The operation has completed successfully.
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
> --cluster=ceph --show-config-value=fsid
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_xfs
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_mount_options_xfs
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
> --cluster=ceph --show-config-value=osd_journal_size
>
> [ceph0][WARNIN] DEBUG:ceph-disk:Journal /dev/md0p17 is a partition
>
> [ceph0][WARNIN] WARNING:ceph-disk:OSD will not be hot-swappable if journal
> is not the same device as the osd data
>
> [ceph0][WARNIN] DEBUG:ceph-disk:Creating osd partition on /dev/sdl
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/sbin/sgdisk
> --largest-new=1 --change-name=1:ceph data
> --partition-guid=1:a96b4af4-11f4-4257-9476-64a6e4c93c28
> --typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be -- /dev/sdl
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /sbin/partprobe /dev/sdl
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /sbin/udevadm settle
>
> [ceph0][WARNIN] DEBUG:ceph-disk:Creating xfs fs on /dev/sdl1
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /sbin/mkfs -t xfs -f -i
> size=2048 -- /dev/sdl1
>
> [ceph0][WARNIN] DEBUG:ceph-disk:Mounting /dev/sdl1 on
> /var/lib/ceph/tmp/mnt.8xAu31 with options noatime
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /bin/mount -t xfs -o
> noatime -- /dev/sdl1 /var/lib/ceph/tmp/mnt.8xAu31
>
> [ceph0][WARNIN] DEBUG:ceph-disk:Preparing osd data dir
> /var/lib/ceph/tmp/mnt.8xAu31
>
> [ceph0][WARNIN] DEBUG:ceph-disk:Creating symlink
> /var/lib/ceph/tmp/mnt.8xAu31/journal -> /dev/md0p17
>
> [ceph0][WARNIN] DEBUG:ceph-disk:Unmounting /var/lib/ceph/tmp/mnt.8xAu31
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /bin/umount --
> /var/lib/ceph/tmp/mnt.8xAu31
>
> [ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/sbin/sgdisk
> --typecode=1:4fbd7e29-9d25-41b8-afd0-062c0ceff05d -- /dev/sdl
>
> [ceph0][WARNIN] INFO:ceph-disk:calling partx on prepared device /dev/sdl
>
> [ceph0][WARNIN] INFO:ceph-disk:re-reading known partitions w

[ceph-users] help to confirm if journal includes everything a OP has

2014-08-14 Thread yuelongguang
hi,all
 
By reading the code , i notice everything of a OP is encoded into Transaction 
which is writed into journal later.
does journal record everything(meta,xattr,file data...) of a OP. 
if so everything is writed into disk twice, and journal always reaches full 
state, right?
 
 
thanks ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] can osd start up if journal is lost and it has not been replayed?

2014-08-14 Thread yuelongguang
hi

could you tell the reason, why 'the journal is lost, the OSD is lost'? if 
journal is lost, actually it only lost part  which ware not replayed.
let take a similar case as example, a osd is down for some time , its journal 
is out of date(lose part of journal), but it can catch up with other osds. why?
that example can tell that  either outdated osd can get all journal from others 
 or 'catch up' has different theory with journal.
could you explain?
 
 
 
thanks








At 2014-08-14 05:21:20, "Craig Lewis"  wrote:

If the journal is lost, the OSD is lost.  This can be a problem if you use 1 
SSD for journals for many OSDs.


There has been some discussion about making the OSDs able to recover from a 
lost journal, but I haven't heard anything else about it.  I haven't been 
paying much attention to the developer mailing list though.




For your second question, I'd start by looking at the source code in 
src/osd/ReplicatedPG.cc (for standard replication), or src/osd/ECBackend.cc 
(for Erasure Coding).  I'm not a Ceph developer though, so that might not be 
the right place to start.





On Tue, Aug 12, 2014 at 7:08 PM, yuelongguang  wrote:

hi,all
 
1.
can osd start up  if journal is lost and it has not been replayed?
 
2.
how it catchs up latest epoch?  take osd as example,  where is the code? it 
better you consider journal is lost or not.
in my mind journal only includes meta/R/W operations, does not include 
data(file data).
 
 
thanks



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to create multiple OSD's per host?

2014-08-14 Thread Bruce McFarland
This is an example of the output from 'ceph-deploy osd create [data] [journal'
I've noticed that all of the 'ceph-conf' commands use the same parameter of 
'-name=osd.'  Everytime ceph-deploy is called. I end up with 30 osd's - 29 in 
the prepared and 1 active according to the 'ceph-disk list' output and only 1 
osd that has a xfs mount point. I've tried both with all data/journal devices 
on the same ceph-deploy command line and issuing 1 ceph-deploy cmd for each OSD 
data/journal pair (easier to script).


+ ceph-deploy osd create ceph0:/dev/sdl:/dev/md0p17
[ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.10): /usr/bin/ceph-deploy osd create 
ceph0:/dev/sdl:/dev/md0p17
[ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks 
ceph0:/dev/sdl:/dev/md0p17
[ceph0][DEBUG ] connected to host: ceph0
[ceph0][DEBUG ] detect platform information from remote host
[ceph0][DEBUG ] detect machine type
[ceph_deploy.osd][INFO  ] Distro info: CentOS 6.5 Final
[ceph_deploy.osd][DEBUG ] Deploying osd to ceph0
[ceph0][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[ceph0][INFO  ] Running command: udevadm trigger --subsystem-match=block 
--action=add
[ceph_deploy.osd][DEBUG ] Preparing host ceph0 disk /dev/sdl journal 
/dev/md0p17 activate True
[ceph0][INFO  ] Running command: ceph-disk -v prepare --fs-type xfs --cluster 
ceph -- /dev/sdl /dev/md0p17
[ceph0][DEBUG ] Information: Moved requested sector from 34 to 2048 in
[ceph0][DEBUG ] order to align on 2048-sector boundaries.
[ceph0][DEBUG ] The operation has completed successfully.
[ceph0][DEBUG ] meta-data=/dev/sdl1  isize=2048   agcount=4, 
agsize=244188597 blks
[ceph0][DEBUG ]  =   sectsz=512   attr=2, 
projid32bit=0
[ceph0][DEBUG ] data =   bsize=4096   blocks=976754385, 
imaxpct=5
[ceph0][DEBUG ]  =   sunit=0  swidth=0 blks
[ceph0][DEBUG ] naming   =version 2  bsize=4096   ascii-ci=0
[ceph0][DEBUG ] log  =internal log   bsize=4096   blocks=476930, 
version=2
[ceph0][DEBUG ]  =   sectsz=512   sunit=0 blks, 
lazy-count=1
[ceph0][DEBUG ] realtime =none   extsz=4096   blocks=0, 
rtextents=0
[ceph0][DEBUG ] The operation has completed successfully.
[ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd 
--cluster=ceph --show-config-value=fsid
[ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf 
--cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs
[ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf 
--cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_xfs
[ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf 
--cluster=ceph --name=osd. --lookup osd_mount_options_xfs
[ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf 
--cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs
[ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd 
--cluster=ceph --show-config-value=osd_journal_size
[ceph0][WARNIN] DEBUG:ceph-disk:Journal /dev/md0p17 is a partition
[ceph0][WARNIN] WARNING:ceph-disk:OSD will not be hot-swappable if journal is 
not the same device as the osd data
[ceph0][WARNIN] DEBUG:ceph-disk:Creating osd partition on /dev/sdl
[ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/sbin/sgdisk 
--largest-new=1 --change-name=1:ceph data 
--partition-guid=1:a96b4af4-11f4-4257-9476-64a6e4c93c28 
--typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be -- /dev/sdl
[ceph0][WARNIN] INFO:ceph-disk:Running command: /sbin/partprobe /dev/sdl
[ceph0][WARNIN] INFO:ceph-disk:Running command: /sbin/udevadm settle
[ceph0][WARNIN] DEBUG:ceph-disk:Creating xfs fs on /dev/sdl1
[ceph0][WARNIN] INFO:ceph-disk:Running command: /sbin/mkfs -t xfs -f -i 
size=2048 -- /dev/sdl1
[ceph0][WARNIN] DEBUG:ceph-disk:Mounting /dev/sdl1 on 
/var/lib/ceph/tmp/mnt.8xAu31 with options noatime
[ceph0][WARNIN] INFO:ceph-disk:Running command: /bin/mount -t xfs -o noatime -- 
/dev/sdl1 /var/lib/ceph/tmp/mnt.8xAu31
[ceph0][WARNIN] DEBUG:ceph-disk:Preparing osd data dir 
/var/lib/ceph/tmp/mnt.8xAu31
[ceph0][WARNIN] DEBUG:ceph-disk:Creating symlink 
/var/lib/ceph/tmp/mnt.8xAu31/journal -> /dev/md0p17
[ceph0][WARNIN] DEBUG:ceph-disk:Unmounting /var/lib/ceph/tmp/mnt.8xAu31
[ceph0][WARNIN] INFO:ceph-disk:Running command: /bin/umount -- 
/var/lib/ceph/tmp/mnt.8xAu31
[ceph0][WARNIN] INFO:ceph-disk:Running command: /usr/sbin/sgdisk 
--typecode=1:4fbd7e29-9d25-41b8-afd0-062c0ceff05d -- /dev/sdl
[ceph0][WARNIN] INFO:ceph-disk:calling partx on prepared device /dev/sdl
[ceph0][WARNIN] INFO:ceph-disk:re-reading known partitions will display errors
[ceph0][WARNIN] INFO:ceph-disk:Running command: /sbin/partx -a /dev/sdl
[ceph0][WARNIN] BLKPG: Device or resource busy
[ceph0][WARNIN] error adding partition 1
[ceph0][INFO  ] Running command: udevadm trigger --subsystem-match=block 
--action=add
[ceph0][INFO  ] ch

Re: [ceph-users] Cache tiering and target_max_bytes

2014-08-14 Thread Sage Weil
On Thu, 14 Aug 2014, Pawe? Sadowski wrote:
> W dniu 14.08.2014 17:20, Sage Weil pisze:
> > On Thu, 14 Aug 2014, Pawe? Sadowski wrote:
> >> Hello,
> >>
> >> I've a cluster of 35 OSD (30 HDD, 5 SSD) with cache tiering configured.
> >> During tests it looks like ceph is not respecting target_max_bytes
> >> settings. Steps to reproduce:
> >>  - configure cache tiering
> >>  - set target_max_bytes to 32G (on hot pool)
> >>  - write more than 32G of data
> >>  - nothing happens



The reason the agent is doing work is because you don't have 
hit_set_* configured for the cache pool, which means the cluster isn't 
tracking what objects get read to inform the flush/evict 
decisions.  Configuring that will fix this.  Try

 ceph osd pool set cache hit_set_type bloom
 ceph osd pool set cache hit_set_count 8
 ceph osd pool set cache hit_set_period 3600

or similar.

The agent could still run in a brain-dead mode without it, but it suffers 
from the bug you found.  That was fixed after 0.80.5 and will be in 
0.80.6.

Thanks!
sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache tiering and target_max_bytes

2014-08-14 Thread Paweł Sadowski
W dniu 14.08.2014 17:20, Sage Weil pisze:
> On Thu, 14 Aug 2014, Pawe? Sadowski wrote:
>> Hello,
>>
>> I've a cluster of 35 OSD (30 HDD, 5 SSD) with cache tiering configured.
>> During tests it looks like ceph is not respecting target_max_bytes
>> settings. Steps to reproduce:
>>  - configure cache tiering
>>  - set target_max_bytes to 32G (on hot pool)
>>  - write more than 32G of data
>>  - nothing happens
> Can you 'ceph pg dump pools -f json-pretty' at this point?  And pick a 
> random PG in the cache pool and capture the output of 'ceph pg  
> query'.
[root@host227 ~] ceph df detail
GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED OBJECTS
54833G 53657G 1176G2.14  91468  
POOLS:
NAME ID CATEGORY USED   %USED OBJECTS
DIRTY READ WRITE
data 0  -0  0 0  
0 00
metadata 1  -0  0 0  
0 00
rbd  2  -57700M 0.10  14425  
14425 014425
cache3  -118G   0.22  30247  
12742 086180
volumes  4  -0  0 0  
0 00
images   5  -0  0 0  
0 00
backups  6  -0  0 0  
0 00
vms  7  -182G   0.33  46796  
46796 246799

*** set target_max_bytes to 128G

[root@host227 ~] ceph osd pool set cache target_max_bytes $[128 * 1024 *
1024 * 1024]
set pool 3 target_max_bytes to 137438953472

[root@host227 ~] ceph -s
cluster 9fd17ded-7fb2-4993-b56c-bfa9ba3d0c1e
 health HEALTH_OK
 monmap e1: 1 mons at {host227=a.b.c.18:6789/0}, election epoch 2,
quorum 0 host227
 osdmap e83: 35 osds: 35 up, 35 in
  pgmap v2494: 1472 pgs, 8 pools, 338 GB data, 86713 objects
1176 GB used, 53657 GB / 54833 GB avail
1472 active+clean


*** data writing ...


[root@host227 ~] ceph df
GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED
54833G 53429G 1403G2.56 
POOLS:
NAME ID USED   %USED OBJECTS
data 0  0  0 0  
metadata 1  0  0 0  
rbd  2  57700M 0.10  14425  
cache3  140G   0.26  36054  
volumes  4  0  0 0  
images   5  0  0 0  
backups  6  0  0 0  
vms  7  182G   0.33  46796  

*** ceph didn't move any objects from hot pool



[root@host227 ~] ceph pg dump pools -f json-pretty
dumped pools in format json-pretty

[
{ "poolid": 0,
  "stat_sum": { "num_bytes": 0,
  "num_objects": 0,
  "num_object_clones": 0,
  "num_object_copies": 0,
  "num_objects_missing_on_primary": 0,
  "num_objects_degraded": 0,
  "num_objects_unfound": 0,
  "num_objects_dirty": 0,
  "num_whiteouts": 0,
  "num_read": 0,
  "num_read_kb": 0,
  "num_write": 0,
  "num_write_kb": 0,
  "num_scrub_errors": 0,
  "num_shallow_scrub_errors": 0,
  "num_deep_scrub_errors": 0,
  "num_objects_recovered": 0,
  "num_bytes_recovered": 0,
  "num_keys_recovered": 0,
  "num_objects_omap": 0,
  "num_objects_hit_set_archive": 0},
  "stat_cat_sum": {},
  "log_size": 0,
  "ondisk_log_size": 0},
{ "poolid": 1,
  "stat_sum": { "num_bytes": 0,
  "num_objects": 0,
  "num_object_clones": 0,
  "num_object_copies": 0,
  "num_objects_missing_on_primary": 0,
  "num_objects_degraded": 0,
  "num_objects_unfound": 0,
  "num_objects_dirty": 0,
  "num_whiteouts": 0,
  "num_read": 0,
  "num_read_kb": 0,
  "num_write": 0,
  "num_write_kb": 0,
  "num_scrub_errors": 0,
  "num_shallow_scrub_errors": 0,
  "num_deep_scrub_errors": 0,
  "num_objects_recovered": 0,
  "num_bytes_recovered": 0,
  "num_keys_recovered": 0,
  "num_objects_omap": 0,
  "num_objects_hit_set_archive": 0},
  "stat_cat_sum": {},
  "log_size": 0,
  "ondisk_log_size": 0},
{ "poolid": 2,
  "stat_sum": { "num_bytes": 60502835200,
  "num_objects": 14425,
  "num_object_clones": 0,
  "num_object_copies": 43275,
  "num_objects_missing_on_primary": 0,
  "num_objects_degraded": 0,
  "num_objects_unfound": 0,
  "num_objects_dirty": 14425,
  "num_whiteouts": 0,
  "num_read": 0,
  "num_read_kb": 0,
  "num_write": 14425,
  "num_w

Re: [ceph-users] ceph --status Missing keyring

2014-08-14 Thread John Wilkins
Dan,

Do you have /etc/ceph/ceph.client.admin.keyring, or is that in a local
directory?

Ceph will be looking for it in the /etc/ceph directory by default.

See if adding read permissions works, e.g., sudo chmod +r. You can also try
sudo when executing ceph.




On Wed, Aug 6, 2014 at 6:55 AM, O'Reilly, Dan 
wrote:

> Any idea what may be the issue here?
>
>
>
> [ceph@tm1cldcphal01 ~]$ ceph --status
>
> 2014-08-06 07:53:21.767255 7fe31fd1e700 -1 monclient(hunting): ERROR:
> missing keyring, cannot use cephx for authentication
>
> 2014-08-06 07:53:21.767263 7fe31fd1e700  0 librados: client.admin
> initialization error (2) No such file or directory
>
> Error connecting to cluster: ObjectNotFound
>
> [ceph@tm1cldcphal01 ~]$ ll
>
> total 372
>
> -rw--- 1 ceph ceph 71 Aug  5 21:07 ceph.bootstrap-mds.keyring
>
> -rw--- 1 ceph ceph 71 Aug  5 21:07 ceph.bootstrap-osd.keyring
>
> -rw--- 1 ceph ceph 63 Aug  5 21:07 ceph.client.admin.keyring
>
> -rw--- 1 ceph ceph289 Aug  5 21:01 ceph.conf
>
> -rw--- 1 ceph ceph 355468 Aug  6 07:53 ceph.log
>
> -rw--- 1 ceph ceph 73 Aug  5 21:01 ceph.mon.keyring
>
> [ceph@tm1cldcphal01 ~]$ cat ceph.conf
>
> [global]
>
> auth_service_required = cephx
>
> filestore_xattr_use_omap = true
>
> auth_client_required = cephx
>
> auth_cluster_required = cephx
>
> mon_host = 10.18.201.110,10.18.201.76,10.18.201.77
>
> mon_initial_members = tm1cldmonl01, tm1cldmonl02, tm1cldmonl03
>
> fsid = 474a8905-7537-42a6-8edc-1ab9fd2ca5e4
>
>
>
> [ceph@tm1cldcphal01 ~]$
>
>
>
> Dan O'Reilly
>
> UNIX Systems Administration
>
> [image: cid:638154011@09122011-048B]
>
> 9601 S. Meridian Blvd.
>
> Englewood, CO 80112
>
> 720-514-6293
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
John Wilkins
Senior Technical Writer
Inktank
john.wilk...@inktank.com
(415) 425-9599
http://inktank.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] librados: client.admin authentication error

2014-08-14 Thread John Wilkins
Can you provide some background?

I've just reworked the cephx authentication sections. They are still in a
wip branch, and as you ask the question, it occurs to me that we do not
have a troubleshooting section for authentication issues.

It could be any number of things:

1. you don't have the client.admin key on the client where you are
executing ceph --status
2. you have a key mismatch
3. the key permissions aren't set for your user (e.g., try sudo).

The updated sections are:

http://ceph.com/docs/wip-doc-authentication/rados/configuration/auth-config-ref/
http://ceph.com/docs/wip-doc-authentication/rados/operations/user-management/

I've put the "how it works" theory into the architecture doc:

http://ceph.com/docs/wip-doc-authentication/architecture/#high-availability-authentication

It does strike me that we could use a bit of troubleshooting for
authentication issues.








On Wed, Aug 6, 2014 at 7:56 AM, O'Reilly, Dan 
wrote:

> Anybody know why this error occurs, and a solution?
>
>
>
> [ceph@tm1cldcphal01 ~]$ ceph --version
>
> ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
>
> [ceph@tm1cldcphal01 ~]$ ceph --status
>
> 2014-08-06 08:55:13.168770 7f5527929700  0 librados: client.admin
> authentication error (95) Operation not supported
>
> Error connecting to cluster: Error
>
>
>
> Dan O'Reilly
>
> UNIX Systems Administration
>
> [image: cid:638154011@09122011-048B]
>
> 9601 S. Meridian Blvd.
>
> Englewood, CO 80112
>
> 720-514-6293
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
John Wilkins
Senior Technical Writer
Inktank
john.wilk...@inktank.com
(415) 425-9599
http://inktank.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Musings

2014-08-14 Thread Robert LeBlanc
We are looking to deploy Ceph in our environment and I have some musings
that I would like some feedback on. There are concerns about scaling a
single Ceph instance to the PBs of size we would use, so the idea is to
start small like once Ceph cluster per rack or two. Then as we feel more
comfortable with it, then expand/combine clusters into larger systems. I'm
not sure that it is possible to combine discrete Ceph clusters. It also
seems to make sense to build a CRUSH map that defines regions, data
centers, sections, rows, racks, and hosts now so that there is less data
migration later, but I'm not sure how a merge would work.

I've been also toying with the idea of SSD journal per node verses SSD
cache tier pool verses lots of RAM for cache. Based on the performance
webinar today, it seems that cache misses in the cache pool causes a lot of
writing to the cache pool and severely degrades performance. I certainly
like the idea of a heat map that way a single read of an entire VM (backup,
rsync) won't kill the cache pool.

I've also been bouncing the idea to have data locality by configuring the
CRUSH map to keep two of the three replicas within the same row and the
third replica just somewhere in the data center. Based on a conversation on
the IRC a couple of days ago, it seems that this could work very will if
min_size is 2. But the documentation and the objective of Ceph seems to
indicate that min_size only applies in degraded situations. During normal
operation a write would have to be acknowledged by all three replicas
before being returned to the client, otherwise it would be eventually
consistent and not strongly consistent (I do like the idea of eventually
consistent for replication as long as we can be strongly consistent in some
form at the same time like 2 out of 3).

I've read through the online manual, so now I'm looking for personal
perspectives that you may have.

Thanks,
Robert LeBlanc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH map advice

2014-08-14 Thread Craig Lewis
On Thu, Aug 14, 2014 at 12:47 AM, Christian Balzer  wrote:
>
> Hello,
>
> On Tue, 12 Aug 2014 10:53:21 -0700 Craig Lewis wrote:
>
>> That's a low probability, given the number of disks you have.  I would've
>> taken that bet (with backups).  As the number of OSDs goes up, the
>> probability of multiple simultaneous failures goes up, and slowly
>> becomes a bad bet.
>>
>
> I must be very unlucky then. ^o^
> As in, I've had dual disk failures in a set of 8 disks 3 times now
> (within the last 6 years).
> And twice that lead to data loss, once with RAID5 (no surprise there) and
> once with RAID10 (unlucky failure of neighboring disks).
> Granted, that was with consumer HDDs and the last one with rather well
> aged ones, too. But there you go.

Yeah, I'd say you're unlucky, unless you're running a pretty large cluster.
 I usually run my 8 disk arrays in RAID-Z2 / RAID6 though; 5 disks is my
limit for RAID-Z1 / RAID5.

I've been lucky so far.  No double failures in my RAID-Z1 / RAID5 arrays,
and no triple failures in my RAID-Z2 / RAID6 arrays.  After 15 years and
hundreds of arrays, I should've had at least one.  I have had several
double failures in RAID1, but none of those were important.


If this isn't a big cluster, I would suspect that you have a vibration or
power issue.  Both are known to cause premature death in HDDs.  Of course,
rebuilding a degraded RAID is also a well known cause of premature HDD
death.



> As for backups, those are for when somebody does something stupid and
> deletes stuff they shouldn't have.
> A storage system should be a) up all the time and b) not loose data.


I completely agree, but never trust it.

Over the years, I've used backups to recover when:

   - I do something stupid
   - My developers do something stupid
   - Hardware does something stupid
   - Manufacturer firmware does something stupid
   - Manufacturer Tech support tells me to do something stupid
   - My datacenter does something stupid
   - My power companies do something stupid

I've lost data from a software RAID0, all the way up to a
quadruply-redundant multi-million dollar hardware storage array.
 Regardless of the promises printed on the box, it's the contingency plans
that keep the paychecks coming.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance really drops from 700MB/s to 10MB/s

2014-08-14 Thread German Anders
I use nmon on each OSD server, this is a really good tool to find out 
what is going on regarding CPU, Mem, Disks and Networking




German Anders


















--- Original message ---
Asunto: Re: [ceph-users] Performance really drops from 700MB/s to 
10MB/s

De: Craig Lewis 
Para: Mariusz Gronczewski 
Cc: German Anders , Ceph Users 


Fecha: Thursday, 14/08/2014 15:42

I find graphs really help here.  One screen that has all the disk I/O
and latency for all OSDs makes it easy to pin point the bottleneck.

If you don't have that, I'd go low tech: Watch the blinky lights. It's
really easy to see which disk is the hotspot.



On Thu, Aug 14, 2014 at 6:56 AM, Mariusz Gronczewski
 wrote:


Actual OSD (/var/log/ceph/ceph-osd.$id) logs would be more useful.

Few ideas:

* do 'ceph health detail' to get detail of which OSD is stalling
* 'ceph osd perf' to see latency of each osd
* 'ceph --admin-daemon /var/run/ceph/ceph-osd.$id.asok 
dump_historic_ops' shows "recent slow" ops


I actually have very similiar problem, cluster goes full speed 
(sometimes even for hours) and suddenly everything stops for a minute 
or 5, no disk IO, no IO wait (so disks are fine), no IO errors in 
kernel log, and OSDs only complain that other OSD subop is slow (but 
on that OSD everything looks fine too)


On Wed, 13 Aug 2014 16:04:30 -0400, German Anders
 wrote:



Also, even a "ls -ltr" could be done inside the /mnt of the RBD that
it freeze the prompt. Any ideas? I've attach some syslogs from one of
the OSD servers and also from the client. Both are running Ubuntu
14.04LTS with Kernel  3.15.8.
The cluster is not usable at this point, since I can't run a "ls" on
the rbd.

Thanks in advance,

Best regards,


German Anders


















--- Original message ---
Asunto: Re: [ceph-users] Performance really drops from 700MB/s to
10MB/s
De: German Anders 
Para: Mark Nelson 
Cc: 
Fecha: Wednesday, 13/08/2014 11:09


Actually is very strange, since if i run the fio test on the client,
and also un parallel run a iostat on all the OSD servers, i don't see
any workload going on over the disks, I mean... nothing! 0.00and
also the fio script on the client is reacting very rare too:


$ sudo fio --filename=/dev/rbd1 --direct=1 --rw=write --bs=4m
--size=10G --iodepth=16 --ioengine=libaio --runtime=60
--group_reporting --name=file99
file99: (g=0): rw=write, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio,
iodepth=16
fio-2.1.3
Starting 1 process
Jobs: 1 (f=1): [W] [2.1% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
01h:26m:43s]

It's seems like is doing nothing..



German Anders




















--- Original message ---
Asunto: Re: [ceph-users] Performance really drops from 700MB/s to
10MB/s
De: Mark Nelson 
Para: 
Fecha: Wednesday, 13/08/2014 11:00

On 08/13/2014 08:19 AM, German Anders wrote:



Hi to all,

   I'm having a particular behavior on a 
new Ceph cluster.

I've map
a RBD to a client and issue some performance tests with fio, at this
point everything goes just fine (also the results :) ), but then I try
to run another new test on a new RBD on the same client, and suddenly
the performance goes below 10MB/s and it took almost 10 minutes to
complete a 10G file test, if I issue a *ceph -w* I don't see anything
suspicious, any idea what can be happening here?


When things are going fast, are your disks actually writing data out
as
fast as your client IO would indicate? (don't forgot to count
replication!)  It may be that the great speed is just writing data
into
the tmpfs journals (if the test is only 10GB and spread across 36
OSDs,
it could finish pretty quickly writing to tmpfs!).  FWIW, tmpfs
journals
aren't very safe.  It's not something you want to use outside of
testing
except in unusual circumstances.

In your tests, when things are bad: it's generally worth checking to
see
if any one disk/osd is backed up relative to the others.  There are a
couple of ways to accomplish this.  the Ceph admin socket can tell you
information about each OSD ie how many outstanding IOs and a history
of
slow ops.  You can also look at per-disk statistics with something
like
iostat or collectl.

Hope this helps!





   The cluster is made of:

3 x MON Servers
4 x OSD Servers (3TB SAS 6G disks for OSD daemons & tmpfs for Journal
->
there's one tmpfs of 36GB that is share by 9 OSD daemons, on each
server)
2 x Network SW (Cluster and Public)
10GbE speed on both networks

   The ceph.conf file is the following:

[global]
fsid = 56e56e4c-ea59-4157-8b98-acae109bebe1
mon_initial_members = cephmon01, cephmon02, cephmon03
mon_host = 10.97.10.1,10.97.10.2,10.97.10.3
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
filestore_xattr_use_omap = true
public_network = 10.97.0.0/16
cluster_network = 192.168.10.0/24
osd_pool_default_size = 2
glance_api_version = 2

[mon]
debug_optracker = 0

[mon.cephmon01]
host = cephmon01
mon_addr = 10.97.10.1:678

[ceph-users] How to create multiple OSD's per host?

2014-08-14 Thread Bruce McFarland
I've tried using ceph-deploy but it wants to assign the same id for each osd 
and I end up with a bunch of "prepared" ceph-disk's and only 1 "active". If I 
use the manual "short form" method the activate step fails and there are no xfs 
mount points on the ceph-disks. If I use the manual "long form" it seems like 
I'm the closest to getting active ceph-disks/osd's but the monitor always shows 
the osds as "down/in" and the ceph-disks don't persist over a boot cycle.

Is there a document anywhere that anyone knows of that explains a step by step 
process for bringing up multiple osd's per host - 1 hdd with ssd journal 
partition per osd?
Thanks,
Bruce
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance really drops from 700MB/s to 10MB/s

2014-08-14 Thread Craig Lewis
I find graphs really help here.  One screen that has all the disk I/O
and latency for all OSDs makes it easy to pin point the bottleneck.

If you don't have that, I'd go low tech: Watch the blinky lights. It's
really easy to see which disk is the hotspot.



On Thu, Aug 14, 2014 at 6:56 AM, Mariusz Gronczewski
 wrote:
> Actual OSD (/var/log/ceph/ceph-osd.$id) logs would be more useful.
>
> Few ideas:
>
> * do 'ceph health detail' to get detail of which OSD is stalling
> * 'ceph osd perf' to see latency of each osd
> * 'ceph --admin-daemon /var/run/ceph/ceph-osd.$id.asok dump_historic_ops' 
> shows "recent slow" ops
>
> I actually have very similiar problem, cluster goes full speed (sometimes 
> even for hours) and suddenly everything stops for a minute or 5, no disk IO, 
> no IO wait (so disks are fine), no IO errors in kernel log, and OSDs only 
> complain that other OSD subop is slow (but on that OSD everything looks fine 
> too)
>
> On Wed, 13 Aug 2014 16:04:30 -0400, German Anders
>  wrote:
>
>> Also, even a "ls -ltr" could be done inside the /mnt of the RBD that
>> it freeze the prompt. Any ideas? I've attach some syslogs from one of
>> the OSD servers and also from the client. Both are running Ubuntu
>> 14.04LTS with Kernel  3.15.8.
>> The cluster is not usable at this point, since I can't run a "ls" on
>> the rbd.
>>
>> Thanks in advance,
>>
>> Best regards,
>>
>>
>> German Anders
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> > --- Original message ---
>> > Asunto: Re: [ceph-users] Performance really drops from 700MB/s to
>> > 10MB/s
>> > De: German Anders 
>> > Para: Mark Nelson 
>> > Cc: 
>> > Fecha: Wednesday, 13/08/2014 11:09
>> >
>> >
>> > Actually is very strange, since if i run the fio test on the client,
>> > and also un parallel run a iostat on all the OSD servers, i don't see
>> > any workload going on over the disks, I mean... nothing! 0.00and
>> > also the fio script on the client is reacting very rare too:
>> >
>> >
>> > $ sudo fio --filename=/dev/rbd1 --direct=1 --rw=write --bs=4m
>> > --size=10G --iodepth=16 --ioengine=libaio --runtime=60
>> > --group_reporting --name=file99
>> > file99: (g=0): rw=write, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio,
>> > iodepth=16
>> > fio-2.1.3
>> > Starting 1 process
>> > Jobs: 1 (f=1): [W] [2.1% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
>> > 01h:26m:43s]
>> >
>> > It's seems like is doing nothing..
>> >
>> >
>> >
>> > German Anders
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >> --- Original message ---
>> >> Asunto: Re: [ceph-users] Performance really drops from 700MB/s to
>> >> 10MB/s
>> >> De: Mark Nelson 
>> >> Para: 
>> >> Fecha: Wednesday, 13/08/2014 11:00
>> >>
>> >> On 08/13/2014 08:19 AM, German Anders wrote:
>> >>>
>> >>> Hi to all,
>> >>>
>> >>>I'm having a particular behavior on a new Ceph cluster.
>> >>> I've map
>> >>> a RBD to a client and issue some performance tests with fio, at this
>> >>> point everything goes just fine (also the results :) ), but then I try
>> >>> to run another new test on a new RBD on the same client, and suddenly
>> >>> the performance goes below 10MB/s and it took almost 10 minutes to
>> >>> complete a 10G file test, if I issue a *ceph -w* I don't see anything
>> >>> suspicious, any idea what can be happening here?
>> >>
>> >> When things are going fast, are your disks actually writing data out
>> >> as
>> >> fast as your client IO would indicate? (don't forgot to count
>> >> replication!)  It may be that the great speed is just writing data
>> >> into
>> >> the tmpfs journals (if the test is only 10GB and spread across 36
>> >> OSDs,
>> >> it could finish pretty quickly writing to tmpfs!).  FWIW, tmpfs
>> >> journals
>> >> aren't very safe.  It's not something you want to use outside of
>> >> testing
>> >> except in unusual circumstances.
>> >>
>> >> In your tests, when things are bad: it's generally worth checking to
>> >> see
>> >> if any one disk/osd is backed up relative to the others.  There are a
>> >> couple of ways to accomplish this.  the Ceph admin socket can tell you
>> >> information about each OSD ie how many outstanding IOs and a history
>> >> of
>> >> slow ops.  You can also look at per-disk statistics with something
>> >> like
>> >> iostat or collectl.
>> >>
>> >> Hope this helps!
>> >>
>> >>>
>> >>>
>> >>>The cluster is made of:
>> >>>
>> >>> 3 x MON Servers
>> >>> 4 x OSD Servers (3TB SAS 6G disks for OSD daemons & tmpfs for Journal
>> >>> ->
>> >>> there's one tmpfs of 36GB that is share by 9 OSD daemons, on each
>> >>> server)
>> >>> 2 x Network SW (Cluster and Public)
>> >>> 10GbE speed on both networks
>> >>>
>> >>>The ceph.conf file is the following:
>> >>>
>> >>> [global]
>> >>> fsid = 56e56e4c-ea59-4157-8b98-acae109bebe1
>> >>> mon_initial_members = cephmon01, cephmon02, cephmon03
>> >>> mon_host = 10.97.10.1,10.97.10.2,10.97.10.3
>> >>> auth_client_required = cephx
>> >>> auth_cluster_

[ceph-users] Translating a RadosGW object name into a filename on disk

2014-08-14 Thread Craig Lewis
In my effort to learn more of the details of Ceph, I'm trying to
figure out how to get from an object name in RadosGW, through the
layers, down to the files on disk.

clewis@clewis-mac ~ $ s3cmd ls s3://cpltest/
2014-08-13 23:0214M  28dde9db15fdcb5a342493bc81f91151
s3://cpltest/vmware-freebsd-tools.tar.gz

Looking at the .rgw pool's contents tells me that the cpltest bucket
is default.73886.55:
root@dev-ceph0:/var/lib/ceph/osd/ceph-0/current# rados -p .rgw ls | grep cpltest
cpltest
.bucket.meta.cpltest:default.73886.55

The rados objects that belong to that bucket are:
root@dev-ceph0:~# rados -p .rgw.buckets ls | grep default.73886.55
default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_1
default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_3
default.73886.55_vmware-freebsd-tools.tar.gz
default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_2
default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_4


I know those shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_ files are the
rest of vmware-freebsd-tools.tar.gz.  I can infer that because this
bucket only has a single file (and the sum of the sizes matches).
With many files, I can't infer the link anymore.

How do I look up that link?

I tried reading the src/rgw/rgw_rados.cc, but I'm getting lost.



My real goal is the reverse.  I recently repaired an inconsistent PG.
The primary replica had the bad data, so I want to verify that the
repaired object is correct.  I have a database that stores the SHA256
of every object.  If I can get from the filename on disk back to an S3
object, I can verify the file.  If it's bad, I can restore from the
replicated zone.


Aside from today's task, I think it's really handy to understand these
low level details.  I know it's been handy in the past, when I had
disk corruption under my PostgreSQL database.  Knowing (and
practicing) ahead of time really saved me a lot of downtime then.


Thanks for any pointers.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean

2014-08-14 Thread Craig Lewis
It sound likes you need to throttle recovery.  I have this in my ceph.conf:
[osd]
  osd max backfills = 1
  osd recovery max active = 1
  osd recovery op priority = 1


Those configs, plus SSD journals, really helped the stability of my cluster
during recovery.  Before I made those changes, I would see OSDs get voted
down by other OSDs for not responding to heartbeats quickly.  Messages in
ceph.log like:
osd.# IP:PORT 420 : [WRN] map e41738 wrongly marked me down

are an indication that OSDs are so overloaded that they're getting kicked
out.



I also ran into problems when OSDs were getting kicked repeatedly.  It
caused those really large sections in pg query's
[recovery_state][past_intervals]
that you also have. I would restart an OSD, it would peer, and then suicide
timeout 300 seconds after starting the peering process.  When I first saw
it, it was only affecting a few OSDs.  If you're seeing repeated suicide
timeouts in the OSD's logs, there's a manual process to catch them up.



On Thu, Aug 14, 2014 at 12:25 AM, Riederer, Michael 
wrote:

>  Hi Craig,
>
> Yes we have stability problems. The cluster is definitely not suitable for
> a production environment. I will not describe the details here. I want to get
> to know ceph and this is possible with the Test-cluster. Some osds are
> very slow, less than 15 MB / sec writable. Also increases the load on the
> ceph nodes to over 30 when a osd is removed and a reorganistation of the
> data is necessary. If the load is very high (over 30) I have seen exactly
> what you describe. osds go down and out and come back up and in.
>
> OK. I'll try the slow osd to remove and then to scrub, deep-scrub the pgs.
>
> Many thanks for your help.
>
> Regards,
> Mike
>
>  --
> *Von:* Craig Lewis [cle...@centraldesktop.com]
> *Gesendet:* Mittwoch, 13. August 2014 19:48
>
> *An:* Riederer, Michael
> *Cc:* Karan Singh; ceph-users@lists.ceph.com
> *Betreff:* Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck
> inactive; 4 pgs stuck unclean
>
>   Yes, ceph pg  query, not dump.  Sorry about that.
>
>  Are you having problems with OSD stability?  There's a lot of history in
> the [recovery_state][past_intervals]. That's normal when OSDs go down,
> and out, and come back up and in. You have a lot of history there. You
> might even be getting into the point that you have so much failover
> history, the OSDs can't process it all before they hit the suicide timeout.
>
>  [recovery_state][probing_osds] lists a lot of OSDs that have recently
> owned these PGs. If the OSDs are crashing frequently, you need to get that
> under control before proceeding.
>
>  Once the OSDs are stable, I think Ceph just needs to scrub and
> deep-scrub those PGs.
>
>
>  Until Ceph clears out the [recovery_state][probing_osds] section in the
> pg query, it's not going to do anything.  ceph osd lost hears you, but
> doesn't trust you.  Ceph won't do anything until it's actually checked
> those OSDs itself.  Scrubbing and Deep scrubbing should convince it.
>
>  Once that [recovery_state][probing_osds] section is gone, you should see
> the [recovery_state][past_intervals] section shrink or disappear. I don't
> have either section in my pg query. Once that happens, your ceph pg repair
> or ceph pg force_create_pg should finally have some effect.  You may or
> may not need to re-issue those commands.
>
>
>
>
> On Tue, Aug 12, 2014 at 9:32 PM, Riederer, Michael  > wrote:
>
>>  Hi Craig,
>>
>> # ceph pg 2.587 query
>> # ceph pg 2.c1 query
>> # ceph pg 2.92 query
>> # ceph pg 2.e3 query
>>
>> Please download the output form here:
>> http://server.riederer.org/ceph-user/
>>
>> #
>>
>>
>> It is not possible to map a rbd:
>>
>> # rbd map testshareone --pool rbd --name client.admin
>> rbd: add failed: (5) Input/output error
>>
>> I found that:
>> http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/11405
>>  # ceph osd getcrushmap -o crushmap.bin
>>  got crush map from osdmap epoch 3741
>> # crushtool -i crushmap.bin --set-chooseleaf_vary_r 0 -o crushmap-new.bin
>> # ceph osd setcrushmap -i crushmap-new.bin
>> set crush map
>>
>> The Cluster had to do some. Now it looks a bit different.
>>
>> It is still not possible to map a rbd.
>>
>>  root@ceph-admin-storage:~# ceph -s
>> cluster 6b481875-8be5-4508-b075-e1f660fd7b33
>>  health HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs
>> stuck unclean
>>  monmap e2: 3 mons at {ceph-1-storage=
>> 10.65.150.101:6789/0,ceph-2-storage=10.65.150.102:6789/0,ceph-3-storage=10.65.150.103:6789/0},
>> election epoch 5010, quorum 0,1,2
>> ceph-1-storage,ceph-2-storage,ceph-3-storage
>>   osdmap e34206: 55 osds: 55 up, 55 in
>>   pgmap v10838368: 6144 pgs, 3 pools, 11002 GB data, 2762 kobjects
>> 22078 GB used, 79932 GB / 102010 GB avail
>> 6140 active+clean
>>4 incomplete
>>
>> root@ceph-admin-storage:~# ceph health detail
>> HEALTH

Re: [ceph-users] cache pools on hypervisor servers

2014-08-14 Thread Sage Weil
On Thu, 14 Aug 2014, Andrei Mikhailovsky wrote:
> Hi guys,
> 
> Could someone from the ceph team please comment on running osd cache pool on
> the hypervisors? Is this a good idea, or will it create a lot of performance
> issues?

It doesn't sound like an especially good idea.  In general you want the 
cache pool to be significantly faster than the base pool (think PCI 
attached flash).  And there won't be any particular affinity to the host 
where the VM consuming the sotrage happens to be, so I don't think there 
is a reason to put the flash in the hypervisor nodes unless there simply 
isn't anywhere else to put them.

Probably what you're after is a client-side write-thru cache?  There is 
some ongoing work to build this into qemu and possibly librbd, but nothing 
is ready yet that I know of.

sage


> 
> Anyone in the ceph community that has done this? Any results to share?
> 
> Many thanks
> 
> Andrei
> 
> 
>   From: "Robert van Leeuwen" 
>   To: "Andrei Mikhailovsky" 
>   Cc: ceph-users@lists.ceph.com
>   Sent: Thursday, 14 August, 2014 9:31:24 AM
>   Subject: RE: cache pools on hypervisor servers
> 
> > Personally I am not worried too much about the hypervisor -
> hypervisor traffic as I am using a dedicated infiniband network for
> storage.
> > It is not used for the guest to guest or the internet traffic or
> anything else. I would like to decrease or at least smooth out the
> traffic peaks between the hypervisors and the SAS/SATA osd storage
> servers.
> > I guess the ssd cache pool would enable me to do that as the
> eviction rate should be more structured compared to the random io
> writes that guest vms generate. Sounds reasonable
> 
> >>I'm very interested in the effect of caching pools in combination
> with running VMs on them so I'd be happy to hear what you find ;)
> > I will give it a try and share back the results when we get the ssd
> kit.
> Excellent, looking forward to it.
> 
> 
> >> As a side note: Running OSDs on hypervisors would not be my
> preferred choice since hypervisor load might impact Ceph performance.
> > Do you think it is not a good idea even if you have a lot of cores
> on the hypervisors?
> > Like 24 or 32 per host server?
> > According to my monitoring, our osd servers are not that stressed
> and generally have over 50% of free cpu power.
> 
> The number of cores do not really matter if they are all busy ;)
> I honestly do not know how Ceph behaves when it is CPU starved but I
> guess it might not be pretty.
> Since your whole environment will be crumbling down if your storage
> becomes unavailable it is not a risk I would take lightly.
> 
> Cheers,
> Robert van Leeuwen
> 
> 
> 
> 
> 
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD disk replacement best practise

2014-08-14 Thread Smart Weblications GmbH - Florian Wiessner
Am 14.08.2014 13:29, schrieb Guang Yang:
> Hi cephers,
> Most recently I am drafting the run books for OSD disk replacement, I think 
> the rule of thumb is to reduce data migration (recover/backfill), and I 
> thought the following procedure should achieve the purpose:
>   1. ceph osd out osd.XXX (mark it out to trigger data migration)
>   2. ceph osd rm osd.XXX
>   3. ceph auth rm osd.XXX
>   4. provision a new OSD which will take XXX as the OSD id and migrate data 
> back.
> 
> With the above procedure, the crush weight of the host never changed so that 
> we can limit the data migration only for those which are neccesary.
> 
> Does it make sense?
> 

Looks sane to me, and i remember that i did it that way a few times.


-- 

Mit freundlichen Grüßen,

Florian Wiessner

Smart Weblications GmbH
Martinsberger Str. 1
D-95119 Naila

fon.: +49 9282 9638 200
fax.: +49 9282 9638 205
24/7: +49 900 144 000 00 - 0,99 EUR/Min*
http://www.smart-weblications.de

--
Sitz der Gesellschaft: Naila
Geschäftsführer: Florian Wiessner
HRB-Nr.: HRB 3840 Amtsgericht Hof
*aus dem dt. Festnetz, ggf. abweichende Preise aus dem Mobilfunknetz
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cache pools on hypervisor servers

2014-08-14 Thread Andrei Mikhailovsky
Hi guys, 

Could someone from the ceph team please comment on running osd cache pool on 
the hypervisors? Is this a good idea, or will it create a lot of performance 
issues? 

Anyone in the ceph community that has done this? Any results to share? 

Many thanks 

Andrei 

- Original Message -

> From: "Robert van Leeuwen" 
> To: "Andrei Mikhailovsky" 
> Cc: ceph-users@lists.ceph.com
> Sent: Thursday, 14 August, 2014 9:31:24 AM
> Subject: RE: cache pools on hypervisor servers

> > Personally I am not worried too much about the hypervisor - hypervisor
> > traffic as I am using a dedicated infiniband network for storage.
> > It is not used for the guest to guest or the internet traffic or anything
> > else. I would like to decrease or at least smooth out the traffic peaks
> > between the hypervisors and the SAS/SATA osd storage servers.
> > I guess the ssd cache pool would enable me to do that as the eviction rate
> > should be more structured compared to the random io writes that guest vms
> > generate.
> Sounds reasonable

> >>I'm very interested in the effect of caching pools in combination with
> >>running VMs on them so I'd be happy to hear what you find ;)
> > I will give it a try and share back the results when we get the ssd kit.
> Excellent, looking forward to it.

> >> As a side note: Running OSDs on hypervisors would not be my preferred
> >> choice since hypervisor load might impact Ceph performance.
> > Do you think it is not a good idea even if you have a lot of cores on the
> > hypervisors?
> > Like 24 or 32 per host server?
> > According to my monitoring, our osd servers are not that stressed and
> > generally have over 50% of free cpu power.

> The number of cores do not really matter if they are all busy ;)
> I honestly do not know how Ceph behaves when it is CPU starved but I guess it
> might not be pretty.
> Since your whole environment will be crumbling down if your storage becomes
> unavailable it is not a risk I would take lightly.

> Cheers,
> Robert van Leeuwen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fixed all active+remapped PGs stuck forever (but I have no clue why)

2014-08-14 Thread David Moreau Simard
Ah, I was afraid it would be related to the amount of replicas versus the 
amount of host buckets.

Makes sense. I was unable to reproduce the issue with three hosts and one OSD 
on each host.

Thanks.
--
David Moreau Simard

On Aug 14, 2014, at 12:36 AM, Christian Balzer 
mailto:ch...@gol.com>> wrote:


Hello,

On Thu, 14 Aug 2014 03:38:11 + David Moreau Simard wrote:

Hi,

Trying to update my continuous integration environment.. same deployment
method with the following specs:
- Ubuntu Precise, Kernel 3.2, Emperor (0.72.2) - Yields a successful,
healthy cluster.
- Ubuntu Trusty, Kernel 3.13, Firefly (0.80.5) - I have stuck placement
groups.

Here’s some relevant bits from the Trusty/Firefly setup before I move on
to what I’ve done/tried: http://pastebin.com/eqQTHcxU <— This was about
halfway through PG healing.

So, the setup is three monitors, two other hosts on which there are 9
OSDs each. At the beginning, all my placement groups were stuck unclean.

And there's your reason why the firefly install "failed".
The default replication is 3 and you have just 2 storage nodes, combined
with the default CRUSH rules that's exactly what will happen.
To avoid this from the start either use 3 nodes or set
---
osd_pool_default_size = 2
osd_pool_default_min_size = 1
---
in your ceph.conf very early on, before creating anything, especially
OSDs.

Setting the replication for all your pools to 2 with "ceph osd pool 
set size 2" as the first step after your install should have worked, too.

But with all the things you tried, I can't really tell you why things
behaved they way they did for you.

Christian

I tried the easy things first:
- set crush tunables to optimal
- run repairs/scrub on OSDs
- restart OSDs

Nothing happened. All ~12000 PGs remained stuck unclean since forever
active+remapped. Next, I played with the crush map. I deleted the
default replicated_ruleset rule and created a (basic) rule for each pool
for the time being. I set the pools to use their respective rule and
also reduced their size to 2 and min_size to 1.

Still nothing, all PGs stuck.
I’m not sure why but I tried setting the crush tunables to legacy - I
guess in a trial and error attempt.

Half my PGs healed almost immediately. 6082 PGs remained in
active+remapped. I try running scrubs/repairs - it won’t heal the other
half. I set the tunables back to optimal, still nothing.

I set tunables to legacy again and most of them end up healing with only
1335 left in active+remapped.

The remainder of the PGs healed when I restarted the OSDs.

Does anyone have a clue why this happened ?
It looks like switching back and forth between tunables fixed the stuck
PGs ?

I can easily reproduce this if anyone wants more info.

Let me know !
--
David Moreau Simard

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Christian BalzerNetwork/Systems Engineer
ch...@gol.comGlobal OnLine Japan/Fusion Communications
http://www.gol.com/

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache tiering and target_max_bytes

2014-08-14 Thread Sage Weil
On Thu, 14 Aug 2014, Pawe? Sadowski wrote:
> Hello,
> 
> I've a cluster of 35 OSD (30 HDD, 5 SSD) with cache tiering configured.
> During tests it looks like ceph is not respecting target_max_bytes
> settings. Steps to reproduce:
>  - configure cache tiering
>  - set target_max_bytes to 32G (on hot pool)
>  - write more than 32G of data
>  - nothing happens

Can you 'ceph pg dump pools -f json-pretty' at this point?  And pick a 
random PG in the cache pool and capture the output of 'ceph pg  
query'.

Then 'ceph tell osd.* injectargs '--debug-ms 1 --debug-osd 20'.

> If I set target_max_bytes again (to the same value or any other option,
> for example cache_min_evict_age) ceph will start to move data from hot
> to base pool.

Once it starts going, capture an OSD log (/var/log/ceph/ceph-osd.NNN.log) 
for an OSD that is now moving data.

Thanks!
sage

> 
> I'm using ceph in version 0.80.4 (with cherry-picked patch from bug
> http://tracker.ceph.com/issues/8982.
> 
> Is there away to make it work as expected?
> 
> -- 
> PS
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados bench no clean cleanup

2014-08-14 Thread Kenneth Waegeman


- Message from zhu qiang  -
   Date: Fri, 8 Aug 2014 10:00:18 +0800
   From: zhu qiang 
Subject: RE: [ceph-users] rados bench no clean cleanup
 To: 'Kenneth Waegeman' , 'ceph-users'  





Not "cleanup", it is "--no-cleanup"


I wanted to delete the benchdata, not to keep it:)


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On  
Behalf Of Kenneth Waegeman

Sent: Wednesday, August 06, 2014 4:49 PM
To: ceph-users
Subject: [ceph-users] rados bench no clean cleanup

Hi,

I did a test with 'rados -p ecdata bench 10 write' on an ECpool  
with a cache replicated pool over it (ceph 0.83).
The benchmark wrote about 12TB of data. After the 10 seconds  
run, rados started to delete his benchmark files.
But only about 2,5TB got deleted, then rados returned. I tried to do  
it with the cleanup function 'rados -p ecdata cleanup --prefix bench'

and after a lot of time, it returns:

  Warning: using slow linear search
  Removed 2322000 objects

But rados df showed the same statistics as before.
I ran it again, and it again showed 'Removed 2322000 objects',  
without any change in the rados df statistics.
It is probably the 'lazy deletion', because if I try to do a 'rados  
get' on it, there is 'No such file or directory'. But I still see  
the objects when I do 'rados -p ecdata ls'.


Is this indeed because of the lazy deletion?  Is there a way to see  
how much not-deleted objects are in the pool? And is there then a  
reason why rados did remove the first 2,5TB? Or is this just a rados  
bench issue?:)


Thanks again!

Kenneth

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



- End message from zhu qiang  -

--

Met vriendelijke groeten,
Kenneth Waegeman


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance really drops from 700MB/s to 10MB/s

2014-08-14 Thread German Anders


Hi Mariusz,

 Thanks a lot for the ideas, I've rebooted the client server, map 
again the rbd and launch the fio test again, this time it work... very 
rarewhile running the test I run also:


ceph@cephmon01:~$ ceph osd perf
osdid fs_commit_latency(ms) fs_apply_latency(ms)
   0   506   22
   1   465   26
   2   4903
   3   623   13
   4   548   68
   5   484   16
   6   4482
   7   523   27
   8   489   30
   9   498   52
  10   472   12
  11   4077
  12   3150
  13   540   17
  14   599   18
  15   420   14
  16   5157
  17   3953
  18   565   14
  19   557   59
  20   5157
  21   689   56
  22   474   10
  23   1421
  24   3647
  25   3906
  26   507  107
  27   573   20
  28   1581
  29   490   25
  30   3010
  31   381   15
  32   440   27
  33   482   16
  34   3239
  35   414   21

I don't see any suspicious here. The fio command was:


$ sudo fio --filename=/dev/rbd0 --direct=1 --rw=write --bs=4m 
--size=10G --iodepth=16 --ioengine=libaio --runtime=60 
--group_reporting --name=fileB
fileB: (g=0): rw=write, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio, 
iodepth=16

fio-2.1.3
Starting 1 process
Jobs: 1 (f=1): [W] [100.0% done] [0KB/748.0MB/0KB /s] [0/187/0 iops] 
[eta 00m:00s]

fileB: (groupid=0, jobs=1): err= 0: pid=2172: Thu Aug 14 10:21:13 2014
 write: io=10240MB, bw=741672KB/s, iops=181, runt= 14138msec
   slat (usec): min=569, max=2747, avg=1741.44, stdev=507.08
   clat (msec): min=19, max=465, avg=86.55, stdev=35.16
lat (msec): min=20, max=466, avg=88.30, stdev=34.92
   clat percentiles (msec):
|  1.00th=[   39],  5.00th=[   54], 10.00th=[   60], 20.00th=[   
64],
| 30.00th=[   69], 40.00th=[   75], 50.00th=[   81], 60.00th=[   
85],
| 70.00th=[   92], 80.00th=[  102], 90.00th=[  124], 95.00th=[  
147],
| 99.00th=[  217], 99.50th=[  258], 99.90th=[  424], 99.95th=[  
441],

| 99.99th=[  465]
   bw (KB  /s): min=686754, max=783298, per=99.81%, avg=740262.96, 
stdev=19845.43

   lat (msec) : 20=0.04%, 50=3.36%, 100=75.51%, 250=20.51%, 500=0.59%
 cpu  : usr=6.18%, sys=12.97%, ctx=11554, majf=0, minf=2225
 IO depths: 1=0.1%, 2=0.1%, 4=0.2%, 8=0.3%, 16=99.4%, 32=0.0%, 

=64=0.0%
submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 

=64=0.0%
complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, 

=64=0.0%

issued: total=r=0/w=2560/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
 WRITE: io=10240MB, aggrb=741672KB/s, minb=741672KB/s, 
maxb=741672KB/s, mint=14138msec, maxt=14138msec


Disk stats (read/write):
 rbd0: ios=182/20459, merge=0/0, ticks=92/1213748, in_queue=1214796, 
util=99.80%

ceph@mail02-old:~$





German Anders



















--- Original message ---
Asunto: Re: [ceph-users] Performance really drops from 700MB/s to 
10MB/s

De: Mariusz Gronczewski 
Para: German Anders 
Cc: 
Fecha: Thursday, 14/08/2014 10:56

Actual OSD (/var/log/ceph/ceph-osd.$id) logs would be more useful.

Few ideas:

* do 'ceph health detail' to get detail of which OSD is stalling
* 'ceph osd perf' to see latency of each osd
* 'ceph --admin-daemon /var/run/ceph/ceph-osd.$id.asok 
dump_historic_ops' shows "recent slow" ops


I actually have very similiar problem, cluster goes full speed 
(sometimes even for hours) and suddenly everything stops for a minute 
or 5, no disk IO, no IO wait (so disks are fine), no IO errors in 
kernel log, and OSDs only complain that other OSD subop is slow (but 
on that OSD everything looks fine too)


On Wed, 13 Aug 2014 16:04:30 -0400, German Anders
 wrote:



Also, even a "ls -ltr" could be done inside the /mnt of the RBD that
it freeze the prompt. Any ideas? I've attach some syslogs from one of
the OSD servers and also from the client. Both are running Ubuntu
14.04LTS with Kernel  3.15.8.
The cluster is not usable at

[ceph-users] Cache tiering and target_max_bytes

2014-08-14 Thread Paweł Sadowski
Hello,

I've a cluster of 35 OSD (30 HDD, 5 SSD) with cache tiering configured.
During tests it looks like ceph is not respecting target_max_bytes
settings. Steps to reproduce:
 - configure cache tiering
 - set target_max_bytes to 32G (on hot pool)
 - write more than 32G of data
 - nothing happens

If I set target_max_bytes again (to the same value or any other option,
for example cache_min_evict_age) ceph will start to move data from hot
to base pool.

I'm using ceph in version 0.80.4 (with cherry-picked patch from bug
http://tracker.ceph.com/issues/8982.

Is there away to make it work as expected?

-- 
PS


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance really drops from 700MB/s to 10MB/s

2014-08-14 Thread Mariusz Gronczewski
Actual OSD (/var/log/ceph/ceph-osd.$id) logs would be more useful.

Few ideas:

* do 'ceph health detail' to get detail of which OSD is stalling
* 'ceph osd perf' to see latency of each osd
* 'ceph --admin-daemon /var/run/ceph/ceph-osd.$id.asok dump_historic_ops' shows 
"recent slow" ops

I actually have very similiar problem, cluster goes full speed (sometimes even 
for hours) and suddenly everything stops for a minute or 5, no disk IO, no IO 
wait (so disks are fine), no IO errors in kernel log, and OSDs only complain 
that other OSD subop is slow (but on that OSD everything looks fine too)

On Wed, 13 Aug 2014 16:04:30 -0400, German Anders
 wrote:

> Also, even a "ls -ltr" could be done inside the /mnt of the RBD that 
> it freeze the prompt. Any ideas? I've attach some syslogs from one of 
> the OSD servers and also from the client. Both are running Ubuntu 
> 14.04LTS with Kernel  3.15.8.
> The cluster is not usable at this point, since I can't run a "ls" on 
> the rbd.
> 
> Thanks in advance,
> 
> Best regards,
> 
> 
> German Anders
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> > --- Original message ---
> > Asunto: Re: [ceph-users] Performance really drops from 700MB/s to 
> > 10MB/s
> > De: German Anders 
> > Para: Mark Nelson 
> > Cc: 
> > Fecha: Wednesday, 13/08/2014 11:09
> >
> >
> > Actually is very strange, since if i run the fio test on the client, 
> > and also un parallel run a iostat on all the OSD servers, i don't see 
> > any workload going on over the disks, I mean... nothing! 0.00and 
> > also the fio script on the client is reacting very rare too:
> >
> >
> > $ sudo fio --filename=/dev/rbd1 --direct=1 --rw=write --bs=4m 
> > --size=10G --iodepth=16 --ioengine=libaio --runtime=60 
> > --group_reporting --name=file99
> > file99: (g=0): rw=write, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio, 
> > iodepth=16
> > fio-2.1.3
> > Starting 1 process
> > Jobs: 1 (f=1): [W] [2.1% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 
> > 01h:26m:43s]
> >
> > It's seems like is doing nothing..
> >
> >
> >
> > German Anders
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >> --- Original message ---
> >> Asunto: Re: [ceph-users] Performance really drops from 700MB/s to 
> >> 10MB/s
> >> De: Mark Nelson 
> >> Para: 
> >> Fecha: Wednesday, 13/08/2014 11:00
> >>
> >> On 08/13/2014 08:19 AM, German Anders wrote:
> >>>
> >>> Hi to all,
> >>>
> >>>I'm having a particular behavior on a new Ceph cluster. 
> >>> I've map
> >>> a RBD to a client and issue some performance tests with fio, at this
> >>> point everything goes just fine (also the results :) ), but then I try
> >>> to run another new test on a new RBD on the same client, and suddenly
> >>> the performance goes below 10MB/s and it took almost 10 minutes to
> >>> complete a 10G file test, if I issue a *ceph -w* I don't see anything
> >>> suspicious, any idea what can be happening here?
> >>
> >> When things are going fast, are your disks actually writing data out 
> >> as
> >> fast as your client IO would indicate? (don't forgot to count
> >> replication!)  It may be that the great speed is just writing data 
> >> into
> >> the tmpfs journals (if the test is only 10GB and spread across 36 
> >> OSDs,
> >> it could finish pretty quickly writing to tmpfs!).  FWIW, tmpfs 
> >> journals
> >> aren't very safe.  It's not something you want to use outside of 
> >> testing
> >> except in unusual circumstances.
> >>
> >> In your tests, when things are bad: it's generally worth checking to 
> >> see
> >> if any one disk/osd is backed up relative to the others.  There are a
> >> couple of ways to accomplish this.  the Ceph admin socket can tell you
> >> information about each OSD ie how many outstanding IOs and a history 
> >> of
> >> slow ops.  You can also look at per-disk statistics with something 
> >> like
> >> iostat or collectl.
> >>
> >> Hope this helps!
> >>
> >>>
> >>>
> >>>The cluster is made of:
> >>>
> >>> 3 x MON Servers
> >>> 4 x OSD Servers (3TB SAS 6G disks for OSD daemons & tmpfs for Journal 
> >>> ->
> >>> there's one tmpfs of 36GB that is share by 9 OSD daemons, on each 
> >>> server)
> >>> 2 x Network SW (Cluster and Public)
> >>> 10GbE speed on both networks
> >>>
> >>>The ceph.conf file is the following:
> >>>
> >>> [global]
> >>> fsid = 56e56e4c-ea59-4157-8b98-acae109bebe1
> >>> mon_initial_members = cephmon01, cephmon02, cephmon03
> >>> mon_host = 10.97.10.1,10.97.10.2,10.97.10.3
> >>> auth_client_required = cephx
> >>> auth_cluster_required = cephx
> >>> auth_service_required = cephx
> >>> filestore_xattr_use_omap = true
> >>> public_network = 10.97.0.0/16
> >>> cluster_network = 192.168.10.0/24
> >>> osd_pool_default_size = 2
> >>> glance_api_version = 2
> >>>
> >>> [mon]
> >>> debug_optracker = 0
> >>>
> >>> [mon.cephmon01]
> >>> host = cephmon01
> >>> mon_addr = 10.97.10.1:6789
> >>>
> >>> [mon.cephmon02]
> >>> host = cephmon02
> >>> mon_addr = 10.97.10.2:6789
> >>>
> >>

[ceph-users] osd pool stats

2014-08-14 Thread Luis Periquito
Hi,

I've just added a few more OSDs to my cluster. As it was expected the
system started rebalancing all the PGs to the new nodes.

pool .rgw.buckets id 24
  -221/-182 objects degraded (121.429%)
  recovery io 27213 kB/s, 53 objects/s
  client io 27434 B/s rd, 0 B/s wr, 66 op/s

the status outputs:
988801/13249309 objects degraded (7.463%)
  10 active+remapped+wait_backfill
  13 active+remapped+backfilling
 457 active+clean

I'm running ceph 0.80.5.

-- 

Luis Periquito

Unix Engineer

Ocado.com 

Head Office, Titan Court, 3 Bishop Square, Hatfield Business Park,
Hatfield, Herts AL10 9NE

-- 


Notice:  This email is confidential and may contain copyright material of 
members of the Ocado Group. Opinions and views expressed in this message 
may not necessarily reflect the opinions and views of the members of the 
Ocado Group.

If you are not the intended recipient, please notify us immediately and 
delete all copies of this message. Please note that it is your 
responsibility to scan this message for viruses.  

References to the “Ocado Group” are to Ocado Group plc (registered in 
England and Wales with number 7098618) and its subsidiary undertakings (as 
that expression is defined in the Companies Act 2006) from time to time.  
The registered office of Ocado Group plc is Titan Court, 3 Bishops Square, 
Hatfield Business Park, Hatfield, Herts. AL10 9NE.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] running Firefly client (0.80.1) against older version (dumpling 0.67.10) cluster?

2014-08-14 Thread Sage Weil
On Thu, 14 Aug 2014, Nigel Williams wrote:
> Anyone know if this is safe in the short term? we're rebuilding our
> nova-compute nodes and can make sure the Dumpling versions are pinned
> as part of the process in the future.

It's safe, with the possible exception of radosgw, that generally needs 
the server size cls_rgw (part of the ceph package) updated before it is.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD disk replacement best practise

2014-08-14 Thread Guang Yang
Hi cephers,
Most recently I am drafting the run books for OSD disk replacement, I think the 
rule of thumb is to reduce data migration (recover/backfill), and I thought the 
following procedure should achieve the purpose:
  1. ceph osd out osd.XXX (mark it out to trigger data migration)
  2. ceph osd rm osd.XXX
  3. ceph auth rm osd.XXX
  4. provision a new OSD which will take XXX as the OSD id and migrate data 
back.

With the above procedure, the crush weight of the host never changed so that we 
can limit the data migration only for those which are neccesary.

Does it make sense?

Thanks,
Guang
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster inconsistency?

2014-08-14 Thread Kenneth Waegeman


I have:
osd_objectstore = keyvaluestore-dev

in the global section of my ceph.conf


[root@ceph002 ~]# ceph osd erasure-code-profile get profile11
directory=/usr/lib64/ceph/erasure-code
k=8
m=3
plugin=jerasure
ruleset-failure-domain=osd
technique=reed_sol_van

the ecdata pool has this as profile

pool 3 'ecdata' erasure size 11 min_size 8 crush_ruleset 2 object_hash  
rjenkins pg_num 128 pgp_num 128 last_change 161 flags hashpspool  
stripe_width 4096


ECrule in crushmap

rule ecdata {
ruleset 2
type erasure
min_size 3
max_size 20
step set_chooseleaf_tries 5
step take default-ec
step choose indep 0 type osd
step emit
}
root default-ec {
id -8   # do not change unnecessarily
# weight 140.616
alg straw
hash 0  # rjenkins1
item ceph001-ec weight 46.872
item ceph002-ec weight 46.872
item ceph003-ec weight 46.872
...

Cheers!
Kenneth

- Message from Haomai Wang  -
   Date: Thu, 14 Aug 2014 10:07:50 +0800
   From: Haomai Wang 
Subject: Re: [ceph-users] ceph cluster inconsistency?
 To: Kenneth Waegeman 
 Cc: ceph-users 



Hi Kenneth,

Could you give your configuration related to EC and KeyValueStore?
Not sure whether it's bug on KeyValueStore

On Thu, Aug 14, 2014 at 12:06 AM, Kenneth Waegeman
 wrote:

Hi,

I was doing some tests with rados bench on a Erasure Coded pool (using
keyvaluestore-dev objectstore) on 0.83, and I see some strangs things:


[root@ceph001 ~]# ceph status
cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
 health HEALTH_WARN too few pgs per osd (4 < min 20)
 monmap e1: 3 mons at
{ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0},
election epoch 6, quorum 0,1,2 ceph001,ceph002,ceph003
 mdsmap e116: 1/1/1 up {0=ceph001.cubone.os=up:active}, 2 up:standby
 osdmap e292: 78 osds: 78 up, 78 in
  pgmap v48873: 320 pgs, 4 pools, 15366 GB data, 3841 kobjects
1381 GB used, 129 TB / 131 TB avail
 320 active+clean

There is around 15T of data, but only 1.3 T usage.

This is also visible in rados:

[root@ceph001 ~]# rados df
pool name   category KB  objects   clones
degraded  unfound   rdrd KB   wrwr KB
data-  000
0   00000
ecdata  -16113451009  39339590
0   011  3935632  16116850711
metadata-  2   200
0   0   33   36   218
rbd -  000
0   00000
  total used  1448266016  3933979
  total avail   139400181016
  total space   140848447032


Another (related?) thing: if I do rados -p ecdata ls, I trigger osd
shutdowns (each time):
I get a list followed by an error:

...
benchmark_data_ceph001.cubone.os_8961_object243839
benchmark_data_ceph001.cubone.os_5560_object801983
benchmark_data_ceph001.cubone.os_31461_object856489
benchmark_data_ceph001.cubone.os_8961_object202232
benchmark_data_ceph001.cubone.os_4919_object33199
benchmark_data_ceph001.cubone.os_5560_object807797
benchmark_data_ceph001.cubone.os_4919_object74729
benchmark_data_ceph001.cubone.os_31461_object1264121
benchmark_data_ceph001.cubone.os_5560_object1318513
benchmark_data_ceph001.cubone.os_5560_object1202111
benchmark_data_ceph001.cubone.os_31461_object939107
benchmark_data_ceph001.cubone.os_31461_object729682
benchmark_data_ceph001.cubone.os_5560_object122915
benchmark_data_ceph001.cubone.os_5560_object76521
benchmark_data_ceph001.cubone.os_5560_object113261
benchmark_data_ceph001.cubone.os_31461_object575079
benchmark_data_ceph001.cubone.os_5560_object671042
benchmark_data_ceph001.cubone.os_5560_object381146
2014-08-13 17:57:48.736150 7f65047b5700  0 -- 10.141.8.180:0/1023295 >>
10.141.8.182:6839/4471 pipe(0x7f64fc019b20 sd=5 :0 s=1 pgs=0 cs=0 l=1
c=0x7f64fc019db0).fault

And I can see this in the log files:

   -25> 2014-08-13 17:52:56.323908 7f8a97fa4700  1 --
10.143.8.182:6827/64670 <== osd.57 10.141.8.182:0/15796 51 
osd_ping(ping e220 stamp 2014-08-13 17:52:56.323092) v2  47+0+0
(3227325175 0 0) 0xf475940 con 0xee89fa0
   -24> 2014-08-13 17:52:56.323938 7f8a97fa4700  1 --
10.143.8.182:6827/64670 --> 10.141.8.182:0/15796 -- osd_ping(ping_reply e220
stamp 2014-08-13 17:52:56.323092) v2 -- ?+0 0xf815b00 con 0xee89fa0
   -23> 2014-08-13 17:52:56.324078 7f8a997a7700  1 --
10.141.8.182:6840/64670 <== osd.57 10.141.8.182:0/15796 51 
osd_ping(ping e220 stamp 2014-08-13 17:52:56.323092) v2  47+0+0
(3227325175 0 0) 0xf132bc0 con 0xee8a680
   -22> 2014-08-13 17:52:56.324111 7f8a997a7700  1 --
10.141.8.182:6840/64670 --> 10.141.8.182:0/15

Re: [ceph-users] cache pools on hypervisor servers

2014-08-14 Thread Robert van Leeuwen
> Personally I am not worried too much about the hypervisor - hypervisor 
> traffic as I am using a dedicated infiniband network for storage.
> It is not used for the guest to guest or the internet traffic or anything 
> else. I would like to decrease or at least smooth out the traffic peaks 
> between the hypervisors and the SAS/SATA osd storage servers.
> I guess the ssd cache pool would enable me to do that as the eviction rate 
> should be more structured compared to the random io writes that guest vms 
> generate.
Sounds reasonable

>>I'm very interested in the effect of caching pools in combination with 
>>running VMs on them so I'd be happy to hear what you find ;)
> I will give it a try and share back the results when we get the ssd kit.
Excellent, looking forward to it.


>> As a side note: Running OSDs on hypervisors would not be my preferred choice 
>> since hypervisor load might impact Ceph performance.
> Do you think it is not a good idea even if you have a lot of cores on the 
> hypervisors?
> Like 24 or 32 per host server?
> According to my monitoring, our osd servers are not that stressed and 
> generally have over 50% of free cpu power.

The number of cores do not really matter if they are all busy ;)
I honestly do not know how Ceph behaves when it is CPU starved but I guess it 
might not be pretty.
Since your whole environment will be crumbling down if your storage becomes 
unavailable it is not a risk I would take lightly.

Cheers,
Robert van Leeuwen




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH map advice

2014-08-14 Thread Christian Balzer

Hello,

On Tue, 12 Aug 2014 10:53:21 -0700 Craig Lewis wrote:

> On Mon, Aug 11, 2014 at 11:26 PM, John Morris  wrote:
> 
> > On 08/11/2014 08:26 PM, Craig Lewis wrote:
> >
> >> Your MON nodes are separate hardware from the OSD nodes, right?
> >>
> >
> > Two nodes are OSD + MON, plus a separate MON node.
> >
> >
> >  If so,
> >> with replication=2, you should be able to shut down one of the two OSD
> >> nodes, and everything will continue working.
> >>
> >
> > IIUC, the third MON node is sufficient for a quorum if one of the OSD +
> > MON nodes shuts down, is that right?
> >
> 
> So yeah, if you lose any one node, you'll be fine.
> 
> 
> >
> > Replication=2 is a little worrisome, since we've already seen two disks
> > simultaneously fail just in the year the cluster has been running.
> > That statistically unlikely situation is the first and probably last
> > time I'll see that, but they say lightning can strike twice
> 
> 
> That's a low probability, given the number of disks you have.  I would've
> taken that bet (with backups).  As the number of OSDs goes up, the
> probability of multiple simultaneous failures goes up, and slowly
> becomes a bad bet.
> 

I must be very unlucky then. ^o^ 
As in, I've had dual disk failures in a set of 8 disks 3 times now
(within the last 6 years). 
And twice that lead to data loss, once with RAID5 (no surprise there) and
once with RAID10 (unlucky failure of neighboring disks).
Granted, that was with consumer HDDs and the last one with rather well
aged ones, too. But there you go.

As for backups, those are for when somebody does something stupid and
deletes stuff they shouldn't have. 
A storage system should be a) up all the time and b) not loose data.

> 
> 
> >
> >
> >  Since it's for
> >> experimentation, I wouldn't deal with the extra hassle of
> >> replication=4 and custom CRUSH rules to make it work.  If you have
> >> your heart set on that, it should be possible.  I'm no CRUSH expert
> >> though, so I can't say for certain until I've actually done it.
> >>
> >> I'm a bit confused why your performance is horrible though.  I'm
> >> assuming your HDDs are 7200 RPM.  With the SSD journals and
> >> replication=3, you won't have a ton of IO, but you shouldn't have any
> >> problem doing > 100 MB/s with 4 MB blocks.  Unless your SSDs are very
> >> low quality, the HDDs should be your bottleneck.
> >>
> >
> > The below setup is tomorrow's plan; today's reality is 3 OSDs on one
> > node and 2 OSDs on another, crappy SSDs, 1Gb networks, pgs stuck
> > unclean and no monitoring to pinpoint bottlenecks.  My work is cut out
> > for me.  :)
> >
> > Thanks for the helpful reply.  I wish we could just add a third OSD
> > node and have these issues just go away, but it's not in the budget
> > ATM.
> >
That's really unfortunate, because it would solve a lot of your problems
and potential data loss issues.

If you can add HDDs (budget and space wise), consider running RAID1 for
OSDs for the time being and sleep easier with a replication of 2 until
you can add more nodes. 

Christian

> >
> Ah, yeah, that explains the performance problems.  Although, crappy SSD
> journals are still better than no SSD journals.  When I added SSD
> journals to my existing cluster, I saw my write bandwidth go from 10
> MBps/disk to 50MBps/disk.  Average latency dropped a bit, and the
> variance in latency dropped a lot.
> 
> Just adding more disks to your existing nodes would help performance,
> assuming you have room to add them.


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fixed all active+remapped PGs stuck forever (but I have no clue why)

2014-08-14 Thread Christian Balzer

Hello,

On Thu, 14 Aug 2014 01:38:05 -0500 John Morris wrote:

> 
> On 08/13/2014 11:36 PM, Christian Balzer wrote:
> >
> > Hello,
> >
> > On Thu, 14 Aug 2014 03:38:11 + David Moreau Simard wrote:
> >
> >> Hi,
> >>
> >> Trying to update my continuous integration environment.. same
> >> deployment method with the following specs:
> >> - Ubuntu Precise, Kernel 3.2, Emperor (0.72.2) - Yields a successful,
> >> healthy cluster.
> >> - Ubuntu Trusty, Kernel 3.13, Firefly (0.80.5) - I have stuck
> >> placement groups.
> >>
> >> Here’s some relevant bits from the Trusty/Firefly setup before I move
> >> on to what I’ve done/tried: http://pastebin.com/eqQTHcxU <— This was
> >> about halfway through PG healing.
> >>
> >> So, the setup is three monitors, two other hosts on which there are 9
> >> OSDs each. At the beginning, all my placement groups were stuck
> >> unclean.
> >>
> > And there's your reason why the firefly install "failed".
> > The default replication is 3 and you have just 2 storage nodes,
> > combined with the default CRUSH rules that's exactly what will happen.
> > To avoid this from the start either use 3 nodes or set
> > ---
> > osd_pool_default_size = 2
> > osd_pool_default_min_size = 1
> > ---
> > in your ceph.conf very early on, before creating anything, especially
> > OSDs.
> >
> > Setting the replication for all your pools to 2 with "ceph osd pool
> >  set size 2" as the first step after your install should have
> > worked, too.
> 
> Did something change between Emperor and Firefly that the OP would 
> experience this problem only after upgrading and no other configuration 
> changes?
> 
No, not really.
Well aside from the happy warning that you're running legacy tunables and
people (including me, but that was a non-production cluster) taking that
to set tunables optimal and getting a nice workout of their hardware.

I took the OP's statement not to be an upgrade per se, but a fresh install
with either emperor or firefly on his test cluster.

> Your explanation updates my understanding of how the CRUSH algorithm 
> works.  Take this osd tree for example:
> 
> rack rack0
>   host host0
>   osd.0
>   osd.1
>   host host1
>   osd.2
>   osd.3
> 
> I had thought that with size=3, CRUSH would do its best at any 
> particular level of buckets to distribute replicas across failure 
> domains as best as possible, and otherwise try to keep balance.
> 
> Instead, you seem to say at the 'host' bucket level of the CRUSH map, 
> distribution MUST be across size=3 failure domains.  

The default (firefly, but previous ones are functionally identical) crush
map has:
---
# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
---

The type host states that there will be not more that one replica per host
(node), so with size=3 you will need at least 3 hosts to choose from. 
If you were to change this to to type OSD, all 3 replicas could wind up on
the same host, not really a good idea.

>In the above osd 
> tree, why does the 'rack' level with the single 'rack0' failure domain 
> not cause the OP's stuck PG problem, even with size=2?  Is that level 
> treated specially for some reason?
> 
Aside from the little detail that the OP has only hosts and osds defined
in his crush map (default after all), the rack and all other bucket types
are ONLY taken into consideration when an actual rule calls on them or if
they are part of the subtree of such a bucket (as in the case of osd being
below host). 

> What if the osd tree looked like this:
> 
> rack rack0
>   host host0
>   osd.0
>   osd.1
>   host host1
>   osd.2
>   osd.3
> rack rack1
>   host host2
>   osd.4
>   osd.5
> 
> Here, I would expect size=2 to always put one replica on each rack. 
Nope, see above. 
CRUSH is not clairvoyant, it needs to be told what you want to do with
bucket types. 

> With size=3 in my previous understanding, I would have hoped for one 
> replica on each host.  
Yes, with default rules.

> With the changes in firefly (or the difference in 
> my understanding vs. reality), would size=3 instead result in stuck PGs, 
> since at the rack level there are only two failure domains, mirroring 
> the OP's problem but at the next higher level?
> 

Only if you had the racks included in the rules.

> If not, would it be a solution for the OP be to artificially split the 
> OSDs on each node into another level of buckets, such as this 
> (disgusting) scheme:
> 
> rack rack0
>   host host0
>   bogus 0
>   osd.0
>   bogus 1
>   osd.1
>   host host1
>   bogus 2
>   osd.2
>   bogus 3
>   osd.3
> 

You might finackle something like 

Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean

2014-08-14 Thread Riederer, Michael
Hi Craig,

Yes we have stability problems. The cluster is definitely not suitable for a 
production environment. I will not describe the details here. I want to get to 
know ceph and this is possible with the Test-cluster. Some osds are very slow, 
less than 15 MB / sec writable. Also increases the load on the ceph nodes to 
over 30 when a osd is removed and a reorganistation of the data is necessary. 
If the load is very high (over 30) I have seen exactly what you describe. osds 
go down and out and come back up and in.

OK. I'll try the slow osd to remove and then to scrub, deep-scrub the pgs.

Many thanks for your help.

Regards,
Mike


Von: Craig Lewis [cle...@centraldesktop.com]
Gesendet: Mittwoch, 13. August 2014 19:48
An: Riederer, Michael
Cc: Karan Singh; ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 
pgs stuck unclean

Yes, ceph pg  query, not dump.  Sorry about that.

Are you having problems with OSD stability?  There's a lot of history in the 
[recovery_state][past_intervals]. That's normal when OSDs go down, and out, and 
come back up and in. You have a lot of history there. You might even be getting 
into the point that you have so much failover history, the OSDs can't process 
it all before they hit the suicide timeout.

[recovery_state][probing_osds] lists a lot of OSDs that have recently owned 
these PGs. If the OSDs are crashing frequently, you need to get that under 
control before proceeding.

Once the OSDs are stable, I think Ceph just needs to scrub and deep-scrub those 
PGs.


Until Ceph clears out the [recovery_state][probing_osds] section in the pg 
query, it's not going to do anything.  ceph osd lost hears you, but doesn't 
trust you.  Ceph won't do anything until it's actually checked those OSDs 
itself.  Scrubbing and Deep scrubbing should convince it.

Once that [recovery_state][probing_osds] section is gone, you should see the 
[recovery_state][past_intervals] section shrink or disappear. I don't have 
either section in my pg query. Once that happens, your ceph pg repair or ceph 
pg force_create_pg should finally have some effect.  You may or may not need to 
re-issue those commands.




On Tue, Aug 12, 2014 at 9:32 PM, Riederer, Michael 
mailto:michael.riede...@br.de>> wrote:
Hi Craig,

# ceph pg 2.587 query
# ceph pg 2.c1 query
# ceph pg 2.92 query
# ceph pg 2.e3 query

Please download the output form here:
http://server.riederer.org/ceph-user/

#


It is not possible to map a rbd:

# rbd map testshareone --pool rbd --name client.admin
rbd: add failed: (5) Input/output error

I found that: http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/11405
# ceph osd getcrushmap -o crushmap.bin
got crush map from osdmap epoch 3741
# crushtool -i crushmap.bin --set-chooseleaf_vary_r 0 -o crushmap-new.bin
# ceph osd setcrushmap -i crushmap-new.bin
set crush map

The Cluster had to do some. Now it looks a bit different.

It is still not possible to map a rbd.

root@ceph-admin-storage:~# ceph -s
cluster 6b481875-8be5-4508-b075-e1f660fd7b33
 health HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck 
unclean
 monmap e2: 3 mons at 
{ceph-1-storage=10.65.150.101:6789/0,ceph-2-storage=10.65.150.102:6789/0,ceph-3-storage=10.65.150.103:6789/0},
 election epoch 5010, quorum 0,1,2 ceph-1-storage,ceph-2-storage,ceph-3-storage
 osdmap e34206: 55 osds: 55 up, 55 in
  pgmap v10838368: 6144 pgs, 3 pools, 11002 GB data, 2762 kobjects
22078 GB used, 79932 GB / 102010 GB avail
6140 active+clean
   4 incomplete

root@ceph-admin-storage:~# ceph health detail
HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean
pg 2.92 is stuck inactive since forever, current state incomplete, last acting 
[8,13]
pg 2.c1 is stuck inactive since forever, current state incomplete, last acting 
[13,8]
pg 2.e3 is stuck inactive since forever, current state incomplete, last acting 
[20,8]
pg 2.587 is stuck inactive since forever, current state incomplete, last acting 
[13,8]

pg 2.92 is stuck unclean since forever, current state incomplete, last acting 
[8,13]
pg 2.c1 is stuck unclean since forever, current state incomplete, last acting 
[13,8]
pg 2.e3 is stuck unclean since forever, current state incomplete, last acting 
[20,8]
pg 2.587 is stuck unclean since forever, current state incomplete, last acting 
[13,8]
pg 2.587 is incomplete, acting [13,8]
pg 2.e3 is incomplete, acting [20,8]
pg 2.c1 is incomplete, acting [13,8]

pg 2.92 is incomplete, acting [8,13]

###

After updating to firefly, I did the following:

# ceph health detail
HEALTH_WARN crush map has legacy tunables crush map has legacy tunables; see 
http://ceph.com/docs/mas

[ceph-users] Tracking the system calls for OSD write

2014-08-14 Thread Sudarsan, Rajesh
Hi,

I am trying to track the actual system open and write call on a OSD when a new 
file is created and written. So far, my tracking is as follows:

Using the debug log messages, I located the first write call in do_osd_ops 
function (case CEPH_OSD_OP_WRITE) in os/ReplicatedPG.cc (line 3727)

t->write(soid, op.extent.offset, op.extent.length, osd_op.indata);

where t is a transaction object.

This function is defined in os/ObjectStore.h (line 652)

void write(coll_t cid, const ghobject_t& oid, uint64_t off, uint64_t len,
   const bufferlist& data) {
  __u32 op = OP_WRITE;
  ::encode(op, tbl);
  ::encode(cid, tbl);
  ::encode(oid, tbl);
  ::encode(off, tbl);
  ::encode(len, tbl);
  assert(len == data.length());
  if (data.length() > largest_data_len) {
largest_data_len = data.length();
largest_data_off = off;
largest_data_off_in_tbl = tbl.length() + sizeof(__u32);  // we are 
about to
  }
  ::encode(data, tbl);
  ops++;
}

The encode functions are defined as a template in include/encoding.h (line 61) 
which eventually calls bufferlist.append in src/common/buffers.cc (line 1272) 
to insert the buffers from one list to another.

  void buffer::list::append(const list& bl)
  {
_len += bl._len;
for (std::list::const_iterator p = bl._buffers.begin();
 p != bl._buffers.end();
 ++p)
  _buffers.push_back(*p);
  }

Since the buffers are "push_back" I figured that there must be a call to 
pop_front in ReplicatedPG.cc . But there is not pop_front associated with any 
write. This is where I am stuck.

At this point I have two questions:

1.   When does the actual file open happen?

2.   Where is the system call to physically write the file to the disk?

Any help is appreciated.

Rajesh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com