cephfs (hammer) flips directory access bits

2016-01-07 Thread CSa
Hi,

we are using cephfs on a ceph cluster (V0.94.5, 3x MON, 1x MDS, ~50x OSD).
Recently, we observed a spontaneous (and unwanted) change in the access 
rights of newly created directories:

$ umask
0077
$ mkdir test 
$ ls -ld test
drwx-- 1 me me 0 Jan  6 14:59 test
$ touch test/foo
$ ls -ld test
drwxrwxrwx 1 me me 0 Jan  6 14:59 test
$

I kindly would like to ask for help tracking down this issue.

ciao
Christian

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


The osd process locked itself , when I tested cephfs through filebench

2016-01-07 Thread wangsongbo

Hi all,
When I tested randomrw on my cluster  through filebench (running ceph 
0.94.5) , one of the osds was marked down. but  I could still get the 
process with ps command.

So I checked the log fiile and found follow message:
>
2016-01-07 02:41:02.104124 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: 
no reply from osd.5 since back 2016-01-07 02:40:49.365340 front 
2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:55.104035)
2016-01-07 02:41:02.104156 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: 
no reply from osd.6 since back 2016-01-07 02:40:49.365340 front 
2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:55.104035)
2016-01-07 02:41:02.104168 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: 
no reply from osd.7 since back 2016-01-07 02:40:49.365340 front 
2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:55.104035)
2016-01-07 02:41:02.104182 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: 
no reply from osd.8 since back 2016-01-07 02:40:49.365340 front 
2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:55.104035)
2016-01-07 02:41:02.104194 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: 
no reply from osd.12 since back 2016-01-07 02:40:49.365340 front 
2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:55.104035)
2016-01-07 02:41:02.104208 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: 
no reply from osd.15 since back 2016-01-07 02:40:49.365340 front 
2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:55.104035)
2016-01-07 02:41:02.104226 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: 
no reply from osd.16 since back 2016-01-07 02:40:49.365340 front 
2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:55.104035)
2016-01-07 02:41:02.104253 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: 
no reply from osd.17 since back 2016-01-07 02:40:49.365340 front 
2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:55.104035)
2016-01-07 02:41:03.104394 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: 
no reply from osd.3 since back 2016-01-07 02:40:49.365340 front 
2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:56.104394)
2016-01-07 02:41:03.104441 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: 
no reply from osd.4 since back 2016-01-07 02:40:49.365340 front 
2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:56.104394)
2016-01-07 02:41:03.104451 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: 
no reply from osd.5 since back 2016-01-07 02:40:49.365340 front 
2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:56.104394)
2016-01-07 02:41:03.104459 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: 
no reply from osd.6 since back 2016-01-07 02:40:49.365340 front 
2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:56.104394)
2016-01-07 02:41:03.104467 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: 
no reply from osd.7 since back 2016-01-07 02:40:49.365340 front 
2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:56.104394)
2016-01-07 02:41:03.104495 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: 
no reply from osd.8 since back 2016-01-07 02:40:49.365340 front 
2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:56.104394)
2016-01-07 02:41:03.104503 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: 
no reply from osd.12 since back 2016-01-07 02:40:49.365340 front 
2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:56.104394)
2016-01-07 02:41:03.104512 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: 
no reply from osd.15 since back 2016-01-07 02:40:49.365340 front 
2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:56.104394)
2016-01-07 02:41:03.104526 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: 
no reply from osd.16 since back 2016-01-07 02:40:49.365340 front 
2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:56.104394)
2016-01-07 02:41:03.104541 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: 
no reply from osd.17 since back 2016-01-07 02:40:49.365340 front 
2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:56.104394)


2016-01-07 02:56:17.340268 7fa98b99d700  0 -- 10.0.19.68:6816/10289 
submit_message osd_op_reply(201270 105e069.046e [write 
0~4194304] v1679'6462 uv6462 ondisk = 0) v6 remote, 10.0.3.68:0/49739, 
failed lossy con, dropping message 0x30f84fc0
2016-01-07 02:56:17.886032 7fa9ae4cb700  0 log_channel(cluster) log 
[WRN] : 1 slow requests, 1 included below; oldest blocked for > 9.802397 
secs
2016-01-07 02:56:17.886195 7fa9ae4cb700  0 log_channel(cluster) log 
[WRN] : slow request 9.802397 seconds old, received at 2016-01-07 
02:56:08.083416: osd_op(client.501311.0:201273 105e069.0471 
[write 0~4194304] 7.ea64f958 RETRY=1 snapc 1=[] 
ondisk+retry+write+known_if_redirected e1679) currently waiting for 
subops from 3,6
2016-01-07 02:56:18.886521 7fa9ae4cb700  0 log_channel(cluster) log 
[WRN] : 1 slow requests, 1 included below; oldest blocked for > 
10.802942 secs
2016-01-07 02:56:18.886626 7fa9ae4cb700  0 log_channel(cluster) log 
[WRN] : slow request 10.802942 seconds old, received at 2016-01-07 
02:56:08.083416: osd_op(client.501311.0:201273 

Custom STL allocator

2016-01-07 Thread Evgeniy Firsov
I want your opinion guys regarding two features implemented in attempt to
greatly reduce number of memory allocation without major surgery in the
code.

The features are:
1. Custom STL allocator, which allocates first N items from the STL
container itself. This is semi-transparent replacement of standard
allocator. Just need to replace std::map with ceph_map for example.
Limitation: a) Brakes move semantic. b) No deallocation implemented, so no
good for big long living containers.
2. Placement allocator, which allows chained allocation of shorter living
object from longer living. Example would be allocation of finish contexts
from aio completion context.
Limitation: a) May require some code rearrangement in order to avoid
concurrent deallocations, otherwise deallocation code uses synchronization
what limits performance. b) same as above b)

Performance results for 32 threads in a synthetic test, std allocator time
to custom 
allocator time ratio:
stlalloc stl+placement alloc
block jemalloc tcmalloc ptmalloc  jemalloc tcmalloc ptmalloc
1M  1298.01 650.66  137.64  735.49  824.45  9.62
64K 514.84  2.82304.62  570.74  4.8512.21
32K 838.89  2.175.031600.5  7.438.28
4K  2.761.994.984.365.3 8.23
32B 2.675.093.694.418.486.4
(100M test iterations for 32B and 4K, 2M for 32K and 64K, 200K for 1M)

I didn¹t see any performance improvement in 100% write fio test, it still
can shine in other workloads or proper classes replaced.
Let me know if it worth to PR them.

STL allocator: 
https://github.com/efirs/ceph/commit/4eed0d63dbcbd00ee3aa325355bfbe56acbb7b
05
STL allocator usage example:
https://github.com/efirs/ceph/commit/362c5c4e10563785cc89370d28511e0493f1b2
11
https://github.com/efirs/ceph/commit/e2df67f7570c68e53775bc55cda12c6253e66d
2f
Placement allocator:
https://github.com/efirs/ceph/commit/8df5cd7d753fd09e79a24f2fc781cf3af02e6d
3e
Placement allocator usage example:
https://github.com/efirs/ceph/commit/70db18d9c1b39190bde68548b57c2aa7a9e455
e0

‹
Evgeniy

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is BlueFS an alternative of BlueStore?

2016-01-07 Thread Sage Weil
On Thu, 7 Jan 2016, Javen Wu wrote:
> Hi Sage,
> 
> Sorry to bother you. I am not sure if it is appropriate to send email to you
> directly, but I cannot find any useful information to address my confusion
> from Internet. Hope you can help me.
> 
> Occasionally, I heard that you are going to start BlueFS to eliminate the
> redudancy between XFS journal and RocksDB WAL. I am a little confused.
> Is the Bluefs only to host RocksDB for BlueStore or it's an
> alternative of BlueStore?
> 
> I am a new comer to CEPH, I am not sure my understanding is correct about
> BlueStore. BlueStore in my mind is as below.
> 
>  BlueStore
>  =
>RocksDB
> +---+  +---+
> |   onode   |  |   |
> |WAL|  |   |
> |   omap|  |   |
> +---+  |   bdev|
> |   |  |   |
> |   XFS |  |   |
> |   |  |   |
> +---+  +---+

This is the picture before BlueFS enters the picture.

> I am curious if BlueFS is able to host RocksDB, actually it's already a
> "filesystem" which have to maintain blockmap kind of metadata by its own
> WITHOUT the help of RocksDB. 

Right.  BlueFS is a really simple "file system" that is *just* complicated 
enough to implement the rocksdb::Env interface, which is what rocksdb 
needs to store its log and sst files.  The after picture looks like

 ++
 | bluestore  |
 +--+ |
 | rocksdb  | |
 +--+ |
 |  bluefs  | |
 +--+-+
 |block device|
 ++

> The reason we care the intention and the design target of BlueFS is that I had
> discussion with my partner Peng.Hse about an idea to introduce a new
> ObjectStore using ZFS library. I know CEPH supports ZFS as FileStore backend
> already, but we had a different immature idea to use libzpool to implement a
> new
> ObjectStore for CEPH totally in userspace without SPL and ZOL kernel module.
> So that we can align CEPH transaction and zfs transaction in order to  avoid
> double write for CEPH journal.
> ZFS core part libzpool (DMU, metaslab etc) offers a dnode object store and
> it's platform kernel/user independent. Another benefit for the idea is we
> can extend our metadata without bothering any DBStore.
> 
> Frankly, we are not sure if our idea is realistic so far, but when I heard of
> BlueFS, I think we need to know the BlueFS design goal.

I think it makes a lot of sense, but there are a few challenges.  One 
reason we use rocksdb (or a similar kv store) is that we need in-order 
enumeration of objects in order to do collection listing (needed for 
backfill, scrub, and omap).  You'll need something similar on top of zfs.  

I suspect the simplest path would be to also implement the rocksdb::Env 
interface on top of the zfs libraries.  See BlueRocksEnv.{cc,h} to see the 
interface that has to be implemented...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is BlueFS an alternative of BlueStore?

2016-01-07 Thread Javen Wu

Thanks Sage for your reply.

I am not sure I understand the challenges you mentioned about 
backfill/scrub.

I will investigate from the code and let you know if we can conquer the
challenge by easy means.
Our rough idea for ZFSStore are:
1. encapsulate dnode object as onode and add onode attributes.
2. uses ZAP object as collection. (ZFS directory uses ZAP object)
3. enumerating entries in ZAP object is list objects in collection.
4. create a new metaslab class to store CEPH journal.
5. align CEPH journal and ZFS transcation.

Actually we've talked about the possibility of building RocksDB::Env on top
of the zfs libraries. It must align ZIL(ZFS intent log) and RocksDB WAL.
Otherwise, there is still same problem as XFS and RocksDB.

ZFS is tree style log structure-like file system, once a leaf block updates,
the modification would be propagated from the leaf to the root of tree.
To batch writes and reduce times of disk write, ZFS persist modification 
to disk
in 5 seconds transaction. Only when Fsync/sync write arrives in the 
middle of

the 5 seconds, ZFS would persist the journal to ZIL.
I remembered RocksDB would do a sync after log record adding, so it means if
we can not align ZIL and WAL, the log write would be write to ZIL 
firstly and

then apply ZIL to log file, finally Rockdb update sst file. It's almost the
same problem as XFS if my understanding is correct.

In my mind, aligning ZIL and WAL need more modifications in RocksDB.

Thanks
Javen


On 2016年01月07日 22:37, peng.hse wrote:

Hi Sage,

thanks for your quick response. Javen and I  once the zfs 
developer,are currently focusing on how to
leverage some of the zfs ideas to improve the ceph backend performance 
in userspace.



Based on your encouraging reply, we come up with 2 schemes to continue 
our future work


1. the scheme one: using the entire new FS to replace rocksdb+bluefs, 
the FS itself handles the mapping of
oid->fs-object(kind of zfs dnode) and the according attrs used by 
ceph.
   Despite the implemention challenges you mentioned about the 
in-order enumeration of objects during backfill, scrub, etc (the
same situation we also confronted in zfs, the ZAP features help us 
a lot).
From performance or architecture point of view, it looks more 
clear and clean, would you suggest us to give a try ?


2. the scheme two: As your last suspect, we just temporarily 
implemented the simple version of the FS
 which leverage libzpool ideas to plug into rocksdb underneath as 
your bluefs did


precious your insightful reply.

Thanks



On 2016年01月07日 21:19, Sage Weil wrote:

On Thu, 7 Jan 2016, Javen Wu wrote:

Hi Sage,

Sorry to bother you. I am not sure if it is appropriate to send 
email to you
directly, but I cannot find any useful information to address my 
confusion

from Internet. Hope you can help me.

Occasionally, I heard that you are going to start BlueFS to 
eliminate the

redudancy between XFS journal and RocksDB WAL. I am a little confused.
Is the Bluefs only to host RocksDB for BlueStore or it's an
alternative of BlueStore?

I am a new comer to CEPH, I am not sure my understanding is correct 
about

BlueStore. BlueStore in my mind is as below.

  BlueStore
  =
RocksDB
+---+  +---+
|   onode   |  |   |
|WAL|  |   |
|   omap|  |   |
+---+  |   bdev|
|   |  |   |
|   XFS |  |   |
|   |  |   |
+---+  +---+

This is the picture before BlueFS enters the picture.


I am curious if BlueFS is able to host RocksDB, actually it's already a
"filesystem" which have to maintain blockmap kind of metadata by its 
own

WITHOUT the help of RocksDB.
Right.  BlueFS is a really simple "file system" that is *just* 
complicated

enough to implement the rocksdb::Env interface, which is what rocksdb
needs to store its log and sst files.  The after picture looks like

  ++
  | bluestore  |
  +--+ |
  | rocksdb  | |
  +--+ |
  |  bluefs  | |
  +--+-+
  |block device|
  ++

The reason we care the intention and the design target of BlueFS is 
that I had

discussion with my partner Peng.Hse about an idea to introduce a new
ObjectStore using ZFS library. I know CEPH supports ZFS as FileStore 
backend
already, but we had a different immature idea to use libzpool to 
implement a

new
ObjectStore for CEPH totally in userspace without SPL and ZOL kernel 
module.
So that we can align CEPH transaction and zfs transaction in order 
to  avoid

double write for CEPH journal.
ZFS core part libzpool (DMU, metaslab etc) offers a dnode object 
store and
it's platform kernel/user independent. Another benefit for the idea 
is we

can extend our metadata without bothering any DBStore.

Frankly, we are not sure 

Re: Is BlueFS an alternative of BlueStore?

2016-01-07 Thread peng.hse

Hi Sage,

thanks for your quick response. Javen and I  once the zfs developer,are 
currently focusing on how to
leverage some of the zfs ideas to improve the ceph backend performance 
in userspace.



Based on your encouraging reply, we come up with 2 schemes to continue 
our future work


1. the scheme one: using the entire new FS to replace rocksdb+bluefs, 
the FS itself handles the mapping of

oid->fs-object(kind of zfs dnode) and the according attrs used by ceph.
   Despite the implemention challenges you mentioned about the in-order 
enumeration of objects during backfill, scrub, etc (the
same situation we also confronted in zfs, the ZAP features help us 
a lot).
From performance or architecture point of view, it looks more clear 
and clean, would you suggest us to give a try ?


2. the scheme two: As your last suspect, we just temporarily implemented 
the simple version of the FS
 which leverage libzpool ideas to plug into rocksdb underneath as 
your bluefs did


precious your insightful reply.

Thanks



On 2016年01月07日 21:19, Sage Weil wrote:

On Thu, 7 Jan 2016, Javen Wu wrote:

Hi Sage,

Sorry to bother you. I am not sure if it is appropriate to send email to you
directly, but I cannot find any useful information to address my confusion
from Internet. Hope you can help me.

Occasionally, I heard that you are going to start BlueFS to eliminate the
redudancy between XFS journal and RocksDB WAL. I am a little confused.
Is the Bluefs only to host RocksDB for BlueStore or it's an
alternative of BlueStore?

I am a new comer to CEPH, I am not sure my understanding is correct about
BlueStore. BlueStore in my mind is as below.

  BlueStore
  =
RocksDB
+---+  +---+
|   onode   |  |   |
|WAL|  |   |
|   omap|  |   |
+---+  |   bdev|
|   |  |   |
|   XFS |  |   |
|   |  |   |
+---+  +---+

This is the picture before BlueFS enters the picture.


I am curious if BlueFS is able to host RocksDB, actually it's already a
"filesystem" which have to maintain blockmap kind of metadata by its own
WITHOUT the help of RocksDB.

Right.  BlueFS is a really simple "file system" that is *just* complicated
enough to implement the rocksdb::Env interface, which is what rocksdb
needs to store its log and sst files.  The after picture looks like

  ++
  | bluestore  |
  +--+ |
  | rocksdb  | |
  +--+ |
  |  bluefs  | |
  +--+-+
  |block device|
  ++


The reason we care the intention and the design target of BlueFS is that I had
discussion with my partner Peng.Hse about an idea to introduce a new
ObjectStore using ZFS library. I know CEPH supports ZFS as FileStore backend
already, but we had a different immature idea to use libzpool to implement a
new
ObjectStore for CEPH totally in userspace without SPL and ZOL kernel module.
So that we can align CEPH transaction and zfs transaction in order to  avoid
double write for CEPH journal.
ZFS core part libzpool (DMU, metaslab etc) offers a dnode object store and
it's platform kernel/user independent. Another benefit for the idea is we
can extend our metadata without bothering any DBStore.

Frankly, we are not sure if our idea is realistic so far, but when I heard of
BlueFS, I think we need to know the BlueFS design goal.

I think it makes a lot of sense, but there are a few challenges.  One
reason we use rocksdb (or a similar kv store) is that we need in-order
enumeration of objects in order to do collection listing (needed for
backfill, scrub, and omap).  You'll need something similar on top of zfs.

I suspect the simplest path would be to also implement the rocksdb::Env
interface on top of the zfs libraries.  See BlueRocksEnv.{cc,h} to see the
interface that has to be implemented...

sage



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


two tarballs for ceph 10.0.1

2016-01-07 Thread Ken Dreyer
In http://download.ceph.com/tarballs/ , there's two tarballs:
"ceph_10.0.1.orig.tar.gz" and "ceph_10.0.1.orig.tar.gz.1"

Which one is correct? Can we delete one?

- Ken
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FreeBSD Building and Testing

2016-01-06 Thread Willem Jan Withagen

On 6-1-2016 08:51, Mykola Golub wrote:

On Mon, Dec 28, 2015 at 05:53:04PM +0100, Willem Jan Withagen wrote:

Hi,

Can somebody try to help me and explain why

in test: Func: test/mon/osd-crash
Func: TEST_crush_reject_empty started

Fails with a python error which sort of startles me:
test/mon/osd-crush.sh:227: TEST_crush_reject_empty:  local
empty_map=testdir/osd-crush/empty_map
test/mon/osd-crush.sh:228: TEST_crush_reject_empty:  :
test/mon/osd-crush.sh:229: TEST_crush_reject_empty:  ./crushtool -c
testdir/osd-crush/empty_map.txt -o testdir/osd-crush/empty_map.m
ap
test/mon/osd-crush.sh:230: TEST_crush_reject_empty:  expect_failure
testdir/osd-crush 'Error EINVAL' ./ceph osd setcrushmap -i testd
ir/osd-crush/empty_map.map
../qa/workunits/ceph-helpers.sh:1171: expect_failure:  local
dir=testdir/osd-crush
../qa/workunits/ceph-helpers.sh:1172: expect_failure:  shift
../qa/workunits/ceph-helpers.sh:1173: expect_failure:  local 'expected=Error
EINVAL'
../qa/workunits/ceph-helpers.sh:1174: expect_failure:  shift
../qa/workunits/ceph-helpers.sh:1175: expect_failure:  local success
../qa/workunits/ceph-helpers.sh:1176: expect_failure:  pwd
../qa/workunits/ceph-helpers.sh:1177: expect_failure:  printenv
../qa/workunits/ceph-helpers.sh:1178: expect_failure:  echo ./ceph osd
setcrushmap -i testdir/osd-crush/empty_map.map
../qa/workunits/ceph-helpers.sh:1180: expect_failure:  ./ceph osd
setcrushmap -i testdir/osd-crush/empty_map.map
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
Traceback (most recent call last):
   File "./ceph", line 936, in 
 retval = main()
   File "./ceph", line 874, in main
 sigdict, inbuf, verbose)
   File "./ceph", line 457, in new_style_command
 inbuf=inbuf)
   File "/usr/srcs/Ceph/wip-freebsd-wjw/ceph/src/pybind/ceph_argparse.py",
line 1208, in json_command
 raise RuntimeError('"{0}": exception {1}'.format(argdict, e))
RuntimeError: "{'prefix': u'osd setcrushmap'}": exception "['{"prefix": "osd
setcrushmap"}']": exception 'utf8' codec can't decode b
yte 0x86 in position 56: invalid start byte

Which is certainly not the type of error expected.
But it is hard to detect any 0x86 in the arguments.


Are you able to reproduce this problem manually? I.e. in src dir, start the
cluster using vstart.sh:

./vstart.sh -n

Check it is running:

./ceph -s

Repeat the test:

truncate -s 0 empty_map.txt
./crushtool -c empty_map.txt -o empty_map.map
./ceph osd setcrushmap -i empty_map.map

Expected output:

  "Error EINVAL: Failed crushmap test: ./crushtool: exit status: 1"



Hi all,

I've spent the Xmas days trying to learn more about Python.
(And catching up with old friends :) )

My heritage is the days of assembler, shell script, C, Perl and likes.
So the pony had to learn a few new tricks. (aka language)
I'm now trying to get python nosetest to actually work

In the mean time I also found that FreeBSD has patches for Googletest
to actually  make most of the DEATH tests work.

I think this python stream pars error got resolved by upgrading
everything build, including  the complete package environment and
upgrading kernel and tools... :) Which I think cleaned out the python
environment which was a bit mixed up with different versions.

Now test/mon/osd-crush.sh return OKE, so I guess the setup of the 
environment

is relatively critical.

I also noted that some of the test get more tests done IF I run them under
root-priviledges

The last test run resulted in:
=
   ceph 10.0.1: src/test-suite.log
=

# TOTAL: 120
# PASS:  110
# SKIP:  0
# XFAIL: 0
# FAIL:  10
# XPASS: 0
# ERROR: 0

FAIL ceph-detect-init/run-tox.sh (exit status: 1)
FAIL test/run-rbd-unit-tests.sh (exit status: 138)
FAIL test/ceph_objectstore_tool.py (exit status: 1)
FAIL test/cephtool-test-mon.sh (exit status: 1)
FAIL test/cephtool-test-rados.sh (exit status: 1)
FAIL test/libradosstriper/rados-striper.sh (exit status: 1)
FAIL test/test_objectstore_memstore.sh (exit status: 127)
FAIL test/ceph-disk.sh (exit status: 1)
FAIL test/pybind/test_ceph_argparse.py (exit status: 127)
FAIL test/pybind/test_ceph_daemon.py (exit status: 127)

where the first and last 2 actually don't work because of python things
that are not working on FreeBSD and I have to sort out.
ceph_detect_init.exc.UnsupportedPlatform: Platform is not supported.:
../test-driver: ./test/pybind/test_ceph_argparse.py: not found
FAIL test/pybind/test_ceph_argparse.py (exit status: 127)

I also have:
./test/test_objectstore_memstore.sh: ./ceph_test_objectstore: not found
FAIL test/test_objectstore_memstore.sh (exit status: 127)

Which ia a weird one, that needs some TLC.

So I'm slowly getting there...

--WjW
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FreeBSD Building and Testing

2016-01-06 Thread Willem Jan Withagen

On 5-1-2016 19:23, Gregory Farnum wrote:

On Mon, Dec 28, 2015 at 8:53 AM, Willem Jan Withagen  wrote:

Hi,

Can somebody try to help me and explain why

in test: Func: test/mon/osd-crash
Func: TEST_crush_reject_empty started

Fails with a python error which sort of startles me:
test/mon/osd-crush.sh:227: TEST_crush_reject_empty:  local
empty_map=testdir/osd-crush/empty_map
test/mon/osd-crush.sh:228: TEST_crush_reject_empty:  :
test/mon/osd-crush.sh:229: TEST_crush_reject_empty:  ./crushtool -c
testdir/osd-crush/empty_map.txt -o testdir/osd-crush/empty_map.m
ap
test/mon/osd-crush.sh:230: TEST_crush_reject_empty:  expect_failure
testdir/osd-crush 'Error EINVAL' ./ceph osd setcrushmap -i testd
ir/osd-crush/empty_map.map
../qa/workunits/ceph-helpers.sh:1171: expect_failure:  local
dir=testdir/osd-crush
../qa/workunits/ceph-helpers.sh:1172: expect_failure:  shift
../qa/workunits/ceph-helpers.sh:1173: expect_failure:  local 'expected=Error
EINVAL'
../qa/workunits/ceph-helpers.sh:1174: expect_failure:  shift
../qa/workunits/ceph-helpers.sh:1175: expect_failure:  local success
../qa/workunits/ceph-helpers.sh:1176: expect_failure:  pwd
../qa/workunits/ceph-helpers.sh:1177: expect_failure:  printenv
../qa/workunits/ceph-helpers.sh:1178: expect_failure:  echo ./ceph osd
setcrushmap -i testdir/osd-crush/empty_map.map
../qa/workunits/ceph-helpers.sh:1180: expect_failure:  ./ceph osd
setcrushmap -i testdir/osd-crush/empty_map.map
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
Traceback (most recent call last):
   File "./ceph", line 936, in 
 retval = main()
   File "./ceph", line 874, in main
 sigdict, inbuf, verbose)
   File "./ceph", line 457, in new_style_command
 inbuf=inbuf)
   File "/usr/srcs/Ceph/wip-freebsd-wjw/ceph/src/pybind/ceph_argparse.py",
line 1208, in json_command
 raise RuntimeError('"{0}": exception {1}'.format(argdict, e))
RuntimeError: "{'prefix': u'osd setcrushmap'}": exception "['{"prefix": "osd
setcrushmap"}']": exception 'utf8' codec can't decode b
yte 0x86 in position 56: invalid start byte

Which is certainly not the type of error expected.
But it is hard to detect any 0x86 in the arguments.

And yes python is right, there are no UTF8 sequences that start with 0x86.
Question is:
 Why does it want to parse with UTF8?
 And how do I switch it off?
 Or how to I fix this error?


I've not handled this myself but we've seen this a few times. The
latest example in a quick email search was
http://tracker.ceph.com/issues/9405, and it was apparently having a
string which wasn't null-terminated.



Looks like in my case it was due to too large a mess in the python 
environment.

But I'll keep this in my mind, IFF it comes back to haunt me more.

Thanx,
--WjW
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Stable releases preparation temporarily stalled

2016-01-06 Thread Loic Dachary
Hi,

The stable releases (hammer, infernalis) did not make progress in the past few 
weeks because we can't run tests.

Before xmas the following happened:

* the sepia lab was migrated and we discovered the OpenStack teuthology backend 
can't run without it (that was a problem during a few days only)
* there are OpenStack specific failures in each teuthology suites and it is non 
trivial to separate them from genuine backport errors
* the make check bot went down (it was partially running on my private hardware)

If we just wait, I'm not sure when we will be able to resume our work because:

* the sepia lab is back but has less horsepower than it did
* not all of us have access to the sepia lab
* the make check bot is being worked on by the infrastructure team but it is 
low priority and it may take weeks before it's back online
* the ceph-qa-suite errors that are OpenStack specific are low priority and it 
may never be fixed

I think we should rely on the sepia lab for testing for the foreseeable future 
and wait for the make check bot to be back. Tests will take a long time to run, 
but we've been able to work with a one week delay before so it's not a blocker.

Although fixing OpenStack specific errors would allow us to use the teuthology 
OpenStack backend (I will fix the last error left in the rados suite), it is 
unrealistic to set that as a requirement to run tests: we don't have the 
workforce nor the skills to do that. Hopefully, some time in the future, Ceph 
developers will  use ceph-qa-suite on OpenStack as part of the development 
workflow. But right now running ceph-qa-suite on OpenStack suites is outside of 
the development workflow and in a state of continuous regression which is 
inconvenient for us because we need something stable to compare the runs from 
the integration branch.

Fixing the make check bot is a two part problem. Each failed run must be looked 
at to chase false negatives (continuous integration with false negatives is a 
plague), which I did in the past year on a daily basis and I'm happy to keep 
doing. Before xmas break the bot running at jenkins.ceph.com sent over 90% 
false negative, primarily because it was trying to run on unsupported operating 
systems and it was stopped until this is fixed. It also appears that the 
machine running the bot is not re-imaged after each test, meaning a bugous run 
may taint all future tests and create a continuous flow of false negative. 
Addressing these two issues require knowing or learning about the Ceph jenkins 
setup and slave provisioning. This probably is a few days of work, reason why 
the infrastructure team can't resolve that immediately.

If you have alternative creative ideas on how to improve the current situation, 
please speak up :-)

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FreeBSD Building and Testing

2016-01-06 Thread Willem Jan Withagen
On 6-1-2016 08:51, Mykola Golub wrote:
> 
> Are you able to reproduce this problem manually? I.e. in src dir, start the
> cluster using vstart.sh:
> 
> ./vstart.sh -n
> 
> Check it is running:
> 
> ./ceph -s
> 
> Repeat the test:
> 
> truncate -s 0 empty_map.txt
> ./crushtool -c empty_map.txt -o empty_map.map
> ./ceph osd setcrushmap -i empty_map.map
> 
> Expected output:
> 
>  "Error EINVAL: Failed crushmap test: ./crushtool: exit status: 1"
> 

Oke thanx

Nice to have some of these examples...

--WjW

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


01/06/2016 Weekly Ceph Performance Meeting IS ON!

2016-01-06 Thread Mark Nelson
8AM PST as usual (ie in 18 minutes)! Discussion topics today include 
bluestore testing results and a potential performance regression in 
CentOS/RHEL 7.1 kernels.  Please feel free to add your own topics!


Here's the links:

Etherpad URL:
http://pad.ceph.com/p/performance_weekly

To join the Meeting:
https://bluejeans.com/268261044

To join via Browser:
https://bluejeans.com/268261044/browser

To join with Lync:
https://bluejeans.com/268261044/lync


To join via Room System:
Video Conferencing System: bjn.vc -or- 199.48.152.152
Meeting ID: 268261044

To join via Phone:
1) Dial:
  +1 408 740 7256
  +1 888 240 2560(US Toll Free)
  +1 408 317 9253(Alternate Number)
  (see all numbers - http://bluejeans.com/numbers)
2) Enter Conference ID: 268261044

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Is BlueFS an alternative of BlueStore?

2016-01-06 Thread Javen Wu

Hi Sage,

Sorry to bother you. I am not sure if it is appropriate to send email to 
you

directly, but I cannot find any useful information to address my confusion
from Internet. Hope you can help me.

Occasionally, I heard that you are going to start BlueFS to eliminate the
redudancy between XFS journal and RocksDB WAL. I am a little confused.
Is the Bluefs only to host RocksDB for BlueStore or it's an
alternative of BlueStore?

I am a new comer to CEPH, I am not sure my understanding is correct about
BlueStore. BlueStore in my mind is as below.

 BlueStore
 =
   RocksDB
+---+  +---+
|   onode   |  |   |
|WAL|  |   |
|   omap|  |   |
+---+  |   bdev|
|   |  |   |
|   XFS |  |   |
|   |  |   |
+---+  +---+

I am curious if BlueFS is able to host RocksDB, actually it's already a
"filesystem" which have to maintain blockmap kind of metadata by its own
WITHOUT the help of RocksDB. When BlueFS is introduced into the picture,
why RocksDB is needed yet? So I guess BlueFS is an alternative of BlueStore
and it's a new ObjectStore without leveraging RocksDB.

Is my understanding correct?

The reason we care the intention and the design target of BlueFS is that 
I had

discussion with my partner Peng.Hse about an idea to introduce a new
ObjectStore using ZFS library. I know CEPH supports ZFS as FileStore 
backend
already, but we had a different immature idea to use libzpool to 
implement a new
ObjectStore for CEPH totally in userspace without SPL and ZOL kernel 
module.
So that we can align CEPH transaction and zfs transaction in order to  
avoid

double write for CEPH journal.
ZFS core part libzpool (DMU, metaslab etc) offers a dnode object store and
it's platform kernel/user independent. Another benefit for the idea is we
can extend our metadata without bothering any DBStore.

Frankly, we are not sure if our idea is realistic so far, but when I 
heard of

BlueFS, I think we need to know the BlueFS design goal.

Thanks
Javen
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 01/06/2016 Weekly Ceph Performance Meeting IS ON!

2016-01-06 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

The last recording I'm seeing is for 10/07/15. Can we get the newer ones?

Thanks,
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Jan 6, 2016 at 8:43 AM, Mark Nelson  wrote:
> 8AM PST as usual (ie in 18 minutes)! Discussion topics today include
> bluestore testing results and a potential performance regression in
> CentOS/RHEL 7.1 kernels.  Please feel free to add your own topics!
>
> Here's the links:
>
> Etherpad URL:
> http://pad.ceph.com/p/performance_weekly
>
> To join the Meeting:
> https://bluejeans.com/268261044
>
> To join via Browser:
> https://bluejeans.com/268261044/browser
>
> To join with Lync:
> https://bluejeans.com/268261044/lync
>
>
> To join via Room System:
> Video Conferencing System: bjn.vc -or- 199.48.152.152
> Meeting ID: 268261044
>
> To join via Phone:
> 1) Dial:
>   +1 408 740 7256
>   +1 888 240 2560(US Toll Free)
>   +1 408 317 9253(Alternate Number)
>   (see all numbers - http://bluejeans.com/numbers)
> 2) Enter Conference ID: 268261044
>
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWjYwHCRDmVDuy+mK58QAAXGUP/1L+iRyYLxfhI1hyuhCM
+nnrS41HPZ4oTeirmo9jKLj6eOBB/NStaoJpDxibkKsrkQSI/CZs9c1mZiF1
t0Jm+PEWy7N30lLgjCh8UUU+J6PMG450xABFeuJfriWPuS4WkKsstlsdhlWd
IcFJUlGotlagjA57tEW5DaaEqg8SKoykIOs7nnhIUkezHfB51fjyYQKH7/XB
kINLDigl4KjDVrpijCa80E9Kg1T+4wR/tRDOOSWzyQJtLRpwrZBAL8X8Ab9p
WEqryr0MudicgG5kasZLbaS/edcuvV2UsNjtTBmJRQlf26TzZMIFkJNeE9Z6
89QPFCcuCRe7aNBG7zU8GAmU2Tg2ZcBGDySBJ2GVh8Fjx81VKPQTUSWEaPt7
8THlVkE8oV3OfOmeJE7uiKYR4U2X2WIS8Y0ThUJCLbZZfjoQB03oEiY6+3yd
jXL+/27fxFu2bvC/ODWHbT+EcB6S+dnJzOYl44oxOCkdP+TjjPP7jcvgyBwe
N+Yx3i6G0QUFt/QtKfztN9vsqh0oZ8OWA0Idj++V1vFVls0o1NvswLo2fAqc
aLbGzKfkxRDEszr6zgwoVegDBGSd7LbCShEd9RjmTblayqYZbCblal12ZQ2D
MJv0iqwWAPQSHuJUhw/DOzf9cz1YNU8hrRF+myxlwvjwno7xPrEx3/SibHEX
5E7/
=ySZ7
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] PGP signatures for RHEL hammer RPMs for ceph-deploy

2016-01-05 Thread Alfredo Deza
This is odd. We are signing all packages before publishing them on the
repository. These ceph-deploy releases are following a new release
process so I will
have to investigate where is the disconnect.

Thanks for letting us know.

On Tue, Jan 5, 2016 at 10:31 AM, Derek Yarnell  wrote:
> It looks like the ceph-deploy > 1.5.28 packages in the
> http://download.ceph.com/rpm-hammer/el6 and
> http://download.ceph.com/rpm-hammer/el7 repositories are not being PGP
> signed.  What happened?  This is causing our yum updates to fail but may
> be a sign of something much more nefarious?
>
> # rpm -qp --queryformat %{SIGPGP}
> http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.28-0.noarch.rpm
> 89021c040001020006050255fae0d5000a0910e84ac2c0460f3994203610009e284c0c6749f9d1ccd54aca8668e5f4148eb60f0ade762a5cb316926060d73a82490c41b8a5e9a5ebb8a7136a5ce294565cf8548dce160f7a577b623f12fb841b1656fba0b139404b4a074c076abf8c38f176bbecfc551567d22826d6c3ac2a67d8c8f4db67e3a2566272f492f3a1461b2c80bfc56f0c29e3a0c0e03fe50ee877d2d2b99963ea876914f5d85ae6fcf60c7c372040fcc82591552af21e152a37ab4103c3116ccd3a5f10992dc9ec483922212ef8ad8c37abbb6a751f6da2cc79567ed45e7bcb83d92aecc2a61d7584699183622714376bf3766e8781c7675834cce7d3e6c349bee6992872248fe7dd9f00248806e0c99f1a7010a8e77d13fefffeb142c1ee4ee8e55e53043fb89b7127a1c2282f4ab0fa3d19eccaa38194aa42310860bdd7746de8512b106d7923e9da9d1ad84b4ba1f8a3175b808d08f99ca5b737d4a7cba1f165b815187bec9ff1e0b5627e435ed869ae0bb16419e928e1a64413bb4dd62a6b1b049faa02eaa14bd6636b5f835bfef16acfd2daad82c1fed57a5e635971281367d2fe99c3b2b542490559d9b9b3f4295c86185aa3c4b4014da55c1b0ff68bc42c869729fee29472c413c911ea9bc5d58957bfb670ddc54d28fd8f30444969b790e53f9d34a1b2df9b
>
>  e2afe9d26
> d5be57b9fcd659c4880fad613ba5f175e4e3466dba4919a4656ffd228688a9c81d865e6df870ba33bbfc000
>
> # rpm -qp --queryformat %{SIGPGP}
> http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.29-0.noarch.rpm
> (none)
>
> # rpm -qp --queryformat %{SIGPGP}
> http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.30-0.noarch.rpm
> (none)
>
> # rpm -qp --queryformat %{SIGPGP}
> http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.28-0.noarch.rpm
> 89021c040001020006050255fadf42000a0910e84ac2c0460f39943b131000cb7f253c91019b2f5993fd232c4369003d521538aa19f996717d2eee780fe2d7ed4e969418ce92d6ad4be69b3c5421b80d2241a9d6e72e758ba86f0360e24aadd63d89165b47a566bcd8bed39d7b37e809d7afdf6b38e5e014f98caca6df7da6278822e2457c627cdba505febc23edb32447e11c2878e79bf5f5690def708ed7d79d261a839d5808b177cb3d6a8bc62317441f3e1b5cf986aeb5cde98fc986c42af2761418e7e83309df9b8703648a8e6eefe83f9d3cbcfe371bc336320657f86343ab25df8bd578203b6f312746ebbe0da195adeb1087487d12d530281b5328731c54240b0c5c01f1648c8802231876a33a0835a553e1b84e6d8a15acdd5db6b6bf9c6dee84b22ae0e70dc0cf2acdd5779e510a248844bba0af87ae8d5a874502ec0e48b235926222cf3386c44e30e3af14dea6134a5873784013297fa19a09f439bc8a2b73f563fc6e5cfa60767629a37f3cd24762f7b14e5f7ce08adeed82da3effc59298359a9f7f0efab0e4e808a33ceb07431530e0c279462da043bbece02d3fdf6a96e5a813eea0bf0f73e84b7fac6e28449e1bf15ddc2fa692f641ce8d4d9ed4261ba2824adee47dad90993ebc46d6ee083e92c8f76aaf8428e274e48cb1a91d0a2eb15e8779289b3771ef71
>
>  1cd9cc7f2
> 8f7a3cde708e4577b0aad546024ee98646f4f543ee1e33d8c96a93cff9b48deefa5b3996f659b16786ff016
>
> # rpm -qp --queryformat %{SIGPGP}
> http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.29-0.noarch.rpm
> (none)
>
> # rpm -qp --queryformat %{SIGPGP}
> http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.30-0.noarch.rpm
> (none)
>
> --
> Derek T. Yarnell
> University of Maryland
> Institute for Advanced Computer Studies
> ___
> ceph-users mailing list
> ceph-us...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FreeBSD Building and Testing

2016-01-05 Thread Gregory Farnum
On Mon, Dec 28, 2015 at 8:53 AM, Willem Jan Withagen  wrote:
> Hi,
>
> Can somebody try to help me and explain why
>
> in test: Func: test/mon/osd-crash
> Func: TEST_crush_reject_empty started
>
> Fails with a python error which sort of startles me:
> test/mon/osd-crush.sh:227: TEST_crush_reject_empty:  local
> empty_map=testdir/osd-crush/empty_map
> test/mon/osd-crush.sh:228: TEST_crush_reject_empty:  :
> test/mon/osd-crush.sh:229: TEST_crush_reject_empty:  ./crushtool -c
> testdir/osd-crush/empty_map.txt -o testdir/osd-crush/empty_map.m
> ap
> test/mon/osd-crush.sh:230: TEST_crush_reject_empty:  expect_failure
> testdir/osd-crush 'Error EINVAL' ./ceph osd setcrushmap -i testd
> ir/osd-crush/empty_map.map
> ../qa/workunits/ceph-helpers.sh:1171: expect_failure:  local
> dir=testdir/osd-crush
> ../qa/workunits/ceph-helpers.sh:1172: expect_failure:  shift
> ../qa/workunits/ceph-helpers.sh:1173: expect_failure:  local 'expected=Error
> EINVAL'
> ../qa/workunits/ceph-helpers.sh:1174: expect_failure:  shift
> ../qa/workunits/ceph-helpers.sh:1175: expect_failure:  local success
> ../qa/workunits/ceph-helpers.sh:1176: expect_failure:  pwd
> ../qa/workunits/ceph-helpers.sh:1177: expect_failure:  printenv
> ../qa/workunits/ceph-helpers.sh:1178: expect_failure:  echo ./ceph osd
> setcrushmap -i testdir/osd-crush/empty_map.map
> ../qa/workunits/ceph-helpers.sh:1180: expect_failure:  ./ceph osd
> setcrushmap -i testdir/osd-crush/empty_map.map
> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
> Traceback (most recent call last):
>   File "./ceph", line 936, in 
> retval = main()
>   File "./ceph", line 874, in main
> sigdict, inbuf, verbose)
>   File "./ceph", line 457, in new_style_command
> inbuf=inbuf)
>   File "/usr/srcs/Ceph/wip-freebsd-wjw/ceph/src/pybind/ceph_argparse.py",
> line 1208, in json_command
> raise RuntimeError('"{0}": exception {1}'.format(argdict, e))
> RuntimeError: "{'prefix': u'osd setcrushmap'}": exception "['{"prefix": "osd
> setcrushmap"}']": exception 'utf8' codec can't decode b
> yte 0x86 in position 56: invalid start byte
>
> Which is certainly not the type of error expected.
> But it is hard to detect any 0x86 in the arguments.
>
> And yes python is right, there are no UTF8 sequences that start with 0x86.
> Question is:
> Why does it want to parse with UTF8?
> And how do I switch it off?
> Or how to I fix this error?

I've not handled this myself but we've seen this a few times. The
latest example in a quick email search was
http://tracker.ceph.com/issues/9405, and it was apparently having a
string which wasn't null-terminated.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


CBT on an existing cluster

2016-01-05 Thread Deneau, Tom
Having trouble getting a reply from c...@cbt.com so trying ceph-devel list...

To get familiar with CBT, I first wanted to use it on an existing cluster.
(i.e., not have CBT do any cluster setup).

Is there a .yaml example that illustrates how to use cbt to run for example, 
its radosbench benchmark on an existing cluster?

-- Tom Deneau, AMD

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CBT on an existing cluster

2016-01-05 Thread Gregory Farnum
On Tue, Jan 5, 2016 at 9:56 AM, Deneau, Tom  wrote:
> Having trouble getting a reply from c...@cbt.com so trying ceph-devel list...
>
> To get familiar with CBT, I first wanted to use it on an existing cluster.
> (i.e., not have CBT do any cluster setup).
>
> Is there a .yaml example that illustrates how to use cbt to run for example, 
> its radosbench benchmark on an existing cluster?

I dunno anything about CBT, but I don't see any emails from you on
that list and the correct address is c...@lists.ceph.com (rather than
the other way around), so let's try that. :)
-Greg

PS: next reply drop ceph-devel, please!
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] PGP signatures for RHEL hammer RPMs for ceph-deploy

2016-01-05 Thread Alfredo Deza
It looks like this was only for ceph-deploy in Hammer. I verified that
this wasn't the case in e.g. Infernalis

I have ensured that the ceph-deploy packages in hammer are in fact
signed and coming from our builds.

Thanks again for reporting this!

On Tue, Jan 5, 2016 at 12:27 PM, Alfredo Deza  wrote:
> This is odd. We are signing all packages before publishing them on the
> repository. These ceph-deploy releases are following a new release
> process so I will
> have to investigate where is the disconnect.
>
> Thanks for letting us know.
>
> On Tue, Jan 5, 2016 at 10:31 AM, Derek Yarnell  wrote:
>> It looks like the ceph-deploy > 1.5.28 packages in the
>> http://download.ceph.com/rpm-hammer/el6 and
>> http://download.ceph.com/rpm-hammer/el7 repositories are not being PGP
>> signed.  What happened?  This is causing our yum updates to fail but may
>> be a sign of something much more nefarious?
>>
>> # rpm -qp --queryformat %{SIGPGP}
>> http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.28-0.noarch.rpm
>> 89021c040001020006050255fae0d5000a0910e84ac2c0460f3994203610009e284c0c6749f9d1ccd54aca8668e5f4148eb60f0ade762a5cb316926060d73a82490c41b8a5e9a5ebb8a7136a5ce294565cf8548dce160f7a577b623f12fb841b1656fba0b139404b4a074c076abf8c38f176bbecfc551567d22826d6c3ac2a67d8c8f4db67e3a2566272f492f3a1461b2c80bfc56f0c29e3a0c0e03fe50ee877d2d2b99963ea876914f5d85ae6fcf60c7c372040fcc82591552af21e152a37ab4103c3116ccd3a5f10992dc9ec483922212ef8ad8c37abbb6a751f6da2cc79567ed45e7bcb83d92aecc2a61d7584699183622714376bf3766e8781c7675834cce7d3e6c349bee6992872248fe7dd9f00248806e0c99f1a7010a8e77d13fefffeb142c1ee4ee8e55e53043fb89b7127a1c2282f4ab0fa3d19eccaa38194aa42310860bdd7746de8512b106d7923e9da9d1ad84b4ba1f8a3175b808d08f99ca5b737d4a7cba1f165b815187bec9ff1e0b5627e435ed869ae0bb16419e928e1a64413bb4dd62a6b1b049faa02eaa14bd6636b5f835bfef16acfd2daad82c1fed57a5e635971281367d2fe99c3b2b542490559d9b9b3f4295c86185aa3c4b4014da55c1b0ff68bc42c869729fee29472c413c911ea9bc5d58957bfb670ddc54d28fd8f30444969b790e53f9d34a1b2df9b
>>
>>  e2afe9d26
>> d5be57b9fcd659c4880fad613ba5f175e4e3466dba4919a4656ffd228688a9c81d865e6df870ba33bbfc000
>>
>> # rpm -qp --queryformat %{SIGPGP}
>> http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.29-0.noarch.rpm
>> (none)
>>
>> # rpm -qp --queryformat %{SIGPGP}
>> http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.30-0.noarch.rpm
>> (none)
>>
>> # rpm -qp --queryformat %{SIGPGP}
>> http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.28-0.noarch.rpm
>> 89021c040001020006050255fadf42000a0910e84ac2c0460f39943b131000cb7f253c91019b2f5993fd232c4369003d521538aa19f996717d2eee780fe2d7ed4e969418ce92d6ad4be69b3c5421b80d2241a9d6e72e758ba86f0360e24aadd63d89165b47a566bcd8bed39d7b37e809d7afdf6b38e5e014f98caca6df7da6278822e2457c627cdba505febc23edb32447e11c2878e79bf5f5690def708ed7d79d261a839d5808b177cb3d6a8bc62317441f3e1b5cf986aeb5cde98fc986c42af2761418e7e83309df9b8703648a8e6eefe83f9d3cbcfe371bc336320657f86343ab25df8bd578203b6f312746ebbe0da195adeb1087487d12d530281b5328731c54240b0c5c01f1648c8802231876a33a0835a553e1b84e6d8a15acdd5db6b6bf9c6dee84b22ae0e70dc0cf2acdd5779e510a248844bba0af87ae8d5a874502ec0e48b235926222cf3386c44e30e3af14dea6134a5873784013297fa19a09f439bc8a2b73f563fc6e5cfa60767629a37f3cd24762f7b14e5f7ce08adeed82da3effc59298359a9f7f0efab0e4e808a33ceb07431530e0c279462da043bbece02d3fdf6a96e5a813eea0bf0f73e84b7fac6e28449e1bf15ddc2fa692f641ce8d4d9ed4261ba2824adee47dad90993ebc46d6ee083e92c8f76aaf8428e274e48cb1a91d0a2eb15e8779289b3771ef71
>>
>>  1cd9cc7f2
>> 8f7a3cde708e4577b0aad546024ee98646f4f543ee1e33d8c96a93cff9b48deefa5b3996f659b16786ff016
>>
>> # rpm -qp --queryformat %{SIGPGP}
>> http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.29-0.noarch.rpm
>> (none)
>>
>> # rpm -qp --queryformat %{SIGPGP}
>> http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.30-0.noarch.rpm
>> (none)
>>
>> --
>> Derek T. Yarnell
>> University of Maryland
>> Institute for Advanced Computer Studies
>> ___
>> ceph-users mailing list
>> ceph-us...@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] PGP signatures for RHEL hammer RPMs for ceph-deploy

2016-01-05 Thread Derek Yarnell
Hi Alfredo,

I am still having a bit of trouble though with what looks like the
1.5.31 release.  With a `yum update ceph-deploy` I get the following
even after a full `yum clean all`.

http://ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.31-0.noarch.rpm:
[Errno -1] Package does not match intended download. Suggestion: run yum
--enablerepo=Ceph-noarch clean metadata

Thanks,
derek

On 1/5/16 1:25 PM, Alfredo Deza wrote:
> It looks like this was only for ceph-deploy in Hammer. I verified that
> this wasn't the case in e.g. Infernalis
> 
> I have ensured that the ceph-deploy packages in hammer are in fact
> signed and coming from our builds.
> 
> Thanks again for reporting this!
> 
> On Tue, Jan 5, 2016 at 12:27 PM, Alfredo Deza  wrote:
>> This is odd. We are signing all packages before publishing them on the
>> repository. These ceph-deploy releases are following a new release
>> process so I will
>> have to investigate where is the disconnect.
>>
>> Thanks for letting us know.
>>
>> On Tue, Jan 5, 2016 at 10:31 AM, Derek Yarnell  wrote:
>>> It looks like the ceph-deploy > 1.5.28 packages in the
>>> http://download.ceph.com/rpm-hammer/el6 and
>>> http://download.ceph.com/rpm-hammer/el7 repositories are not being PGP
>>> signed.  What happened?  This is causing our yum updates to fail but may
>>> be a sign of something much more nefarious?
>>>
>>> # rpm -qp --queryformat %{SIGPGP}
>>> http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.28-0.noarch.rpm
>>> 89021c040001020006050255fae0d5000a0910e84ac2c0460f3994203610009e284c0c6749f9d1ccd54aca8668e5f4148eb60f0ade762a5cb316926060d73a82490c41b8a5e9a5ebb8a7136a5ce294565cf8548dce160f7a577b623f12fb841b1656fba0b139404b4a074c076abf8c38f176bbecfc551567d22826d6c3ac2a67d8c8f4db67e3a2566272f492f3a1461b2c80bfc56f0c29e3a0c0e03fe50ee877d2d2b99963ea876914f5d85ae6fcf60c7c372040fcc82591552af21e152a37ab4103c3116ccd3a5f10992dc9ec483922212ef8ad8c37abbb6a751f6da2cc79567ed45e7bcb83d92aecc2a61d7584699183622714376bf3766e8781c7675834cce7d3e6c349bee6992872248fe7dd9f00248806e0c99f1a7010a8e77d13fefffeb142c1ee4ee8e55e53043fb89b7127a1c2282f4ab0fa3d19eccaa38194aa42310860bdd7746de8512b106d7923e9da9d1ad84b4ba1f8a3175b808d08f99ca5b737d4a7cba1f165b815187bec9ff1e0b5627e435ed869ae0bb16419e928e1a64413bb4dd62a6b1b049faa02eaa14bd6636b5f835bfef16acfd2daad82c1fed57a5e635971281367d2fe99c3b2b542490559d9b9b3f4295c86185aa3c4b4014da55c1b0ff68bc42c869729fee29472c413c911ea9bc5d58957bfb670ddc54d28fd8f30444969b790e53f9d34a1b2
 
 df9b
>>>
>>>  e2afe9d26
>>> d5be57b9fcd659c4880fad613ba5f175e4e3466dba4919a4656ffd228688a9c81d865e6df870ba33bbfc000
>>>
>>> # rpm -qp --queryformat %{SIGPGP}
>>> http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.29-0.noarch.rpm
>>> (none)
>>>
>>> # rpm -qp --queryformat %{SIGPGP}
>>> http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.30-0.noarch.rpm
>>> (none)
>>>
>>> # rpm -qp --queryformat %{SIGPGP}
>>> http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.28-0.noarch.rpm
>>> 89021c040001020006050255fadf42000a0910e84ac2c0460f39943b131000cb7f253c91019b2f5993fd232c4369003d521538aa19f996717d2eee780fe2d7ed4e969418ce92d6ad4be69b3c5421b80d2241a9d6e72e758ba86f0360e24aadd63d89165b47a566bcd8bed39d7b37e809d7afdf6b38e5e014f98caca6df7da6278822e2457c627cdba505febc23edb32447e11c2878e79bf5f5690def708ed7d79d261a839d5808b177cb3d6a8bc62317441f3e1b5cf986aeb5cde98fc986c42af2761418e7e83309df9b8703648a8e6eefe83f9d3cbcfe371bc336320657f86343ab25df8bd578203b6f312746ebbe0da195adeb1087487d12d530281b5328731c54240b0c5c01f1648c8802231876a33a0835a553e1b84e6d8a15acdd5db6b6bf9c6dee84b22ae0e70dc0cf2acdd5779e510a248844bba0af87ae8d5a874502ec0e48b235926222cf3386c44e30e3af14dea6134a5873784013297fa19a09f439bc8a2b73f563fc6e5cfa60767629a37f3cd24762f7b14e5f7ce08adeed82da3effc59298359a9f7f0efab0e4e808a33ceb07431530e0c279462da043bbece02d3fdf6a96e5a813eea0bf0f73e84b7fac6e28449e1bf15ddc2fa692f641ce8d4d9ed4261ba2824adee47dad90993ebc46d6ee083e92c8f76aaf8428e274e48cb1a91d0a2eb15e8779289b3771
 
 ef71
>>>
>>>  1cd9cc7f2
>>> 8f7a3cde708e4577b0aad546024ee98646f4f543ee1e33d8c96a93cff9b48deefa5b3996f659b16786ff016
>>>
>>> # rpm -qp --queryformat %{SIGPGP}
>>> http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.29-0.noarch.rpm
>>> (none)
>>>
>>> # rpm -qp --queryformat %{SIGPGP}
>>> http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.30-0.noarch.rpm
>>> (none)
>>>
>>> --
>>> Derek T. Yarnell
>>> University of Maryland
>>> Institute for Advanced Computer Studies
>>> ___
>>> ceph-users mailing list
>>> ceph-us...@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Derek T. Yarnell
University of Maryland
Institute for Advanced Computer Studies
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


PGP signatures for RHEL hammer RPMs for ceph-deploy

2016-01-05 Thread Derek Yarnell
It looks like the ceph-deploy > 1.5.28 packages in the
http://download.ceph.com/rpm-hammer/el6 and
http://download.ceph.com/rpm-hammer/el7 repositories are not being PGP
signed.  What happened?  This is causing our yum updates to fail but may
be a sign of something much more nefarious?

# rpm -qp --queryformat %{SIGPGP}
http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.28-0.noarch.rpm
89021c040001020006050255fae0d5000a0910e84ac2c0460f3994203610009e284c0c6749f9d1ccd54aca8668e5f4148eb60f0ade762a5cb316926060d73a82490c41b8a5e9a5ebb8a7136a5ce294565cf8548dce160f7a577b623f12fb841b1656fba0b139404b4a074c076abf8c38f176bbecfc551567d22826d6c3ac2a67d8c8f4db67e3a2566272f492f3a1461b2c80bfc56f0c29e3a0c0e03fe50ee877d2d2b99963ea876914f5d85ae6fcf60c7c372040fcc82591552af21e152a37ab4103c3116ccd3a5f10992dc9ec483922212ef8ad8c37abbb6a751f6da2cc79567ed45e7bcb83d92aecc2a61d7584699183622714376bf3766e8781c7675834cce7d3e6c349bee6992872248fe7dd9f00248806e0c99f1a7010a8e77d13fefffeb142c1ee4ee8e55e53043fb89b7127a1c2282f4ab0fa3d19eccaa38194aa42310860bdd7746de8512b106d7923e9da9d1ad84b4ba1f8a3175b808d08f99ca5b737d4a7cba1f165b815187bec9ff1e0b5627e435ed869ae0bb16419e928e1a64413bb4dd62a6b1b049faa02eaa14bd6636b5f835bfef16acfd2daad82c1fed57a5e635971281367d2fe99c3b2b542490559d9b9b3f4295c86185aa3c4b4014da55c1b0ff68bc42c869729fee29472c413c911ea9bc5d58957bfb670ddc54d28fd8f30444969b790e53f9d34a1b2df9b
 
 e2afe9d26
d5be57b9fcd659c4880fad613ba5f175e4e3466dba4919a4656ffd228688a9c81d865e6df870ba33bbfc000

# rpm -qp --queryformat %{SIGPGP}
http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.29-0.noarch.rpm
(none)

# rpm -qp --queryformat %{SIGPGP}
http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.30-0.noarch.rpm
(none)

# rpm -qp --queryformat %{SIGPGP}
http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.28-0.noarch.rpm
89021c040001020006050255fadf42000a0910e84ac2c0460f39943b131000cb7f253c91019b2f5993fd232c4369003d521538aa19f996717d2eee780fe2d7ed4e969418ce92d6ad4be69b3c5421b80d2241a9d6e72e758ba86f0360e24aadd63d89165b47a566bcd8bed39d7b37e809d7afdf6b38e5e014f98caca6df7da6278822e2457c627cdba505febc23edb32447e11c2878e79bf5f5690def708ed7d79d261a839d5808b177cb3d6a8bc62317441f3e1b5cf986aeb5cde98fc986c42af2761418e7e83309df9b8703648a8e6eefe83f9d3cbcfe371bc336320657f86343ab25df8bd578203b6f312746ebbe0da195adeb1087487d12d530281b5328731c54240b0c5c01f1648c8802231876a33a0835a553e1b84e6d8a15acdd5db6b6bf9c6dee84b22ae0e70dc0cf2acdd5779e510a248844bba0af87ae8d5a874502ec0e48b235926222cf3386c44e30e3af14dea6134a5873784013297fa19a09f439bc8a2b73f563fc6e5cfa60767629a37f3cd24762f7b14e5f7ce08adeed82da3effc59298359a9f7f0efab0e4e808a33ceb07431530e0c279462da043bbece02d3fdf6a96e5a813eea0bf0f73e84b7fac6e28449e1bf15ddc2fa692f641ce8d4d9ed4261ba2824adee47dad90993ebc46d6ee083e92c8f76aaf8428e274e48cb1a91d0a2eb15e8779289b3771ef71
 
 1cd9cc7f2
8f7a3cde708e4577b0aad546024ee98646f4f543ee1e33d8c96a93cff9b48deefa5b3996f659b16786ff016

# rpm -qp --queryformat %{SIGPGP}
http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.29-0.noarch.rpm
(none)

# rpm -qp --queryformat %{SIGPGP}
http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.30-0.noarch.rpm
(none)

-- 
Derek T. Yarnell
University of Maryland
Institute for Advanced Computer Studies
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


deprecation and build warnings

2016-01-05 Thread Gregory Farnum
I was annoyed again at our gitbuilders being all yellow because of
compile warnings so I went to check out how many of them are real and
how many of them are self-inflicted warnings. I just spot-checked
http://gitbuilder.sepia.ceph.com/gitbuilder-ceph-tarball-trusty-amd64-basic/log.cgi?log=2694e1171f23166e8a11c57c7b284621498decd8,
but much to my pleasant surprise there are only two errors:

1) we have 16 uses of rados_ioctx_pool_required_alignment, which is deprecated.
2) we have two uses of libec_isa.so being linked against a loadable module.

Both of these are contained entirely in our unit tests. I don't know
exactly what's going on with the second one, but I imagine it's not a
difficult fix? For the first one, can we just stop testing it? Or in
some way suppress the warning for those callers? I'd love to have some
green show up on the dashboard again, so that it's not too hard to
notice when we introduce actual build regressions. ;)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Docs now building again

2016-01-05 Thread Dan Mick
https://github.com/ceph/ceph/pull/7119 fixed an issue preventing docs
from building.  Master is fixed; merge that into your branches if you
want working docs again.

-- 
Dan Mick
Red Hat, Inc.
Ceph docs: http://ceph.com/docs
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FreeBSD Building and Testing

2016-01-05 Thread Mykola Golub
On Mon, Dec 28, 2015 at 05:53:04PM +0100, Willem Jan Withagen wrote:
> Hi,
> 
> Can somebody try to help me and explain why
> 
> in test: Func: test/mon/osd-crash
> Func: TEST_crush_reject_empty started
> 
> Fails with a python error which sort of startles me:
> test/mon/osd-crush.sh:227: TEST_crush_reject_empty:  local
> empty_map=testdir/osd-crush/empty_map
> test/mon/osd-crush.sh:228: TEST_crush_reject_empty:  :
> test/mon/osd-crush.sh:229: TEST_crush_reject_empty:  ./crushtool -c
> testdir/osd-crush/empty_map.txt -o testdir/osd-crush/empty_map.m
> ap
> test/mon/osd-crush.sh:230: TEST_crush_reject_empty:  expect_failure
> testdir/osd-crush 'Error EINVAL' ./ceph osd setcrushmap -i testd
> ir/osd-crush/empty_map.map
> ../qa/workunits/ceph-helpers.sh:1171: expect_failure:  local
> dir=testdir/osd-crush
> ../qa/workunits/ceph-helpers.sh:1172: expect_failure:  shift
> ../qa/workunits/ceph-helpers.sh:1173: expect_failure:  local 'expected=Error
> EINVAL'
> ../qa/workunits/ceph-helpers.sh:1174: expect_failure:  shift
> ../qa/workunits/ceph-helpers.sh:1175: expect_failure:  local success
> ../qa/workunits/ceph-helpers.sh:1176: expect_failure:  pwd
> ../qa/workunits/ceph-helpers.sh:1177: expect_failure:  printenv
> ../qa/workunits/ceph-helpers.sh:1178: expect_failure:  echo ./ceph osd
> setcrushmap -i testdir/osd-crush/empty_map.map
> ../qa/workunits/ceph-helpers.sh:1180: expect_failure:  ./ceph osd
> setcrushmap -i testdir/osd-crush/empty_map.map
> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
> Traceback (most recent call last):
>   File "./ceph", line 936, in 
> retval = main()
>   File "./ceph", line 874, in main
> sigdict, inbuf, verbose)
>   File "./ceph", line 457, in new_style_command
> inbuf=inbuf)
>   File "/usr/srcs/Ceph/wip-freebsd-wjw/ceph/src/pybind/ceph_argparse.py",
> line 1208, in json_command
> raise RuntimeError('"{0}": exception {1}'.format(argdict, e))
> RuntimeError: "{'prefix': u'osd setcrushmap'}": exception "['{"prefix": "osd
> setcrushmap"}']": exception 'utf8' codec can't decode b
> yte 0x86 in position 56: invalid start byte
> 
> Which is certainly not the type of error expected.
> But it is hard to detect any 0x86 in the arguments.

Are you able to reproduce this problem manually? I.e. in src dir, start the
cluster using vstart.sh:

./vstart.sh -n

Check it is running:

./ceph -s

Repeat the test:

truncate -s 0 empty_map.txt
./crushtool -c empty_map.txt -o empty_map.map
./ceph osd setcrushmap -i empty_map.map

Expected output:

 "Error EINVAL: Failed crushmap test: ./crushtool: exit status: 1"

-- 
Mykola Golub
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] PGP signatures for RHEL hammer RPMs for ceph-deploy

2016-01-05 Thread Alfredo Deza
It seems that the metadata didn't get updated.

I just tried out and got the right version with no issues. Hopefully
*this* time it works for you.

Sorry for all the troubles

On Tue, Jan 5, 2016 at 3:21 PM, Derek Yarnell  wrote:
> Hi Alfredo,
>
> I am still having a bit of trouble though with what looks like the
> 1.5.31 release.  With a `yum update ceph-deploy` I get the following
> even after a full `yum clean all`.
>
> http://ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.31-0.noarch.rpm:
> [Errno -1] Package does not match intended download. Suggestion: run yum
> --enablerepo=Ceph-noarch clean metadata
>
> Thanks,
> derek
>
> On 1/5/16 1:25 PM, Alfredo Deza wrote:
>> It looks like this was only for ceph-deploy in Hammer. I verified that
>> this wasn't the case in e.g. Infernalis
>>
>> I have ensured that the ceph-deploy packages in hammer are in fact
>> signed and coming from our builds.
>>
>> Thanks again for reporting this!
>>
>> On Tue, Jan 5, 2016 at 12:27 PM, Alfredo Deza  wrote:
>>> This is odd. We are signing all packages before publishing them on the
>>> repository. These ceph-deploy releases are following a new release
>>> process so I will
>>> have to investigate where is the disconnect.
>>>
>>> Thanks for letting us know.
>>>
>>> On Tue, Jan 5, 2016 at 10:31 AM, Derek Yarnell  wrote:
 It looks like the ceph-deploy > 1.5.28 packages in the
 http://download.ceph.com/rpm-hammer/el6 and
 http://download.ceph.com/rpm-hammer/el7 repositories are not being PGP
 signed.  What happened?  This is causing our yum updates to fail but may
 be a sign of something much more nefarious?

 # rpm -qp --queryformat %{SIGPGP}
 http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.28-0.noarch.rpm
 89021c040001020006050255fae0d5000a0910e84ac2c0460f3994203610009e284c0c6749f9d1ccd54aca8668e5f4148eb60f0ade762a5cb316926060d73a82490c41b8a5e9a5ebb8a7136a5ce294565cf8548dce160f7a577b623f12fb841b1656fba0b139404b4a074c076abf8c38f176bbecfc551567d22826d6c3ac2a67d8c8f4db67e3a2566272f492f3a1461b2c80bfc56f0c29e3a0c0e03fe50ee877d2d2b99963ea876914f5d85ae6fcf60c7c372040fcc82591552af21e152a37ab4103c3116ccd3a5f10992dc9ec483922212ef8ad8c37abbb6a751f6da2cc79567ed45e7bcb83d92aecc2a61d7584699183622714376bf3766e8781c7675834cce7d3e6c349bee6992872248fe7dd9f00248806e0c99f1a7010a8e77d13fefffeb142c1ee4ee8e55e53043fb89b7127a1c2282f4ab0fa3d19eccaa38194aa42310860bdd7746de8512b106d7923e9da9d1ad84b4ba1f8a3175b808d08f99ca5b737d4a7cba1f165b815187bec9ff1e0b5627e435ed869ae0bb16419e928e1a64413bb4dd62a6b1b049faa02eaa14bd6636b5f835bfef16acfd2daad82c1fed57a5e635971281367d2fe99c3b2b542490559d9b9b3f4295c86185aa3c4b4014da55c1b0ff68bc42c869729fee29472c413c911ea9bc5d58957bfb670ddc54d28fd8f30444969b790e53f9d34a1b2
>
>  df9b

  e2afe9d26
 d5be57b9fcd659c4880fad613ba5f175e4e3466dba4919a4656ffd228688a9c81d865e6df870ba33bbfc000

 # rpm -qp --queryformat %{SIGPGP}
 http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.29-0.noarch.rpm
 (none)

 # rpm -qp --queryformat %{SIGPGP}
 http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.30-0.noarch.rpm
 (none)

 # rpm -qp --queryformat %{SIGPGP}
 http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.28-0.noarch.rpm
 89021c040001020006050255fadf42000a0910e84ac2c0460f39943b131000cb7f253c91019b2f5993fd232c4369003d521538aa19f996717d2eee780fe2d7ed4e969418ce92d6ad4be69b3c5421b80d2241a9d6e72e758ba86f0360e24aadd63d89165b47a566bcd8bed39d7b37e809d7afdf6b38e5e014f98caca6df7da6278822e2457c627cdba505febc23edb32447e11c2878e79bf5f5690def708ed7d79d261a839d5808b177cb3d6a8bc62317441f3e1b5cf986aeb5cde98fc986c42af2761418e7e83309df9b8703648a8e6eefe83f9d3cbcfe371bc336320657f86343ab25df8bd578203b6f312746ebbe0da195adeb1087487d12d530281b5328731c54240b0c5c01f1648c8802231876a33a0835a553e1b84e6d8a15acdd5db6b6bf9c6dee84b22ae0e70dc0cf2acdd5779e510a248844bba0af87ae8d5a874502ec0e48b235926222cf3386c44e30e3af14dea6134a5873784013297fa19a09f439bc8a2b73f563fc6e5cfa60767629a37f3cd24762f7b14e5f7ce08adeed82da3effc59298359a9f7f0efab0e4e808a33ceb07431530e0c279462da043bbece02d3fdf6a96e5a813eea0bf0f73e84b7fac6e28449e1bf15ddc2fa692f641ce8d4d9ed4261ba2824adee47dad90993ebc46d6ee083e92c8f76aaf8428e274e48cb1a91d0a2eb15e8779289b3771
>
>  ef71

  1cd9cc7f2
 8f7a3cde708e4577b0aad546024ee98646f4f543ee1e33d8c96a93cff9b48deefa5b3996f659b16786ff016

 # rpm -qp --queryformat %{SIGPGP}
 http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.29-0.noarch.rpm
 (none)

 # rpm -qp --queryformat %{SIGPGP}
 http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.30-0.noarch.rpm
 (none)

 --
 Derek T. Yarnell
 University of Maryland
 Institute for Advanced Computer Studies
 ___
 ceph-users mailing list
 ceph-us...@lists.ceph.com
 

Re: Long peering - throttle at FileStore::queue_transactions

2016-01-05 Thread Guang Yang
On Mon, Jan 4, 2016 at 7:21 PM, Sage Weil  wrote:
> On Mon, 4 Jan 2016, Guang Yang wrote:
>> Hi Cephers,
>> Happy New Year! I got question regards to the long PG peering..
>>
>> Over the last several days I have been looking into the *long peering*
>> problem when we start a OSD / OSD host, what I observed was that the
>> two peering working threads were throttled (stuck) when trying to
>> queue new transactions (writing pg log), thus the peering process are
>> dramatically slow down.
>>
>> The first question came to me was, what were the transactions in the
>> queue? The major ones, as I saw, included:
>>
>> - The osd_map and incremental osd_map, this happens if the OSD had
>> been down for a while (in a large cluster), or when the cluster got
>> upgrade, which made the osd_map epoch the down OSD had, was far behind
>> the latest osd_map epoch. During the OSD booting, it would need to
>> persist all those osd_maps and generate lots of filestore transactions
>> (linear with the epoch gap).
>> > As the PG was not involved in most of those epochs, could we only take and 
>> > persist those osd_maps which matter to the PGs on the OSD?
>
> This part should happen before the OSD sends the MOSDBoot message, before
> anyone knows it exists.  There is a tunable threshold that controls how
> recent the map has to be before the OSD tries to boot.  If you're
> seeing this in the real world, be probably just need to adjust that value
> way down to something small(er).
It would queue the transactions and then sends out the MOSDBoot, thus
there is still a chance that it could have contention with the peering
OPs (especially on large clusters where there are lots of activities
which generates many osdmap epoch). Any chance we can change the
*queue_transactions* to "apply_transactions*, thus we block there
waiting for the persistent of the osdmap. At least we may be able to
do that during OSD booting? The concern is, if the OSD is active, the
apply_transaction would take longer with holding the osd_lock..
I don't find such tuning, could you elaborate? Thanks!
>
> sage
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


hammer mon failure

2016-01-05 Thread Samuel Just
http://tracker.ceph.com/issues/14236

New hammer mon failure in the nightlies (missing a map apparently?),
can you take a look?
-Sam
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hammer mon failure

2016-01-05 Thread Joao Eduardo Luis
On 01/05/2016 07:55 PM, Samuel Just wrote:
> http://tracker.ceph.com/issues/14236
> 
> New hammer mon failure in the nightlies (missing a map apparently?),
> can you take a look?
> -Sam

Will do.

  -Joao
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Is rbd map/unmap op. configured like an event?

2016-01-04 Thread Wukongming
Hi All,

Is rbd map/unmap op. configured like an event in the directory of /etc/init, so 
we can use system/upstart to automanage it?

-
wukongming ID: 12019
Tel:0571-86760239
Dept:2014 UIS2 ONEStor


-
本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
邮件!
This e-mail and its attachments contain confidential information from H3C, 
which is
intended only for the person or entity whose address is listed above. Any use 
of the
information contained herein in any way (including, but not limited to, total 
or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify 
the sender
by phone or email immediately and delete it!


Long peering - throttle at FileStore::queue_transactions

2016-01-04 Thread Guang Yang
Hi Cephers,
Happy New Year! I got question regards to the long PG peering..

Over the last several days I have been looking into the *long peering*
problem when we start a OSD / OSD host, what I observed was that the
two peering working threads were throttled (stuck) when trying to
queue new transactions (writing pg log), thus the peering process are
dramatically slow down.

The first question came to me was, what were the transactions in the
queue? The major ones, as I saw, included:

- The osd_map and incremental osd_map, this happens if the OSD had
been down for a while (in a large cluster), or when the cluster got
upgrade, which made the osd_map epoch the down OSD had, was far behind
the latest osd_map epoch. During the OSD booting, it would need to
persist all those osd_maps and generate lots of filestore transactions
(linear with the epoch gap).
> As the PG was not involved in most of those epochs, could we only take and 
> persist those osd_maps which matter to the PGs on the OSD?

- There are lots of deletion transactions, and as the PG booting, it
needs to merge the PG log from its peers, and for the deletion PG
entry, it would need to queue the deletion transaction immediately.
> Could we delay the queue of the transactions until all PGs on the host are 
> peered?

Thanks,
Guang
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD data file are OSD logs

2016-01-04 Thread Samuel Just
IIRC, you are running giant.  I think that's the log rotate dangling
fd bug (not fixed in giant since giant is eol).  Fixed upstream
8778ab3a1ced7fab07662248af0c773df759653d, firefly backport is
b8e3f6e190809febf80af66415862e7c7e415214.
-Sam

On Mon, Jan 4, 2016 at 3:37 PM, Guang Yang  wrote:
> Hi Cephers,
> Before I open a tracker, I would like check if it is a known issue or not..
>
> One one of our clusters, there was OSD crash during repairing,  the
> crash happened after we issued a PG repair for inconsistent PGs, which
> failed because the recorded file size (within xattr) mismatched with
> the actual file size.
>
> The mismatch was caused by the fact that the content of the data file
> are OSD logs, following is from osd.354 on c003:
>
> -rw-r--r-- 1 yahoo root  75168 Jan  3 07:30
> default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7
> -bash-4.1$ head
> "default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7"
> 2016-01-03 07:30:01.600119 7f7fe2096700 15
> filestore(/home/y/var/lib/ceph/osd/ceph-354) getattrs
> 3.171s7_head/a2478171/default.12061.9_8396947527_52ac8b3ec6_o.jpg/head//3/18446744073709551615/7
> 2016-01-03 07:30:01.604967 7f7fe2096700 10
> filestore(/home/y/var/lib/ceph/osd/ceph-354)  -ERANGE, len is 494
> 2016-01-03 07:30:01.604984 7f7fe2096700 10
> filestore(/home/y/var/lib/ceph/osd/ceph-354)  -ERANGE, got 247
> 2016-01-03 07:30:01.604986 7f7fe2096700 20
> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
> '_user.rgw.idtag'
> 2016-01-03 07:30:01.604996 7f7fe2096700 20
> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting '_'
> 2016-01-03 07:30:01.605007 7f7fe2096700 20
> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
> 'snapset'
> 2016-01-03 07:30:01.605013 7f7fe2096700 20
> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
> '_user.rgw.manifest'
> 2016-01-03 07:30:01.605026 7f7fe2096700 20
> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
> 'hinfo_key'
> 2016-01-03 07:30:01.605042 7f7fe2096700 20
> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
> '_user.rgw.x-amz-meta-origin'
> 2016-01-03 07:30:01.605049 7f7fe2096700 20
> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
> '_user.rgw.acl'
>
>
> This only happens on the clusters we turned on the verbose log
> (debug_osd/filestore=20). And we are running ceph v0.87.
>
> Thanks,
> Guang
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD data file are OSD logs

2016-01-04 Thread Guang Yang
Thanks Sam for the confirmation.

Thanks,
Guang

On Mon, Jan 4, 2016 at 3:59 PM, Samuel Just  wrote:
> IIRC, you are running giant.  I think that's the log rotate dangling
> fd bug (not fixed in giant since giant is eol).  Fixed upstream
> 8778ab3a1ced7fab07662248af0c773df759653d, firefly backport is
> b8e3f6e190809febf80af66415862e7c7e415214.
> -Sam
>
> On Mon, Jan 4, 2016 at 3:37 PM, Guang Yang  wrote:
>> Hi Cephers,
>> Before I open a tracker, I would like check if it is a known issue or not..
>>
>> One one of our clusters, there was OSD crash during repairing,  the
>> crash happened after we issued a PG repair for inconsistent PGs, which
>> failed because the recorded file size (within xattr) mismatched with
>> the actual file size.
>>
>> The mismatch was caused by the fact that the content of the data file
>> are OSD logs, following is from osd.354 on c003:
>>
>> -rw-r--r-- 1 yahoo root  75168 Jan  3 07:30
>> default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7
>> -bash-4.1$ head
>> "default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7"
>> 2016-01-03 07:30:01.600119 7f7fe2096700 15
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) getattrs
>> 3.171s7_head/a2478171/default.12061.9_8396947527_52ac8b3ec6_o.jpg/head//3/18446744073709551615/7
>> 2016-01-03 07:30:01.604967 7f7fe2096700 10
>> filestore(/home/y/var/lib/ceph/osd/ceph-354)  -ERANGE, len is 494
>> 2016-01-03 07:30:01.604984 7f7fe2096700 10
>> filestore(/home/y/var/lib/ceph/osd/ceph-354)  -ERANGE, got 247
>> 2016-01-03 07:30:01.604986 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
>> '_user.rgw.idtag'
>> 2016-01-03 07:30:01.604996 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting '_'
>> 2016-01-03 07:30:01.605007 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
>> 'snapset'
>> 2016-01-03 07:30:01.605013 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
>> '_user.rgw.manifest'
>> 2016-01-03 07:30:01.605026 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
>> 'hinfo_key'
>> 2016-01-03 07:30:01.605042 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
>> '_user.rgw.x-amz-meta-origin'
>> 2016-01-03 07:30:01.605049 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
>> '_user.rgw.acl'
>>
>>
>> This only happens on the clusters we turned on the verbose log
>> (debug_osd/filestore=20). And we are running ceph v0.87.
>>
>> Thanks,
>> Guang
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


OSD data file are OSD logs

2016-01-04 Thread Guang Yang
Hi Cephers,
Before I open a tracker, I would like check if it is a known issue or not..

One one of our clusters, there was OSD crash during repairing,  the
crash happened after we issued a PG repair for inconsistent PGs, which
failed because the recorded file size (within xattr) mismatched with
the actual file size.

The mismatch was caused by the fact that the content of the data file
are OSD logs, following is from osd.354 on c003:

-rw-r--r-- 1 yahoo root  75168 Jan  3 07:30
default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7
-bash-4.1$ head
"default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7"
2016-01-03 07:30:01.600119 7f7fe2096700 15
filestore(/home/y/var/lib/ceph/osd/ceph-354) getattrs
3.171s7_head/a2478171/default.12061.9_8396947527_52ac8b3ec6_o.jpg/head//3/18446744073709551615/7
2016-01-03 07:30:01.604967 7f7fe2096700 10
filestore(/home/y/var/lib/ceph/osd/ceph-354)  -ERANGE, len is 494
2016-01-03 07:30:01.604984 7f7fe2096700 10
filestore(/home/y/var/lib/ceph/osd/ceph-354)  -ERANGE, got 247
2016-01-03 07:30:01.604986 7f7fe2096700 20
filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
'_user.rgw.idtag'
2016-01-03 07:30:01.604996 7f7fe2096700 20
filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting '_'
2016-01-03 07:30:01.605007 7f7fe2096700 20
filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
'snapset'
2016-01-03 07:30:01.605013 7f7fe2096700 20
filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
'_user.rgw.manifest'
2016-01-03 07:30:01.605026 7f7fe2096700 20
filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
'hinfo_key'
2016-01-03 07:30:01.605042 7f7fe2096700 20
filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
'_user.rgw.x-amz-meta-origin'
2016-01-03 07:30:01.605049 7f7fe2096700 20
filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
'_user.rgw.acl'


This only happens on the clusters we turned on the verbose log
(debug_osd/filestore=20). And we are running ceph v0.87.

Thanks,
Guang
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Long peering - throttle at FileStore::queue_transactions

2016-01-04 Thread Samuel Just
We need every OSDMap persisted before persisting later ones because we
rely on there being no holes for a bunch of reasons.

The deletion transactions are more interesting.  It's not part of the
boot process, these are deletions resulting from merging in a log from
a peer which logically removed an object.  It's more noticeable on
boot because all PGs will see these operations at once (if there are a
bunch of deletes happening).  We need to process these transactions
before we can serve reads (before we activate) currently since we use
the on disk state (modulo the objectcontext locks) as authoritative.
That transaction iirc also contains the updated PGLog.  We can't avoid
writing down the PGLog prior to activation, but we *can* delay the
deletes (and even batch/throttle them) if we do some work:
1) During activation, we need to maintain a set of to-be-deleted
objects.  For each of these objects, we need to populate the
objectcontext cache with an exists=false objectcontext so that we
don't erroneously read the deleted data.  Each of the entries in the
to-be-deleted object set would have a reference to the context to keep
it alive until the deletion is processed.
2) Any write operation which references one of these objects needs to
be preceded by a delete if one has not yet been queued (and the
to-be-deleted set updated appropriately).  The tricky part is that the
primary and replicas may have different objects in this set...  The
replica would have to insert deletes ahead of any subop (or the ec
equilivant) it gets from the primary.  For that to work, it needs to
have something like the obc cache.  I have a wip-replica-read branch
which refactors object locking to allow the replica to maintain locks
(to avoid replica-reads conflicting with writes).  That machinery
would probably be the right place to put it.
3) We need to make sure that if a node restarts anywhere in this
process that it correctly repopulates the set of to be deleted
entries.  We might consider a deleted-to version in the log?  Not sure
about this one since it would be different on the replica and the
primary.

Anyway, it's actually more complicated than you'd expect and will
require more design (and probably depends on wip-replica-read
landing).
-Sam

On Mon, Jan 4, 2016 at 3:32 PM, Guang Yang  wrote:
> Hi Cephers,
> Happy New Year! I got question regards to the long PG peering..
>
> Over the last several days I have been looking into the *long peering*
> problem when we start a OSD / OSD host, what I observed was that the
> two peering working threads were throttled (stuck) when trying to
> queue new transactions (writing pg log), thus the peering process are
> dramatically slow down.
>
> The first question came to me was, what were the transactions in the
> queue? The major ones, as I saw, included:
>
> - The osd_map and incremental osd_map, this happens if the OSD had
> been down for a while (in a large cluster), or when the cluster got
> upgrade, which made the osd_map epoch the down OSD had, was far behind
> the latest osd_map epoch. During the OSD booting, it would need to
> persist all those osd_maps and generate lots of filestore transactions
> (linear with the epoch gap).
>> As the PG was not involved in most of those epochs, could we only take and 
>> persist those osd_maps which matter to the PGs on the OSD?
>
> - There are lots of deletion transactions, and as the PG booting, it
> needs to merge the PG log from its peers, and for the deletion PG
> entry, it would need to queue the deletion transaction immediately.
>> Could we delay the queue of the transactions until all PGs on the host are 
>> peered?
>
> Thanks,
> Guang
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Long peering - throttle at FileStore::queue_transactions

2016-01-04 Thread Sage Weil
On Mon, 4 Jan 2016, Guang Yang wrote:
> Hi Cephers,
> Happy New Year! I got question regards to the long PG peering..
> 
> Over the last several days I have been looking into the *long peering*
> problem when we start a OSD / OSD host, what I observed was that the
> two peering working threads were throttled (stuck) when trying to
> queue new transactions (writing pg log), thus the peering process are
> dramatically slow down.
> 
> The first question came to me was, what were the transactions in the
> queue? The major ones, as I saw, included:
> 
> - The osd_map and incremental osd_map, this happens if the OSD had
> been down for a while (in a large cluster), or when the cluster got
> upgrade, which made the osd_map epoch the down OSD had, was far behind
> the latest osd_map epoch. During the OSD booting, it would need to
> persist all those osd_maps and generate lots of filestore transactions
> (linear with the epoch gap).
> > As the PG was not involved in most of those epochs, could we only take and 
> > persist those osd_maps which matter to the PGs on the OSD?

This part should happen before the OSD sends the MOSDBoot message, before 
anyone knows it exists.  There is a tunable threshold that controls how 
recent the map has to be before the OSD tries to boot.  If you're 
seeing this in the real world, be probably just need to adjust that value 
way down to something small(er).

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Benachrichtigung

2016-01-04 Thread EMAIL LOTTERIE



Sehr geehrte / ter email Benützer !

Ihre email Adresse hat €1.20,00€ (EINEMILLIONZWEIHUNDERTAUSEND EURO)
gewonnen . Mit den Glückszahlen 9-3-8-26-28-4-64 In der EURO MILLIONEN
EMAIL LOTTERIE.Die Summe ergibt sich aus
einer Gewinnausschuttung von. €22.800,000,00
( ZWEIUNDZWANZIGMILLIONENACHTHUNDERTTOUSEND )
Die Summe wurde durch 19 Gewinnern aus der gleichen
Kategorie geteilt. ! Bitte kontaktieren Sie für Ihren Gewinn zuständige
Sachbearbeiterin Frau Christiane Hamann per email:
christiane_hama...@aol.com
BITTE AUSFUILLEN DEIN DATAS AUS UNTEN.
Glückszahlen:___
NAME: ___FAMILIENNAME:_
ADRESSE:__
STADT: PLZ: LAND: ___
GEB: DATUM: __BERUF:
FESTNETZ TEL.NR: 
MOBILETELEFON NR: ___FAX: ___
EMAIL:___ DATE
SIGNATURE:_

bitte füllen Sie das anschließende Formular vollständig aus und senden es
per email zurück !
Hochachtungsvoll
Inmaculada Garcia Martinez
Koordinator.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Charity/Donation

2016-01-04 Thread Skoll, Jeff
Hi,
My name is Jeffrey Skoll, a philanthropist and the founder of one of the 
largest private foundations in the world. I believe strongly in ‘giving while 
living.’ I had one idea that never changed in my mind — that you should use 
your wealth to help people and I have decided to secretly give USD2.498 Million 
to a randomly selected individual. On receipt of this email, you should count 
yourself as the individual. Kindly get back to me at your earliest convenience, 
so I know your email address is valid.

Visit the web page to know more about me: 
http://www.theglobeandmail.com/news/national/meet-the-canadian-billionaire-whos-giving-it-all-away/article4209888/
 or you can read an article of me on Wikipedia.

Regards,
Jeffrey Skoll.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Speeding up rbd_stat() in libvirt

2016-01-04 Thread Jason Dillaman
Short term, assuming there wouldn't be an objection from the libvirt community, 
I think spawning a thread pool and concurrently executing several rbd_stat 
calls concurrently would be the easiest and cleanest solution.  I wouldn't 
suggest trying to roll your own solution for retrieving image sizes for format 
1 and 2 RBD images directly within libvirt.

Longer term, given this use case, perhaps it would make sense to add an async 
version of rbd_open.  The rbd_stat call itself just reads the data from memory 
initialized by rbd_open.  On the Jewel branch, librbd has had some major rework 
and image loading is asynchronous under the hood already.

-- 

Jason Dillaman 


- Original Message -
> From: "Wido den Hollander" 
> To: ceph-devel@vger.kernel.org
> Sent: Monday, December 28, 2015 8:48:40 AM
> Subject: Speeding up rbd_stat() in libvirt
> 
> Hi,
> 
> The storage pools of libvirt know a mechanism called 'refresh' which
> will scan a storage pool to refresh the contents.
> 
> The current implementation does:
> * List all images via rbd_list()
> * Call rbd_stat() on each image
> 
> Source:
> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_rbd.c;h=cdbfdee98505492407669130712046783223c3cf;hb=master#l329
> 
> This works, but a RBD pool with 10k images takes a couple of minutes to
> scan.
> 
> Now, Ceph is distributed, so this could be done in parallel, but before
> I start on this I was wondering if somebody had a good idea to fix this?
> 
> I don't know if it is allowed in libvirt to spawn multiple threads and
> have workers do this, but it was something which came to mind.
> 
> libvirt only wants to know the size of a image and this is now stored in
> the rbd_directory object, so the rbd_stat() is required.
> 
> Suggestions or ideas? I would like to have this process to be as fast as
> possible.
> 
> Wido
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Speeding up rbd_stat() in libvirt

2016-01-04 Thread Wido den Hollander


On 04-01-16 16:38, Jason Dillaman wrote:
> Short term, assuming there wouldn't be an objection from the libvirt 
> community, I think spawning a thread pool and concurrently executing several 
> rbd_stat calls concurrently would be the easiest and cleanest solution.  I 
> wouldn't suggest trying to roll your own solution for retrieving image sizes 
> for format 1 and 2 RBD images directly within libvirt.
> 

I'll ask in the libvirt community if they allow such a thing.

> Longer term, given this use case, perhaps it would make sense to add an async 
> version of rbd_open.  The rbd_stat call itself just reads the data from 
> memory initialized by rbd_open.  On the Jewel branch, librbd has had some 
> major rework and image loading is asynchronous under the hood already.
> 

Hmm, that would be nice. In the callback I could call rbd_stat() and
populate the volume list within libvirt.

I would very much like to go that route since it saves me a lot of code
inside libvirt ;)

Wido

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Create one millon empty files with cephfs

2016-01-04 Thread Gregory Farnum
On Tue, Dec 29, 2015 at 4:55 AM, Fengguang Gong  wrote:
> hi,
> We create one million empty files through filebench, here is the test env:
> MDS: one MDS
> MON: one MON
> OSD: two OSD, each with one Inter P3700; data on OSD with 2x replica
> Network: all nodes are connected through 10 gigabit network
>
> We use more than one client to create files, to test the scalability of
> MDS. Here are the results:
> IOPS under one client: 850
> IOPS under two client: 1150
> IOPS under four client: 1180
>
> As we can see, the IOPS almost maintains unchanged when the number of
> client increase from 2 to 4.
>
> Cephfs may have a low scalability under one MDS, and we think its the big
> lock in
> MDSDamon::ms_dispatch()::Mutex::locker(every request acquires this lock),
> who limits the
> scalability of MDS.
>
> We think this big lock could be removed through the following steps:
> 1. separate the process of ClientRequest with other requests, so we can
> parallel the process
> of ClientRequest
> 2. use some small granularity locks instead of big lock to ensure
> consistency
>
> Wondering this idea is reasonable?

Parallelizing the MDS is probably a very big job; it's on our radar
but not for a while yet.

If one were to do it, yes, breaking down the big MDS lock would be the
way forward. I'm not sure entirely what that involves — you'd need to
significantly chunk up the locking on our more critical data
structures, most especially the MDCache. Luckily there is *some* help
there in terms of the file cap locking structures we already have in
place, but it's a *huge* project and not one to be undertaken lightly.
A special processing mechanism for ClientRequests versus other
requests is not an assumption I'd start with.

I think you'll find that file creates are just about the least
scalable thing you can do on CephFS right now, though, so there is
some easier ground. One obvious approach is to extend the current
inode preallocation — it already allocates inodes per-client and has a
fast path inside of the MDS for handing them back. It'd be great if
clients were aware of that preallocation and could create files
without waiting for the MDS to talk back to them! The issue with this
is two-fold:
1) need to update the cap flushing protocol to deal with files newly
created by the client
2) need to handle all the backtrace stuff normally performed by the
MDS on file create (which still needs to happen, on either the client
or the server)
There's also clean up in case of a client failure, but we've already
got a model for that in how we figure out real file sizes and things
based on max size.

I think there's a ticket about this somewhere, but I can't find it off-hand...
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 答复: Reboot blocked when undoing unmap op.

2016-01-04 Thread Ilya Dryomov
On Mon, Jan 4, 2016 at 10:51 AM, Wukongming  wrote:
> Hi, Ilya,
>
> It is an old problem.
> When you say "when you issue a reboot, daemons get killed and the kernel 
> client ends up waiting for the them to come back, because of outstanding 
> writes issued by umount called by systemd (or whatever)."
>
> Do you mean if umount rbd successfully, the process of kernel client will 
> stop waiting? What kind of Communication mechanism between libceph and 
> daemons(or ceph userspace)?

If you umount the filesystem on top of rbd and unmap rbd image, there
won't be anything to wait for.  In fact, if there aren't any other rbd
images mapped, libceph will clean up after itself and exit.

If you umount the filesystem on top of rbd but don't unmap the image,
libceph will remain there, along with some amount of communication
(keepalive messages, watch requests, etc).  However, all of that is
internal and is unlikely to block reboot.

If you don't umount the filesystem, your init system will try to umount
it, issuing FS requests to the rbd device.  We don't want to drop those
requests, so, if daemons are gone by then, libceph ends up blocking.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to configure if there are tow network cards in Client

2015-12-31 Thread Linux Chips
it would certainly help those with less knowledge about networking in 
linux, though i do not know how many people using ceph are in this 
category. Sage and the others here may have a better idea about its 
feasibility.
but i usually use rule-* and route-* (in CentOS) files, they work with 
networkmanager, and very easy to configure. in ubuntu you can put them 
in interfaces file, and they are as easy. if such a tool is made, i 
think it should understand the ceph.conf file, but i doubt it can figure 
out the routes correctly without you putting them in.


On 12/29/2015 03:58 PM, 蔡毅 wrote:

   Thank for your replies.
   So is it reasonable that we could write a file such as shell script to 
bind one process with a specific IP and modify the routing tables and rules
as one of Ceph’s tools? So that the users is convenient when they want to 
change the NIC connecting with the OSD.



At 2015-12-29 18:21:21, "Linux Chips"  wrote:

On 12/28/2015 07:47 PM, Sage Weil wrote:

On Fri, 25 Dec 2015, ?? wrote:

Hi all,
  When we read the code, we haven?t find the function that the client can 
bind a specific IP. In Ceph?s configuration, we could only find the parameter 
?public network?, but it seems acts on the OSD but not the client.
  There is a scenario that the client has two network cards named NIC1 and 
NIC2. The NIC1 is responsible for communicating with cluster (monitor and 
RADOS) and the NIC2 has other services except Ceph?s client. So   we need the 
client can bind specific IP in order to differentiate the IP communicating with 
cluster from another IP serving other applications. We want to know is there 
any configuration in Ceph to achieve this function? If there is, how could we 
configure the IP? if not, could we add this function in Ceph? Thank you so much.

you can use routing tables plus routing rules. otherwise linux will just
use the default gateway.
or you can put the second interface on the same public net of ceph.
though that would break if you have multiple external nets.

Right.  There isn't a configurable to do this now--we've always just let
the kernel network layer sort it out. Is this just a matter of calling
bind on the socket before connecting? I've never done this before..

linux will send all packets to the default gateway event if an
application binds to an ip on different interface, the packet will go
out with the source address as the binded one but through your router.
the only solution, even if the bind function exists is to use the
routing tables and rules.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message tomajord...@vger.kernel.org
More majordomo info athttp://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fwd: how io works when backfill

2015-12-29 Thread Sage Weil
On Tue, 29 Dec 2015, Dong Wu wrote:
> if add in osd.7 and 7 becomes the primary: pg1.0 [1, 2, 3]  --> pg1.0
> [7, 2, 3],  is it similar with the example above?
> still install a pg_temp entry mapping the PG back to [1, 2, 3], then
> backfill happens to 7, normal io write to [1, 2, 3], if io to the
> portion of the PG that has already been backfilled will also be sent
> to osd.7?

Yes (although I forget how it picks the ordering of the osds in the temp 
mapping).  See PG::choose_acting() for the details.

> how about these examples about removing an osd:
> - pg1.0 [1, 2, 3]
> - osd.3 down and be removed
> - mapping changes to [1, 2, 5], but osd.5 has no data, then install a
> pg_temp mapping the PG back to [1, 2], then backfill happens to 5,
> - normal io write to [1, 2], if io hits object which has been
> backfilled to osd.5, io will also send to osd.5
> - when backfill completes, remove the pg_temp and mapping changes back
> to [1, 2, 5]

Yes

> another example:
> - pg1.0 [1, 2, 3]
> - osd.3 down and be removed
> - mapping changes to [5, 1, 2], but osd.5 has no data of the pg, then
> install a pg_temp mapping the PG back to [1, 2] which osd.1
> temporarily becomes the primary, then backfill happens to 5,
> - normal io write to [1, 2], if io hits object which has been
> backfilled to osd.5, io will also send to osd.5
> - when backfill completes, remove the pg_temp and mapping changes back
> to [5, 1, 2]
> 
> is my ananysis right?

Yep!

sage

> 
> 2015-12-29 1:30 GMT+08:00 Sage Weil :
> > On Mon, 28 Dec 2015, Zhiqiang Wang wrote:
> >> 2015-12-27 20:48 GMT+08:00 Dong Wu :
> >> > Hi,
> >> > When add osd or remove osd, ceph will backfill to rebalance data.
> >> > eg:
> >> > - pg1.0[1, 2, 3]
> >> > - add an osd(eg. osd.7)
> >> > - ceph start backfill, then pg1.0 osd set changes to [1, 2, 7]
> >> > - if [a, b, c, d, e] are objects needing to backfill to osd.7 and now
> >> > object a is backfilling
> >> > - when a write io hits object a, then the io needs to wait for its
> >> > complete, then goes on.
> >> > - but if io hits object b which has not been backfilled, io reaches
> >> > osd.1, then osd.1 send the io to osd.2  and osd.7, but osd.7 does not
> >> > have object b, so osd.7 needs to wait for object b to backfilled, then
> >> > write. Is it right? Or osd.1 only send the io to osd.2, not both?
> >>
> >> I think in this case, when the write of object b reaches osd.1, it
> >> holds the client write, raises the priority of the recovery of object
> >> b, and kick off the recovery of it. When the recovery of object b is
> >> done, it requeue the client write, and then everything goes like
> >> usual.
> >
> > It's more complicated than that.  In a normal (log-based) recovery
> > situation, it is something like the above: if the acting set is [1,2,3]
> > but 3 is missing the latest copy of A, a write to A will block on the
> > primary while the primary initiates recovery of A immediately.  Once that
> > completes the IO will continue.
> >
> > For backfill, it's different.  In your example, you start with [1,2,3]
> > then add in osd.7.  The OSD will see that 7 has no data for teh PG and
> > install a pg_temp entry mapping the PG back to [1,2,3] temporarily.  Then
> > things will proceed normally while backfill happens to 7.  Backfill won't
> > interfere with normal IO at all, except that IO to the portion of the PG
> > that has already been backfilled will also be sent to the backfill target
> > (7) so that it stays up to date.  Once it complets, the pg_temp entry is
> > removed and the mapping changes back to [1,2,7].  Then osd.3 is allowed to
> > remove it's copy of the PG.
> >
> > sage
> >
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Create one millon empty files with cephfs

2015-12-29 Thread Fengguang Gong
hi,
We create one million empty files through filebench, here is the test env:
MDS: one MDS
MON: one MON
OSD: two OSD, each with one Inter P3700; data on OSD with 2x replica
Network: all nodes are connected through 10 gigabit network

We use more than one client to create files, to test the scalability of
MDS. Here are the results:
IOPS under one client: 850
IOPS under two client: 1150
IOPS under four client: 1180

As we can see, the IOPS almost maintains unchanged when the number of
client increase from 2 to 4.

Cephfs may have a low scalability under one MDS, and we think its the big
lock in
MDSDamon::ms_dispatch()::Mutex::locker(every request acquires this lock),
who limits the
scalability of MDS.

We think this big lock could be removed through the following steps:
1. separate the process of ClientRequest with other requests, so we can
parallel the process
of ClientRequest
2. use some small granularity locks instead of big lock to ensure
consistency

Wondering this idea is reasonable?

thanks
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Create one millon empty files with cephfs

2015-12-29 Thread Fengguang Gong
hi,
We create one million empty files through filebench, here is the test env:
MDS: one MDS
MON: one MON
OSD: two OSD, each with one Inter P3700; data on OSD with 2x replica
Network: all nodes are connected through 10 gigabit network

We use more than one client to create files, to test the scalability
of MDS. Here are the results:
IOPS under one client: 850
IOPS under two client: 1150
IOPS under four client: 1180

As we can see, the IOPS almost maintains unchanged when the number of
client increase from 2 to 4.

Cephfs may have a low scalability under one MDS, and we think its the
big lock in
MDSDamon::ms_dispatch()::Mutex::locker(every request acquires this
lock), who limits the scalability of MDS.

We think this big lock could be removed through the following steps:
1. separate the process of ClientRequest with other requests, so we
can parallel the process of ClientRequest
2. use some small granularity locks instead of big lock to ensure consistency

Wondering this idea is reasonable?

thanks
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re:Re: How to configure if there are tow network cards in Client

2015-12-29 Thread 蔡毅

  Thank for your replies.  
  So is it reasonable that we could write a file such as shell script to 
bind one process with a specific IP and modify the routing tables and rules
as one of Ceph’s tools? So that the users is convenient when they want to 
change the NIC connecting with the OSD.



At 2015-12-29 18:21:21, "Linux Chips"  wrote:
>On 12/28/2015 07:47 PM, Sage Weil wrote:
>> On Fri, 25 Dec 2015, ?? wrote:
>>> Hi all,
>>>  When we read the code, we haven?t find the function that the client 
>>> can bind a specific IP. In Ceph?s configuration, we could only find the 
>>> parameter ?public network?, but it seems acts on the OSD but not the client.
>>>  There is a scenario that the client has two network cards named NIC1 
>>> and NIC2. The NIC1 is responsible for communicating with cluster (monitor 
>>> and RADOS) and the NIC2 has other services except Ceph?s client. So   we 
>>> need the client can bind specific IP in order to differentiate the IP 
>>> communicating with cluster from another IP serving other applications. We 
>>> want to know is there any configuration in Ceph to achieve this function? 
>>> If there is, how could we configure the IP? if not, could we add this 
>>> function in Ceph? Thank you so much.
>you can use routing tables plus routing rules. otherwise linux will just 
>use the default gateway.
>or you can put the second interface on the same public net of ceph. 
>though that would break if you have multiple external nets.
>> Right.  There isn't a configurable to do this now--we've always just let
>> the kernel network layer sort it out. Is this just a matter of calling
>> bind on the socket before connecting? I've never done this before..
>linux will send all packets to the default gateway event if an 
>application binds to an ip on different interface, the packet will go 
>out with the source address as the binded one but through your router. 
>the only solution, even if the bind function exists is to use the 
>routing tables and rules.
>>
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


Re: How to configure if there are tow network cards in Client

2015-12-29 Thread Linux Chips

On 12/28/2015 07:47 PM, Sage Weil wrote:

On Fri, 25 Dec 2015, ?? wrote:

Hi all,
 When we read the code, we haven?t find the function that the client can 
bind a specific IP. In Ceph?s configuration, we could only find the parameter 
?public network?, but it seems acts on the OSD but not the client.
 There is a scenario that the client has two network cards named NIC1 and 
NIC2. The NIC1 is responsible for communicating with cluster (monitor and 
RADOS) and the NIC2 has other services except Ceph?s client. So   we need the 
client can bind specific IP in order to differentiate the IP communicating with 
cluster from another IP serving other applications. We want to know is there 
any configuration in Ceph to achieve this function? If there is, how could we 
configure the IP? if not, could we add this function in Ceph? Thank you so much.
you can use routing tables plus routing rules. otherwise linux will just 
use the default gateway.
or you can put the second interface on the same public net of ceph. 
though that would break if you have multiple external nets.

Right.  There isn't a configurable to do this now--we've always just let
the kernel network layer sort it out. Is this just a matter of calling
bind on the socket before connecting? I've never done this before..
linux will send all packets to the default gateway event if an 
application binds to an ip on different interface, the packet will go 
out with the source address as the binded one but through your router. 
the only solution, even if the bind function exists is to use the 
routing tables and rules.


sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fwd: how io works when backfill

2015-12-28 Thread Dong Wu
if add in osd.7 and 7 becomes the primary: pg1.0 [1, 2, 3]  --> pg1.0
[7, 2, 3],  is it similar with the example above?
still install a pg_temp entry mapping the PG back to [1, 2, 3], then
backfill happens to 7, normal io write to [1, 2, 3], if io to the
portion of the PG that has already been backfilled will also be sent
to osd.7?

how about these examples about removing an osd:
- pg1.0 [1, 2, 3]
- osd.3 down and be removed
- mapping changes to [1, 2, 5], but osd.5 has no data, then install a
pg_temp mapping the PG back to [1, 2], then backfill happens to 5,
- normal io write to [1, 2], if io hits object which has been
backfilled to osd.5, io will also send to osd.5
- when backfill completes, remove the pg_temp and mapping changes back
to [1, 2, 5]


another example:
- pg1.0 [1, 2, 3]
- osd.3 down and be removed
- mapping changes to [5, 1, 2], but osd.5 has no data of the pg, then
install a pg_temp mapping the PG back to [1, 2] which osd.1
temporarily becomes the primary, then backfill happens to 5,
- normal io write to [1, 2], if io hits object which has been
backfilled to osd.5, io will also send to osd.5
- when backfill completes, remove the pg_temp and mapping changes back
to [5, 1, 2]

is my ananysis right?

2015-12-29 1:30 GMT+08:00 Sage Weil :
> On Mon, 28 Dec 2015, Zhiqiang Wang wrote:
>> 2015-12-27 20:48 GMT+08:00 Dong Wu :
>> > Hi,
>> > When add osd or remove osd, ceph will backfill to rebalance data.
>> > eg:
>> > - pg1.0[1, 2, 3]
>> > - add an osd(eg. osd.7)
>> > - ceph start backfill, then pg1.0 osd set changes to [1, 2, 7]
>> > - if [a, b, c, d, e] are objects needing to backfill to osd.7 and now
>> > object a is backfilling
>> > - when a write io hits object a, then the io needs to wait for its
>> > complete, then goes on.
>> > - but if io hits object b which has not been backfilled, io reaches
>> > osd.1, then osd.1 send the io to osd.2  and osd.7, but osd.7 does not
>> > have object b, so osd.7 needs to wait for object b to backfilled, then
>> > write. Is it right? Or osd.1 only send the io to osd.2, not both?
>>
>> I think in this case, when the write of object b reaches osd.1, it
>> holds the client write, raises the priority of the recovery of object
>> b, and kick off the recovery of it. When the recovery of object b is
>> done, it requeue the client write, and then everything goes like
>> usual.
>
> It's more complicated than that.  In a normal (log-based) recovery
> situation, it is something like the above: if the acting set is [1,2,3]
> but 3 is missing the latest copy of A, a write to A will block on the
> primary while the primary initiates recovery of A immediately.  Once that
> completes the IO will continue.
>
> For backfill, it's different.  In your example, you start with [1,2,3]
> then add in osd.7.  The OSD will see that 7 has no data for teh PG and
> install a pg_temp entry mapping the PG back to [1,2,3] temporarily.  Then
> things will proceed normally while backfill happens to 7.  Backfill won't
> interfere with normal IO at all, except that IO to the portion of the PG
> that has already been backfilled will also be sent to the backfill target
> (7) so that it stays up to date.  Once it complets, the pg_temp entry is
> removed and the mapping changes back to [1,2,7].  Then osd.3 is allowed to
> remove it's copy of the PG.
>
> sage
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fwd: how io works when backfill

2015-12-28 Thread Zhiqiang Wang
2015-12-27 20:48 GMT+08:00 Dong Wu :
> Hi,
> When add osd or remove osd, ceph will backfill to rebalance data.
> eg:
> - pg1.0[1, 2, 3]
> - add an osd(eg. osd.7)
> - ceph start backfill, then pg1.0 osd set changes to [1, 2, 7]
> - if [a, b, c, d, e] are objects needing to backfill to osd.7 and now
> object a is backfilling
> - when a write io hits object a, then the io needs to wait for its
> complete, then goes on.
> - but if io hits object b which has not been backfilled, io reaches
> osd.1, then osd.1 send the io to osd.2  and osd.7, but osd.7 does not
> have object b, so osd.7 needs to wait for object b to backfilled, then
> write. Is it right? Or osd.1 only send the io to osd.2, not both?

I think in this case, when the write of object b reaches osd.1, it
holds the client write, raises the priority of the recovery of object
b, and kick off the recovery of it. When the recovery of object b is
done, it requeue the client write, and then everything goes like
usual.

> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Speeding up rbd_stat() in libvirt

2015-12-28 Thread Wido den Hollander
Hi,

The storage pools of libvirt know a mechanism called 'refresh' which
will scan a storage pool to refresh the contents.

The current implementation does:
* List all images via rbd_list()
* Call rbd_stat() on each image

Source:
http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_rbd.c;h=cdbfdee98505492407669130712046783223c3cf;hb=master#l329

This works, but a RBD pool with 10k images takes a couple of minutes to
scan.

Now, Ceph is distributed, so this could be done in parallel, but before
I start on this I was wondering if somebody had a good idea to fix this?

I don't know if it is allowed in libvirt to spawn multiple threads and
have workers do this, but it was something which came to mind.

libvirt only wants to know the size of a image and this is now stored in
the rbd_directory object, so the rbd_stat() is required.

Suggestions or ideas? I would like to have this process to be as fast as
possible.

Wido
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FreeBSD Building and Testing

2015-12-28 Thread Willem Jan Withagen

Hi,

Can somebody try to help me and explain why

in test: Func: test/mon/osd-crash
Func: TEST_crush_reject_empty started

Fails with a python error which sort of startles me:
test/mon/osd-crush.sh:227: TEST_crush_reject_empty:  local 
empty_map=testdir/osd-crush/empty_map

test/mon/osd-crush.sh:228: TEST_crush_reject_empty:  :
test/mon/osd-crush.sh:229: TEST_crush_reject_empty:  ./crushtool -c 
testdir/osd-crush/empty_map.txt -o testdir/osd-crush/empty_map.m

ap
test/mon/osd-crush.sh:230: TEST_crush_reject_empty:  expect_failure 
testdir/osd-crush 'Error EINVAL' ./ceph osd setcrushmap -i testd

ir/osd-crush/empty_map.map
../qa/workunits/ceph-helpers.sh:1171: expect_failure:  local 
dir=testdir/osd-crush

../qa/workunits/ceph-helpers.sh:1172: expect_failure:  shift
../qa/workunits/ceph-helpers.sh:1173: expect_failure:  local 
'expected=Error EINVAL'

../qa/workunits/ceph-helpers.sh:1174: expect_failure:  shift
../qa/workunits/ceph-helpers.sh:1175: expect_failure:  local success
../qa/workunits/ceph-helpers.sh:1176: expect_failure:  pwd
../qa/workunits/ceph-helpers.sh:1177: expect_failure:  printenv
../qa/workunits/ceph-helpers.sh:1178: expect_failure:  echo ./ceph osd 
setcrushmap -i testdir/osd-crush/empty_map.map
../qa/workunits/ceph-helpers.sh:1180: expect_failure:  ./ceph osd 
setcrushmap -i testdir/osd-crush/empty_map.map

*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
Traceback (most recent call last):
  File "./ceph", line 936, in 
retval = main()
  File "./ceph", line 874, in main
sigdict, inbuf, verbose)
  File "./ceph", line 457, in new_style_command
inbuf=inbuf)
  File 
"/usr/srcs/Ceph/wip-freebsd-wjw/ceph/src/pybind/ceph_argparse.py", line 
1208, in json_command

raise RuntimeError('"{0}": exception {1}'.format(argdict, e))
RuntimeError: "{'prefix': u'osd setcrushmap'}": exception "['{"prefix": 
"osd setcrushmap"}']": exception 'utf8' codec can't decode b

yte 0x86 in position 56: invalid start byte

Which is certainly not the type of error expected.
But it is hard to detect any 0x86 in the arguments.

And yes python is right, there are no UTF8 sequences that start with 0x86.
Question is:
Why does it want to parse with UTF8?
And how do I switch it off?
Or how to I fix this error?

Thanx,
--WjW
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fwd: how io works when backfill

2015-12-28 Thread Sage Weil
On Mon, 28 Dec 2015, Zhiqiang Wang wrote:
> 2015-12-27 20:48 GMT+08:00 Dong Wu :
> > Hi,
> > When add osd or remove osd, ceph will backfill to rebalance data.
> > eg:
> > - pg1.0[1, 2, 3]
> > - add an osd(eg. osd.7)
> > - ceph start backfill, then pg1.0 osd set changes to [1, 2, 7]
> > - if [a, b, c, d, e] are objects needing to backfill to osd.7 and now
> > object a is backfilling
> > - when a write io hits object a, then the io needs to wait for its
> > complete, then goes on.
> > - but if io hits object b which has not been backfilled, io reaches
> > osd.1, then osd.1 send the io to osd.2  and osd.7, but osd.7 does not
> > have object b, so osd.7 needs to wait for object b to backfilled, then
> > write. Is it right? Or osd.1 only send the io to osd.2, not both?
> 
> I think in this case, when the write of object b reaches osd.1, it
> holds the client write, raises the priority of the recovery of object
> b, and kick off the recovery of it. When the recovery of object b is
> done, it requeue the client write, and then everything goes like
> usual.

It's more complicated than that.  In a normal (log-based) recovery 
situation, it is something like the above: if the acting set is [1,2,3] 
but 3 is missing the latest copy of A, a write to A will block on the 
primary while the primary initiates recovery of A immediately.  Once that 
completes the IO will continue.

For backfill, it's different.  In your example, you start with [1,2,3] 
then add in osd.7.  The OSD will see that 7 has no data for teh PG and 
install a pg_temp entry mapping the PG back to [1,2,3] temporarily.  Then 
things will proceed normally while backfill happens to 7.  Backfill won't 
interfere with normal IO at all, except that IO to the portion of the PG 
that has already been backfilled will also be sent to the backfill target 
(7) so that it stays up to date.  Once it complets, the pg_temp entry is 
removed and the mapping changes back to [1,2,7].  Then osd.3 is allowed to 
remove it's copy of the PG.

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to configure if there are tow network cards in Client

2015-12-28 Thread Sage Weil
On Fri, 25 Dec 2015, ?? wrote:
> Hi all,
> When we read the code, we haven?t find the function that the client can 
> bind a specific IP. In Ceph?s configuration, we could only find the parameter 
> ?public network?, but it seems acts on the OSD but not the client.
> There is a scenario that the client has two network cards named NIC1 and 
> NIC2. The NIC1 is responsible for communicating with cluster (monitor and 
> RADOS) and the NIC2 has other services except Ceph?s client. So   we need the 
> client can bind specific IP in order to differentiate the IP communicating 
> with cluster from another IP serving other applications. We want to know is 
> there any configuration in Ceph to achieve this function? If there is, how 
> could we configure the IP? if not, could we add this function in Ceph? Thank 
> you so much.

Right.  There isn't a configurable to do this now--we've always just let 
the kernel network layer sort it out. Is this just a matter of calling 
bind on the socket before connecting? I've never done this before..

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ceph branch status

2015-12-28 Thread ceph branch robot
-- All Branches --

Abhishek Varshney 
2015-11-23 11:45:29 +0530   infernalis-backports

Adam C. Emerson 
2015-12-21 16:51:39 -0500   wip-cxx11concurrency

Adam Crume 
2014-12-01 20:45:58 -0800   wip-doc-rbd-replay

Alfredo Deza 
2015-03-23 16:39:48 -0400   wip-11212
2015-12-23 11:25:13 -0500   wip-doc-style

Alfredo Deza 
2014-07-08 13:58:35 -0400   wip-8679
2014-09-04 13:58:14 -0400   wip-8366
2014-10-13 11:10:10 -0400   wip-9730

Ali Maredia 
2015-11-25 13:45:29 -0500   wip-10587-split-servers
2015-12-23 12:01:46 -0500   wip-cmake
2015-12-23 16:12:47 -0500   wip-cmake-rocksdb

Barbora Ančincová 
2015-11-04 16:43:45 +0100   wip-doc-RGW

Boris Ranto 
2015-09-04 15:19:11 +0200   wip-bash-completion

Daniel Gryniewicz 
2015-11-11 09:06:00 -0500   wip-rgw-storage-class
2015-12-09 12:56:37 -0500   cmake-dang

Danny Al-Gaaf 
2015-04-23 16:32:00 +0200   wip-da-SCA-20150421
2015-04-23 17:18:57 +0200   wip-nosetests
2015-04-23 18:20:16 +0200   wip-unify-num_objects_degraded
2015-11-03 14:10:47 +0100   wip-da-SCA-20151029
2015-11-03 14:40:44 +0100   wip-da-SCA-20150910

David Zafman 
2014-08-29 10:41:23 -0700   wip-libcommon-rebase
2015-04-24 13:14:23 -0700   wip-cot-giant
2015-09-28 11:33:11 -0700   wip-12983
2015-12-22 16:19:25 -0800   wip-zafman-testing

Dongmao Zhang 
2014-11-14 19:14:34 +0800   thesues-master

Greg Farnum 
2015-04-29 21:44:11 -0700   wip-init-names
2015-07-16 09:28:24 -0700   hammer-12297
2015-10-02 13:00:59 -0700   greg-infernalis-lock-testing
2015-10-02 13:09:05 -0700   greg-infernalis-lock-testing-cacher
2015-10-07 00:45:24 -0700   greg-infernalis-fs
2015-10-21 17:43:07 -0700   client-pagecache-norevoke
2015-10-27 11:32:46 -0700   hammer-pg-replay
2015-11-24 07:17:33 -0800   greg-fs-verify
2015-12-11 00:24:40 -0800   greg-fs-testing

Greg Farnum 
2014-10-23 13:33:44 -0700   wip-forward-scrub

Guang G Yang 
2015-06-26 20:31:44 +   wip-ec-readall
2015-07-23 16:13:19 +   wip-12316

Guang Yang 
2014-09-25 00:47:46 +   wip-9008
2015-10-20 15:30:41 +   wip-13441

Haomai Wang 
2015-10-26 00:02:04 +0800   wip-13521

Haomai Wang 
2014-07-27 13:37:49 +0800   wip-flush-set
2015-04-20 00:47:59 +0800   update-organization
2015-07-21 19:33:56 +0800   fio-objectstore
2015-08-26 09:57:27 +0800   wip-recovery-attr
2015-10-24 23:39:07 +0800   fix-compile-warning

Hector Martin 
2015-12-03 03:07:02 +0900   wip-cython-rbd

Ilya Dryomov 
2014-09-05 16:15:10 +0400   wip-rbd-notify-errors

Ivo Jimenez 
2015-08-24 23:12:45 -0700   hammer-with-new-workunit-for-wip-12551

James Page 
2015-11-04 11:08:42 +   javacruft-wip-ec-modules

Jason Dillaman 
2015-08-31 23:17:53 -0400   wip-12698
2015-11-13 02:00:21 -0500   wip-11287-rebased

Jenkins 
2015-11-04 14:31:13 -0800   rhcs-v0.94.3-ubuntu

Jenkins 
2014-07-29 05:24:39 -0700   wip-nhm-hang
2014-10-14 12:10:38 -0700   wip-2
2015-02-02 10:35:28 -0800   wip-sam-v0.92
2015-08-21 12:46:32 -0700   last
2015-08-21 12:46:32 -0700   loic-v9.0.3
2015-09-15 10:23:18 -0700   rhcs-v0.80.8
2015-09-21 16:48:32 -0700   rhcs-v0.94.1-ubuntu

Joao Eduardo Luis 
2014-09-10 09:39:23 +0100   wip-leveldb-get.dumpling

Joao Eduardo Luis 
2014-07-22 15:41:42 +0100   wip-leveldb-misc

Joao Eduardo Luis 
2014-09-02 17:19:52 +0100   wip-leveldb-get
2014-10-17 16:20:11 +0100   wip-paxos-fix
2014-10-21 21:32:46 +0100   wip-9675.dumpling
2015-07-27 21:56:42 +0100   wip-11470.hammer
2015-09-09 15:45:45 +0100   wip-11786.hammer

Joao Eduardo Luis 
2014-11-17 16:43:53 +   wip-mon-osdmap-cleanup
2014-12-15 16:18:56 +   wip-giant-mon-backports
2014-12-17 17:13:57 +   wip-mon-backports.firefly
2014-12-17 23:15:10 +   wip-mon-sync-fix.dumpling
2015-01-07 23:01:00 +   wip-mon-blackhole-mlog-0.87.7
2015-01-10 02:40:42 +   wip-dho-joao
2015-01-10 02:46:31 +   

Cordial greeting

2015-12-28 Thread Zahra Robert



Cordial greeting message from Fatima, I am seeking for your help,I will be
very glad if you do assist me to relocate a sum of (US$4 Million Dollars)
into your Bank account in your country for the benefit of both of us i
want to use this money for investment. I will give you more details as you
reply Yours Eva Zahra Robert

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CEPH build

2015-12-28 Thread Odintsov Vladislav
Hi,

resending my letter.
Thank you for the attention.


Best regards,

Vladislav Odintsov


From: Sage Weil 
Sent: Monday, December 28, 2015 19:49
To: Odintsov Vladislav
Subject: Re: CEPH build

Can you resend this to ceph-devel, and copy ad...@redhat.com?

On Fri, 25 Dec 2015, Odintsov Vladislav wrote:

>
> Hi, Sage!
>
>
> I'm working at Cloud provider as a system engineer, and now
> I'm trying to build different versions of CEPH (0.94, 9.2, 10.0) with libxio
> enabled, and I've got a problem with understanding, how do ceph maintainers
> create official tarballs and builds from git repo.
>
> I saw you as a maintainer of build related files in a repo, and thought you
> can help me :) If I'm wrong, please, say me, who can do it.
>
> I've found very many information sources with different description of ceph
> build process:
>
> - https://github.com/ceph/ceph-build
>
> - https://github.com/ceph/autobuild-ceph
>
> - documentation on ceph.docs.
>
>
> But I'm unable to get the same tarball as
> at http://download.ceph.com/tarballs/
>
> for example for version v0.94.5. What else should I read? Or, maybe there is
> some magic...)
>
>
> Actually, I want understand how official builds are made (which tools), I'd
> like to go through all build related steps by myself to understand the
> upstream building process.
>
>
> Thanks a lot for your help!
>
>
> 
> Best regards,
>
> Vladislav Odintsov
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


how io works when backfill

2015-12-27 Thread Dong Wu
Hi,
When add osd or remove osd, ceph will backfill to rebalance data.
eg:
- pg1.0[1, 2, 3]
- add an osd(eg. osd.7)
- ceph start backfill, then pg1.0 osd set changes to [1, 2, 7]
- if [a, b, c, d, e] are objects needing to backfill to osd.7 and now
object a is backfilling
- when a write io hits object a, then the io needs to wait for its
complete, then goes on.
- but if io hits object b which has not been backfilled, io reaches
osd.1, then osd.1 send the io to osd.2  and osd.7, but osd.7 does not
have object b, so osd.7 needs to wait for object b to backfilled, then
write. Is it right? Or osd.1 only send the io to osd.2, not both?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] why not add (offset,len) to pglog

2015-12-25 Thread Dong Wu
Thank you for your reply. I am looking formard to Sage's opinion too @sage.
Also I'll keep on with the BlueStore and Kstore's progress.

Regards

2015-12-25 14:48 GMT+08:00 Ning Yao :
> Hi, Dong Wu,
>
> 1. As I currently work for other things, this proposal is abandon for
> a long time
> 2. This is a complicated task as we need to consider a lots such as
> (not just for writeOp, as well as truncate, delete) and also need to
> consider the different affects for different backends(Replicated, EC).
> 3. I don't think it is good time to redo this patch now, since the
> BlueStore and Kstore  is inprogress, and I'm afraid to bring some
> side-effect.  We may prepare and propose the whole design in next CDS.
> 4. Currently, we already have some tricks to deal with recovery (like
> throttle the max recovery op, set the priority for recovery and so
> on). So this kind of patch may not solve the critical problem but just
> make things better, and I am not quite sure that this will really
> bring a big improvement. Based on my previous test, it works
> excellently on slow disk (say hdd), and also for a short-time
> maintaining. Otherwise, it will trigger the backfill process.  So wait
> for Sage's opinion @sage
>
> If you are interest on this, we may cooperate to do this.
>
> Regards
> Ning Yao
>
>
> 2015-12-25 14:23 GMT+08:00 Dong Wu :
>> Thanks, from this pull request I learned that this issue is not
>> completed, is there any new progress of this issue?
>>
>> 2015-12-25 12:30 GMT+08:00 Xinze Chi (信泽) :
>>> Yeah, This is good idea for recovery, but not for backfill.
>>> @YaoNing have pull a request about this
>>> https://github.com/ceph/ceph/pull/3837 this year.
>>>
>>> 2015-12-25 11:16 GMT+08:00 Dong Wu :
 Hi,
 I have doubt about pglog, the pglog contains (op,object,version) etc.
 when peering, use pglog to construct missing list,then recover the
 whole object in missing list even if different data among replicas is
 less then a whole object data(eg,4MB).
 why not add (offset,len) to pglog? If so, the missing list can contain
 (object, offset, len), then we can reduce recover data.
 ___
 ceph-users mailing list
 ceph-us...@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Xinze Chi
>> ___
>> ceph-users mailing list
>> ceph-us...@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


How to configure if there are tow network cards in Client

2015-12-25 Thread 蔡毅
Hi all,
When we read the code, we haven’t find the function that the client can 
bind a specific IP. In Ceph’s configuration, we could only find the parameter 
“public network”, but it seems acts on the OSD but not the client.
There is a scenario that the client has two network cards named NIC1 and 
NIC2. The NIC1 is responsible for communicating with cluster (monitor and 
RADOS) and the NIC2 has other services except Ceph’s client. So   we need the 
client can bind specific IP in order to differentiate the IP communicating with 
cluster from another IP serving other applications. We want to know is there 
any configuration in Ceph to achieve this function? If there is, how could we 
configure the IP? if not, could we add this function in Ceph? Thank you so much.
Best regards,
Cai Yi


Re: [ceph-users] why not add (offset,len) to pglog

2015-12-25 Thread Sage Weil
On Fri, 25 Dec 2015, Ning Yao wrote:
> Hi, Dong Wu,
> 
> 1. As I currently work for other things, this proposal is abandon for
> a long time
> 2. This is a complicated task as we need to consider a lots such as
> (not just for writeOp, as well as truncate, delete) and also need to
> consider the different affects for different backends(Replicated, EC).
> 3. I don't think it is good time to redo this patch now, since the
> BlueStore and Kstore  is inprogress, and I'm afraid to bring some
> side-effect.  We may prepare and propose the whole design in next CDS.
> 4. Currently, we already have some tricks to deal with recovery (like
> throttle the max recovery op, set the priority for recovery and so
> on). So this kind of patch may not solve the critical problem but just
> make things better, and I am not quite sure that this will really
> bring a big improvement. Based on my previous test, it works
> excellently on slow disk (say hdd), and also for a short-time
> maintaining. Otherwise, it will trigger the backfill process.  So wait
> for Sage's opinion @sage
> 
> If you are interest on this, we may cooperate to do this.

I think it's a great idea.  We didn't do it before only because it is 
complicated.  The good news is that if we can't conclusively infer exactly 
which parts of hte object need to be recovered from the log entry we can 
always just fall back to recovering the whole thing.  Also, the place 
where this is currently most visible is RBD small writes:

 - osd goes down
 - client sends a 4k overwrite and modifies an object
 - osd comes back up
 - client sends another 4k overwrite
 - client io blocks while osd recovers 4mb

So even if we initially ignore truncate and omap and EC and clones and 
anything else complicated I suspect we'll get a nice benefit.

I haven't thought about this too much, but my guess is that the hard part 
is making the primary's missing set representation include a partial delta 
(say, an interval_set<> indicating which ranges of the file have changed) 
in a way that gracefully degrades to recovering the whole object if we're 
not sure.

In any case, we should definitely have the design conversation!

sage

> 
> Regards
> Ning Yao
> 
> 
> 2015-12-25 14:23 GMT+08:00 Dong Wu :
> > Thanks, from this pull request I learned that this issue is not
> > completed, is there any new progress of this issue?
> >
> > 2015-12-25 12:30 GMT+08:00 Xinze Chi (??) :
> >> Yeah, This is good idea for recovery, but not for backfill.
> >> @YaoNing have pull a request about this
> >> https://github.com/ceph/ceph/pull/3837 this year.
> >>
> >> 2015-12-25 11:16 GMT+08:00 Dong Wu :
> >>> Hi,
> >>> I have doubt about pglog, the pglog contains (op,object,version) etc.
> >>> when peering, use pglog to construct missing list,then recover the
> >>> whole object in missing list even if different data among replicas is
> >>> less then a whole object data(eg,4MB).
> >>> why not add (offset,len) to pglog? If so, the missing list can contain
> >>> (object, offset, len), then we can reduce recover data.
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-us...@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >>
> >> --
> >> Regards,
> >> Xinze Chi
> > ___
> > ceph-users mailing list
> > ceph-us...@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] why not add (offset,len) to pglog

2015-12-24 Thread Ning Yao
Hi, Dong Wu,

1. As I currently work for other things, this proposal is abandon for
a long time
2. This is a complicated task as we need to consider a lots such as
(not just for writeOp, as well as truncate, delete) and also need to
consider the different affects for different backends(Replicated, EC).
3. I don't think it is good time to redo this patch now, since the
BlueStore and Kstore  is inprogress, and I'm afraid to bring some
side-effect.  We may prepare and propose the whole design in next CDS.
4. Currently, we already have some tricks to deal with recovery (like
throttle the max recovery op, set the priority for recovery and so
on). So this kind of patch may not solve the critical problem but just
make things better, and I am not quite sure that this will really
bring a big improvement. Based on my previous test, it works
excellently on slow disk (say hdd), and also for a short-time
maintaining. Otherwise, it will trigger the backfill process.  So wait
for Sage's opinion @sage

If you are interest on this, we may cooperate to do this.

Regards
Ning Yao


2015-12-25 14:23 GMT+08:00 Dong Wu :
> Thanks, from this pull request I learned that this issue is not
> completed, is there any new progress of this issue?
>
> 2015-12-25 12:30 GMT+08:00 Xinze Chi (信泽) :
>> Yeah, This is good idea for recovery, but not for backfill.
>> @YaoNing have pull a request about this
>> https://github.com/ceph/ceph/pull/3837 this year.
>>
>> 2015-12-25 11:16 GMT+08:00 Dong Wu :
>>> Hi,
>>> I have doubt about pglog, the pglog contains (op,object,version) etc.
>>> when peering, use pglog to construct missing list,then recover the
>>> whole object in missing list even if different data among replicas is
>>> less then a whole object data(eg,4MB).
>>> why not add (offset,len) to pglog? If so, the missing list can contain
>>> (object, offset, len), then we can reduce recover data.
>>> ___
>>> ceph-users mailing list
>>> ceph-us...@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> Regards,
>> Xinze Chi
> ___
> ceph-users mailing list
> ceph-us...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] why not add (offset,len) to pglog

2015-12-24 Thread Dong Wu
Thanks, from this pull request I learned that this issue is not
completed, is there any new progress of this issue?

2015-12-25 12:30 GMT+08:00 Xinze Chi (信泽) :
> Yeah, This is good idea for recovery, but not for backfill.
> @YaoNing have pull a request about this
> https://github.com/ceph/ceph/pull/3837 this year.
>
> 2015-12-25 11:16 GMT+08:00 Dong Wu :
>> Hi,
>> I have doubt about pglog, the pglog contains (op,object,version) etc.
>> when peering, use pglog to construct missing list,then recover the
>> whole object in missing list even if different data among replicas is
>> less then a whole object data(eg,4MB).
>> why not add (offset,len) to pglog? If so, the missing list can contain
>> (object, offset, len), then we can reduce recover data.
>> ___
>> ceph-users mailing list
>> ceph-us...@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Regards,
> Xinze Chi
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


why not add (offset,len) to pglog

2015-12-24 Thread Dong Wu
Hi,
I have doubt about pglog, the pglog contains (op,object,version) etc.
when peering, use pglog to construct missing list,then recover the
whole object in missing list even if different data among replicas is
less then a whole object data(eg,4MB).
why not add (offset,len) to pglog? If so, the missing list can contain
(object, offset, len), then we can reduce recover data.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] why not add (offset,len) to pglog

2015-12-24 Thread 信泽
Yeah, This is good idea for recovery, but not for backfill.
@YaoNing have pull a request about this
https://github.com/ceph/ceph/pull/3837 this year.

2015-12-25 11:16 GMT+08:00 Dong Wu :
> Hi,
> I have doubt about pglog, the pglog contains (op,object,version) etc.
> when peering, use pglog to construct missing list,then recover the
> whole object in missing list even if different data among replicas is
> less then a whole object data(eg,4MB).
> why not add (offset,len) to pglog? If so, the missing list can contain
> (object, offset, len), then we can reduce recover data.
> ___
> ceph-users mailing list
> ceph-us...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Regards,
Xinze Chi
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fixing jenkins builds on pull requests

2015-12-23 Thread Loic Dachary
Hi,

I triaged the jenkins related failures (from #24 to #49):

CentOS 6 not supported:

  https://jenkins.ceph.com/job/ceph-pull-requests/26/console
  https://jenkins.ceph.com/job/ceph-pull-requests/28/console
  https://jenkins.ceph.com/job/ceph-pull-requests/29/console
  https://jenkins.ceph.com/job/ceph-pull-requests/34/console
  https://jenkins.ceph.com/job/ceph-pull-requests/38/console
  https://jenkins.ceph.com/job/ceph-pull-requests/44/console
  https://jenkins.ceph.com/job/ceph-pull-requests/46/console
  https://jenkins.ceph.com/job/ceph-pull-requests/48/console
  https://jenkins.ceph.com/job/ceph-pull-requests/49/console

Ubuntu 12.04 not supported:

  https://jenkins.ceph.com/job/ceph-pull-requests/27/console
  https://jenkins.ceph.com/job/ceph-pull-requests/36/console

Failure to fetch from github

  https://jenkins.ceph.com/job/ceph-pull-requests/35/console

I've not been able to analyze more failures because it looks like only 30 jobs 
are kept. Here is an updated summary:

 * running on unsupported operating systems (CentOS 6, precise and maybe others)
 * leftovers from a previous test (which should be removed when a new slave is 
provisionned for each test)
 * keep the last 300 jobs for forensic analysis (about one week worth)
 * disable reporting to github pull requests until the above are resolved (all 
failures were false negative).

Cheers

On 23/12/2015 10:11, Loic Dachary wrote:
> Hi Alfredo,
> 
> I forgot to mention that the ./run-make-check.sh run currently has no known 
> false negative on CentOS 7. By that I mean that if run on master 100 times, 
> it will succeed 100 times. This is good to debug the jenkins builds on pull 
> requests as we know all problems either come from the infrastructure or the 
> pull request. We do not have to worry about random errors due to race 
> conditions in the tests or things like that.
> 
> I'll keep an eye on the test results and analyse each failure. For now it 
> would be best to disable reporting failures as they are almost entirely false 
> negative and will confuse the contributor. The failures come from:
> 
>  * running on unsupported operating systems (CentOS 6 and maybe others)
>  * leftovers from a previous test (which should be removed when a new slave 
> is provisionned for each test)
> 
> I'll add to this thread when / if I find more.
> 
> Cheers
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


use object size of 32k rather than 4M

2015-12-23 Thread hzwulibin
Hi, cephers, Sage and Haomai

Recently we stuck of the performance down problem when recoverying. The scene 
is simple:
1. run fio with rand write(bs=4k)
2. stop one osd; sleep 10; start the osd
3. the IOPS drop from 6K to about 200

We now know the SSD which that osd on is the bottleneck when recovery. After 
read the code, we find the IO of that 
SSD come from two ways:
1. normal recovery IO
2. user IO but in the missing list, need to recovery the 4M object first.

So our first step is limit the recovery IO to slow down the stress of that SSD. 
That helps in some scene, but not this one.


We have 36 OSD with 3 replicas, so when one osd down, about 1/12 objects will 
be in degraded state.
When we run fio with 4k randwrite, about 1/12 io will stuck and need to 
recovery the 4M object first.
That really enlarge the stress the that SSD.

In order to reduce the enlarge impact, we want to change the default size of 
the object from 4M to 32k.

We know that will increase the number of the objects of one OSD and make remove 
process become longer.

Hmm, here i want to ask your guys is there any other potential problems will 
32k size have? If no obvious problem, will could dive into
it and do more test on it.

Many thanks!

--
hzwulibin
2015-12-23


Re: [ceph-users] use object size of 32k rather than 4M

2015-12-23 Thread hzwulibin
Hi, Robert

Thanks for your quick reply. Yeah, the number of file really will be the 
potential problem. But if just the memory problem, we could use more memory in 
our OSD
servers.

Also, i tested it on XFS use mdtest, here is the result:


$ sudo ~/wulb/bin/mdtest -I 1 -z 1 -b 1024 -R -F
--
[[10342,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: 10-180-0-34

Another transport will be used instead, although this may result in
lower performance.
--
-- started at 12/23/2015 18:59:16 --

mdtest-1.8.3 was launched with 1 total task(s) on 1 nodes
Command line used: /home/ceph/wulb/bin/mdtest -I 1 -z 1 -b 1024 -R -F
Path: /home/ceph
FS: 824.5 GiB   Used FS: 4.8%   Inodes: 52.4 Mi   Used Inodes: 0.6%
random seed: 1450868356

1 tasks, 1025 files

SUMMARY: (of 1 iterations)
   Operation  MaxMin   MeanStd Dev
   -  ------   ---
   File creation :  44660.505  44660.505  44660.505  0.000
   File stat : 693747.783 693747.783 693747.783  0.000
   File read : 365319.444 365319.444 365319.444  0.000
   File removal  :  62064.560  62064.560  62064.560  0.000
   Tree creation :  69680.729  69680.729  69680.729  0.000
   Tree removal  :352.905352.905352.905  0.000


From what i tested, the speed of File stat and File read are not slow down 
much.  So, could i say the speed of OP like
lookup a file will not decrease much, just increase the number of the files?


--   
hzwulibin
2015-12-23

-
发件人:"Van Leeuwen, Robert" 
发送日期:2015-12-23 20:57
收件人:hzwulibin,ceph-devel,ceph-users
抄送:
主题:Re: [ceph-users] use object size of 32k rather than 4M


>In order to reduce the enlarge impact, we want to change the default size of 
>the object from 4M to 32k.
>
>We know that will increase the number of the objects of one OSD and make 
>remove process become longer.
>
>Hmm, here i want to ask your guys is there any other potential problems will 
>32k size have? If no obvious problem, will could dive into
>it and do more test on it.


I assume the objects on the OSDs filesystem will become 32k when you do this.
So if you have 1TB of data on one OSD you will have 31 million files == 31 
million inodes 
This is excluding the directory structure which also might be significant.
If you have 10 OSDs on a server you will easily hit 310 million inodes.
You will need a LOT of memory to make sure the inodes are cached but even then 
looking up the inode might add significant latency.

My guess is it will be fast in the beginning but it will grind to an hold when 
the cluster gets fuller due to inodes no longer being in memory.

Also this does not take in any other bottlenecks you might hit in ceph which 
other users can probably answer better.


Cheers,
Robert van Leeuwen



Re: Time to move the make check bot to jenkins.ceph.com

2015-12-23 Thread Ken Dreyer
This is really great. Thanks Loic and Alfredo!

- Ken

On Tue, Dec 22, 2015 at 11:23 AM, Loic Dachary  wrote:
> Hi,
>
> The make check bot moved to jenkins.ceph.com today and ran it's first 
> successfull job. You will no longer see comments from the bot: it will update 
> the github status instead, which is less intrusive.
>
> Cheers
>
> On 21/12/2015 11:13, Loic Dachary wrote:
>> Hi,
>>
>> The make check bot is broken in a way that I can't figure out right now. 
>> Maybe now is the time to move it to jenkins.ceph.com ? It should not be more 
>> difficult than launching the run-make-check.sh script. It does not need 
>> network or root access.
>>
>> Cheers
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] use object size of 32k rather than 4M

2015-12-23 Thread Van Leeuwen, Robert

>In order to reduce the enlarge impact, we want to change the default size of 
>the object from 4M to 32k.
>
>We know that will increase the number of the objects of one OSD and make 
>remove process become longer.
>
>Hmm, here i want to ask your guys is there any other potential problems will 
>32k size have? If no obvious problem, will could dive into
>it and do more test on it.


I assume the objects on the OSDs filesystem will become 32k when you do this.
So if you have 1TB of data on one OSD you will have 31 million files == 31 
million inodes 
This is excluding the directory structure which also might be significant.
If you have 10 OSDs on a server you will easily hit 310 million inodes.
You will need a LOT of memory to make sure the inodes are cached but even then 
looking up the inode might add significant latency.

My guess is it will be fast in the beginning but it will grind to an hold when 
the cluster gets fuller due to inodes no longer being in memory.

Also this does not take in any other bottlenecks you might hit in ceph which 
other users can probably answer better.


Cheers,
Robert van Leeuwen


Re: [ceph-users] use object size of 32k rather than 4M

2015-12-23 Thread Van Leeuwen, Robert
>Thanks for your quick reply. Yeah, the number of file really will be the 
>potential problem. But if just the memory problem, we could use more memory in 
>our OSD
>servers.

Add more mem might not be a viable solution:
Ceph does not say how much data is stores in an inode but the docs say the 
xattr of ext4 is not big enough.
Assuming xfs will use 512 bytes is probably very optimistic.
So for e.g. 300 million inodes you are talking about, at least, 150GB.

>
>Also, i tested it on XFS use mdtest, here is the result:
>
>
>FS: 824.5 GiB   Used FS: 4.8%   Inodes: 52.4 Mi   Used Inodes: 0.6%

52 million files without extended attributes is probably not a real life 
scenario for a filled up ceph node with multiple OSDs.

Cheers,
Robert van Leeuwen


Re: Let's Not Destroy the World in 2038

2015-12-23 Thread Adam C. Emerson
On 22/12/2015, Gregory Farnum wrote:
[snip]
> So I think we're stuck with creating a new utime_t and incrementing
> the struct_v on everything that contains them. :/
[snip]
> We'll also then need the full feature bit system to make
> sure we send the old encoding to clients which don't understand the
> new one, and to prevent a mid-upgrade cluster from writing data on a
> new node that gets moved to a new node which doesn't understand it.

That is my understanding. I have the impression that network communication
get feature bits for the other nodes and on-disk structures are explicitly
versioned. If I'm mistaken, please hurl corrections at me.

> Given that utime_t occurs in a lot of places, and really can't change
> *again* after this, we probably shouldn't set up the new version with
> versioned encoding?

You're overly pessimistic. I'm hoping our post-human descendents store
their unfathomably alien, reconstructed minds in some galaxy spanning
descendent of Ceph and need more than a 64-bit second count.

However, I agree that the time value itself should not have an encoded
version tag.

To my intuition, the best way forward would be to:

(1) Add non-defaulted feature parameters on encode/decode of utime_t and
ceph::real_time. This will break everything that uses them.

(2) Add explicit encode_old/encode_new functions. that way when we KNOW which
one we want at compile time we don't have to pay for a runtime check.

(3) When we have feature bits, pass them in.

(4) When we have a version, bump it. For new versions, explicitly call
encode_new. When we know we want old, call old.

(5) If there are classes that we encode that have neither feature bits nor
versioning available, see what uses them and act accordingly. Hopefully the
special cases will be few.

Does that seem reasonable?

I thank you.

And all hypothetical post-huamn Ceph users thank you.

-- 
Senior Software Engineer   Red Hat Storage, Ann Arbor, MI, US
IRC: Aemerson@{RedHat, OFTC, Freenode}
0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C  7C12 80F7 544B 90ED BFB9


signature.asc
Description: PGP signature


rgw: sticky user quota data on bucket removal

2015-12-23 Thread Paul Von-Stamwitz
Hi,

We're testing user quotas on Hammer with civetweb and we're running into an 
issue with user stats.

If the user/admin removes a bucket using -force/-purge-objects options with 
s3cmd/radosgw-admin respectively, the user stats will continue to reflect the 
deleted objects for quota purposes, and there seems to be no way to reset them. 
It appears that user stats need to be sync'ed prior to bucket removal. Setting 
" rgw user quota bucket sync interval = 0" appears to solve the problem.

What is the downside to setting the interval to 0?

I think the right solution is to have an implied sync-stats during bucket 
removal. Other suggestions?

All the best,
Paul
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rgw: sticky user quota data on bucket removal

2015-12-23 Thread Yehuda Sadeh-Weinraub
On Wed, Dec 23, 2015 at 3:53 PM, Paul Von-Stamwitz
 wrote:
> Hi,
>
> We're testing user quotas on Hammer with civetweb and we're running into an 
> issue with user stats.
>
> If the user/admin removes a bucket using -force/-purge-objects options with 
> s3cmd/radosgw-admin respectively, the user stats will continue to reflect the 
> deleted objects for quota purposes, and there seems to be no way to reset 
> them. It appears that user stats need to be sync'ed prior to bucket removal. 
> Setting " rgw user quota bucket sync interval = 0" appears to solve the 
> problem.
>
> What is the downside to setting the interval to 0?

We'll update the buckets that are getting modified continuously,
instead of once every interval.

>
> I think the right solution is to have an implied sync-stats during bucket 
> removal. Other suggestions?
>

No, syncing the bucket stats on removal sounds right.

Yehuda

> All the best,
> Paul
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: rgw: sticky user quota data on bucket removal

2015-12-23 Thread Paul Von-Stamwitz
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> ow...@vger.kernel.org] On Behalf Of Yehuda Sadeh-Weinraub
> Sent: Wednesday, December 23, 2015 5:02 PM
> To: Paul Von-Stamwitz
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: rgw: sticky user quota data on bucket removal
> 
> On Wed, Dec 23, 2015 at 3:53 PM, Paul Von-Stamwitz
>  wrote:
> > Hi,
> >
> > We're testing user quotas on Hammer with civetweb and we're running
> > into an issue with user stats.
> >
> > If the user/admin removes a bucket using -force/-purge-objects options
> > with s3cmd/radosgw-admin respectively, the user stats will continue to
> > reflect the deleted objects for quota purposes, and there seems to be no
> > way to reset them. It appears that user stats need to be sync'ed prior to
> > bucket removal. Setting " rgw user quota bucket sync interval = 0" appears 
> > to
> > solve the problem.
> >
> > What is the downside to setting the interval to 0?
> 
> We'll update the buckets that are getting modified continuously, instead of
> once every interval.
>

So, I presume that this will impact performance on puts and deletes. We'll take 
a look at the impact on this.

> >
> > I think the right solution is to have an implied sync-stats during bucket
> > removal. Other suggestions?
> >
> 
> No, syncing the bucket stats on removal sounds right.
> 

Great. This would alleviate any performance impact on continuous updates.
Thanks!

> Yehuda
> 
> > All the best,
> > Paul
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majord...@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html


Re: New "make check" job for Ceph pull requests

2015-12-23 Thread Loic Dachary
Hi,

For the record the pending issues that prevent the "make check" job 
(https://jenkins.ceph.com/job/ceph-pull-requests/) from running can be found at 
http://tracker.ceph.com/issues/14172

Cheers

On 23/12/2015 21:05, Alfredo Deza wrote:
> Hi all,
> 
> As of yesterday (Tuesday Dec 22nd) we have the "make check" job
> running within our CI infrastructure, working very similarly as the
> previous check with a few differences:
> 
> * there are no longer comments added to the pull requests
> * notifications of success (or failure) are done inline in the same
> notification box for "This branch has no conflicts with the base
> branch"
> * All members of the Ceph organization can trigger a job with the
> following comment:
> test this please
> 
> Changes to the job should be done following our new process: anyone can open
> a pull request against the "ceph-pull-requests" job that configures/modifies
> it. This process is fairly minimal:
> 
> 1) *Jobs no longer require to make changes in the Jenkins UI*, they
> are rather plain text YAML files that live in the ceph/ceph-build.git
> repository and have a specific structure. Job changes (including
> scripts) are made directly on that repository via pull requests.
> 
> 2) As soon as a PR is merged the changes are automatically pushed to
> Jenkins. Regardless if this is a new or old job. All one needs for a
> new job to appear is a directory with a working YAML file (see links
> at the end on what this means)
> 
> Below, please find a list to resources on how to make changes to a
> Jenkins Job, and examples on how mostly anyone can provide changes:
> 
> * Format and configuration of YAML files are consumed by JJB (Jenkins
> Job builder), full docs are here:
> http://docs.openstack.org/infra/jenkins-job-builder/definition.html
> * Where does the make-check configuration lives?
> https://github.com/ceph/ceph-build/tree/master/ceph-pull-requests
> * Full documentation on Job structure and configuration:
> https://github.com/ceph/ceph-build#ceph-build
> * Everyone has READ permissions on jenkins.ceph.com (you can 'login'
> with your github account), current admin members (WRITE permissions)
> are: ktdreyer, alfredodeza, gregmeno, dmick, zmc, andrewschoen,
> ceph-jenkins, dachary, ldachary
> 
> If you have any questions, we can help and provide guidance and feedback. We
> highly encourage contributors to take ownership on this new tool and make it
> awesome!
> 
> Thanks,
> 
> 
> Alfredo
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


New "make check" job for Ceph pull requests

2015-12-23 Thread Alfredo Deza
Hi all,

As of yesterday (Tuesday Dec 22nd) we have the "make check" job
running within our CI infrastructure, working very similarly as the
previous check with a few differences:

* there are no longer comments added to the pull requests
* notifications of success (or failure) are done inline in the same
notification box for "This branch has no conflicts with the base
branch"
* All members of the Ceph organization can trigger a job with the
following comment:
test this please

Changes to the job should be done following our new process: anyone can open
a pull request against the "ceph-pull-requests" job that configures/modifies
it. This process is fairly minimal:

1) *Jobs no longer require to make changes in the Jenkins UI*, they
are rather plain text YAML files that live in the ceph/ceph-build.git
repository and have a specific structure. Job changes (including
scripts) are made directly on that repository via pull requests.

2) As soon as a PR is merged the changes are automatically pushed to
Jenkins. Regardless if this is a new or old job. All one needs for a
new job to appear is a directory with a working YAML file (see links
at the end on what this means)

Below, please find a list to resources on how to make changes to a
Jenkins Job, and examples on how mostly anyone can provide changes:

* Format and configuration of YAML files are consumed by JJB (Jenkins
Job builder), full docs are here:
http://docs.openstack.org/infra/jenkins-job-builder/definition.html
* Where does the make-check configuration lives?
https://github.com/ceph/ceph-build/tree/master/ceph-pull-requests
* Full documentation on Job structure and configuration:
https://github.com/ceph/ceph-build#ceph-build
* Everyone has READ permissions on jenkins.ceph.com (you can 'login'
with your github account), current admin members (WRITE permissions)
are: ktdreyer, alfredodeza, gregmeno, dmick, zmc, andrewschoen,
ceph-jenkins, dachary, ldachary

If you have any questions, we can help and provide guidance and feedback. We
highly encourage contributors to take ownership on this new tool and make it
awesome!

Thanks,


Alfredo
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


jenkins on ceph pull requests: clarify which Operating System is used

2015-12-23 Thread Loic Dachary
Hi Alfredo,

I see a make check slave currently runs on jessie and I think to remember it 
ran on trusty slaves before. It's a good thing operating systems are mixed but 
there does not seem to be a clear indication about which operating system is 
used. For instance regarding:

https://jenkins.ceph.com/job/ceph-pull-requests/44/

one has to click on the console and know that it shows in the first few lines 
as:

Building remotely on centos6+158.69.78.199 (x86_64 huge centos6 amd64) in 
workspace 

Side note: as CentOS 6 is no longer a supported platform, trying to build on it 
will fail.

Another problem is that chosing an operating system randomly may lead to 
different test results and the inability for the author of the pull request to 
chose repeat the bug because the operating system on which it happens is not 
selected.

Unless there is a know strategy with jenkins to deal with that kind of problem, 
it probably is best to stick to a single Operating System and CentOS 7 would be 
my choice.

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


fixing jenkins builds on pull requests

2015-12-23 Thread Loic Dachary
Hi Alfredo,

I forgot to mention that the ./run-make-check.sh run currently has no known 
false negative on CentOS 7. By that I mean that if run on master 100 times, it 
will succeed 100 times. This is good to debug the jenkins builds on pull 
requests as we know all problems either come from the infrastructure or the 
pull request. We do not have to worry about random errors due to race 
conditions in the tests or things like that.

I'll keep an eye on the test results and analyse each failure. For now it would 
be best to disable reporting failures as they are almost entirely false 
negative and will confuse the contributor. The failures come from:

 * running on unsupported operating systems (CentOS 6 and maybe others)
 * leftovers from a previous test (which should be removed when a new slave is 
provisionned for each test)

I'll add to this thread when / if I find more.

Cheers
-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Let's Not Destroy the World in 2038

2015-12-22 Thread Adam C. Emerson
Comrades,

Ceph's victory is assured. It will be the storage system of The Future.
Matt Benjamin has reminded me that if we don't act fast¹ Ceph will be
responsible for destroying the world.

utime_t() uses a 32-bit second count internally. This isn't great, but it's
something we can fix. ceph::real_time currently uses a 64-bit bit count of
nanoseconds, which is better. And we can change it to something else without
having to rewrite much other code.

The problem lies in our encode/deocde functions for time (both utime_t
and ceph::real_time, since I didn't want to break compatibility.) we
use a 32-bit second count. I would like to change the wire and disk
representation to a 64-bit second count and a 32-bit nanosecond count.

Would there be resistance to a project to do this? I don't know if a
FEATURE bit would help. A FEATURE bit to toggle the width of the second
count would be ideal if it would work. Otherwise it looks like the best
way to do this would be to find all the structures currently ::encoded
that hold time values, bump the version number and have an 'old_utime'
that we use for everything pre-change.

Thank you!

¹ Within the next twenty-three years. But that's not really a long time in the
  larger scheme of things.

-- 
Senior Software Engineer   Red Hat Storage, Ann Arbor, MI, US
IRC: Aemerson@{RedHat, OFTC, Freenode}
0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C  7C12 80F7 544B 90ED BFB9


signature.asc
Description: PGP signature


Re: RBD performance with many childs and snapshots

2015-12-22 Thread Josh Durgin

On 12/22/2015 01:55 PM, Wido den Hollander wrote:

On 12/21/2015 11:51 PM, Josh Durgin wrote:

On 12/21/2015 11:06 AM, Wido den Hollander wrote:

Hi,

While implementing the buildvolfrom method in libvirt for RBD I'm stuck
at some point.

$ virsh vol-clone --pool myrbdpool image1 image2

This would clone image1 to a new RBD image called 'image2'.

The code I've written now does:

1. Create a snapshot called image1@libvirt-
2. Protect the snapshot
3. Clone the snapshot to 'image1'

wido@wido-desktop:~/repos/libvirt$ ./tools/virsh vol-clone --pool
rbdpool image1 image2
Vol image2 cloned from image1

wido@wido-desktop:~/repos/libvirt$

root@alpha:~# rbd -p libvirt info image2
rbd image 'image2':
 size 10240 MB in 2560 objects
 order 22 (4096 kB objects)
 block_name_prefix: rbd_data.1976451ead36b
 format: 2
 features: layering, striping
 flags:
 parent: libvirt/image1@libvirt-1450724650
 overlap: 10240 MB
 stripe unit: 4096 kB
 stripe count: 1
root@alpha:~#

But this could potentially lead to a lot of snapshots with children on
'image1'.

image1 itself will probably never change, but I'm wondering about the
negative performance impact this might have on a OSD.


Creating them isn't so bad, more snapshots that don't change don't have
much affect on the osds. Deleting them is what's expensive, since the
osds need to scan the objects to see which ones are part of the
snapshot and can be deleted. If you have too many snapshots created and
deleted, it can affect cluster load, so I'd rather avoid always
creating a snapshot.


I'd rather not hardcode a snapshot name like 'libvirt-parent-snapshot'
into libvirt. There is however no way to pass something like a snapshot
name in libvirt when cloning.

Any bright suggestions? Or is it fine to create so many snapshots?


You could have canonical names for the libvirt snapshots like you
suggest, 'libvirt-', and check via rbd_diff_iterate2()
whether the parent image changed since the last snapshot. That's a bit
slower than plain cloning, but with object map + fast diff it's fast
again, since it doesn't need to scan all the objects anymore.

I think libvirt would need to expand its api a bit to be able to really
use it effectively to manage rbd. Hiding the snapshots becomes
cumbersome if the application wants to use them too. If libvirt's
current model of clones lets parents be deleted before children,
that may be a hassle to hide too...



I gave it a shot. callback functions are a bit new to me, but I gave it
a try:
https://github.com/wido/libvirt/commit/756dca8023027616f53c39fa73c52a6d8f86a223

Could you take a look?


Left some comments on the commits. Looks good in general.

Josh

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD performance with many childs and snapshots

2015-12-22 Thread Josh Durgin

On 12/22/2015 05:34 AM, Wido den Hollander wrote:



On 21-12-15 23:51, Josh Durgin wrote:

On 12/21/2015 11:06 AM, Wido den Hollander wrote:

Hi,

While implementing the buildvolfrom method in libvirt for RBD I'm stuck
at some point.

$ virsh vol-clone --pool myrbdpool image1 image2

This would clone image1 to a new RBD image called 'image2'.

The code I've written now does:

1. Create a snapshot called image1@libvirt-
2. Protect the snapshot
3. Clone the snapshot to 'image1'

wido@wido-desktop:~/repos/libvirt$ ./tools/virsh vol-clone --pool
rbdpool image1 image2
Vol image2 cloned from image1

wido@wido-desktop:~/repos/libvirt$

root@alpha:~# rbd -p libvirt info image2
rbd image 'image2':
 size 10240 MB in 2560 objects
 order 22 (4096 kB objects)
 block_name_prefix: rbd_data.1976451ead36b
 format: 2
 features: layering, striping
 flags:
 parent: libvirt/image1@libvirt-1450724650
 overlap: 10240 MB
 stripe unit: 4096 kB
 stripe count: 1
root@alpha:~#

But this could potentially lead to a lot of snapshots with children on
'image1'.

image1 itself will probably never change, but I'm wondering about the
negative performance impact this might have on a OSD.


Creating them isn't so bad, more snapshots that don't change don't have
much affect on the osds. Deleting them is what's expensive, since the
osds need to scan the objects to see which ones are part of the
snapshot and can be deleted. If you have too many snapshots created and
deleted, it can affect cluster load, so I'd rather avoid always
creating a snapshot.


I'd rather not hardcode a snapshot name like 'libvirt-parent-snapshot'
into libvirt. There is however no way to pass something like a snapshot
name in libvirt when cloning.

Any bright suggestions? Or is it fine to create so many snapshots?


You could have canonical names for the libvirt snapshots like you
suggest, 'libvirt-', and check via rbd_diff_iterate2()
whether the parent image changed since the last snapshot. That's a bit
slower than plain cloning, but with object map + fast diff it's fast
again, since it doesn't need to scan all the objects anymore.



I'll give that a try, seems like a good suggestion!

I'll have to use rbd_diff_iterate() through since iterate2() is
post-hammer and that will not be available on all systems.


I think libvirt would need to expand its api a bit to be able to really
use it effectively to manage rbd. Hiding the snapshots becomes
cumbersome if the application wants to use them too. If libvirt's
current model of clones lets parents be deleted before children,
that may be a hassle to hide too...



Yes, I would love to see:

- vol-snap-list
- vol-snap-create
- vol-snap-delete
- vol-snap-revert

And then:

- vol-clone --snapshot  --pool  image1 image2

But this would need some more work inside libvirt. Would be very nice
though.


Yeah, those would be nice.


At CloudStack we want to do as much as possible using libvirt, the more
features it has there, the less we have to do in Java code :)


Dan Berrange has talked about using libvirt storage pools for managing
rbd and other storage from openstack nova too, for the same reason. I'm
not sure if there are any current plans for that, but you may want to
ask him about it on the libvirt list.

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to move the make check bot to jenkins.ceph.com

2015-12-22 Thread Loic Dachary
Hi,

The make check bot moved to jenkins.ceph.com today and ran it's first 
successfull job. You will no longer see comments from the bot: it will update 
the github status instead, which is less intrusive.

Cheers

On 21/12/2015 11:13, Loic Dachary wrote:
> Hi,
> 
> The make check bot is broken in a way that I can't figure out right now. 
> Maybe now is the time to move it to jenkins.ceph.com ? It should not be more 
> difficult than launching the run-make-check.sh script. It does not need 
> network or root access.
> 
> Cheers
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: RFC: tool for applying 'ceph daemon ' command to all OSDs

2015-12-22 Thread Dan Mick
On 12/21/2015 11:29 PM, Gregory Farnum wrote:
> On Mon, Dec 21, 2015 at 9:59 PM, Dan Mick  wrote:
>> I needed something to fetch current config values from all OSDs (sorta
>> the opposite of 'injectargs --key value), so I hacked it, and then
>> spiffed it up a bit.  Does this seem like something that would be useful
>> in this form in the upstream Ceph, or does anyone have any thoughts on
>> its design or structure?
>>
>> It requires a locally-installed ceph CLI and a ceph.conf that points to
>> the cluster and any required keyrings.  You can also provide it with
>> a YAML file mapping host to osds if you want to save time collecting
>> that info for a statically-defined cluster, or if you want just a subset
>> of OSDs.
>>
>> https://github.com/dmick/tools/blob/master/osd_daemon_cmd.py
>>
>> Excerpt from usage:
>>
>> Execute a Ceph osd daemon command on every OSD in a cluster with
>> one connection to each OSD host.
>>
>> Usage:
>> osd_daemon_cmd [-c CONF] [-u USER] [-f FILE] (COMMAND | -k KEY)
>>
>> Options:
>>-c CONF   ceph.conf file to use [default: ./ceph.conf]
>>-u USER   user to connect with ssh
>>-f FILE   get names and osds from yaml
>>COMMAND   command other than "config get" to execute
>>-k KEYconfig key to retrieve with config get 
> 
> I naively like the functionality being available, but if I'm skimming
> this correctly it looks like you're relying on the local node being
> able to passwordless-ssh to all of the nodes, and for that account to
> be able to access the ceph admin sockets. Granted we rely on the ssh
> for ceph-deploy as well, so maybe that's okay, but I'm not sure in
> this case since it implies a lot more network openness.

Yep; it's basically the same model and role assumed as "cluster destroyer".

> Relatedly (perhaps in an opposing direction), maybe we want anything
> exposed over the network to have some sort of explicit permissions
> model?

Well, I've heard that idea floated about the admin socket for years, but
I don't think anyone's hot to add cephx to it :)

> Maybe not and we should just ship the script for trusted users. I
> would have liked it on the long-running cluster I'm sure you built it
> for. ;)

it's like you're clairvoyant.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is rbd_discard enough to wipe an RBD image?

2015-12-22 Thread Wido den Hollander
On 12/21/2015 11:20 PM, Josh Durgin wrote:
> On 12/21/2015 11:00 AM, Wido den Hollander wrote:
>> My discard code now works, but I wanted to verify. If I understand Jason
>> correctly it would be a matter of figuring out the 'order' of a image
>> and call rbd_discard in a loop until you reach the end of the image.
> 
> You'd need to get the order via rbd_stat(), convert it to object size
> (i.e. (1 << order)), and fetch stripe_count with rbd_get_stripe_count().
> 
> Then do the discards in (object size * stripe_count) chunks. This
> ensures you discard entire objects. This is the size you'd want to use
> for import/export as well, ideally.
> 

Thanks! I just implemented this, could you take a look?

https://github.com/wido/libvirt/commit/b07925ad50fdb6683b5b21deefceb0829a7842dc

>> I just want libvirt to be as feature complete as possible when it comes
>> to RBD.
> 
> I see, makes sense.
> 
> Josh
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD performance with many childs and snapshots

2015-12-22 Thread Wido den Hollander
On 12/21/2015 11:51 PM, Josh Durgin wrote:
> On 12/21/2015 11:06 AM, Wido den Hollander wrote:
>> Hi,
>>
>> While implementing the buildvolfrom method in libvirt for RBD I'm stuck
>> at some point.
>>
>> $ virsh vol-clone --pool myrbdpool image1 image2
>>
>> This would clone image1 to a new RBD image called 'image2'.
>>
>> The code I've written now does:
>>
>> 1. Create a snapshot called image1@libvirt-
>> 2. Protect the snapshot
>> 3. Clone the snapshot to 'image1'
>>
>> wido@wido-desktop:~/repos/libvirt$ ./tools/virsh vol-clone --pool
>> rbdpool image1 image2
>> Vol image2 cloned from image1
>>
>> wido@wido-desktop:~/repos/libvirt$
>>
>> root@alpha:~# rbd -p libvirt info image2
>> rbd image 'image2':
>> size 10240 MB in 2560 objects
>> order 22 (4096 kB objects)
>> block_name_prefix: rbd_data.1976451ead36b
>> format: 2
>> features: layering, striping
>> flags:
>> parent: libvirt/image1@libvirt-1450724650
>> overlap: 10240 MB
>> stripe unit: 4096 kB
>> stripe count: 1
>> root@alpha:~#
>>
>> But this could potentially lead to a lot of snapshots with children on
>> 'image1'.
>>
>> image1 itself will probably never change, but I'm wondering about the
>> negative performance impact this might have on a OSD.
> 
> Creating them isn't so bad, more snapshots that don't change don't have
> much affect on the osds. Deleting them is what's expensive, since the
> osds need to scan the objects to see which ones are part of the
> snapshot and can be deleted. If you have too many snapshots created and
> deleted, it can affect cluster load, so I'd rather avoid always
> creating a snapshot.
> 
>> I'd rather not hardcode a snapshot name like 'libvirt-parent-snapshot'
>> into libvirt. There is however no way to pass something like a snapshot
>> name in libvirt when cloning.
>>
>> Any bright suggestions? Or is it fine to create so many snapshots?
> 
> You could have canonical names for the libvirt snapshots like you
> suggest, 'libvirt-', and check via rbd_diff_iterate2()
> whether the parent image changed since the last snapshot. That's a bit
> slower than plain cloning, but with object map + fast diff it's fast
> again, since it doesn't need to scan all the objects anymore.
> 
> I think libvirt would need to expand its api a bit to be able to really
> use it effectively to manage rbd. Hiding the snapshots becomes
> cumbersome if the application wants to use them too. If libvirt's
> current model of clones lets parents be deleted before children,
> that may be a hassle to hide too...
> 

I gave it a shot. callback functions are a bit new to me, but I gave it
a try:
https://github.com/wido/libvirt/commit/756dca8023027616f53c39fa73c52a6d8f86a223

Could you take a look?

> Josh
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Let's Not Destroy the World in 2038

2015-12-22 Thread Gregory Farnum
On Tue, Dec 22, 2015 at 12:10 PM, Adam C. Emerson  wrote:
> Comrades,
>
> Ceph's victory is assured. It will be the storage system of The Future.
> Matt Benjamin has reminded me that if we don't act fast¹ Ceph will be
> responsible for destroying the world.
>
> utime_t() uses a 32-bit second count internally. This isn't great, but it's
> something we can fix. ceph::real_time currently uses a 64-bit bit count of
> nanoseconds, which is better. And we can change it to something else without
> having to rewrite much other code.
>
> The problem lies in our encode/deocde functions for time (both utime_t
> and ceph::real_time, since I didn't want to break compatibility.) we
> use a 32-bit second count. I would like to change the wire and disk
> representation to a 64-bit second count and a 32-bit nanosecond count.
>
> Would there be resistance to a project to do this? I don't know if a
> FEATURE bit would help. A FEATURE bit to toggle the width of the second
> count would be ideal if it would work. Otherwise it looks like the best
> way to do this would be to find all the structures currently ::encoded
> that hold time values, bump the version number and have an 'old_utime'
> that we use for everything pre-change.

Unfortunately, we include utimes in structures that are written to
disk. So I think we're stuck with creating a new utime_t and
incrementing the struct_v on everything that contains them. :/

Of course, we'll also then need the full feature bit system to make
sure we send the old encoding to clients which don't understand the
new one, and to prevent a mid-upgrade cluster from writing data on a
new node that gets moved to a new node which doesn't understand it.

Given that utime_t occurs in a lot of places, and really can't change
*again* after this, we probably shouldn't set up the new version with
versioned encoding?
-Greg

>
> Thank you!
>
> ¹ Within the next twenty-three years. But that's not really a long time in the
>   larger scheme of things.
>
> --
> Senior Software Engineer   Red Hat Storage, Ann Arbor, MI, US
> IRC: Aemerson@{RedHat, OFTC, Freenode}
> 0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C  7C12 80F7 544B 90ED BFB9
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: tool for applying 'ceph daemon ' command to all OSDs

2015-12-22 Thread igor.podo...@ts.fujitsu.com
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> ow...@vger.kernel.org] On Behalf Of Dan Mick
> Sent: Tuesday, December 22, 2015 7:00 AM
> To: ceph-devel
> Subject: RFC: tool for applying 'ceph daemon ' command to all OSDs
> 
> I needed something to fetch current config values from all OSDs (sorta the
> opposite of 'injectargs --key value), so I hacked it, and then spiffed it up 
> a bit.
> Does this seem like something that would be useful in this form in the
> upstream Ceph, or does anyone have any thoughts on its design or
> structure?
>

You could do it using socat too:

Node1 has osd.0

Node1:
cd /var/run/ceph
sudo socat TCP-LISTEN:60100,fork unix-connect:ceph-osd.0.asok

Node2:
cd /var/run/ceph
sudo  socat unix-listen:ceph-osd.0.asok,fork TCP:Node1:60100

Node2:
sudo ceph daemon osd.0 help | head
{
"config diff": "dump diff of current config and default config",
"config get": "config get : get the config value",

This is more for development/test setup.

Regards,
Igor.

> It requires a locally-installed ceph CLI and a ceph.conf that points to the
> cluster and any required keyrings.  You can also provide it with a YAML file
> mapping host to osds if you want to save time collecting that info for a
> statically-defined cluster, or if you want just a subset of OSDs.
> 
> https://github.com/dmick/tools/blob/master/osd_daemon_cmd.py
> 
> Excerpt from usage:
> 
> Execute a Ceph osd daemon command on every OSD in a cluster with one
> connection to each OSD host.
> 
> Usage:
> osd_daemon_cmd [-c CONF] [-u USER] [-f FILE] (COMMAND | -k KEY)
> 
> Options:
>-c CONF   ceph.conf file to use [default: ./ceph.conf]
>-u USER   user to connect with ssh
>-f FILE   get names and osds from yaml
>COMMAND   command other than "config get" to execute
>-k KEYconfig key to retrieve with config get 
> 
> --
> Dan Mick
> Red Hat, Inc.
> Ceph docs: http://ceph.com/docs
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html


Re: RBD performance with many childs and snapshots

2015-12-22 Thread Wido den Hollander


On 21-12-15 23:51, Josh Durgin wrote:
> On 12/21/2015 11:06 AM, Wido den Hollander wrote:
>> Hi,
>>
>> While implementing the buildvolfrom method in libvirt for RBD I'm stuck
>> at some point.
>>
>> $ virsh vol-clone --pool myrbdpool image1 image2
>>
>> This would clone image1 to a new RBD image called 'image2'.
>>
>> The code I've written now does:
>>
>> 1. Create a snapshot called image1@libvirt-
>> 2. Protect the snapshot
>> 3. Clone the snapshot to 'image1'
>>
>> wido@wido-desktop:~/repos/libvirt$ ./tools/virsh vol-clone --pool
>> rbdpool image1 image2
>> Vol image2 cloned from image1
>>
>> wido@wido-desktop:~/repos/libvirt$
>>
>> root@alpha:~# rbd -p libvirt info image2
>> rbd image 'image2':
>> size 10240 MB in 2560 objects
>> order 22 (4096 kB objects)
>> block_name_prefix: rbd_data.1976451ead36b
>> format: 2
>> features: layering, striping
>> flags:
>> parent: libvirt/image1@libvirt-1450724650
>> overlap: 10240 MB
>> stripe unit: 4096 kB
>> stripe count: 1
>> root@alpha:~#
>>
>> But this could potentially lead to a lot of snapshots with children on
>> 'image1'.
>>
>> image1 itself will probably never change, but I'm wondering about the
>> negative performance impact this might have on a OSD.
> 
> Creating them isn't so bad, more snapshots that don't change don't have
> much affect on the osds. Deleting them is what's expensive, since the
> osds need to scan the objects to see which ones are part of the
> snapshot and can be deleted. If you have too many snapshots created and
> deleted, it can affect cluster load, so I'd rather avoid always
> creating a snapshot.
> 
>> I'd rather not hardcode a snapshot name like 'libvirt-parent-snapshot'
>> into libvirt. There is however no way to pass something like a snapshot
>> name in libvirt when cloning.
>>
>> Any bright suggestions? Or is it fine to create so many snapshots?
> 
> You could have canonical names for the libvirt snapshots like you
> suggest, 'libvirt-', and check via rbd_diff_iterate2()
> whether the parent image changed since the last snapshot. That's a bit
> slower than plain cloning, but with object map + fast diff it's fast
> again, since it doesn't need to scan all the objects anymore.
> 

I'll give that a try, seems like a good suggestion!

I'll have to use rbd_diff_iterate() through since iterate2() is
post-hammer and that will not be available on all systems.

> I think libvirt would need to expand its api a bit to be able to really
> use it effectively to manage rbd. Hiding the snapshots becomes
> cumbersome if the application wants to use them too. If libvirt's
> current model of clones lets parents be deleted before children,
> that may be a hassle to hide too...
> 

Yes, I would love to see:

- vol-snap-list
- vol-snap-create
- vol-snap-delete
- vol-snap-revert

And then:

- vol-clone --snapshot  --pool  image1 image2

But this would need some more work inside libvirt. Would be very nice
though.

At CloudStack we want to do as much as possible using libvirt, the more
features it has there, the less we have to do in Java code :)

Wido

> Josh
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Issue with Ceph File System and LIO

2015-12-22 Thread Eric Eastman
On Sun, Dec 20, 2015 at 7:38 PM, Eric Eastman
 wrote:
> On Fri, Dec 18, 2015 at 12:18 AM, Yan, Zheng  wrote:
>> On Fri, Dec 18, 2015 at 2:23 PM, Eric Eastman
>>  wrote:
 Hi Yan Zheng, Eric Eastman

 Similar bug was reported in f2fs, btrfs, it does affect 4.4-rc4, the fixing
 patch was merged into 4.4-rc5, dfd01f026058 ("sched/wait: Fix the signal
 handling fix").

 Related report & discussion was here:
 https://lkml.org/lkml/2015/12/12/149

 I'm not sure the current reported issue of ceph was related to that though,
 but at least try testing with an upgraded or patched kernel could verify 
 it.
 :)

 Thanks,
>
>>
>> please try rc5 kernel without patches and DEBUG_VM=y
>>
>> Regards
>> Yan, Zheng
>
>
> The latest test with 4.4rc5 with CONFIG_DEBUG_VM=y has ran for over 36
> hours with no ERRORS or WARNINGS.  My plan is to install the 4.4rc6
> kernel from the Ubuntu kernel-ppa site once it is available, and rerun
> the tests.
>

Test has run for 2 days using the 4.4rc6 kernel from the Ubuntu
kernel-ppa kernel site without error or warning.  Looks like it was a
4.4rc4 bug.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FreeBSD Building and Testing

2015-12-21 Thread Willem Jan Withagen

On 21-12-2015 01:45, Xinze Chi (信泽) wrote:

sorry for delay reply. Please have a try
https://github.com/ceph/ceph/commit/ae4a8162eacb606a7f65259c6ac236e144bfef0a.


Tried this one first:

Testsuite summary for ceph 10.0.1

# TOTAL: 120
# PASS:  100
# SKIP:  0
# XFAIL: 0
# FAIL:  20
# XPASS: 0
# ERROR: 0


So that certainly helps.
Have not yet analyzed the log files... But is seems we are getting 
somewhere.

Needed to manually kill a rados access in:
 | | | \-+- 09792 wjw /bin/sh ../test-driver 
./test/ceph_objectstore_tool.py
 | | |   \-+- 09807 wjw python 
./test/ceph_objectstore_tool.py (python2.7)
 | | | \--- 11406 wjw 
/usr/srcs/Ceph/wip-freebsd-wjw/ceph/src/.libs/rados -p rep_pool -N put 
REPobject1 /tmp/data.9807/-REPobject1__head


But also 2 mon-osd's were running, and perhaps ine was nog belonging
with that test. So they could be in each others way.

Found some fails in OSD's at:

./test-suite.log:osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())
./test-suite.log:osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())

struct OnRecoveryReadComplete :
  public GenContext 
&> {

  ECBackend *pg;
  hobject_t hoid;
  set want;
  OnRecoveryReadComplete(ECBackend *pg, const hobject_t )
: pg(pg), hoid(hoid) {}
  void finish(pair ) {
ECBackend::read_result_t  = in.second;
// FIXME???
assert(res.r == 0);
201:assert(res.errors.empty());
assert(res.returned.size() == 1);
pg->handle_recovery_read_complete(
  hoid,
  res.returned.back(),
  res.attrs,
  in.first);
  }
};

Given the FIXME?? the code here could be fishy??

I would say that just this patch would be sufficient.
The second patch also looks like it is could be useful since it
lowers the bar on being tested. And when just aligning is required
because of (a)iovec processing that 4096 will likely suffice.

Thanx you very much for the help.

--WjW



2015-12-21 0:10 GMT+08:00 Willem Jan Withagen :

Hi,

Most of the Ceph is getting there in the most crude and rough state.
So beneath is a status update on what is not working for me jet.

Especially help with the aligment problem in os/FileJournal.cc would be
appricated... It would allow me to run ceph-osd and run more tests to
completion.

What would happen if I comment out this test, and ignore the fact that
thing might be unaligned?
Is it a performance/paging issue?
Or is data going to be corrupted?

--WjW

PASS: src/test/run-cli-tests

Testsuite summary for ceph 10.0.0

# TOTAL: 1
# PASS:  1
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0


gmake test:

Testsuite summary for ceph 10.0.0

# TOTAL: 119
# PASS:  95
# SKIP:  0
# XFAIL: 0
# FAIL:  24
# XPASS: 0
# ERROR: 0


The folowing notes can be made with this:
1) the run-cli-tests run to completion because I excluded the RBD tests
2) gmake test has the following tests FAIL:
FAIL: unittest_erasure_code_plugin
FAIL: ceph-detect-init/run-tox.sh
FAIL: test/erasure-code/test-erasure-code.sh
FAIL: test/erasure-code/test-erasure-eio.sh
FAIL: test/run-rbd-unit-tests.sh
FAIL: test/ceph_objectstore_tool.py
FAIL: test/test-ceph-helpers.sh
FAIL: test/cephtool-test-osd.sh
FAIL: test/cephtool-test-mon.sh
FAIL: test/cephtool-test-mds.sh
FAIL: test/cephtool-test-rados.sh
FAIL: test/mon/osd-crush.sh
FAIL: test/osd/osd-scrub-repair.sh
FAIL: test/osd/osd-scrub-snaps.sh
FAIL: test/osd/osd-config.sh
FAIL: test/osd/osd-bench.sh
FAIL: test/osd/osd-reactivate.sh
FAIL: test/osd/osd-copy-from.sh
FAIL: test/libradosstriper/rados-striper.sh
FAIL: test/test_objectstore_memstore.sh
FAIL: test/ceph-disk.sh
FAIL: test/pybind/test_ceph_argparse.py
FAIL: test/pybind/test_ceph_daemon.py
FAIL: ../qa/workunits/erasure-code/encode-decode-non-regression.sh

Most of the fails are because ceph-osd crashed consistently on:
-1 journal  bl.is_aligned(block_size) 0
bl.is_n_align_sized(CEPH_MINIMUM_BLOCK_SIZE) 1
-1 journal  block_size 131072 CEPH_MINIMUM_BLOCK_SIZE 4096
CEPH_PAGE_SIZE 4096 header.alignment 131072
bl buffer::list(len=131072, buffer::ptr(0~131072 0x805319000 in raw
0x805319000 len 131072 nref 1))
os/FileJournal.cc: In function 'void FileJournal::align_bl(off64_t,
bufferlist &)' thread 805217400 time 2015-12-19 13:43:06.706797

RBD performance with many childs and snapshots

2015-12-21 Thread Wido den Hollander
Hi,

While implementing the buildvolfrom method in libvirt for RBD I'm stuck
at some point.

$ virsh vol-clone --pool myrbdpool image1 image2

This would clone image1 to a new RBD image called 'image2'.

The code I've written now does:

1. Create a snapshot called image1@libvirt-
2. Protect the snapshot
3. Clone the snapshot to 'image1'

wido@wido-desktop:~/repos/libvirt$ ./tools/virsh vol-clone --pool
rbdpool image1 image2
Vol image2 cloned from image1

wido@wido-desktop:~/repos/libvirt$

root@alpha:~# rbd -p libvirt info image2
rbd image 'image2':
size 10240 MB in 2560 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.1976451ead36b
format: 2
features: layering, striping
flags:
parent: libvirt/image1@libvirt-1450724650
overlap: 10240 MB
stripe unit: 4096 kB
stripe count: 1
root@alpha:~#

But this could potentially lead to a lot of snapshots with children on
'image1'.

image1 itself will probably never change, but I'm wondering about the
negative performance impact this might have on a OSD.

I'd rather not hardcode a snapshot name like 'libvirt-parent-snapshot'
into libvirt. There is however no way to pass something like a snapshot
name in libvirt when cloning.

Any bright suggestions? Or is it fine to create so many snapshots?

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is rbd_discard enough to wipe an RBD image?

2015-12-21 Thread Wido den Hollander
On 12/21/2015 04:50 PM, Josh Durgin wrote:
> On 12/21/2015 07:09 AM, Jason Dillaman wrote:
>> You will have to ensure that your writes are properly aligned with the
>> object size (or object set if fancy striping is used on the RBD
>> volume).  In that case, the discard is translated to remove operations
>> on each individual backing object.  The only time zeros are written to
>> disk is if you specify an offset somewhere in the middle of an object
>> (i.e. the whole object cannot be deleted nor can it be truncated) --
>> this is the partial discard case controlled by that configuration param.
>>
> 
> I'm curious what's using the virVolWipe stuff - it can't guarantee it's
> actually wiping the data in many common configurations, not just with
> ceph but with any kind of disk, since libvirt is usually not consuming
> raw disks, and with modern flash and smr drives even that is not enough.
> There's a recent patch improving the docs on this [1].
> 
> If the goal is just to make the data inaccessible to the libvirt user,
> removing the image is just as good.
> 
> That said, with rbd there's not much cost to zeroing the image with
> object map enabled - it's effectively just doing the data removal step
> of 'rbd rm' early.
> 

I was looking at the features the RBD storage pool driver is missing in
libvirt and it is:

- Build from Volume. That's RBD cloning
- Uploading and Downloading Volume
- Wiping Volume

The thing about wiping in libvirt is that the volume still exists
afterwards, it is just empty.

My discard code now works, but I wanted to verify. If I understand Jason
correctly it would be a matter of figuring out the 'order' of a image
and call rbd_discard in a loop until you reach the end of the image.

I just want libvirt to be as feature complete as possible when it comes
to RBD.

Wido

> Josh
> 
> [1] http://comments.gmane.org/gmane.comp.emulators.libvirt/122235
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FreeBSD Building and Testing

2015-12-21 Thread Willem Jan Withagen

On 20-12-2015 17:10, Willem Jan Withagen wrote:

Hi,

Most of the Ceph is getting there in the most crude and rough state.
So beneath is a status update on what is not working for me jet.



Further:
A) unittest_erasure_code_plugin failes on the fact that there is a
different error code returned when dlopen-ing a non existent library.
load dlopen(.libs/libec_invalid.so): Cannot open
".libs/libec_invalid.so"load dlsym(.libs/libec_missing_version.so, _
_erasure_code_init): Undefined symbol
"__erasure_code_init"test/erasure-code/TestErasureCodePlugin.cc:88: Failure
Value of: instance.factory("missing_version", g_conf->erasure_code_dir,
profile, _code, )
   Actual: -2
Expected: -18


EXDEV is actually 18, so that part is correct.
But EXDEV is cross-device link error.

Where as the actual answer: -2 is factual correct:
#define ENOENT  2   /* No such file or directory */

So why is the test for EXDEV instead of ENOENT?
Could be a typical Linux <> FreeBSD thingy.

--WjW
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is rbd_discard enough to wipe an RBD image?

2015-12-21 Thread Josh Durgin

On 12/21/2015 11:00 AM, Wido den Hollander wrote:

My discard code now works, but I wanted to verify. If I understand Jason
correctly it would be a matter of figuring out the 'order' of a image
and call rbd_discard in a loop until you reach the end of the image.


You'd need to get the order via rbd_stat(), convert it to object size 
(i.e. (1 << order)), and fetch stripe_count with rbd_get_stripe_count().


Then do the discards in (object size * stripe_count) chunks. This
ensures you discard entire objects. This is the size you'd want to use
for import/export as well, ideally.


I just want libvirt to be as feature complete as possible when it comes
to RBD.


I see, makes sense.

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fwd: FileStore : no wait thread queue_sync

2015-12-21 Thread David Casier
FYI.
-- Forwarded message --
From: David Casier 
Date: 2015-12-21 23:19 GMT+01:00
Subject: FileStore : no wait thread queue_sync
To: Ceph Development , Sage Weil 
Cc: Benoît LORIOT , Sébastien VALSEMEY



Hi,
What do you think about :

if (!journal && m_filestore_direct) {
  apply_manager.commit_finish
}
 in FileStore::queue_transactions  ?
For direct and no waiting (sync_entry thread) ?

I would also propose putting a parameter "m_omap_is_safe" for bypass
XATTR_SPILL_OUT_NAME and reduce IOPS in hard_drive

if ( !m_omap_is_safe) {
r = chain_fgetxattr(**o, XATTR_SPILL_OUT_NAME, buf, sizeof(buf));
if (r >= 0 && !strncmp(buf, XATTR_NO_SPILL_OUT,
sizeof(XATTR_NO_SPILL_OUT))) {
  r = chain_fsetxattr(**n, XATTR_SPILL_OUT_NAME, XATTR_NO_SPILL_OUT,
  sizeof(XATTR_NO_SPILL_OUT));
} else {
  r = chain_fsetxattr(**n, XATTR_SPILL_OUT_NAME, XATTR_SPILL_OUT,
  sizeof(XATTR_SPILL_OUT));
}
}

-- 



Cordialement,

David CASIER





-- 



Cordialement,

David CASIER


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD performance with many childs and snapshots

2015-12-21 Thread Josh Durgin

On 12/21/2015 11:06 AM, Wido den Hollander wrote:

Hi,

While implementing the buildvolfrom method in libvirt for RBD I'm stuck
at some point.

$ virsh vol-clone --pool myrbdpool image1 image2

This would clone image1 to a new RBD image called 'image2'.

The code I've written now does:

1. Create a snapshot called image1@libvirt-
2. Protect the snapshot
3. Clone the snapshot to 'image1'

wido@wido-desktop:~/repos/libvirt$ ./tools/virsh vol-clone --pool
rbdpool image1 image2
Vol image2 cloned from image1

wido@wido-desktop:~/repos/libvirt$

root@alpha:~# rbd -p libvirt info image2
rbd image 'image2':
size 10240 MB in 2560 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.1976451ead36b
format: 2
features: layering, striping
flags:
parent: libvirt/image1@libvirt-1450724650
overlap: 10240 MB
stripe unit: 4096 kB
stripe count: 1
root@alpha:~#

But this could potentially lead to a lot of snapshots with children on
'image1'.

image1 itself will probably never change, but I'm wondering about the
negative performance impact this might have on a OSD.


Creating them isn't so bad, more snapshots that don't change don't have
much affect on the osds. Deleting them is what's expensive, since the
osds need to scan the objects to see which ones are part of the
snapshot and can be deleted. If you have too many snapshots created and
deleted, it can affect cluster load, so I'd rather avoid always
creating a snapshot.


I'd rather not hardcode a snapshot name like 'libvirt-parent-snapshot'
into libvirt. There is however no way to pass something like a snapshot
name in libvirt when cloning.

Any bright suggestions? Or is it fine to create so many snapshots?


You could have canonical names for the libvirt snapshots like you 
suggest, 'libvirt-', and check via rbd_diff_iterate2()

whether the parent image changed since the last snapshot. That's a bit
slower than plain cloning, but with object map + fast diff it's fast
again, since it doesn't need to scan all the objects anymore.

I think libvirt would need to expand its api a bit to be able to really
use it effectively to manage rbd. Hiding the snapshots becomes 
cumbersome if the application wants to use them too. If libvirt's

current model of clones lets parents be deleted before children,
that may be a hassle to hide too...

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is rbd_discard enough to wipe an RBD image?

2015-12-21 Thread Alexandre DERUMIER
>>I just want to know if this is sufficient to wipe a RBD image?

AFAIK, ceph write zeroes in the rados objects with discard is used.

They are an option for skip zeroes write if needed

OPTION(rbd_skip_partial_discard, OPT_BOOL, false) // when trying to discard a 
range inside an object, set to true to skip zeroing the range.
- Mail original -
De: "Wido den Hollander" 
À: "ceph-devel" 
Envoyé: Dimanche 20 Décembre 2015 22:21:50
Objet: Is rbd_discard enough to wipe an RBD image?

Hi, 

I'm busy implementing the volume wiping method of the libvirt storage 
pool backend and instead of writing to the whole RBD image with zeroes 
I'm using rbd_discard. 

Using a 4MB length I'm starting at offset 0 and work my way through the 
whole RBD image. 

A quick try shows me that my partition table + filesystem are gone on 
the RBD image after I've run rbd_discard. 

I just want to know if this is sufficient to wipe a RBD image? Or would 
it be better to fully fill the image with zeroes? 

-- 
Wido den Hollander 
42on B.V. 
Ceph trainer and consultant 

Phone: +31 (0)20 700 9902 
Skype: contact42on 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majord...@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fwd: Client still connect failed leader after that mon down

2015-12-21 Thread Sage Weil
On Mon, 21 Dec 2015, Zhi Zhang wrote:
> Regards,
> Zhi Zhang (David)
> Contact: zhang.david2...@gmail.com
>   zhangz.da...@outlook.com
> 
> 
> 
> -- Forwarded message --
> From: Jaze Lee 
> Date: Mon, Dec 21, 2015 at 4:08 PM
> Subject: Re: Client still connect failed leader after that mon down
> To: Zhi Zhang 
> 
> 
> Hello,
> I am terrible sorry.
> I think we may not need to reconstruct the monclient.{h,cc}, we find
> the parameter is mon_client_hunt_interval is very usefull.
> When we set mon_client_hunt_interval = 0.5? the time to run a ceph
> command is very small even it first connects the down leader mon.
> 
> The first time i ask the question was because we find the parameter
> from official site
> http://docs.ceph.com/docs/master/rados/configuration/mon-config-ref/.
> It is write in this
> 
> mon client hung interval

Yep, that's a typo. Do you mind submitting a patch to fix it?

Thanks!
sage


> 
> Description:The client will try a new monitor every N seconds until it
> establishes a connection.
> Type:Double
> Default:3.0
> 
> And we set it. it is not work.
> 
> I think may be it is a slip of pen?
> The right configuration parameter should be mon client hunt interval
> 
> Can someone please help me to fix this in official site?
> 
> Thanks a lot.
> 
> 
> 
> 2015-12-21 14:00 GMT+08:00 Jaze Lee :
> > right now we use simple msg, and cpeh version is 0.80...
> >
> > 2015-12-21 10:55 GMT+08:00 Zhi Zhang :
> >> Which msg type and ceph version are you using?
> >>
> >> Once we used 0.94.1 with async msg, we encountered similar issue.
> >> Client was trying to connect a down monitor when it was just started
> >> and this connection would hung there. This is because previous async
> >> msg used blocking connection mode.
> >>
> >> After we back ported non-blocking mode of async msg from higher ceph
> >> version, we haven't encountered such issue yet.
> >>
> >>
> >> Regards,
> >> Zhi Zhang (David)
> >> Contact: zhang.david2...@gmail.com
> >>   zhangz.da...@outlook.com
> >>
> >>
> >> On Fri, Dec 18, 2015 at 11:41 AM, Jevon Qiao  wrote:
> >>> On 17/12/15 21:27, Sage Weil wrote:
> 
>  On Thu, 17 Dec 2015, Jaze Lee wrote:
> >
> > Hello cephers:
> >  In our test, there are three monitors. We find client run ceph
> > command will slow when the leader mon is down. Even after long time, a
> > client run ceph command will also slow in first time.
> > >From strace, we find that the client first to connect the leader, then
> > after 3s, it connect the second.
> > After some search we find that the quorum is not change, the leader is
> > still the down monitor.
> > Is that normal?  Or is there something i miss?
> 
>  It's normal.  Even when the quorum does change, the client doesn't
>  know that.  It should be contacting a random mon on startup, though, so I
>  would expect the 3s delay 1/3 of the time.
> >>>
> >>> That's because client randomly picks up a mon from Monmap. But what we
> >>> observed is that when a mon is down no change is made to monmap(neither 
> >>> the
> >>> epoch nor the members). Is it the culprit for this phenomenon?
> >>>
> >>> Thanks,
> >>> Jevon
> >>>
>  A long-standing low-priority feature request is to have the client 
>  contact
>  2 mons in parallel so that it can still connect quickly if one is down.
>  It's requires some non-trivial work in mon/MonClient.{cc,h} though and I
>  don't think anyone has looked at it seriously.
> 
>  sage
> 
>  --
>  To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>  the body of a message to majord...@vger.kernel.org
>  More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> the body of a message to majord...@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> > --
> > 
> 
> 
> 
> --
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Issue with Ceph File System and LIO

2015-12-21 Thread Gregory Farnum
On Sun, Dec 20, 2015 at 6:38 PM, Eric Eastman
 wrote:
> On Fri, Dec 18, 2015 at 12:18 AM, Yan, Zheng  wrote:
>> On Fri, Dec 18, 2015 at 2:23 PM, Eric Eastman
>>  wrote:
 Hi Yan Zheng, Eric Eastman

 Similar bug was reported in f2fs, btrfs, it does affect 4.4-rc4, the fixing
 patch was merged into 4.4-rc5, dfd01f026058 ("sched/wait: Fix the signal
 handling fix").

 Related report & discussion was here:
 https://lkml.org/lkml/2015/12/12/149

 I'm not sure the current reported issue of ceph was related to that though,
 but at least try testing with an upgraded or patched kernel could verify 
 it.
 :)

 Thanks,
>
>>
>> please try rc5 kernel without patches and DEBUG_VM=y
>>
>> Regards
>> Yan, Zheng
>
>
> The latest test with 4.4rc5 with CONFIG_DEBUG_VM=y has ran for over 36
> hours with no ERRORS or WARNINGS.  My plan is to install the 4.4rc6
> kernel from the Ubuntu kernel-ppa site once it is available, and rerun
> the tests.
>
> Before running this test I had to rebuild the Ceph File System as
> after the last logged errors on Friday using the 4.4rc4 kernel, the
> Ceph File system hung accessing the exported image file.  After
> rebooting my iSCSI gateway using the Ceph File System, from / using
> command: strace du -a cephfs, the mount point, the hang happened on
> the newfsstatat call on my image file:
>
> write(1, "0\tcephfs/ctdb/.ctdb.lock\n", 250 cephfs/ctdb/.ctdb.lock
> ) = 25
> close(5)= 0
> write(1, "0\tcephfs/ctdb\n", 140 cephfs/ctdb
> )= 14
> newfstatat(4, "iscsi", {st_mode=S_IFDIR|0755, st_size=993814480896,
> ...}, AT_SYMLINK_NOFOLLOW) = 0
> openat(4, "iscsi", O_RDONLY|O_NOCTTY|O_NONBLOCK|O_DIRECTORY|O_NOFOLLOW) = 3
> fcntl(3, F_GETFD)   = 0
> fcntl(3, F_SETFD, FD_CLOEXEC)   = 0
> fstat(3, {st_mode=S_IFDIR|0755, st_size=993814480896, ...}) = 0
> fcntl(3, F_GETFL)   = 0x38800 (flags
> O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_NOFOLLOW)
> fcntl(3, F_SETFD, FD_CLOEXEC)   = 0
> newfstatat(4, "iscsi", {st_mode=S_IFDIR|0755, st_size=993814480896,
> ...}, AT_SYMLINK_NOFOLLOW) = 0
> fcntl(3, F_DUPFD, 3)= 5
> fcntl(5, F_GETFD)   = 0
> fcntl(5, F_SETFD, FD_CLOEXEC)   = 0
> getdents(3, /* 8 entries */, 65536) = 288
> getdents(3, /* 0 entries */, 65536) = 0
> close(3)= 0
> newfstatat(5, "iscsi900g.img", ^C
> ^C^C^C
> ^Z
> I could not break out with a ^C, and had to background the process to
> get my prompt back. The process would not die so I had to hard reset
> the system.
>
> This same hang happened on 2 other kernel mounted systems using a 4.3.0 
> kernel.
>
> On a separate system, I fuse mounted the file system and a du -a
> cephfs hung at the same point. Once again I could not break out of the
> hang, and had to hard reset the system.
>
> Restarting the MDS and Monitors did not clear the issue. Taking a
> quick look at the dumpcache showed it was large
>
> # ceph mds tell 0 dumpcache /tmp/dump.txt
> ok
> # wc /tmp/dump.txt
>   370556  5002449 59211054 /tmp/dump.txt
> # tail /tmp/dump.txt
> [inode 1259276 [...c4,head] ~mds0/stray0/1259276/ auth v977593
> snaprealm=0x561339e3fb00 f(v0 m2015-12-12 00:51:04.345614) n(v0
> rc2015-12-12 00:51:04.345614 1=0+1) (iversion lock) 0x561339c66228]
> [inode 120c1ba [...a6,head] ~mds0/stray0/120c1ba/ auth v742016
> snaprealm=0x56133ad19600 f(v0 m2015-12-10 18:25:55.880167) n(v0
> rc2015-12-10 18:25:55.880167 1=0+1) (iversion lock) 0x56133a5e0d88]
> [inode 10d0088 [...77,head] ~mds0/stray6/10d0088/ auth v292336
> snaprealm=0x5613537673c0 f(v0 m2015-12-08 19:23:20.269283) n(v0
> rc2015-12-08 19:23:20.269283 1=0+1) (iversion lock) 0x56134c2f7378]

These are deleted files that haven't been trimmed yet...

>
> I tried one more thing:
>
> ceph daemon mds.0 flush journal
>
> and restarted the MDS. Accessing the file system still locked up, but
> a du -a cephfs did not even get to the iscsi900g.img file. As I was
> running on a broken rc kernel, with snapshots turned on

...and I think we have some known issues in the tracker about snap
trimming and snapshotted inodes. So this is not entirely surprising.
:/
-Greg


>, when this
> corruption happened, I decided to recreated the file system and
> restarted the ESXi iSCSI test.
>
> Regards,
> Eric
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RFC: tool for applying 'ceph daemon ' command to all OSDs

2015-12-21 Thread Dan Mick
I needed something to fetch current config values from all OSDs (sorta
the opposite of 'injectargs --key value), so I hacked it, and then
spiffed it up a bit.  Does this seem like something that would be useful
in this form in the upstream Ceph, or does anyone have any thoughts on
its design or structure?

It requires a locally-installed ceph CLI and a ceph.conf that points to
the cluster and any required keyrings.  You can also provide it with
a YAML file mapping host to osds if you want to save time collecting
that info for a statically-defined cluster, or if you want just a subset
of OSDs.

https://github.com/dmick/tools/blob/master/osd_daemon_cmd.py

Excerpt from usage:

Execute a Ceph osd daemon command on every OSD in a cluster with
one connection to each OSD host.

Usage:
osd_daemon_cmd [-c CONF] [-u USER] [-f FILE] (COMMAND | -k KEY)

Options:
   -c CONF   ceph.conf file to use [default: ./ceph.conf]
   -u USER   user to connect with ssh
   -f FILE   get names and osds from yaml
   COMMAND   command other than "config get" to execute
   -k KEYconfig key to retrieve with config get 

-- 
Dan Mick
Red Hat, Inc.
Ceph docs: http://ceph.com/docs
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   7   8   9   10   >