Re: Is BlueFS an alternative of BlueStore?

2016-01-07 Thread Sage Weil
On Thu, 7 Jan 2016, Javen Wu wrote:
> Hi Sage,
> 
> Sorry to bother you. I am not sure if it is appropriate to send email to you
> directly, but I cannot find any useful information to address my confusion
> from Internet. Hope you can help me.
> 
> Occasionally, I heard that you are going to start BlueFS to eliminate the
> redudancy between XFS journal and RocksDB WAL. I am a little confused.
> Is the Bluefs only to host RocksDB for BlueStore or it's an
> alternative of BlueStore?
> 
> I am a new comer to CEPH, I am not sure my understanding is correct about
> BlueStore. BlueStore in my mind is as below.
> 
>  BlueStore
>  =
>RocksDB
> +---+  +---+
> |   onode   |  |   |
> |WAL|  |   |
> |   omap|  |   |
> +---+  |   bdev|
> |   |  |   |
> |   XFS |  |   |
> |   |  |   |
> +---+  +---+

This is the picture before BlueFS enters the picture.

> I am curious if BlueFS is able to host RocksDB, actually it's already a
> "filesystem" which have to maintain blockmap kind of metadata by its own
> WITHOUT the help of RocksDB. 

Right.  BlueFS is a really simple "file system" that is *just* complicated 
enough to implement the rocksdb::Env interface, which is what rocksdb 
needs to store its log and sst files.  The after picture looks like

 ++
 | bluestore  |
 +--+ |
 | rocksdb  | |
 +--+ |
 |  bluefs  | |
 +--+-+
 |block device|
 ++

> The reason we care the intention and the design target of BlueFS is that I had
> discussion with my partner Peng.Hse about an idea to introduce a new
> ObjectStore using ZFS library. I know CEPH supports ZFS as FileStore backend
> already, but we had a different immature idea to use libzpool to implement a
> new
> ObjectStore for CEPH totally in userspace without SPL and ZOL kernel module.
> So that we can align CEPH transaction and zfs transaction in order to  avoid
> double write for CEPH journal.
> ZFS core part libzpool (DMU, metaslab etc) offers a dnode object store and
> it's platform kernel/user independent. Another benefit for the idea is we
> can extend our metadata without bothering any DBStore.
> 
> Frankly, we are not sure if our idea is realistic so far, but when I heard of
> BlueFS, I think we need to know the BlueFS design goal.

I think it makes a lot of sense, but there are a few challenges.  One 
reason we use rocksdb (or a similar kv store) is that we need in-order 
enumeration of objects in order to do collection listing (needed for 
backfill, scrub, and omap).  You'll need something similar on top of zfs.  

I suspect the simplest path would be to also implement the rocksdb::Env 
interface on top of the zfs libraries.  See BlueRocksEnv.{cc,h} to see the 
interface that has to be implemented...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is BlueFS an alternative of BlueStore?

2016-01-07 Thread Javen Wu

Thanks Sage for your reply.

I am not sure I understand the challenges you mentioned about 
backfill/scrub.

I will investigate from the code and let you know if we can conquer the
challenge by easy means.
Our rough idea for ZFSStore are:
1. encapsulate dnode object as onode and add onode attributes.
2. uses ZAP object as collection. (ZFS directory uses ZAP object)
3. enumerating entries in ZAP object is list objects in collection.
4. create a new metaslab class to store CEPH journal.
5. align CEPH journal and ZFS transcation.

Actually we've talked about the possibility of building RocksDB::Env on top
of the zfs libraries. It must align ZIL(ZFS intent log) and RocksDB WAL.
Otherwise, there is still same problem as XFS and RocksDB.

ZFS is tree style log structure-like file system, once a leaf block updates,
the modification would be propagated from the leaf to the root of tree.
To batch writes and reduce times of disk write, ZFS persist modification 
to disk
in 5 seconds transaction. Only when Fsync/sync write arrives in the 
middle of

the 5 seconds, ZFS would persist the journal to ZIL.
I remembered RocksDB would do a sync after log record adding, so it means if
we can not align ZIL and WAL, the log write would be write to ZIL 
firstly and

then apply ZIL to log file, finally Rockdb update sst file. It's almost the
same problem as XFS if my understanding is correct.

In my mind, aligning ZIL and WAL need more modifications in RocksDB.

Thanks
Javen


On 2016年01月07日 22:37, peng.hse wrote:

Hi Sage,

thanks for your quick response. Javen and I  once the zfs 
developer,are currently focusing on how to
leverage some of the zfs ideas to improve the ceph backend performance 
in userspace.



Based on your encouraging reply, we come up with 2 schemes to continue 
our future work


1. the scheme one: using the entire new FS to replace rocksdb+bluefs, 
the FS itself handles the mapping of
oid->fs-object(kind of zfs dnode) and the according attrs used by 
ceph.
   Despite the implemention challenges you mentioned about the 
in-order enumeration of objects during backfill, scrub, etc (the
same situation we also confronted in zfs, the ZAP features help us 
a lot).
From performance or architecture point of view, it looks more 
clear and clean, would you suggest us to give a try ?


2. the scheme two: As your last suspect, we just temporarily 
implemented the simple version of the FS
 which leverage libzpool ideas to plug into rocksdb underneath as 
your bluefs did


precious your insightful reply.

Thanks



On 2016年01月07日 21:19, Sage Weil wrote:

On Thu, 7 Jan 2016, Javen Wu wrote:

Hi Sage,

Sorry to bother you. I am not sure if it is appropriate to send 
email to you
directly, but I cannot find any useful information to address my 
confusion

from Internet. Hope you can help me.

Occasionally, I heard that you are going to start BlueFS to 
eliminate the

redudancy between XFS journal and RocksDB WAL. I am a little confused.
Is the Bluefs only to host RocksDB for BlueStore or it's an
alternative of BlueStore?

I am a new comer to CEPH, I am not sure my understanding is correct 
about

BlueStore. BlueStore in my mind is as below.

  BlueStore
  =
RocksDB
+---+  +---+
|   onode   |  |   |
|WAL|  |   |
|   omap|  |   |
+---+  |   bdev|
|   |  |   |
|   XFS |  |   |
|   |  |   |
+---+  +---+

This is the picture before BlueFS enters the picture.


I am curious if BlueFS is able to host RocksDB, actually it's already a
"filesystem" which have to maintain blockmap kind of metadata by its 
own

WITHOUT the help of RocksDB.
Right.  BlueFS is a really simple "file system" that is *just* 
complicated

enough to implement the rocksdb::Env interface, which is what rocksdb
needs to store its log and sst files.  The after picture looks like

  ++
  | bluestore  |
  +--+ |
  | rocksdb  | |
  +--+ |
  |  bluefs  | |
  +--+-+
  |block device|
  ++

The reason we care the intention and the design target of BlueFS is 
that I had

discussion with my partner Peng.Hse about an idea to introduce a new
ObjectStore using ZFS library. I know CEPH supports ZFS as FileStore 
backend
already, but we had a different immature idea to use libzpool to 
implement a

new
ObjectStore for CEPH totally in userspace without SPL and ZOL kernel 
module.
So that we can align CEPH transaction and zfs transaction in order 
to  avoid

double write for CEPH journal.
ZFS core part libzpool (DMU, metaslab etc) offers a dnode object 
store and
it's platform kernel/user independent. Another benefit for the idea 
is we

can extend our metadata without bothering any DBStore.

Frankly, we are not sure 

Re: Is BlueFS an alternative of BlueStore?

2016-01-07 Thread peng.hse

Hi Sage,

thanks for your quick response. Javen and I  once the zfs developer,are 
currently focusing on how to
leverage some of the zfs ideas to improve the ceph backend performance 
in userspace.



Based on your encouraging reply, we come up with 2 schemes to continue 
our future work


1. the scheme one: using the entire new FS to replace rocksdb+bluefs, 
the FS itself handles the mapping of

oid->fs-object(kind of zfs dnode) and the according attrs used by ceph.
   Despite the implemention challenges you mentioned about the in-order 
enumeration of objects during backfill, scrub, etc (the
same situation we also confronted in zfs, the ZAP features help us 
a lot).
From performance or architecture point of view, it looks more clear 
and clean, would you suggest us to give a try ?


2. the scheme two: As your last suspect, we just temporarily implemented 
the simple version of the FS
 which leverage libzpool ideas to plug into rocksdb underneath as 
your bluefs did


precious your insightful reply.

Thanks



On 2016年01月07日 21:19, Sage Weil wrote:

On Thu, 7 Jan 2016, Javen Wu wrote:

Hi Sage,

Sorry to bother you. I am not sure if it is appropriate to send email to you
directly, but I cannot find any useful information to address my confusion
from Internet. Hope you can help me.

Occasionally, I heard that you are going to start BlueFS to eliminate the
redudancy between XFS journal and RocksDB WAL. I am a little confused.
Is the Bluefs only to host RocksDB for BlueStore or it's an
alternative of BlueStore?

I am a new comer to CEPH, I am not sure my understanding is correct about
BlueStore. BlueStore in my mind is as below.

  BlueStore
  =
RocksDB
+---+  +---+
|   onode   |  |   |
|WAL|  |   |
|   omap|  |   |
+---+  |   bdev|
|   |  |   |
|   XFS |  |   |
|   |  |   |
+---+  +---+

This is the picture before BlueFS enters the picture.


I am curious if BlueFS is able to host RocksDB, actually it's already a
"filesystem" which have to maintain blockmap kind of metadata by its own
WITHOUT the help of RocksDB.

Right.  BlueFS is a really simple "file system" that is *just* complicated
enough to implement the rocksdb::Env interface, which is what rocksdb
needs to store its log and sst files.  The after picture looks like

  ++
  | bluestore  |
  +--+ |
  | rocksdb  | |
  +--+ |
  |  bluefs  | |
  +--+-+
  |block device|
  ++


The reason we care the intention and the design target of BlueFS is that I had
discussion with my partner Peng.Hse about an idea to introduce a new
ObjectStore using ZFS library. I know CEPH supports ZFS as FileStore backend
already, but we had a different immature idea to use libzpool to implement a
new
ObjectStore for CEPH totally in userspace without SPL and ZOL kernel module.
So that we can align CEPH transaction and zfs transaction in order to  avoid
double write for CEPH journal.
ZFS core part libzpool (DMU, metaslab etc) offers a dnode object store and
it's platform kernel/user independent. Another benefit for the idea is we
can extend our metadata without bothering any DBStore.

Frankly, we are not sure if our idea is realistic so far, but when I heard of
BlueFS, I think we need to know the BlueFS design goal.

I think it makes a lot of sense, but there are a few challenges.  One
reason we use rocksdb (or a similar kv store) is that we need in-order
enumeration of objects in order to do collection listing (needed for
backfill, scrub, and omap).  You'll need something similar on top of zfs.

I suspect the simplest path would be to also implement the rocksdb::Env
interface on top of the zfs libraries.  See BlueRocksEnv.{cc,h} to see the
interface that has to be implemented...

sage



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FreeBSD Building and Testing

2016-01-06 Thread Willem Jan Withagen

On 6-1-2016 08:51, Mykola Golub wrote:

On Mon, Dec 28, 2015 at 05:53:04PM +0100, Willem Jan Withagen wrote:

Hi,

Can somebody try to help me and explain why

in test: Func: test/mon/osd-crash
Func: TEST_crush_reject_empty started

Fails with a python error which sort of startles me:
test/mon/osd-crush.sh:227: TEST_crush_reject_empty:  local
empty_map=testdir/osd-crush/empty_map
test/mon/osd-crush.sh:228: TEST_crush_reject_empty:  :
test/mon/osd-crush.sh:229: TEST_crush_reject_empty:  ./crushtool -c
testdir/osd-crush/empty_map.txt -o testdir/osd-crush/empty_map.m
ap
test/mon/osd-crush.sh:230: TEST_crush_reject_empty:  expect_failure
testdir/osd-crush 'Error EINVAL' ./ceph osd setcrushmap -i testd
ir/osd-crush/empty_map.map
../qa/workunits/ceph-helpers.sh:1171: expect_failure:  local
dir=testdir/osd-crush
../qa/workunits/ceph-helpers.sh:1172: expect_failure:  shift
../qa/workunits/ceph-helpers.sh:1173: expect_failure:  local 'expected=Error
EINVAL'
../qa/workunits/ceph-helpers.sh:1174: expect_failure:  shift
../qa/workunits/ceph-helpers.sh:1175: expect_failure:  local success
../qa/workunits/ceph-helpers.sh:1176: expect_failure:  pwd
../qa/workunits/ceph-helpers.sh:1177: expect_failure:  printenv
../qa/workunits/ceph-helpers.sh:1178: expect_failure:  echo ./ceph osd
setcrushmap -i testdir/osd-crush/empty_map.map
../qa/workunits/ceph-helpers.sh:1180: expect_failure:  ./ceph osd
setcrushmap -i testdir/osd-crush/empty_map.map
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
Traceback (most recent call last):
   File "./ceph", line 936, in 
 retval = main()
   File "./ceph", line 874, in main
 sigdict, inbuf, verbose)
   File "./ceph", line 457, in new_style_command
 inbuf=inbuf)
   File "/usr/srcs/Ceph/wip-freebsd-wjw/ceph/src/pybind/ceph_argparse.py",
line 1208, in json_command
 raise RuntimeError('"{0}": exception {1}'.format(argdict, e))
RuntimeError: "{'prefix': u'osd setcrushmap'}": exception "['{"prefix": "osd
setcrushmap"}']": exception 'utf8' codec can't decode b
yte 0x86 in position 56: invalid start byte

Which is certainly not the type of error expected.
But it is hard to detect any 0x86 in the arguments.


Are you able to reproduce this problem manually? I.e. in src dir, start the
cluster using vstart.sh:

./vstart.sh -n

Check it is running:

./ceph -s

Repeat the test:

truncate -s 0 empty_map.txt
./crushtool -c empty_map.txt -o empty_map.map
./ceph osd setcrushmap -i empty_map.map

Expected output:

  "Error EINVAL: Failed crushmap test: ./crushtool: exit status: 1"



Hi all,

I've spent the Xmas days trying to learn more about Python.
(And catching up with old friends :) )

My heritage is the days of assembler, shell script, C, Perl and likes.
So the pony had to learn a few new tricks. (aka language)
I'm now trying to get python nosetest to actually work

In the mean time I also found that FreeBSD has patches for Googletest
to actually  make most of the DEATH tests work.

I think this python stream pars error got resolved by upgrading
everything build, including  the complete package environment and
upgrading kernel and tools... :) Which I think cleaned out the python
environment which was a bit mixed up with different versions.

Now test/mon/osd-crush.sh return OKE, so I guess the setup of the 
environment

is relatively critical.

I also noted that some of the test get more tests done IF I run them under
root-priviledges

The last test run resulted in:
=
   ceph 10.0.1: src/test-suite.log
=

# TOTAL: 120
# PASS:  110
# SKIP:  0
# XFAIL: 0
# FAIL:  10
# XPASS: 0
# ERROR: 0

FAIL ceph-detect-init/run-tox.sh (exit status: 1)
FAIL test/run-rbd-unit-tests.sh (exit status: 138)
FAIL test/ceph_objectstore_tool.py (exit status: 1)
FAIL test/cephtool-test-mon.sh (exit status: 1)
FAIL test/cephtool-test-rados.sh (exit status: 1)
FAIL test/libradosstriper/rados-striper.sh (exit status: 1)
FAIL test/test_objectstore_memstore.sh (exit status: 127)
FAIL test/ceph-disk.sh (exit status: 1)
FAIL test/pybind/test_ceph_argparse.py (exit status: 127)
FAIL test/pybind/test_ceph_daemon.py (exit status: 127)

where the first and last 2 actually don't work because of python things
that are not working on FreeBSD and I have to sort out.
ceph_detect_init.exc.UnsupportedPlatform: Platform is not supported.:
../test-driver: ./test/pybind/test_ceph_argparse.py: not found
FAIL test/pybind/test_ceph_argparse.py (exit status: 127)

I also have:
./test/test_objectstore_memstore.sh: ./ceph_test_objectstore: not found
FAIL test/test_objectstore_memstore.sh (exit status: 127)

Which ia a weird one, that needs some TLC.

So I'm slowly getting there...

--WjW
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FreeBSD Building and Testing

2016-01-06 Thread Willem Jan Withagen

On 5-1-2016 19:23, Gregory Farnum wrote:

On Mon, Dec 28, 2015 at 8:53 AM, Willem Jan Withagen  wrote:

Hi,

Can somebody try to help me and explain why

in test: Func: test/mon/osd-crash
Func: TEST_crush_reject_empty started

Fails with a python error which sort of startles me:
test/mon/osd-crush.sh:227: TEST_crush_reject_empty:  local
empty_map=testdir/osd-crush/empty_map
test/mon/osd-crush.sh:228: TEST_crush_reject_empty:  :
test/mon/osd-crush.sh:229: TEST_crush_reject_empty:  ./crushtool -c
testdir/osd-crush/empty_map.txt -o testdir/osd-crush/empty_map.m
ap
test/mon/osd-crush.sh:230: TEST_crush_reject_empty:  expect_failure
testdir/osd-crush 'Error EINVAL' ./ceph osd setcrushmap -i testd
ir/osd-crush/empty_map.map
../qa/workunits/ceph-helpers.sh:1171: expect_failure:  local
dir=testdir/osd-crush
../qa/workunits/ceph-helpers.sh:1172: expect_failure:  shift
../qa/workunits/ceph-helpers.sh:1173: expect_failure:  local 'expected=Error
EINVAL'
../qa/workunits/ceph-helpers.sh:1174: expect_failure:  shift
../qa/workunits/ceph-helpers.sh:1175: expect_failure:  local success
../qa/workunits/ceph-helpers.sh:1176: expect_failure:  pwd
../qa/workunits/ceph-helpers.sh:1177: expect_failure:  printenv
../qa/workunits/ceph-helpers.sh:1178: expect_failure:  echo ./ceph osd
setcrushmap -i testdir/osd-crush/empty_map.map
../qa/workunits/ceph-helpers.sh:1180: expect_failure:  ./ceph osd
setcrushmap -i testdir/osd-crush/empty_map.map
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
Traceback (most recent call last):
   File "./ceph", line 936, in 
 retval = main()
   File "./ceph", line 874, in main
 sigdict, inbuf, verbose)
   File "./ceph", line 457, in new_style_command
 inbuf=inbuf)
   File "/usr/srcs/Ceph/wip-freebsd-wjw/ceph/src/pybind/ceph_argparse.py",
line 1208, in json_command
 raise RuntimeError('"{0}": exception {1}'.format(argdict, e))
RuntimeError: "{'prefix': u'osd setcrushmap'}": exception "['{"prefix": "osd
setcrushmap"}']": exception 'utf8' codec can't decode b
yte 0x86 in position 56: invalid start byte

Which is certainly not the type of error expected.
But it is hard to detect any 0x86 in the arguments.

And yes python is right, there are no UTF8 sequences that start with 0x86.
Question is:
 Why does it want to parse with UTF8?
 And how do I switch it off?
 Or how to I fix this error?


I've not handled this myself but we've seen this a few times. The
latest example in a quick email search was
http://tracker.ceph.com/issues/9405, and it was apparently having a
string which wasn't null-terminated.



Looks like in my case it was due to too large a mess in the python 
environment.

But I'll keep this in my mind, IFF it comes back to haunt me more.

Thanx,
--WjW
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FreeBSD Building and Testing

2016-01-06 Thread Willem Jan Withagen
On 6-1-2016 08:51, Mykola Golub wrote:
> 
> Are you able to reproduce this problem manually? I.e. in src dir, start the
> cluster using vstart.sh:
> 
> ./vstart.sh -n
> 
> Check it is running:
> 
> ./ceph -s
> 
> Repeat the test:
> 
> truncate -s 0 empty_map.txt
> ./crushtool -c empty_map.txt -o empty_map.map
> ./ceph osd setcrushmap -i empty_map.map
> 
> Expected output:
> 
>  "Error EINVAL: Failed crushmap test: ./crushtool: exit status: 1"
> 

Oke thanx

Nice to have some of these examples...

--WjW

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 01/06/2016 Weekly Ceph Performance Meeting IS ON!

2016-01-06 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

The last recording I'm seeing is for 10/07/15. Can we get the newer ones?

Thanks,
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Jan 6, 2016 at 8:43 AM, Mark Nelson  wrote:
> 8AM PST as usual (ie in 18 minutes)! Discussion topics today include
> bluestore testing results and a potential performance regression in
> CentOS/RHEL 7.1 kernels.  Please feel free to add your own topics!
>
> Here's the links:
>
> Etherpad URL:
> http://pad.ceph.com/p/performance_weekly
>
> To join the Meeting:
> https://bluejeans.com/268261044
>
> To join via Browser:
> https://bluejeans.com/268261044/browser
>
> To join with Lync:
> https://bluejeans.com/268261044/lync
>
>
> To join via Room System:
> Video Conferencing System: bjn.vc -or- 199.48.152.152
> Meeting ID: 268261044
>
> To join via Phone:
> 1) Dial:
>   +1 408 740 7256
>   +1 888 240 2560(US Toll Free)
>   +1 408 317 9253(Alternate Number)
>   (see all numbers - http://bluejeans.com/numbers)
> 2) Enter Conference ID: 268261044
>
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWjYwHCRDmVDuy+mK58QAAXGUP/1L+iRyYLxfhI1hyuhCM
+nnrS41HPZ4oTeirmo9jKLj6eOBB/NStaoJpDxibkKsrkQSI/CZs9c1mZiF1
t0Jm+PEWy7N30lLgjCh8UUU+J6PMG450xABFeuJfriWPuS4WkKsstlsdhlWd
IcFJUlGotlagjA57tEW5DaaEqg8SKoykIOs7nnhIUkezHfB51fjyYQKH7/XB
kINLDigl4KjDVrpijCa80E9Kg1T+4wR/tRDOOSWzyQJtLRpwrZBAL8X8Ab9p
WEqryr0MudicgG5kasZLbaS/edcuvV2UsNjtTBmJRQlf26TzZMIFkJNeE9Z6
89QPFCcuCRe7aNBG7zU8GAmU2Tg2ZcBGDySBJ2GVh8Fjx81VKPQTUSWEaPt7
8THlVkE8oV3OfOmeJE7uiKYR4U2X2WIS8Y0ThUJCLbZZfjoQB03oEiY6+3yd
jXL+/27fxFu2bvC/ODWHbT+EcB6S+dnJzOYl44oxOCkdP+TjjPP7jcvgyBwe
N+Yx3i6G0QUFt/QtKfztN9vsqh0oZ8OWA0Idj++V1vFVls0o1NvswLo2fAqc
aLbGzKfkxRDEszr6zgwoVegDBGSd7LbCShEd9RjmTblayqYZbCblal12ZQ2D
MJv0iqwWAPQSHuJUhw/DOzf9cz1YNU8hrRF+myxlwvjwno7xPrEx3/SibHEX
5E7/
=ySZ7
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] PGP signatures for RHEL hammer RPMs for ceph-deploy

2016-01-05 Thread Alfredo Deza
This is odd. We are signing all packages before publishing them on the
repository. These ceph-deploy releases are following a new release
process so I will
have to investigate where is the disconnect.

Thanks for letting us know.

On Tue, Jan 5, 2016 at 10:31 AM, Derek Yarnell  wrote:
> It looks like the ceph-deploy > 1.5.28 packages in the
> http://download.ceph.com/rpm-hammer/el6 and
> http://download.ceph.com/rpm-hammer/el7 repositories are not being PGP
> signed.  What happened?  This is causing our yum updates to fail but may
> be a sign of something much more nefarious?
>
> # rpm -qp --queryformat %{SIGPGP}
> http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.28-0.noarch.rpm
> 89021c040001020006050255fae0d5000a0910e84ac2c0460f3994203610009e284c0c6749f9d1ccd54aca8668e5f4148eb60f0ade762a5cb316926060d73a82490c41b8a5e9a5ebb8a7136a5ce294565cf8548dce160f7a577b623f12fb841b1656fba0b139404b4a074c076abf8c38f176bbecfc551567d22826d6c3ac2a67d8c8f4db67e3a2566272f492f3a1461b2c80bfc56f0c29e3a0c0e03fe50ee877d2d2b99963ea876914f5d85ae6fcf60c7c372040fcc82591552af21e152a37ab4103c3116ccd3a5f10992dc9ec483922212ef8ad8c37abbb6a751f6da2cc79567ed45e7bcb83d92aecc2a61d7584699183622714376bf3766e8781c7675834cce7d3e6c349bee6992872248fe7dd9f00248806e0c99f1a7010a8e77d13fefffeb142c1ee4ee8e55e53043fb89b7127a1c2282f4ab0fa3d19eccaa38194aa42310860bdd7746de8512b106d7923e9da9d1ad84b4ba1f8a3175b808d08f99ca5b737d4a7cba1f165b815187bec9ff1e0b5627e435ed869ae0bb16419e928e1a64413bb4dd62a6b1b049faa02eaa14bd6636b5f835bfef16acfd2daad82c1fed57a5e635971281367d2fe99c3b2b542490559d9b9b3f4295c86185aa3c4b4014da55c1b0ff68bc42c869729fee29472c413c911ea9bc5d58957bfb670ddc54d28fd8f30444969b790e53f9d34a1b2df9b
>
>  e2afe9d26
> d5be57b9fcd659c4880fad613ba5f175e4e3466dba4919a4656ffd228688a9c81d865e6df870ba33bbfc000
>
> # rpm -qp --queryformat %{SIGPGP}
> http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.29-0.noarch.rpm
> (none)
>
> # rpm -qp --queryformat %{SIGPGP}
> http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.30-0.noarch.rpm
> (none)
>
> # rpm -qp --queryformat %{SIGPGP}
> http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.28-0.noarch.rpm
> 89021c040001020006050255fadf42000a0910e84ac2c0460f39943b131000cb7f253c91019b2f5993fd232c4369003d521538aa19f996717d2eee780fe2d7ed4e969418ce92d6ad4be69b3c5421b80d2241a9d6e72e758ba86f0360e24aadd63d89165b47a566bcd8bed39d7b37e809d7afdf6b38e5e014f98caca6df7da6278822e2457c627cdba505febc23edb32447e11c2878e79bf5f5690def708ed7d79d261a839d5808b177cb3d6a8bc62317441f3e1b5cf986aeb5cde98fc986c42af2761418e7e83309df9b8703648a8e6eefe83f9d3cbcfe371bc336320657f86343ab25df8bd578203b6f312746ebbe0da195adeb1087487d12d530281b5328731c54240b0c5c01f1648c8802231876a33a0835a553e1b84e6d8a15acdd5db6b6bf9c6dee84b22ae0e70dc0cf2acdd5779e510a248844bba0af87ae8d5a874502ec0e48b235926222cf3386c44e30e3af14dea6134a5873784013297fa19a09f439bc8a2b73f563fc6e5cfa60767629a37f3cd24762f7b14e5f7ce08adeed82da3effc59298359a9f7f0efab0e4e808a33ceb07431530e0c279462da043bbece02d3fdf6a96e5a813eea0bf0f73e84b7fac6e28449e1bf15ddc2fa692f641ce8d4d9ed4261ba2824adee47dad90993ebc46d6ee083e92c8f76aaf8428e274e48cb1a91d0a2eb15e8779289b3771ef71
>
>  1cd9cc7f2
> 8f7a3cde708e4577b0aad546024ee98646f4f543ee1e33d8c96a93cff9b48deefa5b3996f659b16786ff016
>
> # rpm -qp --queryformat %{SIGPGP}
> http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.29-0.noarch.rpm
> (none)
>
> # rpm -qp --queryformat %{SIGPGP}
> http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.30-0.noarch.rpm
> (none)
>
> --
> Derek T. Yarnell
> University of Maryland
> Institute for Advanced Computer Studies
> ___
> ceph-users mailing list
> ceph-us...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FreeBSD Building and Testing

2016-01-05 Thread Gregory Farnum
On Mon, Dec 28, 2015 at 8:53 AM, Willem Jan Withagen  wrote:
> Hi,
>
> Can somebody try to help me and explain why
>
> in test: Func: test/mon/osd-crash
> Func: TEST_crush_reject_empty started
>
> Fails with a python error which sort of startles me:
> test/mon/osd-crush.sh:227: TEST_crush_reject_empty:  local
> empty_map=testdir/osd-crush/empty_map
> test/mon/osd-crush.sh:228: TEST_crush_reject_empty:  :
> test/mon/osd-crush.sh:229: TEST_crush_reject_empty:  ./crushtool -c
> testdir/osd-crush/empty_map.txt -o testdir/osd-crush/empty_map.m
> ap
> test/mon/osd-crush.sh:230: TEST_crush_reject_empty:  expect_failure
> testdir/osd-crush 'Error EINVAL' ./ceph osd setcrushmap -i testd
> ir/osd-crush/empty_map.map
> ../qa/workunits/ceph-helpers.sh:1171: expect_failure:  local
> dir=testdir/osd-crush
> ../qa/workunits/ceph-helpers.sh:1172: expect_failure:  shift
> ../qa/workunits/ceph-helpers.sh:1173: expect_failure:  local 'expected=Error
> EINVAL'
> ../qa/workunits/ceph-helpers.sh:1174: expect_failure:  shift
> ../qa/workunits/ceph-helpers.sh:1175: expect_failure:  local success
> ../qa/workunits/ceph-helpers.sh:1176: expect_failure:  pwd
> ../qa/workunits/ceph-helpers.sh:1177: expect_failure:  printenv
> ../qa/workunits/ceph-helpers.sh:1178: expect_failure:  echo ./ceph osd
> setcrushmap -i testdir/osd-crush/empty_map.map
> ../qa/workunits/ceph-helpers.sh:1180: expect_failure:  ./ceph osd
> setcrushmap -i testdir/osd-crush/empty_map.map
> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
> Traceback (most recent call last):
>   File "./ceph", line 936, in 
> retval = main()
>   File "./ceph", line 874, in main
> sigdict, inbuf, verbose)
>   File "./ceph", line 457, in new_style_command
> inbuf=inbuf)
>   File "/usr/srcs/Ceph/wip-freebsd-wjw/ceph/src/pybind/ceph_argparse.py",
> line 1208, in json_command
> raise RuntimeError('"{0}": exception {1}'.format(argdict, e))
> RuntimeError: "{'prefix': u'osd setcrushmap'}": exception "['{"prefix": "osd
> setcrushmap"}']": exception 'utf8' codec can't decode b
> yte 0x86 in position 56: invalid start byte
>
> Which is certainly not the type of error expected.
> But it is hard to detect any 0x86 in the arguments.
>
> And yes python is right, there are no UTF8 sequences that start with 0x86.
> Question is:
> Why does it want to parse with UTF8?
> And how do I switch it off?
> Or how to I fix this error?

I've not handled this myself but we've seen this a few times. The
latest example in a quick email search was
http://tracker.ceph.com/issues/9405, and it was apparently having a
string which wasn't null-terminated.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CBT on an existing cluster

2016-01-05 Thread Gregory Farnum
On Tue, Jan 5, 2016 at 9:56 AM, Deneau, Tom  wrote:
> Having trouble getting a reply from c...@cbt.com so trying ceph-devel list...
>
> To get familiar with CBT, I first wanted to use it on an existing cluster.
> (i.e., not have CBT do any cluster setup).
>
> Is there a .yaml example that illustrates how to use cbt to run for example, 
> its radosbench benchmark on an existing cluster?

I dunno anything about CBT, but I don't see any emails from you on
that list and the correct address is c...@lists.ceph.com (rather than
the other way around), so let's try that. :)
-Greg

PS: next reply drop ceph-devel, please!
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] PGP signatures for RHEL hammer RPMs for ceph-deploy

2016-01-05 Thread Alfredo Deza
It looks like this was only for ceph-deploy in Hammer. I verified that
this wasn't the case in e.g. Infernalis

I have ensured that the ceph-deploy packages in hammer are in fact
signed and coming from our builds.

Thanks again for reporting this!

On Tue, Jan 5, 2016 at 12:27 PM, Alfredo Deza  wrote:
> This is odd. We are signing all packages before publishing them on the
> repository. These ceph-deploy releases are following a new release
> process so I will
> have to investigate where is the disconnect.
>
> Thanks for letting us know.
>
> On Tue, Jan 5, 2016 at 10:31 AM, Derek Yarnell  wrote:
>> It looks like the ceph-deploy > 1.5.28 packages in the
>> http://download.ceph.com/rpm-hammer/el6 and
>> http://download.ceph.com/rpm-hammer/el7 repositories are not being PGP
>> signed.  What happened?  This is causing our yum updates to fail but may
>> be a sign of something much more nefarious?
>>
>> # rpm -qp --queryformat %{SIGPGP}
>> http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.28-0.noarch.rpm
>> 89021c040001020006050255fae0d5000a0910e84ac2c0460f3994203610009e284c0c6749f9d1ccd54aca8668e5f4148eb60f0ade762a5cb316926060d73a82490c41b8a5e9a5ebb8a7136a5ce294565cf8548dce160f7a577b623f12fb841b1656fba0b139404b4a074c076abf8c38f176bbecfc551567d22826d6c3ac2a67d8c8f4db67e3a2566272f492f3a1461b2c80bfc56f0c29e3a0c0e03fe50ee877d2d2b99963ea876914f5d85ae6fcf60c7c372040fcc82591552af21e152a37ab4103c3116ccd3a5f10992dc9ec483922212ef8ad8c37abbb6a751f6da2cc79567ed45e7bcb83d92aecc2a61d7584699183622714376bf3766e8781c7675834cce7d3e6c349bee6992872248fe7dd9f00248806e0c99f1a7010a8e77d13fefffeb142c1ee4ee8e55e53043fb89b7127a1c2282f4ab0fa3d19eccaa38194aa42310860bdd7746de8512b106d7923e9da9d1ad84b4ba1f8a3175b808d08f99ca5b737d4a7cba1f165b815187bec9ff1e0b5627e435ed869ae0bb16419e928e1a64413bb4dd62a6b1b049faa02eaa14bd6636b5f835bfef16acfd2daad82c1fed57a5e635971281367d2fe99c3b2b542490559d9b9b3f4295c86185aa3c4b4014da55c1b0ff68bc42c869729fee29472c413c911ea9bc5d58957bfb670ddc54d28fd8f30444969b790e53f9d34a1b2df9b
>>
>>  e2afe9d26
>> d5be57b9fcd659c4880fad613ba5f175e4e3466dba4919a4656ffd228688a9c81d865e6df870ba33bbfc000
>>
>> # rpm -qp --queryformat %{SIGPGP}
>> http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.29-0.noarch.rpm
>> (none)
>>
>> # rpm -qp --queryformat %{SIGPGP}
>> http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.30-0.noarch.rpm
>> (none)
>>
>> # rpm -qp --queryformat %{SIGPGP}
>> http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.28-0.noarch.rpm
>> 89021c040001020006050255fadf42000a0910e84ac2c0460f39943b131000cb7f253c91019b2f5993fd232c4369003d521538aa19f996717d2eee780fe2d7ed4e969418ce92d6ad4be69b3c5421b80d2241a9d6e72e758ba86f0360e24aadd63d89165b47a566bcd8bed39d7b37e809d7afdf6b38e5e014f98caca6df7da6278822e2457c627cdba505febc23edb32447e11c2878e79bf5f5690def708ed7d79d261a839d5808b177cb3d6a8bc62317441f3e1b5cf986aeb5cde98fc986c42af2761418e7e83309df9b8703648a8e6eefe83f9d3cbcfe371bc336320657f86343ab25df8bd578203b6f312746ebbe0da195adeb1087487d12d530281b5328731c54240b0c5c01f1648c8802231876a33a0835a553e1b84e6d8a15acdd5db6b6bf9c6dee84b22ae0e70dc0cf2acdd5779e510a248844bba0af87ae8d5a874502ec0e48b235926222cf3386c44e30e3af14dea6134a5873784013297fa19a09f439bc8a2b73f563fc6e5cfa60767629a37f3cd24762f7b14e5f7ce08adeed82da3effc59298359a9f7f0efab0e4e808a33ceb07431530e0c279462da043bbece02d3fdf6a96e5a813eea0bf0f73e84b7fac6e28449e1bf15ddc2fa692f641ce8d4d9ed4261ba2824adee47dad90993ebc46d6ee083e92c8f76aaf8428e274e48cb1a91d0a2eb15e8779289b3771ef71
>>
>>  1cd9cc7f2
>> 8f7a3cde708e4577b0aad546024ee98646f4f543ee1e33d8c96a93cff9b48deefa5b3996f659b16786ff016
>>
>> # rpm -qp --queryformat %{SIGPGP}
>> http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.29-0.noarch.rpm
>> (none)
>>
>> # rpm -qp --queryformat %{SIGPGP}
>> http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.30-0.noarch.rpm
>> (none)
>>
>> --
>> Derek T. Yarnell
>> University of Maryland
>> Institute for Advanced Computer Studies
>> ___
>> ceph-users mailing list
>> ceph-us...@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] PGP signatures for RHEL hammer RPMs for ceph-deploy

2016-01-05 Thread Derek Yarnell
Hi Alfredo,

I am still having a bit of trouble though with what looks like the
1.5.31 release.  With a `yum update ceph-deploy` I get the following
even after a full `yum clean all`.

http://ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.31-0.noarch.rpm:
[Errno -1] Package does not match intended download. Suggestion: run yum
--enablerepo=Ceph-noarch clean metadata

Thanks,
derek

On 1/5/16 1:25 PM, Alfredo Deza wrote:
> It looks like this was only for ceph-deploy in Hammer. I verified that
> this wasn't the case in e.g. Infernalis
> 
> I have ensured that the ceph-deploy packages in hammer are in fact
> signed and coming from our builds.
> 
> Thanks again for reporting this!
> 
> On Tue, Jan 5, 2016 at 12:27 PM, Alfredo Deza  wrote:
>> This is odd. We are signing all packages before publishing them on the
>> repository. These ceph-deploy releases are following a new release
>> process so I will
>> have to investigate where is the disconnect.
>>
>> Thanks for letting us know.
>>
>> On Tue, Jan 5, 2016 at 10:31 AM, Derek Yarnell  wrote:
>>> It looks like the ceph-deploy > 1.5.28 packages in the
>>> http://download.ceph.com/rpm-hammer/el6 and
>>> http://download.ceph.com/rpm-hammer/el7 repositories are not being PGP
>>> signed.  What happened?  This is causing our yum updates to fail but may
>>> be a sign of something much more nefarious?
>>>
>>> # rpm -qp --queryformat %{SIGPGP}
>>> http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.28-0.noarch.rpm
>>> 89021c040001020006050255fae0d5000a0910e84ac2c0460f3994203610009e284c0c6749f9d1ccd54aca8668e5f4148eb60f0ade762a5cb316926060d73a82490c41b8a5e9a5ebb8a7136a5ce294565cf8548dce160f7a577b623f12fb841b1656fba0b139404b4a074c076abf8c38f176bbecfc551567d22826d6c3ac2a67d8c8f4db67e3a2566272f492f3a1461b2c80bfc56f0c29e3a0c0e03fe50ee877d2d2b99963ea876914f5d85ae6fcf60c7c372040fcc82591552af21e152a37ab4103c3116ccd3a5f10992dc9ec483922212ef8ad8c37abbb6a751f6da2cc79567ed45e7bcb83d92aecc2a61d7584699183622714376bf3766e8781c7675834cce7d3e6c349bee6992872248fe7dd9f00248806e0c99f1a7010a8e77d13fefffeb142c1ee4ee8e55e53043fb89b7127a1c2282f4ab0fa3d19eccaa38194aa42310860bdd7746de8512b106d7923e9da9d1ad84b4ba1f8a3175b808d08f99ca5b737d4a7cba1f165b815187bec9ff1e0b5627e435ed869ae0bb16419e928e1a64413bb4dd62a6b1b049faa02eaa14bd6636b5f835bfef16acfd2daad82c1fed57a5e635971281367d2fe99c3b2b542490559d9b9b3f4295c86185aa3c4b4014da55c1b0ff68bc42c869729fee29472c413c911ea9bc5d58957bfb670ddc54d28fd8f30444969b790e53f9d34a1b2
 
 df9b
>>>
>>>  e2afe9d26
>>> d5be57b9fcd659c4880fad613ba5f175e4e3466dba4919a4656ffd228688a9c81d865e6df870ba33bbfc000
>>>
>>> # rpm -qp --queryformat %{SIGPGP}
>>> http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.29-0.noarch.rpm
>>> (none)
>>>
>>> # rpm -qp --queryformat %{SIGPGP}
>>> http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.30-0.noarch.rpm
>>> (none)
>>>
>>> # rpm -qp --queryformat %{SIGPGP}
>>> http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.28-0.noarch.rpm
>>> 89021c040001020006050255fadf42000a0910e84ac2c0460f39943b131000cb7f253c91019b2f5993fd232c4369003d521538aa19f996717d2eee780fe2d7ed4e969418ce92d6ad4be69b3c5421b80d2241a9d6e72e758ba86f0360e24aadd63d89165b47a566bcd8bed39d7b37e809d7afdf6b38e5e014f98caca6df7da6278822e2457c627cdba505febc23edb32447e11c2878e79bf5f5690def708ed7d79d261a839d5808b177cb3d6a8bc62317441f3e1b5cf986aeb5cde98fc986c42af2761418e7e83309df9b8703648a8e6eefe83f9d3cbcfe371bc336320657f86343ab25df8bd578203b6f312746ebbe0da195adeb1087487d12d530281b5328731c54240b0c5c01f1648c8802231876a33a0835a553e1b84e6d8a15acdd5db6b6bf9c6dee84b22ae0e70dc0cf2acdd5779e510a248844bba0af87ae8d5a874502ec0e48b235926222cf3386c44e30e3af14dea6134a5873784013297fa19a09f439bc8a2b73f563fc6e5cfa60767629a37f3cd24762f7b14e5f7ce08adeed82da3effc59298359a9f7f0efab0e4e808a33ceb07431530e0c279462da043bbece02d3fdf6a96e5a813eea0bf0f73e84b7fac6e28449e1bf15ddc2fa692f641ce8d4d9ed4261ba2824adee47dad90993ebc46d6ee083e92c8f76aaf8428e274e48cb1a91d0a2eb15e8779289b3771
 
 ef71
>>>
>>>  1cd9cc7f2
>>> 8f7a3cde708e4577b0aad546024ee98646f4f543ee1e33d8c96a93cff9b48deefa5b3996f659b16786ff016
>>>
>>> # rpm -qp --queryformat %{SIGPGP}
>>> http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.29-0.noarch.rpm
>>> (none)
>>>
>>> # rpm -qp --queryformat %{SIGPGP}
>>> http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.30-0.noarch.rpm
>>> (none)
>>>
>>> --
>>> Derek T. Yarnell
>>> University of Maryland
>>> Institute for Advanced Computer Studies
>>> ___
>>> ceph-users mailing list
>>> ceph-us...@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Derek T. Yarnell
University of Maryland
Institute for Advanced Computer Studies
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FreeBSD Building and Testing

2016-01-05 Thread Mykola Golub
On Mon, Dec 28, 2015 at 05:53:04PM +0100, Willem Jan Withagen wrote:
> Hi,
> 
> Can somebody try to help me and explain why
> 
> in test: Func: test/mon/osd-crash
> Func: TEST_crush_reject_empty started
> 
> Fails with a python error which sort of startles me:
> test/mon/osd-crush.sh:227: TEST_crush_reject_empty:  local
> empty_map=testdir/osd-crush/empty_map
> test/mon/osd-crush.sh:228: TEST_crush_reject_empty:  :
> test/mon/osd-crush.sh:229: TEST_crush_reject_empty:  ./crushtool -c
> testdir/osd-crush/empty_map.txt -o testdir/osd-crush/empty_map.m
> ap
> test/mon/osd-crush.sh:230: TEST_crush_reject_empty:  expect_failure
> testdir/osd-crush 'Error EINVAL' ./ceph osd setcrushmap -i testd
> ir/osd-crush/empty_map.map
> ../qa/workunits/ceph-helpers.sh:1171: expect_failure:  local
> dir=testdir/osd-crush
> ../qa/workunits/ceph-helpers.sh:1172: expect_failure:  shift
> ../qa/workunits/ceph-helpers.sh:1173: expect_failure:  local 'expected=Error
> EINVAL'
> ../qa/workunits/ceph-helpers.sh:1174: expect_failure:  shift
> ../qa/workunits/ceph-helpers.sh:1175: expect_failure:  local success
> ../qa/workunits/ceph-helpers.sh:1176: expect_failure:  pwd
> ../qa/workunits/ceph-helpers.sh:1177: expect_failure:  printenv
> ../qa/workunits/ceph-helpers.sh:1178: expect_failure:  echo ./ceph osd
> setcrushmap -i testdir/osd-crush/empty_map.map
> ../qa/workunits/ceph-helpers.sh:1180: expect_failure:  ./ceph osd
> setcrushmap -i testdir/osd-crush/empty_map.map
> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
> Traceback (most recent call last):
>   File "./ceph", line 936, in 
> retval = main()
>   File "./ceph", line 874, in main
> sigdict, inbuf, verbose)
>   File "./ceph", line 457, in new_style_command
> inbuf=inbuf)
>   File "/usr/srcs/Ceph/wip-freebsd-wjw/ceph/src/pybind/ceph_argparse.py",
> line 1208, in json_command
> raise RuntimeError('"{0}": exception {1}'.format(argdict, e))
> RuntimeError: "{'prefix': u'osd setcrushmap'}": exception "['{"prefix": "osd
> setcrushmap"}']": exception 'utf8' codec can't decode b
> yte 0x86 in position 56: invalid start byte
> 
> Which is certainly not the type of error expected.
> But it is hard to detect any 0x86 in the arguments.

Are you able to reproduce this problem manually? I.e. in src dir, start the
cluster using vstart.sh:

./vstart.sh -n

Check it is running:

./ceph -s

Repeat the test:

truncate -s 0 empty_map.txt
./crushtool -c empty_map.txt -o empty_map.map
./ceph osd setcrushmap -i empty_map.map

Expected output:

 "Error EINVAL: Failed crushmap test: ./crushtool: exit status: 1"

-- 
Mykola Golub
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] PGP signatures for RHEL hammer RPMs for ceph-deploy

2016-01-05 Thread Alfredo Deza
It seems that the metadata didn't get updated.

I just tried out and got the right version with no issues. Hopefully
*this* time it works for you.

Sorry for all the troubles

On Tue, Jan 5, 2016 at 3:21 PM, Derek Yarnell  wrote:
> Hi Alfredo,
>
> I am still having a bit of trouble though with what looks like the
> 1.5.31 release.  With a `yum update ceph-deploy` I get the following
> even after a full `yum clean all`.
>
> http://ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.31-0.noarch.rpm:
> [Errno -1] Package does not match intended download. Suggestion: run yum
> --enablerepo=Ceph-noarch clean metadata
>
> Thanks,
> derek
>
> On 1/5/16 1:25 PM, Alfredo Deza wrote:
>> It looks like this was only for ceph-deploy in Hammer. I verified that
>> this wasn't the case in e.g. Infernalis
>>
>> I have ensured that the ceph-deploy packages in hammer are in fact
>> signed and coming from our builds.
>>
>> Thanks again for reporting this!
>>
>> On Tue, Jan 5, 2016 at 12:27 PM, Alfredo Deza  wrote:
>>> This is odd. We are signing all packages before publishing them on the
>>> repository. These ceph-deploy releases are following a new release
>>> process so I will
>>> have to investigate where is the disconnect.
>>>
>>> Thanks for letting us know.
>>>
>>> On Tue, Jan 5, 2016 at 10:31 AM, Derek Yarnell  wrote:
 It looks like the ceph-deploy > 1.5.28 packages in the
 http://download.ceph.com/rpm-hammer/el6 and
 http://download.ceph.com/rpm-hammer/el7 repositories are not being PGP
 signed.  What happened?  This is causing our yum updates to fail but may
 be a sign of something much more nefarious?

 # rpm -qp --queryformat %{SIGPGP}
 http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.28-0.noarch.rpm
 89021c040001020006050255fae0d5000a0910e84ac2c0460f3994203610009e284c0c6749f9d1ccd54aca8668e5f4148eb60f0ade762a5cb316926060d73a82490c41b8a5e9a5ebb8a7136a5ce294565cf8548dce160f7a577b623f12fb841b1656fba0b139404b4a074c076abf8c38f176bbecfc551567d22826d6c3ac2a67d8c8f4db67e3a2566272f492f3a1461b2c80bfc56f0c29e3a0c0e03fe50ee877d2d2b99963ea876914f5d85ae6fcf60c7c372040fcc82591552af21e152a37ab4103c3116ccd3a5f10992dc9ec483922212ef8ad8c37abbb6a751f6da2cc79567ed45e7bcb83d92aecc2a61d7584699183622714376bf3766e8781c7675834cce7d3e6c349bee6992872248fe7dd9f00248806e0c99f1a7010a8e77d13fefffeb142c1ee4ee8e55e53043fb89b7127a1c2282f4ab0fa3d19eccaa38194aa42310860bdd7746de8512b106d7923e9da9d1ad84b4ba1f8a3175b808d08f99ca5b737d4a7cba1f165b815187bec9ff1e0b5627e435ed869ae0bb16419e928e1a64413bb4dd62a6b1b049faa02eaa14bd6636b5f835bfef16acfd2daad82c1fed57a5e635971281367d2fe99c3b2b542490559d9b9b3f4295c86185aa3c4b4014da55c1b0ff68bc42c869729fee29472c413c911ea9bc5d58957bfb670ddc54d28fd8f30444969b790e53f9d34a1b2
>
>  df9b

  e2afe9d26
 d5be57b9fcd659c4880fad613ba5f175e4e3466dba4919a4656ffd228688a9c81d865e6df870ba33bbfc000

 # rpm -qp --queryformat %{SIGPGP}
 http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.29-0.noarch.rpm
 (none)

 # rpm -qp --queryformat %{SIGPGP}
 http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.30-0.noarch.rpm
 (none)

 # rpm -qp --queryformat %{SIGPGP}
 http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.28-0.noarch.rpm
 89021c040001020006050255fadf42000a0910e84ac2c0460f39943b131000cb7f253c91019b2f5993fd232c4369003d521538aa19f996717d2eee780fe2d7ed4e969418ce92d6ad4be69b3c5421b80d2241a9d6e72e758ba86f0360e24aadd63d89165b47a566bcd8bed39d7b37e809d7afdf6b38e5e014f98caca6df7da6278822e2457c627cdba505febc23edb32447e11c2878e79bf5f5690def708ed7d79d261a839d5808b177cb3d6a8bc62317441f3e1b5cf986aeb5cde98fc986c42af2761418e7e83309df9b8703648a8e6eefe83f9d3cbcfe371bc336320657f86343ab25df8bd578203b6f312746ebbe0da195adeb1087487d12d530281b5328731c54240b0c5c01f1648c8802231876a33a0835a553e1b84e6d8a15acdd5db6b6bf9c6dee84b22ae0e70dc0cf2acdd5779e510a248844bba0af87ae8d5a874502ec0e48b235926222cf3386c44e30e3af14dea6134a5873784013297fa19a09f439bc8a2b73f563fc6e5cfa60767629a37f3cd24762f7b14e5f7ce08adeed82da3effc59298359a9f7f0efab0e4e808a33ceb07431530e0c279462da043bbece02d3fdf6a96e5a813eea0bf0f73e84b7fac6e28449e1bf15ddc2fa692f641ce8d4d9ed4261ba2824adee47dad90993ebc46d6ee083e92c8f76aaf8428e274e48cb1a91d0a2eb15e8779289b3771
>
>  ef71

  1cd9cc7f2
 8f7a3cde708e4577b0aad546024ee98646f4f543ee1e33d8c96a93cff9b48deefa5b3996f659b16786ff016

 # rpm -qp --queryformat %{SIGPGP}
 http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.29-0.noarch.rpm
 (none)

 # rpm -qp --queryformat %{SIGPGP}
 http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.30-0.noarch.rpm
 (none)

 --
 Derek T. Yarnell
 University of Maryland
 Institute for Advanced Computer Studies
 ___
 ceph-users mailing list
 ceph-us...@lists.ceph.com
 

Re: Long peering - throttle at FileStore::queue_transactions

2016-01-05 Thread Guang Yang
On Mon, Jan 4, 2016 at 7:21 PM, Sage Weil  wrote:
> On Mon, 4 Jan 2016, Guang Yang wrote:
>> Hi Cephers,
>> Happy New Year! I got question regards to the long PG peering..
>>
>> Over the last several days I have been looking into the *long peering*
>> problem when we start a OSD / OSD host, what I observed was that the
>> two peering working threads were throttled (stuck) when trying to
>> queue new transactions (writing pg log), thus the peering process are
>> dramatically slow down.
>>
>> The first question came to me was, what were the transactions in the
>> queue? The major ones, as I saw, included:
>>
>> - The osd_map and incremental osd_map, this happens if the OSD had
>> been down for a while (in a large cluster), or when the cluster got
>> upgrade, which made the osd_map epoch the down OSD had, was far behind
>> the latest osd_map epoch. During the OSD booting, it would need to
>> persist all those osd_maps and generate lots of filestore transactions
>> (linear with the epoch gap).
>> > As the PG was not involved in most of those epochs, could we only take and 
>> > persist those osd_maps which matter to the PGs on the OSD?
>
> This part should happen before the OSD sends the MOSDBoot message, before
> anyone knows it exists.  There is a tunable threshold that controls how
> recent the map has to be before the OSD tries to boot.  If you're
> seeing this in the real world, be probably just need to adjust that value
> way down to something small(er).
It would queue the transactions and then sends out the MOSDBoot, thus
there is still a chance that it could have contention with the peering
OPs (especially on large clusters where there are lots of activities
which generates many osdmap epoch). Any chance we can change the
*queue_transactions* to "apply_transactions*, thus we block there
waiting for the persistent of the osdmap. At least we may be able to
do that during OSD booting? The concern is, if the OSD is active, the
apply_transaction would take longer with holding the osd_lock..
I don't find such tuning, could you elaborate? Thanks!
>
> sage
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hammer mon failure

2016-01-05 Thread Joao Eduardo Luis
On 01/05/2016 07:55 PM, Samuel Just wrote:
> http://tracker.ceph.com/issues/14236
> 
> New hammer mon failure in the nightlies (missing a map apparently?),
> can you take a look?
> -Sam

Will do.

  -Joao
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD data file are OSD logs

2016-01-04 Thread Samuel Just
IIRC, you are running giant.  I think that's the log rotate dangling
fd bug (not fixed in giant since giant is eol).  Fixed upstream
8778ab3a1ced7fab07662248af0c773df759653d, firefly backport is
b8e3f6e190809febf80af66415862e7c7e415214.
-Sam

On Mon, Jan 4, 2016 at 3:37 PM, Guang Yang  wrote:
> Hi Cephers,
> Before I open a tracker, I would like check if it is a known issue or not..
>
> One one of our clusters, there was OSD crash during repairing,  the
> crash happened after we issued a PG repair for inconsistent PGs, which
> failed because the recorded file size (within xattr) mismatched with
> the actual file size.
>
> The mismatch was caused by the fact that the content of the data file
> are OSD logs, following is from osd.354 on c003:
>
> -rw-r--r-- 1 yahoo root  75168 Jan  3 07:30
> default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7
> -bash-4.1$ head
> "default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7"
> 2016-01-03 07:30:01.600119 7f7fe2096700 15
> filestore(/home/y/var/lib/ceph/osd/ceph-354) getattrs
> 3.171s7_head/a2478171/default.12061.9_8396947527_52ac8b3ec6_o.jpg/head//3/18446744073709551615/7
> 2016-01-03 07:30:01.604967 7f7fe2096700 10
> filestore(/home/y/var/lib/ceph/osd/ceph-354)  -ERANGE, len is 494
> 2016-01-03 07:30:01.604984 7f7fe2096700 10
> filestore(/home/y/var/lib/ceph/osd/ceph-354)  -ERANGE, got 247
> 2016-01-03 07:30:01.604986 7f7fe2096700 20
> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
> '_user.rgw.idtag'
> 2016-01-03 07:30:01.604996 7f7fe2096700 20
> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting '_'
> 2016-01-03 07:30:01.605007 7f7fe2096700 20
> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
> 'snapset'
> 2016-01-03 07:30:01.605013 7f7fe2096700 20
> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
> '_user.rgw.manifest'
> 2016-01-03 07:30:01.605026 7f7fe2096700 20
> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
> 'hinfo_key'
> 2016-01-03 07:30:01.605042 7f7fe2096700 20
> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
> '_user.rgw.x-amz-meta-origin'
> 2016-01-03 07:30:01.605049 7f7fe2096700 20
> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
> '_user.rgw.acl'
>
>
> This only happens on the clusters we turned on the verbose log
> (debug_osd/filestore=20). And we are running ceph v0.87.
>
> Thanks,
> Guang
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD data file are OSD logs

2016-01-04 Thread Guang Yang
Thanks Sam for the confirmation.

Thanks,
Guang

On Mon, Jan 4, 2016 at 3:59 PM, Samuel Just  wrote:
> IIRC, you are running giant.  I think that's the log rotate dangling
> fd bug (not fixed in giant since giant is eol).  Fixed upstream
> 8778ab3a1ced7fab07662248af0c773df759653d, firefly backport is
> b8e3f6e190809febf80af66415862e7c7e415214.
> -Sam
>
> On Mon, Jan 4, 2016 at 3:37 PM, Guang Yang  wrote:
>> Hi Cephers,
>> Before I open a tracker, I would like check if it is a known issue or not..
>>
>> One one of our clusters, there was OSD crash during repairing,  the
>> crash happened after we issued a PG repair for inconsistent PGs, which
>> failed because the recorded file size (within xattr) mismatched with
>> the actual file size.
>>
>> The mismatch was caused by the fact that the content of the data file
>> are OSD logs, following is from osd.354 on c003:
>>
>> -rw-r--r-- 1 yahoo root  75168 Jan  3 07:30
>> default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7
>> -bash-4.1$ head
>> "default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7"
>> 2016-01-03 07:30:01.600119 7f7fe2096700 15
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) getattrs
>> 3.171s7_head/a2478171/default.12061.9_8396947527_52ac8b3ec6_o.jpg/head//3/18446744073709551615/7
>> 2016-01-03 07:30:01.604967 7f7fe2096700 10
>> filestore(/home/y/var/lib/ceph/osd/ceph-354)  -ERANGE, len is 494
>> 2016-01-03 07:30:01.604984 7f7fe2096700 10
>> filestore(/home/y/var/lib/ceph/osd/ceph-354)  -ERANGE, got 247
>> 2016-01-03 07:30:01.604986 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
>> '_user.rgw.idtag'
>> 2016-01-03 07:30:01.604996 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting '_'
>> 2016-01-03 07:30:01.605007 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
>> 'snapset'
>> 2016-01-03 07:30:01.605013 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
>> '_user.rgw.manifest'
>> 2016-01-03 07:30:01.605026 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
>> 'hinfo_key'
>> 2016-01-03 07:30:01.605042 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
>> '_user.rgw.x-amz-meta-origin'
>> 2016-01-03 07:30:01.605049 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
>> '_user.rgw.acl'
>>
>>
>> This only happens on the clusters we turned on the verbose log
>> (debug_osd/filestore=20). And we are running ceph v0.87.
>>
>> Thanks,
>> Guang
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Long peering - throttle at FileStore::queue_transactions

2016-01-04 Thread Samuel Just
We need every OSDMap persisted before persisting later ones because we
rely on there being no holes for a bunch of reasons.

The deletion transactions are more interesting.  It's not part of the
boot process, these are deletions resulting from merging in a log from
a peer which logically removed an object.  It's more noticeable on
boot because all PGs will see these operations at once (if there are a
bunch of deletes happening).  We need to process these transactions
before we can serve reads (before we activate) currently since we use
the on disk state (modulo the objectcontext locks) as authoritative.
That transaction iirc also contains the updated PGLog.  We can't avoid
writing down the PGLog prior to activation, but we *can* delay the
deletes (and even batch/throttle them) if we do some work:
1) During activation, we need to maintain a set of to-be-deleted
objects.  For each of these objects, we need to populate the
objectcontext cache with an exists=false objectcontext so that we
don't erroneously read the deleted data.  Each of the entries in the
to-be-deleted object set would have a reference to the context to keep
it alive until the deletion is processed.
2) Any write operation which references one of these objects needs to
be preceded by a delete if one has not yet been queued (and the
to-be-deleted set updated appropriately).  The tricky part is that the
primary and replicas may have different objects in this set...  The
replica would have to insert deletes ahead of any subop (or the ec
equilivant) it gets from the primary.  For that to work, it needs to
have something like the obc cache.  I have a wip-replica-read branch
which refactors object locking to allow the replica to maintain locks
(to avoid replica-reads conflicting with writes).  That machinery
would probably be the right place to put it.
3) We need to make sure that if a node restarts anywhere in this
process that it correctly repopulates the set of to be deleted
entries.  We might consider a deleted-to version in the log?  Not sure
about this one since it would be different on the replica and the
primary.

Anyway, it's actually more complicated than you'd expect and will
require more design (and probably depends on wip-replica-read
landing).
-Sam

On Mon, Jan 4, 2016 at 3:32 PM, Guang Yang  wrote:
> Hi Cephers,
> Happy New Year! I got question regards to the long PG peering..
>
> Over the last several days I have been looking into the *long peering*
> problem when we start a OSD / OSD host, what I observed was that the
> two peering working threads were throttled (stuck) when trying to
> queue new transactions (writing pg log), thus the peering process are
> dramatically slow down.
>
> The first question came to me was, what were the transactions in the
> queue? The major ones, as I saw, included:
>
> - The osd_map and incremental osd_map, this happens if the OSD had
> been down for a while (in a large cluster), or when the cluster got
> upgrade, which made the osd_map epoch the down OSD had, was far behind
> the latest osd_map epoch. During the OSD booting, it would need to
> persist all those osd_maps and generate lots of filestore transactions
> (linear with the epoch gap).
>> As the PG was not involved in most of those epochs, could we only take and 
>> persist those osd_maps which matter to the PGs on the OSD?
>
> - There are lots of deletion transactions, and as the PG booting, it
> needs to merge the PG log from its peers, and for the deletion PG
> entry, it would need to queue the deletion transaction immediately.
>> Could we delay the queue of the transactions until all PGs on the host are 
>> peered?
>
> Thanks,
> Guang
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Long peering - throttle at FileStore::queue_transactions

2016-01-04 Thread Sage Weil
On Mon, 4 Jan 2016, Guang Yang wrote:
> Hi Cephers,
> Happy New Year! I got question regards to the long PG peering..
> 
> Over the last several days I have been looking into the *long peering*
> problem when we start a OSD / OSD host, what I observed was that the
> two peering working threads were throttled (stuck) when trying to
> queue new transactions (writing pg log), thus the peering process are
> dramatically slow down.
> 
> The first question came to me was, what were the transactions in the
> queue? The major ones, as I saw, included:
> 
> - The osd_map and incremental osd_map, this happens if the OSD had
> been down for a while (in a large cluster), or when the cluster got
> upgrade, which made the osd_map epoch the down OSD had, was far behind
> the latest osd_map epoch. During the OSD booting, it would need to
> persist all those osd_maps and generate lots of filestore transactions
> (linear with the epoch gap).
> > As the PG was not involved in most of those epochs, could we only take and 
> > persist those osd_maps which matter to the PGs on the OSD?

This part should happen before the OSD sends the MOSDBoot message, before 
anyone knows it exists.  There is a tunable threshold that controls how 
recent the map has to be before the OSD tries to boot.  If you're 
seeing this in the real world, be probably just need to adjust that value 
way down to something small(er).

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Speeding up rbd_stat() in libvirt

2016-01-04 Thread Jason Dillaman
Short term, assuming there wouldn't be an objection from the libvirt community, 
I think spawning a thread pool and concurrently executing several rbd_stat 
calls concurrently would be the easiest and cleanest solution.  I wouldn't 
suggest trying to roll your own solution for retrieving image sizes for format 
1 and 2 RBD images directly within libvirt.

Longer term, given this use case, perhaps it would make sense to add an async 
version of rbd_open.  The rbd_stat call itself just reads the data from memory 
initialized by rbd_open.  On the Jewel branch, librbd has had some major rework 
and image loading is asynchronous under the hood already.

-- 

Jason Dillaman 


- Original Message -
> From: "Wido den Hollander" 
> To: ceph-devel@vger.kernel.org
> Sent: Monday, December 28, 2015 8:48:40 AM
> Subject: Speeding up rbd_stat() in libvirt
> 
> Hi,
> 
> The storage pools of libvirt know a mechanism called 'refresh' which
> will scan a storage pool to refresh the contents.
> 
> The current implementation does:
> * List all images via rbd_list()
> * Call rbd_stat() on each image
> 
> Source:
> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_rbd.c;h=cdbfdee98505492407669130712046783223c3cf;hb=master#l329
> 
> This works, but a RBD pool with 10k images takes a couple of minutes to
> scan.
> 
> Now, Ceph is distributed, so this could be done in parallel, but before
> I start on this I was wondering if somebody had a good idea to fix this?
> 
> I don't know if it is allowed in libvirt to spawn multiple threads and
> have workers do this, but it was something which came to mind.
> 
> libvirt only wants to know the size of a image and this is now stored in
> the rbd_directory object, so the rbd_stat() is required.
> 
> Suggestions or ideas? I would like to have this process to be as fast as
> possible.
> 
> Wido
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Speeding up rbd_stat() in libvirt

2016-01-04 Thread Wido den Hollander


On 04-01-16 16:38, Jason Dillaman wrote:
> Short term, assuming there wouldn't be an objection from the libvirt 
> community, I think spawning a thread pool and concurrently executing several 
> rbd_stat calls concurrently would be the easiest and cleanest solution.  I 
> wouldn't suggest trying to roll your own solution for retrieving image sizes 
> for format 1 and 2 RBD images directly within libvirt.
> 

I'll ask in the libvirt community if they allow such a thing.

> Longer term, given this use case, perhaps it would make sense to add an async 
> version of rbd_open.  The rbd_stat call itself just reads the data from 
> memory initialized by rbd_open.  On the Jewel branch, librbd has had some 
> major rework and image loading is asynchronous under the hood already.
> 

Hmm, that would be nice. In the callback I could call rbd_stat() and
populate the volume list within libvirt.

I would very much like to go that route since it saves me a lot of code
inside libvirt ;)

Wido

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Create one millon empty files with cephfs

2016-01-04 Thread Gregory Farnum
On Tue, Dec 29, 2015 at 4:55 AM, Fengguang Gong  wrote:
> hi,
> We create one million empty files through filebench, here is the test env:
> MDS: one MDS
> MON: one MON
> OSD: two OSD, each with one Inter P3700; data on OSD with 2x replica
> Network: all nodes are connected through 10 gigabit network
>
> We use more than one client to create files, to test the scalability of
> MDS. Here are the results:
> IOPS under one client: 850
> IOPS under two client: 1150
> IOPS under four client: 1180
>
> As we can see, the IOPS almost maintains unchanged when the number of
> client increase from 2 to 4.
>
> Cephfs may have a low scalability under one MDS, and we think its the big
> lock in
> MDSDamon::ms_dispatch()::Mutex::locker(every request acquires this lock),
> who limits the
> scalability of MDS.
>
> We think this big lock could be removed through the following steps:
> 1. separate the process of ClientRequest with other requests, so we can
> parallel the process
> of ClientRequest
> 2. use some small granularity locks instead of big lock to ensure
> consistency
>
> Wondering this idea is reasonable?

Parallelizing the MDS is probably a very big job; it's on our radar
but not for a while yet.

If one were to do it, yes, breaking down the big MDS lock would be the
way forward. I'm not sure entirely what that involves — you'd need to
significantly chunk up the locking on our more critical data
structures, most especially the MDCache. Luckily there is *some* help
there in terms of the file cap locking structures we already have in
place, but it's a *huge* project and not one to be undertaken lightly.
A special processing mechanism for ClientRequests versus other
requests is not an assumption I'd start with.

I think you'll find that file creates are just about the least
scalable thing you can do on CephFS right now, though, so there is
some easier ground. One obvious approach is to extend the current
inode preallocation — it already allocates inodes per-client and has a
fast path inside of the MDS for handing them back. It'd be great if
clients were aware of that preallocation and could create files
without waiting for the MDS to talk back to them! The issue with this
is two-fold:
1) need to update the cap flushing protocol to deal with files newly
created by the client
2) need to handle all the backtrace stuff normally performed by the
MDS on file create (which still needs to happen, on either the client
or the server)
There's also clean up in case of a client failure, but we've already
got a model for that in how we figure out real file sizes and things
based on max size.

I think there's a ticket about this somewhere, but I can't find it off-hand...
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 答复: Reboot blocked when undoing unmap op.

2016-01-04 Thread Ilya Dryomov
On Mon, Jan 4, 2016 at 10:51 AM, Wukongming  wrote:
> Hi, Ilya,
>
> It is an old problem.
> When you say "when you issue a reboot, daemons get killed and the kernel 
> client ends up waiting for the them to come back, because of outstanding 
> writes issued by umount called by systemd (or whatever)."
>
> Do you mean if umount rbd successfully, the process of kernel client will 
> stop waiting? What kind of Communication mechanism between libceph and 
> daemons(or ceph userspace)?

If you umount the filesystem on top of rbd and unmap rbd image, there
won't be anything to wait for.  In fact, if there aren't any other rbd
images mapped, libceph will clean up after itself and exit.

If you umount the filesystem on top of rbd but don't unmap the image,
libceph will remain there, along with some amount of communication
(keepalive messages, watch requests, etc).  However, all of that is
internal and is unlikely to block reboot.

If you don't umount the filesystem, your init system will try to umount
it, issuing FS requests to the rbd device.  We don't want to drop those
requests, so, if daemons are gone by then, libceph ends up blocking.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to configure if there are tow network cards in Client

2015-12-31 Thread Linux Chips
it would certainly help those with less knowledge about networking in 
linux, though i do not know how many people using ceph are in this 
category. Sage and the others here may have a better idea about its 
feasibility.
but i usually use rule-* and route-* (in CentOS) files, they work with 
networkmanager, and very easy to configure. in ubuntu you can put them 
in interfaces file, and they are as easy. if such a tool is made, i 
think it should understand the ceph.conf file, but i doubt it can figure 
out the routes correctly without you putting them in.


On 12/29/2015 03:58 PM, 蔡毅 wrote:

   Thank for your replies.
   So is it reasonable that we could write a file such as shell script to 
bind one process with a specific IP and modify the routing tables and rules
as one of Ceph’s tools? So that the users is convenient when they want to 
change the NIC connecting with the OSD.



At 2015-12-29 18:21:21, "Linux Chips"  wrote:

On 12/28/2015 07:47 PM, Sage Weil wrote:

On Fri, 25 Dec 2015, ?? wrote:

Hi all,
  When we read the code, we haven?t find the function that the client can 
bind a specific IP. In Ceph?s configuration, we could only find the parameter 
?public network?, but it seems acts on the OSD but not the client.
  There is a scenario that the client has two network cards named NIC1 and 
NIC2. The NIC1 is responsible for communicating with cluster (monitor and 
RADOS) and the NIC2 has other services except Ceph?s client. So   we need the 
client can bind specific IP in order to differentiate the IP communicating with 
cluster from another IP serving other applications. We want to know is there 
any configuration in Ceph to achieve this function? If there is, how could we 
configure the IP? if not, could we add this function in Ceph? Thank you so much.

you can use routing tables plus routing rules. otherwise linux will just
use the default gateway.
or you can put the second interface on the same public net of ceph.
though that would break if you have multiple external nets.

Right.  There isn't a configurable to do this now--we've always just let
the kernel network layer sort it out. Is this just a matter of calling
bind on the socket before connecting? I've never done this before..

linux will send all packets to the default gateway event if an
application binds to an ip on different interface, the packet will go
out with the source address as the binded one but through your router.
the only solution, even if the bind function exists is to use the
routing tables and rules.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message tomajord...@vger.kernel.org
More majordomo info athttp://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fwd: how io works when backfill

2015-12-29 Thread Sage Weil
On Tue, 29 Dec 2015, Dong Wu wrote:
> if add in osd.7 and 7 becomes the primary: pg1.0 [1, 2, 3]  --> pg1.0
> [7, 2, 3],  is it similar with the example above?
> still install a pg_temp entry mapping the PG back to [1, 2, 3], then
> backfill happens to 7, normal io write to [1, 2, 3], if io to the
> portion of the PG that has already been backfilled will also be sent
> to osd.7?

Yes (although I forget how it picks the ordering of the osds in the temp 
mapping).  See PG::choose_acting() for the details.

> how about these examples about removing an osd:
> - pg1.0 [1, 2, 3]
> - osd.3 down and be removed
> - mapping changes to [1, 2, 5], but osd.5 has no data, then install a
> pg_temp mapping the PG back to [1, 2], then backfill happens to 5,
> - normal io write to [1, 2], if io hits object which has been
> backfilled to osd.5, io will also send to osd.5
> - when backfill completes, remove the pg_temp and mapping changes back
> to [1, 2, 5]

Yes

> another example:
> - pg1.0 [1, 2, 3]
> - osd.3 down and be removed
> - mapping changes to [5, 1, 2], but osd.5 has no data of the pg, then
> install a pg_temp mapping the PG back to [1, 2] which osd.1
> temporarily becomes the primary, then backfill happens to 5,
> - normal io write to [1, 2], if io hits object which has been
> backfilled to osd.5, io will also send to osd.5
> - when backfill completes, remove the pg_temp and mapping changes back
> to [5, 1, 2]
> 
> is my ananysis right?

Yep!

sage

> 
> 2015-12-29 1:30 GMT+08:00 Sage Weil :
> > On Mon, 28 Dec 2015, Zhiqiang Wang wrote:
> >> 2015-12-27 20:48 GMT+08:00 Dong Wu :
> >> > Hi,
> >> > When add osd or remove osd, ceph will backfill to rebalance data.
> >> > eg:
> >> > - pg1.0[1, 2, 3]
> >> > - add an osd(eg. osd.7)
> >> > - ceph start backfill, then pg1.0 osd set changes to [1, 2, 7]
> >> > - if [a, b, c, d, e] are objects needing to backfill to osd.7 and now
> >> > object a is backfilling
> >> > - when a write io hits object a, then the io needs to wait for its
> >> > complete, then goes on.
> >> > - but if io hits object b which has not been backfilled, io reaches
> >> > osd.1, then osd.1 send the io to osd.2  and osd.7, but osd.7 does not
> >> > have object b, so osd.7 needs to wait for object b to backfilled, then
> >> > write. Is it right? Or osd.1 only send the io to osd.2, not both?
> >>
> >> I think in this case, when the write of object b reaches osd.1, it
> >> holds the client write, raises the priority of the recovery of object
> >> b, and kick off the recovery of it. When the recovery of object b is
> >> done, it requeue the client write, and then everything goes like
> >> usual.
> >
> > It's more complicated than that.  In a normal (log-based) recovery
> > situation, it is something like the above: if the acting set is [1,2,3]
> > but 3 is missing the latest copy of A, a write to A will block on the
> > primary while the primary initiates recovery of A immediately.  Once that
> > completes the IO will continue.
> >
> > For backfill, it's different.  In your example, you start with [1,2,3]
> > then add in osd.7.  The OSD will see that 7 has no data for teh PG and
> > install a pg_temp entry mapping the PG back to [1,2,3] temporarily.  Then
> > things will proceed normally while backfill happens to 7.  Backfill won't
> > interfere with normal IO at all, except that IO to the portion of the PG
> > that has already been backfilled will also be sent to the backfill target
> > (7) so that it stays up to date.  Once it complets, the pg_temp entry is
> > removed and the mapping changes back to [1,2,7].  Then osd.3 is allowed to
> > remove it's copy of the PG.
> >
> > sage
> >
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to configure if there are tow network cards in Client

2015-12-29 Thread Linux Chips

On 12/28/2015 07:47 PM, Sage Weil wrote:

On Fri, 25 Dec 2015, ?? wrote:

Hi all,
 When we read the code, we haven?t find the function that the client can 
bind a specific IP. In Ceph?s configuration, we could only find the parameter 
?public network?, but it seems acts on the OSD but not the client.
 There is a scenario that the client has two network cards named NIC1 and 
NIC2. The NIC1 is responsible for communicating with cluster (monitor and 
RADOS) and the NIC2 has other services except Ceph?s client. So   we need the 
client can bind specific IP in order to differentiate the IP communicating with 
cluster from another IP serving other applications. We want to know is there 
any configuration in Ceph to achieve this function? If there is, how could we 
configure the IP? if not, could we add this function in Ceph? Thank you so much.
you can use routing tables plus routing rules. otherwise linux will just 
use the default gateway.
or you can put the second interface on the same public net of ceph. 
though that would break if you have multiple external nets.

Right.  There isn't a configurable to do this now--we've always just let
the kernel network layer sort it out. Is this just a matter of calling
bind on the socket before connecting? I've never done this before..
linux will send all packets to the default gateway event if an 
application binds to an ip on different interface, the packet will go 
out with the source address as the binded one but through your router. 
the only solution, even if the bind function exists is to use the 
routing tables and rules.


sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fwd: how io works when backfill

2015-12-28 Thread Dong Wu
if add in osd.7 and 7 becomes the primary: pg1.0 [1, 2, 3]  --> pg1.0
[7, 2, 3],  is it similar with the example above?
still install a pg_temp entry mapping the PG back to [1, 2, 3], then
backfill happens to 7, normal io write to [1, 2, 3], if io to the
portion of the PG that has already been backfilled will also be sent
to osd.7?

how about these examples about removing an osd:
- pg1.0 [1, 2, 3]
- osd.3 down and be removed
- mapping changes to [1, 2, 5], but osd.5 has no data, then install a
pg_temp mapping the PG back to [1, 2], then backfill happens to 5,
- normal io write to [1, 2], if io hits object which has been
backfilled to osd.5, io will also send to osd.5
- when backfill completes, remove the pg_temp and mapping changes back
to [1, 2, 5]


another example:
- pg1.0 [1, 2, 3]
- osd.3 down and be removed
- mapping changes to [5, 1, 2], but osd.5 has no data of the pg, then
install a pg_temp mapping the PG back to [1, 2] which osd.1
temporarily becomes the primary, then backfill happens to 5,
- normal io write to [1, 2], if io hits object which has been
backfilled to osd.5, io will also send to osd.5
- when backfill completes, remove the pg_temp and mapping changes back
to [5, 1, 2]

is my ananysis right?

2015-12-29 1:30 GMT+08:00 Sage Weil :
> On Mon, 28 Dec 2015, Zhiqiang Wang wrote:
>> 2015-12-27 20:48 GMT+08:00 Dong Wu :
>> > Hi,
>> > When add osd or remove osd, ceph will backfill to rebalance data.
>> > eg:
>> > - pg1.0[1, 2, 3]
>> > - add an osd(eg. osd.7)
>> > - ceph start backfill, then pg1.0 osd set changes to [1, 2, 7]
>> > - if [a, b, c, d, e] are objects needing to backfill to osd.7 and now
>> > object a is backfilling
>> > - when a write io hits object a, then the io needs to wait for its
>> > complete, then goes on.
>> > - but if io hits object b which has not been backfilled, io reaches
>> > osd.1, then osd.1 send the io to osd.2  and osd.7, but osd.7 does not
>> > have object b, so osd.7 needs to wait for object b to backfilled, then
>> > write. Is it right? Or osd.1 only send the io to osd.2, not both?
>>
>> I think in this case, when the write of object b reaches osd.1, it
>> holds the client write, raises the priority of the recovery of object
>> b, and kick off the recovery of it. When the recovery of object b is
>> done, it requeue the client write, and then everything goes like
>> usual.
>
> It's more complicated than that.  In a normal (log-based) recovery
> situation, it is something like the above: if the acting set is [1,2,3]
> but 3 is missing the latest copy of A, a write to A will block on the
> primary while the primary initiates recovery of A immediately.  Once that
> completes the IO will continue.
>
> For backfill, it's different.  In your example, you start with [1,2,3]
> then add in osd.7.  The OSD will see that 7 has no data for teh PG and
> install a pg_temp entry mapping the PG back to [1,2,3] temporarily.  Then
> things will proceed normally while backfill happens to 7.  Backfill won't
> interfere with normal IO at all, except that IO to the portion of the PG
> that has already been backfilled will also be sent to the backfill target
> (7) so that it stays up to date.  Once it complets, the pg_temp entry is
> removed and the mapping changes back to [1,2,7].  Then osd.3 is allowed to
> remove it's copy of the PG.
>
> sage
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FreeBSD Building and Testing

2015-12-28 Thread Willem Jan Withagen

Hi,

Can somebody try to help me and explain why

in test: Func: test/mon/osd-crash
Func: TEST_crush_reject_empty started

Fails with a python error which sort of startles me:
test/mon/osd-crush.sh:227: TEST_crush_reject_empty:  local 
empty_map=testdir/osd-crush/empty_map

test/mon/osd-crush.sh:228: TEST_crush_reject_empty:  :
test/mon/osd-crush.sh:229: TEST_crush_reject_empty:  ./crushtool -c 
testdir/osd-crush/empty_map.txt -o testdir/osd-crush/empty_map.m

ap
test/mon/osd-crush.sh:230: TEST_crush_reject_empty:  expect_failure 
testdir/osd-crush 'Error EINVAL' ./ceph osd setcrushmap -i testd

ir/osd-crush/empty_map.map
../qa/workunits/ceph-helpers.sh:1171: expect_failure:  local 
dir=testdir/osd-crush

../qa/workunits/ceph-helpers.sh:1172: expect_failure:  shift
../qa/workunits/ceph-helpers.sh:1173: expect_failure:  local 
'expected=Error EINVAL'

../qa/workunits/ceph-helpers.sh:1174: expect_failure:  shift
../qa/workunits/ceph-helpers.sh:1175: expect_failure:  local success
../qa/workunits/ceph-helpers.sh:1176: expect_failure:  pwd
../qa/workunits/ceph-helpers.sh:1177: expect_failure:  printenv
../qa/workunits/ceph-helpers.sh:1178: expect_failure:  echo ./ceph osd 
setcrushmap -i testdir/osd-crush/empty_map.map
../qa/workunits/ceph-helpers.sh:1180: expect_failure:  ./ceph osd 
setcrushmap -i testdir/osd-crush/empty_map.map

*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
Traceback (most recent call last):
  File "./ceph", line 936, in 
retval = main()
  File "./ceph", line 874, in main
sigdict, inbuf, verbose)
  File "./ceph", line 457, in new_style_command
inbuf=inbuf)
  File 
"/usr/srcs/Ceph/wip-freebsd-wjw/ceph/src/pybind/ceph_argparse.py", line 
1208, in json_command

raise RuntimeError('"{0}": exception {1}'.format(argdict, e))
RuntimeError: "{'prefix': u'osd setcrushmap'}": exception "['{"prefix": 
"osd setcrushmap"}']": exception 'utf8' codec can't decode b

yte 0x86 in position 56: invalid start byte

Which is certainly not the type of error expected.
But it is hard to detect any 0x86 in the arguments.

And yes python is right, there are no UTF8 sequences that start with 0x86.
Question is:
Why does it want to parse with UTF8?
And how do I switch it off?
Or how to I fix this error?

Thanx,
--WjW
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fwd: how io works when backfill

2015-12-28 Thread Sage Weil
On Mon, 28 Dec 2015, Zhiqiang Wang wrote:
> 2015-12-27 20:48 GMT+08:00 Dong Wu :
> > Hi,
> > When add osd or remove osd, ceph will backfill to rebalance data.
> > eg:
> > - pg1.0[1, 2, 3]
> > - add an osd(eg. osd.7)
> > - ceph start backfill, then pg1.0 osd set changes to [1, 2, 7]
> > - if [a, b, c, d, e] are objects needing to backfill to osd.7 and now
> > object a is backfilling
> > - when a write io hits object a, then the io needs to wait for its
> > complete, then goes on.
> > - but if io hits object b which has not been backfilled, io reaches
> > osd.1, then osd.1 send the io to osd.2  and osd.7, but osd.7 does not
> > have object b, so osd.7 needs to wait for object b to backfilled, then
> > write. Is it right? Or osd.1 only send the io to osd.2, not both?
> 
> I think in this case, when the write of object b reaches osd.1, it
> holds the client write, raises the priority of the recovery of object
> b, and kick off the recovery of it. When the recovery of object b is
> done, it requeue the client write, and then everything goes like
> usual.

It's more complicated than that.  In a normal (log-based) recovery 
situation, it is something like the above: if the acting set is [1,2,3] 
but 3 is missing the latest copy of A, a write to A will block on the 
primary while the primary initiates recovery of A immediately.  Once that 
completes the IO will continue.

For backfill, it's different.  In your example, you start with [1,2,3] 
then add in osd.7.  The OSD will see that 7 has no data for teh PG and 
install a pg_temp entry mapping the PG back to [1,2,3] temporarily.  Then 
things will proceed normally while backfill happens to 7.  Backfill won't 
interfere with normal IO at all, except that IO to the portion of the PG 
that has already been backfilled will also be sent to the backfill target 
(7) so that it stays up to date.  Once it complets, the pg_temp entry is 
removed and the mapping changes back to [1,2,7].  Then osd.3 is allowed to 
remove it's copy of the PG.

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to configure if there are tow network cards in Client

2015-12-28 Thread Sage Weil
On Fri, 25 Dec 2015, ?? wrote:
> Hi all,
> When we read the code, we haven?t find the function that the client can 
> bind a specific IP. In Ceph?s configuration, we could only find the parameter 
> ?public network?, but it seems acts on the OSD but not the client.
> There is a scenario that the client has two network cards named NIC1 and 
> NIC2. The NIC1 is responsible for communicating with cluster (monitor and 
> RADOS) and the NIC2 has other services except Ceph?s client. So   we need the 
> client can bind specific IP in order to differentiate the IP communicating 
> with cluster from another IP serving other applications. We want to know is 
> there any configuration in Ceph to achieve this function? If there is, how 
> could we configure the IP? if not, could we add this function in Ceph? Thank 
> you so much.

Right.  There isn't a configurable to do this now--we've always just let 
the kernel network layer sort it out. Is this just a matter of calling 
bind on the socket before connecting? I've never done this before..

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CEPH build

2015-12-28 Thread Odintsov Vladislav
Hi,

resending my letter.
Thank you for the attention.


Best regards,

Vladislav Odintsov


From: Sage Weil <sw...@redhat.com>
Sent: Monday, December 28, 2015 19:49
To: Odintsov Vladislav
Subject: Re: CEPH build

Can you resend this to ceph-devel, and copy ad...@redhat.com?

On Fri, 25 Dec 2015, Odintsov Vladislav wrote:

>
> Hi, Sage!
>
>
> I'm working at Cloud provider as a system engineer, and now
> I'm trying to build different versions of CEPH (0.94, 9.2, 10.0) with libxio
> enabled, and I've got a problem with understanding, how do ceph maintainers
> create official tarballs and builds from git repo.
>
> I saw you as a maintainer of build related files in a repo, and thought you
> can help me :) If I'm wrong, please, say me, who can do it.
>
> I've found very many information sources with different description of ceph
> build process:
>
> - https://github.com/ceph/ceph-build
>
> - https://github.com/ceph/autobuild-ceph
>
> - documentation on ceph.docs.
>
>
> But I'm unable to get the same tarball as
> at http://download.ceph.com/tarballs/
>
> for example for version v0.94.5. What else should I read? Or, maybe there is
> some magic...)
>
>
> Actually, I want understand how official builds are made (which tools), I'd
> like to go through all build related steps by myself to understand the
> upstream building process.
>
>
> Thanks a lot for your help!
>
>
> 
> Best regards,
>
> Vladislav Odintsov
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] why not add (offset,len) to pglog

2015-12-25 Thread Dong Wu
Thank you for your reply. I am looking formard to Sage's opinion too @sage.
Also I'll keep on with the BlueStore and Kstore's progress.

Regards

2015-12-25 14:48 GMT+08:00 Ning Yao :
> Hi, Dong Wu,
>
> 1. As I currently work for other things, this proposal is abandon for
> a long time
> 2. This is a complicated task as we need to consider a lots such as
> (not just for writeOp, as well as truncate, delete) and also need to
> consider the different affects for different backends(Replicated, EC).
> 3. I don't think it is good time to redo this patch now, since the
> BlueStore and Kstore  is inprogress, and I'm afraid to bring some
> side-effect.  We may prepare and propose the whole design in next CDS.
> 4. Currently, we already have some tricks to deal with recovery (like
> throttle the max recovery op, set the priority for recovery and so
> on). So this kind of patch may not solve the critical problem but just
> make things better, and I am not quite sure that this will really
> bring a big improvement. Based on my previous test, it works
> excellently on slow disk (say hdd), and also for a short-time
> maintaining. Otherwise, it will trigger the backfill process.  So wait
> for Sage's opinion @sage
>
> If you are interest on this, we may cooperate to do this.
>
> Regards
> Ning Yao
>
>
> 2015-12-25 14:23 GMT+08:00 Dong Wu :
>> Thanks, from this pull request I learned that this issue is not
>> completed, is there any new progress of this issue?
>>
>> 2015-12-25 12:30 GMT+08:00 Xinze Chi (信泽) :
>>> Yeah, This is good idea for recovery, but not for backfill.
>>> @YaoNing have pull a request about this
>>> https://github.com/ceph/ceph/pull/3837 this year.
>>>
>>> 2015-12-25 11:16 GMT+08:00 Dong Wu :
 Hi,
 I have doubt about pglog, the pglog contains (op,object,version) etc.
 when peering, use pglog to construct missing list,then recover the
 whole object in missing list even if different data among replicas is
 less then a whole object data(eg,4MB).
 why not add (offset,len) to pglog? If so, the missing list can contain
 (object, offset, len), then we can reduce recover data.
 ___
 ceph-users mailing list
 ceph-us...@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Xinze Chi
>> ___
>> ceph-users mailing list
>> ceph-us...@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] why not add (offset,len) to pglog

2015-12-25 Thread Sage Weil
On Fri, 25 Dec 2015, Ning Yao wrote:
> Hi, Dong Wu,
> 
> 1. As I currently work for other things, this proposal is abandon for
> a long time
> 2. This is a complicated task as we need to consider a lots such as
> (not just for writeOp, as well as truncate, delete) and also need to
> consider the different affects for different backends(Replicated, EC).
> 3. I don't think it is good time to redo this patch now, since the
> BlueStore and Kstore  is inprogress, and I'm afraid to bring some
> side-effect.  We may prepare and propose the whole design in next CDS.
> 4. Currently, we already have some tricks to deal with recovery (like
> throttle the max recovery op, set the priority for recovery and so
> on). So this kind of patch may not solve the critical problem but just
> make things better, and I am not quite sure that this will really
> bring a big improvement. Based on my previous test, it works
> excellently on slow disk (say hdd), and also for a short-time
> maintaining. Otherwise, it will trigger the backfill process.  So wait
> for Sage's opinion @sage
> 
> If you are interest on this, we may cooperate to do this.

I think it's a great idea.  We didn't do it before only because it is 
complicated.  The good news is that if we can't conclusively infer exactly 
which parts of hte object need to be recovered from the log entry we can 
always just fall back to recovering the whole thing.  Also, the place 
where this is currently most visible is RBD small writes:

 - osd goes down
 - client sends a 4k overwrite and modifies an object
 - osd comes back up
 - client sends another 4k overwrite
 - client io blocks while osd recovers 4mb

So even if we initially ignore truncate and omap and EC and clones and 
anything else complicated I suspect we'll get a nice benefit.

I haven't thought about this too much, but my guess is that the hard part 
is making the primary's missing set representation include a partial delta 
(say, an interval_set<> indicating which ranges of the file have changed) 
in a way that gracefully degrades to recovering the whole object if we're 
not sure.

In any case, we should definitely have the design conversation!

sage

> 
> Regards
> Ning Yao
> 
> 
> 2015-12-25 14:23 GMT+08:00 Dong Wu :
> > Thanks, from this pull request I learned that this issue is not
> > completed, is there any new progress of this issue?
> >
> > 2015-12-25 12:30 GMT+08:00 Xinze Chi (??) :
> >> Yeah, This is good idea for recovery, but not for backfill.
> >> @YaoNing have pull a request about this
> >> https://github.com/ceph/ceph/pull/3837 this year.
> >>
> >> 2015-12-25 11:16 GMT+08:00 Dong Wu :
> >>> Hi,
> >>> I have doubt about pglog, the pglog contains (op,object,version) etc.
> >>> when peering, use pglog to construct missing list,then recover the
> >>> whole object in missing list even if different data among replicas is
> >>> less then a whole object data(eg,4MB).
> >>> why not add (offset,len) to pglog? If so, the missing list can contain
> >>> (object, offset, len), then we can reduce recover data.
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-us...@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >>
> >> --
> >> Regards,
> >> Xinze Chi
> > ___
> > ceph-users mailing list
> > ceph-us...@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] why not add (offset,len) to pglog

2015-12-24 Thread Ning Yao
Hi, Dong Wu,

1. As I currently work for other things, this proposal is abandon for
a long time
2. This is a complicated task as we need to consider a lots such as
(not just for writeOp, as well as truncate, delete) and also need to
consider the different affects for different backends(Replicated, EC).
3. I don't think it is good time to redo this patch now, since the
BlueStore and Kstore  is inprogress, and I'm afraid to bring some
side-effect.  We may prepare and propose the whole design in next CDS.
4. Currently, we already have some tricks to deal with recovery (like
throttle the max recovery op, set the priority for recovery and so
on). So this kind of patch may not solve the critical problem but just
make things better, and I am not quite sure that this will really
bring a big improvement. Based on my previous test, it works
excellently on slow disk (say hdd), and also for a short-time
maintaining. Otherwise, it will trigger the backfill process.  So wait
for Sage's opinion @sage

If you are interest on this, we may cooperate to do this.

Regards
Ning Yao


2015-12-25 14:23 GMT+08:00 Dong Wu :
> Thanks, from this pull request I learned that this issue is not
> completed, is there any new progress of this issue?
>
> 2015-12-25 12:30 GMT+08:00 Xinze Chi (信泽) :
>> Yeah, This is good idea for recovery, but not for backfill.
>> @YaoNing have pull a request about this
>> https://github.com/ceph/ceph/pull/3837 this year.
>>
>> 2015-12-25 11:16 GMT+08:00 Dong Wu :
>>> Hi,
>>> I have doubt about pglog, the pglog contains (op,object,version) etc.
>>> when peering, use pglog to construct missing list,then recover the
>>> whole object in missing list even if different data among replicas is
>>> less then a whole object data(eg,4MB).
>>> why not add (offset,len) to pglog? If so, the missing list can contain
>>> (object, offset, len), then we can reduce recover data.
>>> ___
>>> ceph-users mailing list
>>> ceph-us...@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> Regards,
>> Xinze Chi
> ___
> ceph-users mailing list
> ceph-us...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] why not add (offset,len) to pglog

2015-12-24 Thread Dong Wu
Thanks, from this pull request I learned that this issue is not
completed, is there any new progress of this issue?

2015-12-25 12:30 GMT+08:00 Xinze Chi (信泽) :
> Yeah, This is good idea for recovery, but not for backfill.
> @YaoNing have pull a request about this
> https://github.com/ceph/ceph/pull/3837 this year.
>
> 2015-12-25 11:16 GMT+08:00 Dong Wu :
>> Hi,
>> I have doubt about pglog, the pglog contains (op,object,version) etc.
>> when peering, use pglog to construct missing list,then recover the
>> whole object in missing list even if different data among replicas is
>> less then a whole object data(eg,4MB).
>> why not add (offset,len) to pglog? If so, the missing list can contain
>> (object, offset, len), then we can reduce recover data.
>> ___
>> ceph-users mailing list
>> ceph-us...@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Regards,
> Xinze Chi
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] why not add (offset,len) to pglog

2015-12-24 Thread 信泽
Yeah, This is good idea for recovery, but not for backfill.
@YaoNing have pull a request about this
https://github.com/ceph/ceph/pull/3837 this year.

2015-12-25 11:16 GMT+08:00 Dong Wu :
> Hi,
> I have doubt about pglog, the pglog contains (op,object,version) etc.
> when peering, use pglog to construct missing list,then recover the
> whole object in missing list even if different data among replicas is
> less then a whole object data(eg,4MB).
> why not add (offset,len) to pglog? If so, the missing list can contain
> (object, offset, len), then we can reduce recover data.
> ___
> ceph-users mailing list
> ceph-us...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Regards,
Xinze Chi
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fixing jenkins builds on pull requests

2015-12-23 Thread Loic Dachary
Hi,

I triaged the jenkins related failures (from #24 to #49):

CentOS 6 not supported:

  https://jenkins.ceph.com/job/ceph-pull-requests/26/console
  https://jenkins.ceph.com/job/ceph-pull-requests/28/console
  https://jenkins.ceph.com/job/ceph-pull-requests/29/console
  https://jenkins.ceph.com/job/ceph-pull-requests/34/console
  https://jenkins.ceph.com/job/ceph-pull-requests/38/console
  https://jenkins.ceph.com/job/ceph-pull-requests/44/console
  https://jenkins.ceph.com/job/ceph-pull-requests/46/console
  https://jenkins.ceph.com/job/ceph-pull-requests/48/console
  https://jenkins.ceph.com/job/ceph-pull-requests/49/console

Ubuntu 12.04 not supported:

  https://jenkins.ceph.com/job/ceph-pull-requests/27/console
  https://jenkins.ceph.com/job/ceph-pull-requests/36/console

Failure to fetch from github

  https://jenkins.ceph.com/job/ceph-pull-requests/35/console

I've not been able to analyze more failures because it looks like only 30 jobs 
are kept. Here is an updated summary:

 * running on unsupported operating systems (CentOS 6, precise and maybe others)
 * leftovers from a previous test (which should be removed when a new slave is 
provisionned for each test)
 * keep the last 300 jobs for forensic analysis (about one week worth)
 * disable reporting to github pull requests until the above are resolved (all 
failures were false negative).

Cheers

On 23/12/2015 10:11, Loic Dachary wrote:
> Hi Alfredo,
> 
> I forgot to mention that the ./run-make-check.sh run currently has no known 
> false negative on CentOS 7. By that I mean that if run on master 100 times, 
> it will succeed 100 times. This is good to debug the jenkins builds on pull 
> requests as we know all problems either come from the infrastructure or the 
> pull request. We do not have to worry about random errors due to race 
> conditions in the tests or things like that.
> 
> I'll keep an eye on the test results and analyse each failure. For now it 
> would be best to disable reporting failures as they are almost entirely false 
> negative and will confuse the contributor. The failures come from:
> 
>  * running on unsupported operating systems (CentOS 6 and maybe others)
>  * leftovers from a previous test (which should be removed when a new slave 
> is provisionned for each test)
> 
> I'll add to this thread when / if I find more.
> 
> Cheers
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: [ceph-users] use object size of 32k rather than 4M

2015-12-23 Thread hzwulibin
Hi, Robert

Thanks for your quick reply. Yeah, the number of file really will be the 
potential problem. But if just the memory problem, we could use more memory in 
our OSD
servers.

Also, i tested it on XFS use mdtest, here is the result:


$ sudo ~/wulb/bin/mdtest -I 1 -z 1 -b 1024 -R -F
--
[[10342,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: 10-180-0-34

Another transport will be used instead, although this may result in
lower performance.
--
-- started at 12/23/2015 18:59:16 --

mdtest-1.8.3 was launched with 1 total task(s) on 1 nodes
Command line used: /home/ceph/wulb/bin/mdtest -I 1 -z 1 -b 1024 -R -F
Path: /home/ceph
FS: 824.5 GiB   Used FS: 4.8%   Inodes: 52.4 Mi   Used Inodes: 0.6%
random seed: 1450868356

1 tasks, 1025 files

SUMMARY: (of 1 iterations)
   Operation  MaxMin   MeanStd Dev
   -  ------   ---
   File creation :  44660.505  44660.505  44660.505  0.000
   File stat : 693747.783 693747.783 693747.783  0.000
   File read : 365319.444 365319.444 365319.444  0.000
   File removal  :  62064.560  62064.560  62064.560  0.000
   Tree creation :  69680.729  69680.729  69680.729  0.000
   Tree removal  :352.905352.905352.905  0.000


From what i tested, the speed of File stat and File read are not slow down 
much.  So, could i say the speed of OP like
lookup a file will not decrease much, just increase the number of the files?


--   
hzwulibin
2015-12-23

-
发件人:"Van Leeuwen, Robert" <rovanleeu...@ebay.com>
发送日期:2015-12-23 20:57
收件人:hzwulibin,ceph-devel,ceph-users
抄送:
主题:Re: [ceph-users] use object size of 32k rather than 4M


>In order to reduce the enlarge impact, we want to change the default size of 
>the object from 4M to 32k.
>
>We know that will increase the number of the objects of one OSD and make 
>remove process become longer.
>
>Hmm, here i want to ask your guys is there any other potential problems will 
>32k size have? If no obvious problem, will could dive into
>it and do more test on it.


I assume the objects on the OSDs filesystem will become 32k when you do this.
So if you have 1TB of data on one OSD you will have 31 million files == 31 
million inodes 
This is excluding the directory structure which also might be significant.
If you have 10 OSDs on a server you will easily hit 310 million inodes.
You will need a LOT of memory to make sure the inodes are cached but even then 
looking up the inode might add significant latency.

My guess is it will be fast in the beginning but it will grind to an hold when 
the cluster gets fuller due to inodes no longer being in memory.

Also this does not take in any other bottlenecks you might hit in ceph which 
other users can probably answer better.


Cheers,
Robert van Leeuwen



Re: Time to move the make check bot to jenkins.ceph.com

2015-12-23 Thread Ken Dreyer
This is really great. Thanks Loic and Alfredo!

- Ken

On Tue, Dec 22, 2015 at 11:23 AM, Loic Dachary  wrote:
> Hi,
>
> The make check bot moved to jenkins.ceph.com today and ran it's first 
> successfull job. You will no longer see comments from the bot: it will update 
> the github status instead, which is less intrusive.
>
> Cheers
>
> On 21/12/2015 11:13, Loic Dachary wrote:
>> Hi,
>>
>> The make check bot is broken in a way that I can't figure out right now. 
>> Maybe now is the time to move it to jenkins.ceph.com ? It should not be more 
>> difficult than launching the run-make-check.sh script. It does not need 
>> network or root access.
>>
>> Cheers
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] use object size of 32k rather than 4M

2015-12-23 Thread Van Leeuwen, Robert

>In order to reduce the enlarge impact, we want to change the default size of 
>the object from 4M to 32k.
>
>We know that will increase the number of the objects of one OSD and make 
>remove process become longer.
>
>Hmm, here i want to ask your guys is there any other potential problems will 
>32k size have? If no obvious problem, will could dive into
>it and do more test on it.


I assume the objects on the OSDs filesystem will become 32k when you do this.
So if you have 1TB of data on one OSD you will have 31 million files == 31 
million inodes 
This is excluding the directory structure which also might be significant.
If you have 10 OSDs on a server you will easily hit 310 million inodes.
You will need a LOT of memory to make sure the inodes are cached but even then 
looking up the inode might add significant latency.

My guess is it will be fast in the beginning but it will grind to an hold when 
the cluster gets fuller due to inodes no longer being in memory.

Also this does not take in any other bottlenecks you might hit in ceph which 
other users can probably answer better.


Cheers,
Robert van Leeuwen


Re: [ceph-users] use object size of 32k rather than 4M

2015-12-23 Thread Van Leeuwen, Robert
>Thanks for your quick reply. Yeah, the number of file really will be the 
>potential problem. But if just the memory problem, we could use more memory in 
>our OSD
>servers.

Add more mem might not be a viable solution:
Ceph does not say how much data is stores in an inode but the docs say the 
xattr of ext4 is not big enough.
Assuming xfs will use 512 bytes is probably very optimistic.
So for e.g. 300 million inodes you are talking about, at least, 150GB.

>
>Also, i tested it on XFS use mdtest, here is the result:
>
>
>FS: 824.5 GiB   Used FS: 4.8%   Inodes: 52.4 Mi   Used Inodes: 0.6%

52 million files without extended attributes is probably not a real life 
scenario for a filled up ceph node with multiple OSDs.

Cheers,
Robert van Leeuwen


Re: Let's Not Destroy the World in 2038

2015-12-23 Thread Adam C. Emerson
On 22/12/2015, Gregory Farnum wrote:
[snip]
> So I think we're stuck with creating a new utime_t and incrementing
> the struct_v on everything that contains them. :/
[snip]
> We'll also then need the full feature bit system to make
> sure we send the old encoding to clients which don't understand the
> new one, and to prevent a mid-upgrade cluster from writing data on a
> new node that gets moved to a new node which doesn't understand it.

That is my understanding. I have the impression that network communication
get feature bits for the other nodes and on-disk structures are explicitly
versioned. If I'm mistaken, please hurl corrections at me.

> Given that utime_t occurs in a lot of places, and really can't change
> *again* after this, we probably shouldn't set up the new version with
> versioned encoding?

You're overly pessimistic. I'm hoping our post-human descendents store
their unfathomably alien, reconstructed minds in some galaxy spanning
descendent of Ceph and need more than a 64-bit second count.

However, I agree that the time value itself should not have an encoded
version tag.

To my intuition, the best way forward would be to:

(1) Add non-defaulted feature parameters on encode/decode of utime_t and
ceph::real_time. This will break everything that uses them.

(2) Add explicit encode_old/encode_new functions. that way when we KNOW which
one we want at compile time we don't have to pay for a runtime check.

(3) When we have feature bits, pass them in.

(4) When we have a version, bump it. For new versions, explicitly call
encode_new. When we know we want old, call old.

(5) If there are classes that we encode that have neither feature bits nor
versioning available, see what uses them and act accordingly. Hopefully the
special cases will be few.

Does that seem reasonable?

I thank you.

And all hypothetical post-huamn Ceph users thank you.

-- 
Senior Software Engineer   Red Hat Storage, Ann Arbor, MI, US
IRC: Aemerson@{RedHat, OFTC, Freenode}
0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C  7C12 80F7 544B 90ED BFB9


signature.asc
Description: PGP signature


Re: rgw: sticky user quota data on bucket removal

2015-12-23 Thread Yehuda Sadeh-Weinraub
On Wed, Dec 23, 2015 at 3:53 PM, Paul Von-Stamwitz
 wrote:
> Hi,
>
> We're testing user quotas on Hammer with civetweb and we're running into an 
> issue with user stats.
>
> If the user/admin removes a bucket using -force/-purge-objects options with 
> s3cmd/radosgw-admin respectively, the user stats will continue to reflect the 
> deleted objects for quota purposes, and there seems to be no way to reset 
> them. It appears that user stats need to be sync'ed prior to bucket removal. 
> Setting " rgw user quota bucket sync interval = 0" appears to solve the 
> problem.
>
> What is the downside to setting the interval to 0?

We'll update the buckets that are getting modified continuously,
instead of once every interval.

>
> I think the right solution is to have an implied sync-stats during bucket 
> removal. Other suggestions?
>

No, syncing the bucket stats on removal sounds right.

Yehuda

> All the best,
> Paul
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: rgw: sticky user quota data on bucket removal

2015-12-23 Thread Paul Von-Stamwitz
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> ow...@vger.kernel.org] On Behalf Of Yehuda Sadeh-Weinraub
> Sent: Wednesday, December 23, 2015 5:02 PM
> To: Paul Von-Stamwitz
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: rgw: sticky user quota data on bucket removal
> 
> On Wed, Dec 23, 2015 at 3:53 PM, Paul Von-Stamwitz
> <pvonstamw...@us.fujitsu.com> wrote:
> > Hi,
> >
> > We're testing user quotas on Hammer with civetweb and we're running
> > into an issue with user stats.
> >
> > If the user/admin removes a bucket using -force/-purge-objects options
> > with s3cmd/radosgw-admin respectively, the user stats will continue to
> > reflect the deleted objects for quota purposes, and there seems to be no
> > way to reset them. It appears that user stats need to be sync'ed prior to
> > bucket removal. Setting " rgw user quota bucket sync interval = 0" appears 
> > to
> > solve the problem.
> >
> > What is the downside to setting the interval to 0?
> 
> We'll update the buckets that are getting modified continuously, instead of
> once every interval.
>

So, I presume that this will impact performance on puts and deletes. We'll take 
a look at the impact on this.

> >
> > I think the right solution is to have an implied sync-stats during bucket
> > removal. Other suggestions?
> >
> 
> No, syncing the bucket stats on removal sounds right.
> 

Great. This would alleviate any performance impact on continuous updates.
Thanks!

> Yehuda
> 
> > All the best,
> > Paul
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majord...@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html


Re: New "make check" job for Ceph pull requests

2015-12-23 Thread Loic Dachary
Hi,

For the record the pending issues that prevent the "make check" job 
(https://jenkins.ceph.com/job/ceph-pull-requests/) from running can be found at 
http://tracker.ceph.com/issues/14172

Cheers

On 23/12/2015 21:05, Alfredo Deza wrote:
> Hi all,
> 
> As of yesterday (Tuesday Dec 22nd) we have the "make check" job
> running within our CI infrastructure, working very similarly as the
> previous check with a few differences:
> 
> * there are no longer comments added to the pull requests
> * notifications of success (or failure) are done inline in the same
> notification box for "This branch has no conflicts with the base
> branch"
> * All members of the Ceph organization can trigger a job with the
> following comment:
> test this please
> 
> Changes to the job should be done following our new process: anyone can open
> a pull request against the "ceph-pull-requests" job that configures/modifies
> it. This process is fairly minimal:
> 
> 1) *Jobs no longer require to make changes in the Jenkins UI*, they
> are rather plain text YAML files that live in the ceph/ceph-build.git
> repository and have a specific structure. Job changes (including
> scripts) are made directly on that repository via pull requests.
> 
> 2) As soon as a PR is merged the changes are automatically pushed to
> Jenkins. Regardless if this is a new or old job. All one needs for a
> new job to appear is a directory with a working YAML file (see links
> at the end on what this means)
> 
> Below, please find a list to resources on how to make changes to a
> Jenkins Job, and examples on how mostly anyone can provide changes:
> 
> * Format and configuration of YAML files are consumed by JJB (Jenkins
> Job builder), full docs are here:
> http://docs.openstack.org/infra/jenkins-job-builder/definition.html
> * Where does the make-check configuration lives?
> https://github.com/ceph/ceph-build/tree/master/ceph-pull-requests
> * Full documentation on Job structure and configuration:
> https://github.com/ceph/ceph-build#ceph-build
> * Everyone has READ permissions on jenkins.ceph.com (you can 'login'
> with your github account), current admin members (WRITE permissions)
> are: ktdreyer, alfredodeza, gregmeno, dmick, zmc, andrewschoen,
> ceph-jenkins, dachary, ldachary
> 
> If you have any questions, we can help and provide guidance and feedback. We
> highly encourage contributors to take ownership on this new tool and make it
> awesome!
> 
> Thanks,
> 
> 
> Alfredo
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: RBD performance with many childs and snapshots

2015-12-22 Thread Josh Durgin

On 12/22/2015 01:55 PM, Wido den Hollander wrote:

On 12/21/2015 11:51 PM, Josh Durgin wrote:

On 12/21/2015 11:06 AM, Wido den Hollander wrote:

Hi,

While implementing the buildvolfrom method in libvirt for RBD I'm stuck
at some point.

$ virsh vol-clone --pool myrbdpool image1 image2

This would clone image1 to a new RBD image called 'image2'.

The code I've written now does:

1. Create a snapshot called image1@libvirt-
2. Protect the snapshot
3. Clone the snapshot to 'image1'

wido@wido-desktop:~/repos/libvirt$ ./tools/virsh vol-clone --pool
rbdpool image1 image2
Vol image2 cloned from image1

wido@wido-desktop:~/repos/libvirt$

root@alpha:~# rbd -p libvirt info image2
rbd image 'image2':
 size 10240 MB in 2560 objects
 order 22 (4096 kB objects)
 block_name_prefix: rbd_data.1976451ead36b
 format: 2
 features: layering, striping
 flags:
 parent: libvirt/image1@libvirt-1450724650
 overlap: 10240 MB
 stripe unit: 4096 kB
 stripe count: 1
root@alpha:~#

But this could potentially lead to a lot of snapshots with children on
'image1'.

image1 itself will probably never change, but I'm wondering about the
negative performance impact this might have on a OSD.


Creating them isn't so bad, more snapshots that don't change don't have
much affect on the osds. Deleting them is what's expensive, since the
osds need to scan the objects to see which ones are part of the
snapshot and can be deleted. If you have too many snapshots created and
deleted, it can affect cluster load, so I'd rather avoid always
creating a snapshot.


I'd rather not hardcode a snapshot name like 'libvirt-parent-snapshot'
into libvirt. There is however no way to pass something like a snapshot
name in libvirt when cloning.

Any bright suggestions? Or is it fine to create so many snapshots?


You could have canonical names for the libvirt snapshots like you
suggest, 'libvirt-', and check via rbd_diff_iterate2()
whether the parent image changed since the last snapshot. That's a bit
slower than plain cloning, but with object map + fast diff it's fast
again, since it doesn't need to scan all the objects anymore.

I think libvirt would need to expand its api a bit to be able to really
use it effectively to manage rbd. Hiding the snapshots becomes
cumbersome if the application wants to use them too. If libvirt's
current model of clones lets parents be deleted before children,
that may be a hassle to hide too...



I gave it a shot. callback functions are a bit new to me, but I gave it
a try:
https://github.com/wido/libvirt/commit/756dca8023027616f53c39fa73c52a6d8f86a223

Could you take a look?


Left some comments on the commits. Looks good in general.

Josh

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD performance with many childs and snapshots

2015-12-22 Thread Josh Durgin

On 12/22/2015 05:34 AM, Wido den Hollander wrote:



On 21-12-15 23:51, Josh Durgin wrote:

On 12/21/2015 11:06 AM, Wido den Hollander wrote:

Hi,

While implementing the buildvolfrom method in libvirt for RBD I'm stuck
at some point.

$ virsh vol-clone --pool myrbdpool image1 image2

This would clone image1 to a new RBD image called 'image2'.

The code I've written now does:

1. Create a snapshot called image1@libvirt-
2. Protect the snapshot
3. Clone the snapshot to 'image1'

wido@wido-desktop:~/repos/libvirt$ ./tools/virsh vol-clone --pool
rbdpool image1 image2
Vol image2 cloned from image1

wido@wido-desktop:~/repos/libvirt$

root@alpha:~# rbd -p libvirt info image2
rbd image 'image2':
 size 10240 MB in 2560 objects
 order 22 (4096 kB objects)
 block_name_prefix: rbd_data.1976451ead36b
 format: 2
 features: layering, striping
 flags:
 parent: libvirt/image1@libvirt-1450724650
 overlap: 10240 MB
 stripe unit: 4096 kB
 stripe count: 1
root@alpha:~#

But this could potentially lead to a lot of snapshots with children on
'image1'.

image1 itself will probably never change, but I'm wondering about the
negative performance impact this might have on a OSD.


Creating them isn't so bad, more snapshots that don't change don't have
much affect on the osds. Deleting them is what's expensive, since the
osds need to scan the objects to see which ones are part of the
snapshot and can be deleted. If you have too many snapshots created and
deleted, it can affect cluster load, so I'd rather avoid always
creating a snapshot.


I'd rather not hardcode a snapshot name like 'libvirt-parent-snapshot'
into libvirt. There is however no way to pass something like a snapshot
name in libvirt when cloning.

Any bright suggestions? Or is it fine to create so many snapshots?


You could have canonical names for the libvirt snapshots like you
suggest, 'libvirt-', and check via rbd_diff_iterate2()
whether the parent image changed since the last snapshot. That's a bit
slower than plain cloning, but with object map + fast diff it's fast
again, since it doesn't need to scan all the objects anymore.



I'll give that a try, seems like a good suggestion!

I'll have to use rbd_diff_iterate() through since iterate2() is
post-hammer and that will not be available on all systems.


I think libvirt would need to expand its api a bit to be able to really
use it effectively to manage rbd. Hiding the snapshots becomes
cumbersome if the application wants to use them too. If libvirt's
current model of clones lets parents be deleted before children,
that may be a hassle to hide too...



Yes, I would love to see:

- vol-snap-list
- vol-snap-create
- vol-snap-delete
- vol-snap-revert

And then:

- vol-clone --snapshot  --pool  image1 image2

But this would need some more work inside libvirt. Would be very nice
though.


Yeah, those would be nice.


At CloudStack we want to do as much as possible using libvirt, the more
features it has there, the less we have to do in Java code :)


Dan Berrange has talked about using libvirt storage pools for managing
rbd and other storage from openstack nova too, for the same reason. I'm
not sure if there are any current plans for that, but you may want to
ask him about it on the libvirt list.

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to move the make check bot to jenkins.ceph.com

2015-12-22 Thread Loic Dachary
Hi,

The make check bot moved to jenkins.ceph.com today and ran it's first 
successfull job. You will no longer see comments from the bot: it will update 
the github status instead, which is less intrusive.

Cheers

On 21/12/2015 11:13, Loic Dachary wrote:
> Hi,
> 
> The make check bot is broken in a way that I can't figure out right now. 
> Maybe now is the time to move it to jenkins.ceph.com ? It should not be more 
> difficult than launching the run-make-check.sh script. It does not need 
> network or root access.
> 
> Cheers
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: RFC: tool for applying 'ceph daemon ' command to all OSDs

2015-12-22 Thread Dan Mick
On 12/21/2015 11:29 PM, Gregory Farnum wrote:
> On Mon, Dec 21, 2015 at 9:59 PM, Dan Mick  wrote:
>> I needed something to fetch current config values from all OSDs (sorta
>> the opposite of 'injectargs --key value), so I hacked it, and then
>> spiffed it up a bit.  Does this seem like something that would be useful
>> in this form in the upstream Ceph, or does anyone have any thoughts on
>> its design or structure?
>>
>> It requires a locally-installed ceph CLI and a ceph.conf that points to
>> the cluster and any required keyrings.  You can also provide it with
>> a YAML file mapping host to osds if you want to save time collecting
>> that info for a statically-defined cluster, or if you want just a subset
>> of OSDs.
>>
>> https://github.com/dmick/tools/blob/master/osd_daemon_cmd.py
>>
>> Excerpt from usage:
>>
>> Execute a Ceph osd daemon command on every OSD in a cluster with
>> one connection to each OSD host.
>>
>> Usage:
>> osd_daemon_cmd [-c CONF] [-u USER] [-f FILE] (COMMAND | -k KEY)
>>
>> Options:
>>-c CONF   ceph.conf file to use [default: ./ceph.conf]
>>-u USER   user to connect with ssh
>>-f FILE   get names and osds from yaml
>>COMMAND   command other than "config get" to execute
>>-k KEYconfig key to retrieve with config get 
> 
> I naively like the functionality being available, but if I'm skimming
> this correctly it looks like you're relying on the local node being
> able to passwordless-ssh to all of the nodes, and for that account to
> be able to access the ceph admin sockets. Granted we rely on the ssh
> for ceph-deploy as well, so maybe that's okay, but I'm not sure in
> this case since it implies a lot more network openness.

Yep; it's basically the same model and role assumed as "cluster destroyer".

> Relatedly (perhaps in an opposing direction), maybe we want anything
> exposed over the network to have some sort of explicit permissions
> model?

Well, I've heard that idea floated about the admin socket for years, but
I don't think anyone's hot to add cephx to it :)

> Maybe not and we should just ship the script for trusted users. I
> would have liked it on the long-running cluster I'm sure you built it
> for. ;)

it's like you're clairvoyant.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is rbd_discard enough to wipe an RBD image?

2015-12-22 Thread Wido den Hollander
On 12/21/2015 11:20 PM, Josh Durgin wrote:
> On 12/21/2015 11:00 AM, Wido den Hollander wrote:
>> My discard code now works, but I wanted to verify. If I understand Jason
>> correctly it would be a matter of figuring out the 'order' of a image
>> and call rbd_discard in a loop until you reach the end of the image.
> 
> You'd need to get the order via rbd_stat(), convert it to object size
> (i.e. (1 << order)), and fetch stripe_count with rbd_get_stripe_count().
> 
> Then do the discards in (object size * stripe_count) chunks. This
> ensures you discard entire objects. This is the size you'd want to use
> for import/export as well, ideally.
> 

Thanks! I just implemented this, could you take a look?

https://github.com/wido/libvirt/commit/b07925ad50fdb6683b5b21deefceb0829a7842dc

>> I just want libvirt to be as feature complete as possible when it comes
>> to RBD.
> 
> I see, makes sense.
> 
> Josh
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD performance with many childs and snapshots

2015-12-22 Thread Wido den Hollander
On 12/21/2015 11:51 PM, Josh Durgin wrote:
> On 12/21/2015 11:06 AM, Wido den Hollander wrote:
>> Hi,
>>
>> While implementing the buildvolfrom method in libvirt for RBD I'm stuck
>> at some point.
>>
>> $ virsh vol-clone --pool myrbdpool image1 image2
>>
>> This would clone image1 to a new RBD image called 'image2'.
>>
>> The code I've written now does:
>>
>> 1. Create a snapshot called image1@libvirt-
>> 2. Protect the snapshot
>> 3. Clone the snapshot to 'image1'
>>
>> wido@wido-desktop:~/repos/libvirt$ ./tools/virsh vol-clone --pool
>> rbdpool image1 image2
>> Vol image2 cloned from image1
>>
>> wido@wido-desktop:~/repos/libvirt$
>>
>> root@alpha:~# rbd -p libvirt info image2
>> rbd image 'image2':
>> size 10240 MB in 2560 objects
>> order 22 (4096 kB objects)
>> block_name_prefix: rbd_data.1976451ead36b
>> format: 2
>> features: layering, striping
>> flags:
>> parent: libvirt/image1@libvirt-1450724650
>> overlap: 10240 MB
>> stripe unit: 4096 kB
>> stripe count: 1
>> root@alpha:~#
>>
>> But this could potentially lead to a lot of snapshots with children on
>> 'image1'.
>>
>> image1 itself will probably never change, but I'm wondering about the
>> negative performance impact this might have on a OSD.
> 
> Creating them isn't so bad, more snapshots that don't change don't have
> much affect on the osds. Deleting them is what's expensive, since the
> osds need to scan the objects to see which ones are part of the
> snapshot and can be deleted. If you have too many snapshots created and
> deleted, it can affect cluster load, so I'd rather avoid always
> creating a snapshot.
> 
>> I'd rather not hardcode a snapshot name like 'libvirt-parent-snapshot'
>> into libvirt. There is however no way to pass something like a snapshot
>> name in libvirt when cloning.
>>
>> Any bright suggestions? Or is it fine to create so many snapshots?
> 
> You could have canonical names for the libvirt snapshots like you
> suggest, 'libvirt-', and check via rbd_diff_iterate2()
> whether the parent image changed since the last snapshot. That's a bit
> slower than plain cloning, but with object map + fast diff it's fast
> again, since it doesn't need to scan all the objects anymore.
> 
> I think libvirt would need to expand its api a bit to be able to really
> use it effectively to manage rbd. Hiding the snapshots becomes
> cumbersome if the application wants to use them too. If libvirt's
> current model of clones lets parents be deleted before children,
> that may be a hassle to hide too...
> 

I gave it a shot. callback functions are a bit new to me, but I gave it
a try:
https://github.com/wido/libvirt/commit/756dca8023027616f53c39fa73c52a6d8f86a223

Could you take a look?

> Josh
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Let's Not Destroy the World in 2038

2015-12-22 Thread Gregory Farnum
On Tue, Dec 22, 2015 at 12:10 PM, Adam C. Emerson  wrote:
> Comrades,
>
> Ceph's victory is assured. It will be the storage system of The Future.
> Matt Benjamin has reminded me that if we don't act fast¹ Ceph will be
> responsible for destroying the world.
>
> utime_t() uses a 32-bit second count internally. This isn't great, but it's
> something we can fix. ceph::real_time currently uses a 64-bit bit count of
> nanoseconds, which is better. And we can change it to something else without
> having to rewrite much other code.
>
> The problem lies in our encode/deocde functions for time (both utime_t
> and ceph::real_time, since I didn't want to break compatibility.) we
> use a 32-bit second count. I would like to change the wire and disk
> representation to a 64-bit second count and a 32-bit nanosecond count.
>
> Would there be resistance to a project to do this? I don't know if a
> FEATURE bit would help. A FEATURE bit to toggle the width of the second
> count would be ideal if it would work. Otherwise it looks like the best
> way to do this would be to find all the structures currently ::encoded
> that hold time values, bump the version number and have an 'old_utime'
> that we use for everything pre-change.

Unfortunately, we include utimes in structures that are written to
disk. So I think we're stuck with creating a new utime_t and
incrementing the struct_v on everything that contains them. :/

Of course, we'll also then need the full feature bit system to make
sure we send the old encoding to clients which don't understand the
new one, and to prevent a mid-upgrade cluster from writing data on a
new node that gets moved to a new node which doesn't understand it.

Given that utime_t occurs in a lot of places, and really can't change
*again* after this, we probably shouldn't set up the new version with
versioned encoding?
-Greg

>
> Thank you!
>
> ¹ Within the next twenty-three years. But that's not really a long time in the
>   larger scheme of things.
>
> --
> Senior Software Engineer   Red Hat Storage, Ann Arbor, MI, US
> IRC: Aemerson@{RedHat, OFTC, Freenode}
> 0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C  7C12 80F7 544B 90ED BFB9
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: tool for applying 'ceph daemon ' command to all OSDs

2015-12-22 Thread igor.podo...@ts.fujitsu.com
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> ow...@vger.kernel.org] On Behalf Of Dan Mick
> Sent: Tuesday, December 22, 2015 7:00 AM
> To: ceph-devel
> Subject: RFC: tool for applying 'ceph daemon ' command to all OSDs
> 
> I needed something to fetch current config values from all OSDs (sorta the
> opposite of 'injectargs --key value), so I hacked it, and then spiffed it up 
> a bit.
> Does this seem like something that would be useful in this form in the
> upstream Ceph, or does anyone have any thoughts on its design or
> structure?
>

You could do it using socat too:

Node1 has osd.0

Node1:
cd /var/run/ceph
sudo socat TCP-LISTEN:60100,fork unix-connect:ceph-osd.0.asok

Node2:
cd /var/run/ceph
sudo  socat unix-listen:ceph-osd.0.asok,fork TCP:Node1:60100

Node2:
sudo ceph daemon osd.0 help | head
{
"config diff": "dump diff of current config and default config",
"config get": "config get : get the config value",

This is more for development/test setup.

Regards,
Igor.

> It requires a locally-installed ceph CLI and a ceph.conf that points to the
> cluster and any required keyrings.  You can also provide it with a YAML file
> mapping host to osds if you want to save time collecting that info for a
> statically-defined cluster, or if you want just a subset of OSDs.
> 
> https://github.com/dmick/tools/blob/master/osd_daemon_cmd.py
> 
> Excerpt from usage:
> 
> Execute a Ceph osd daemon command on every OSD in a cluster with one
> connection to each OSD host.
> 
> Usage:
> osd_daemon_cmd [-c CONF] [-u USER] [-f FILE] (COMMAND | -k KEY)
> 
> Options:
>-c CONF   ceph.conf file to use [default: ./ceph.conf]
>-u USER   user to connect with ssh
>-f FILE   get names and osds from yaml
>COMMAND   command other than "config get" to execute
>-k KEYconfig key to retrieve with config get 
> 
> --
> Dan Mick
> Red Hat, Inc.
> Ceph docs: http://ceph.com/docs
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html


Re: RBD performance with many childs and snapshots

2015-12-22 Thread Wido den Hollander


On 21-12-15 23:51, Josh Durgin wrote:
> On 12/21/2015 11:06 AM, Wido den Hollander wrote:
>> Hi,
>>
>> While implementing the buildvolfrom method in libvirt for RBD I'm stuck
>> at some point.
>>
>> $ virsh vol-clone --pool myrbdpool image1 image2
>>
>> This would clone image1 to a new RBD image called 'image2'.
>>
>> The code I've written now does:
>>
>> 1. Create a snapshot called image1@libvirt-
>> 2. Protect the snapshot
>> 3. Clone the snapshot to 'image1'
>>
>> wido@wido-desktop:~/repos/libvirt$ ./tools/virsh vol-clone --pool
>> rbdpool image1 image2
>> Vol image2 cloned from image1
>>
>> wido@wido-desktop:~/repos/libvirt$
>>
>> root@alpha:~# rbd -p libvirt info image2
>> rbd image 'image2':
>> size 10240 MB in 2560 objects
>> order 22 (4096 kB objects)
>> block_name_prefix: rbd_data.1976451ead36b
>> format: 2
>> features: layering, striping
>> flags:
>> parent: libvirt/image1@libvirt-1450724650
>> overlap: 10240 MB
>> stripe unit: 4096 kB
>> stripe count: 1
>> root@alpha:~#
>>
>> But this could potentially lead to a lot of snapshots with children on
>> 'image1'.
>>
>> image1 itself will probably never change, but I'm wondering about the
>> negative performance impact this might have on a OSD.
> 
> Creating them isn't so bad, more snapshots that don't change don't have
> much affect on the osds. Deleting them is what's expensive, since the
> osds need to scan the objects to see which ones are part of the
> snapshot and can be deleted. If you have too many snapshots created and
> deleted, it can affect cluster load, so I'd rather avoid always
> creating a snapshot.
> 
>> I'd rather not hardcode a snapshot name like 'libvirt-parent-snapshot'
>> into libvirt. There is however no way to pass something like a snapshot
>> name in libvirt when cloning.
>>
>> Any bright suggestions? Or is it fine to create so many snapshots?
> 
> You could have canonical names for the libvirt snapshots like you
> suggest, 'libvirt-', and check via rbd_diff_iterate2()
> whether the parent image changed since the last snapshot. That's a bit
> slower than plain cloning, but with object map + fast diff it's fast
> again, since it doesn't need to scan all the objects anymore.
> 

I'll give that a try, seems like a good suggestion!

I'll have to use rbd_diff_iterate() through since iterate2() is
post-hammer and that will not be available on all systems.

> I think libvirt would need to expand its api a bit to be able to really
> use it effectively to manage rbd. Hiding the snapshots becomes
> cumbersome if the application wants to use them too. If libvirt's
> current model of clones lets parents be deleted before children,
> that may be a hassle to hide too...
> 

Yes, I would love to see:

- vol-snap-list
- vol-snap-create
- vol-snap-delete
- vol-snap-revert

And then:

- vol-clone --snapshot  --pool  image1 image2

But this would need some more work inside libvirt. Would be very nice
though.

At CloudStack we want to do as much as possible using libvirt, the more
features it has there, the less we have to do in Java code :)

Wido

> Josh
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Issue with Ceph File System and LIO

2015-12-22 Thread Eric Eastman
On Sun, Dec 20, 2015 at 7:38 PM, Eric Eastman
 wrote:
> On Fri, Dec 18, 2015 at 12:18 AM, Yan, Zheng  wrote:
>> On Fri, Dec 18, 2015 at 2:23 PM, Eric Eastman
>>  wrote:
 Hi Yan Zheng, Eric Eastman

 Similar bug was reported in f2fs, btrfs, it does affect 4.4-rc4, the fixing
 patch was merged into 4.4-rc5, dfd01f026058 ("sched/wait: Fix the signal
 handling fix").

 Related report & discussion was here:
 https://lkml.org/lkml/2015/12/12/149

 I'm not sure the current reported issue of ceph was related to that though,
 but at least try testing with an upgraded or patched kernel could verify 
 it.
 :)

 Thanks,
>
>>
>> please try rc5 kernel without patches and DEBUG_VM=y
>>
>> Regards
>> Yan, Zheng
>
>
> The latest test with 4.4rc5 with CONFIG_DEBUG_VM=y has ran for over 36
> hours with no ERRORS or WARNINGS.  My plan is to install the 4.4rc6
> kernel from the Ubuntu kernel-ppa site once it is available, and rerun
> the tests.
>

Test has run for 2 days using the 4.4rc6 kernel from the Ubuntu
kernel-ppa kernel site without error or warning.  Looks like it was a
4.4rc4 bug.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FreeBSD Building and Testing

2015-12-21 Thread Willem Jan Withagen

On 21-12-2015 01:45, Xinze Chi (信泽) wrote:

sorry for delay reply. Please have a try
https://github.com/ceph/ceph/commit/ae4a8162eacb606a7f65259c6ac236e144bfef0a.


Tried this one first:

Testsuite summary for ceph 10.0.1

# TOTAL: 120
# PASS:  100
# SKIP:  0
# XFAIL: 0
# FAIL:  20
# XPASS: 0
# ERROR: 0


So that certainly helps.
Have not yet analyzed the log files... But is seems we are getting 
somewhere.

Needed to manually kill a rados access in:
 | | | \-+- 09792 wjw /bin/sh ../test-driver 
./test/ceph_objectstore_tool.py
 | | |   \-+- 09807 wjw python 
./test/ceph_objectstore_tool.py (python2.7)
 | | | \--- 11406 wjw 
/usr/srcs/Ceph/wip-freebsd-wjw/ceph/src/.libs/rados -p rep_pool -N put 
REPobject1 /tmp/data.9807/-REPobject1__head


But also 2 mon-osd's were running, and perhaps ine was nog belonging
with that test. So they could be in each others way.

Found some fails in OSD's at:

./test-suite.log:osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())
./test-suite.log:osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())

struct OnRecoveryReadComplete :
  public GenContext 
&> {

  ECBackend *pg;
  hobject_t hoid;
  set want;
  OnRecoveryReadComplete(ECBackend *pg, const hobject_t )
: pg(pg), hoid(hoid) {}
  void finish(pair ) {
ECBackend::read_result_t  = in.second;
// FIXME???
assert(res.r == 0);
201:assert(res.errors.empty());
assert(res.returned.size() == 1);
pg->handle_recovery_read_complete(
  hoid,
  res.returned.back(),
  res.attrs,
  in.first);
  }
};

Given the FIXME?? the code here could be fishy??

I would say that just this patch would be sufficient.
The second patch also looks like it is could be useful since it
lowers the bar on being tested. And when just aligning is required
because of (a)iovec processing that 4096 will likely suffice.

Thanx you very much for the help.

--WjW



2015-12-21 0:10 GMT+08:00 Willem Jan Withagen :

Hi,

Most of the Ceph is getting there in the most crude and rough state.
So beneath is a status update on what is not working for me jet.

Especially help with the aligment problem in os/FileJournal.cc would be
appricated... It would allow me to run ceph-osd and run more tests to
completion.

What would happen if I comment out this test, and ignore the fact that
thing might be unaligned?
Is it a performance/paging issue?
Or is data going to be corrupted?

--WjW

PASS: src/test/run-cli-tests

Testsuite summary for ceph 10.0.0

# TOTAL: 1
# PASS:  1
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0


gmake test:

Testsuite summary for ceph 10.0.0

# TOTAL: 119
# PASS:  95
# SKIP:  0
# XFAIL: 0
# FAIL:  24
# XPASS: 0
# ERROR: 0


The folowing notes can be made with this:
1) the run-cli-tests run to completion because I excluded the RBD tests
2) gmake test has the following tests FAIL:
FAIL: unittest_erasure_code_plugin
FAIL: ceph-detect-init/run-tox.sh
FAIL: test/erasure-code/test-erasure-code.sh
FAIL: test/erasure-code/test-erasure-eio.sh
FAIL: test/run-rbd-unit-tests.sh
FAIL: test/ceph_objectstore_tool.py
FAIL: test/test-ceph-helpers.sh
FAIL: test/cephtool-test-osd.sh
FAIL: test/cephtool-test-mon.sh
FAIL: test/cephtool-test-mds.sh
FAIL: test/cephtool-test-rados.sh
FAIL: test/mon/osd-crush.sh
FAIL: test/osd/osd-scrub-repair.sh
FAIL: test/osd/osd-scrub-snaps.sh
FAIL: test/osd/osd-config.sh
FAIL: test/osd/osd-bench.sh
FAIL: test/osd/osd-reactivate.sh
FAIL: test/osd/osd-copy-from.sh
FAIL: test/libradosstriper/rados-striper.sh
FAIL: test/test_objectstore_memstore.sh
FAIL: test/ceph-disk.sh
FAIL: test/pybind/test_ceph_argparse.py
FAIL: test/pybind/test_ceph_daemon.py
FAIL: ../qa/workunits/erasure-code/encode-decode-non-regression.sh

Most of the fails are because ceph-osd crashed consistently on:
-1 journal  bl.is_aligned(block_size) 0
bl.is_n_align_sized(CEPH_MINIMUM_BLOCK_SIZE) 1
-1 journal  block_size 131072 CEPH_MINIMUM_BLOCK_SIZE 4096
CEPH_PAGE_SIZE 4096 header.alignment 131072
bl buffer::list(len=131072, buffer::ptr(0~131072 0x805319000 in raw
0x805319000 len 131072 nref 1))
os/FileJournal.cc: In function 'void FileJournal::align_bl(off64_t,
bufferlist &)' thread 805217400 time 2015-12-19 13:43:06.706797

Re: Is rbd_discard enough to wipe an RBD image?

2015-12-21 Thread Wido den Hollander
On 12/21/2015 04:50 PM, Josh Durgin wrote:
> On 12/21/2015 07:09 AM, Jason Dillaman wrote:
>> You will have to ensure that your writes are properly aligned with the
>> object size (or object set if fancy striping is used on the RBD
>> volume).  In that case, the discard is translated to remove operations
>> on each individual backing object.  The only time zeros are written to
>> disk is if you specify an offset somewhere in the middle of an object
>> (i.e. the whole object cannot be deleted nor can it be truncated) --
>> this is the partial discard case controlled by that configuration param.
>>
> 
> I'm curious what's using the virVolWipe stuff - it can't guarantee it's
> actually wiping the data in many common configurations, not just with
> ceph but with any kind of disk, since libvirt is usually not consuming
> raw disks, and with modern flash and smr drives even that is not enough.
> There's a recent patch improving the docs on this [1].
> 
> If the goal is just to make the data inaccessible to the libvirt user,
> removing the image is just as good.
> 
> That said, with rbd there's not much cost to zeroing the image with
> object map enabled - it's effectively just doing the data removal step
> of 'rbd rm' early.
> 

I was looking at the features the RBD storage pool driver is missing in
libvirt and it is:

- Build from Volume. That's RBD cloning
- Uploading and Downloading Volume
- Wiping Volume

The thing about wiping in libvirt is that the volume still exists
afterwards, it is just empty.

My discard code now works, but I wanted to verify. If I understand Jason
correctly it would be a matter of figuring out the 'order' of a image
and call rbd_discard in a loop until you reach the end of the image.

I just want libvirt to be as feature complete as possible when it comes
to RBD.

Wido

> Josh
> 
> [1] http://comments.gmane.org/gmane.comp.emulators.libvirt/122235
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FreeBSD Building and Testing

2015-12-21 Thread Willem Jan Withagen

On 20-12-2015 17:10, Willem Jan Withagen wrote:

Hi,

Most of the Ceph is getting there in the most crude and rough state.
So beneath is a status update on what is not working for me jet.



Further:
A) unittest_erasure_code_plugin failes on the fact that there is a
different error code returned when dlopen-ing a non existent library.
load dlopen(.libs/libec_invalid.so): Cannot open
".libs/libec_invalid.so"load dlsym(.libs/libec_missing_version.so, _
_erasure_code_init): Undefined symbol
"__erasure_code_init"test/erasure-code/TestErasureCodePlugin.cc:88: Failure
Value of: instance.factory("missing_version", g_conf->erasure_code_dir,
profile, _code, )
   Actual: -2
Expected: -18


EXDEV is actually 18, so that part is correct.
But EXDEV is cross-device link error.

Where as the actual answer: -2 is factual correct:
#define ENOENT  2   /* No such file or directory */

So why is the test for EXDEV instead of ENOENT?
Could be a typical Linux <> FreeBSD thingy.

--WjW
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is rbd_discard enough to wipe an RBD image?

2015-12-21 Thread Josh Durgin

On 12/21/2015 11:00 AM, Wido den Hollander wrote:

My discard code now works, but I wanted to verify. If I understand Jason
correctly it would be a matter of figuring out the 'order' of a image
and call rbd_discard in a loop until you reach the end of the image.


You'd need to get the order via rbd_stat(), convert it to object size 
(i.e. (1 << order)), and fetch stripe_count with rbd_get_stripe_count().


Then do the discards in (object size * stripe_count) chunks. This
ensures you discard entire objects. This is the size you'd want to use
for import/export as well, ideally.


I just want libvirt to be as feature complete as possible when it comes
to RBD.


I see, makes sense.

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD performance with many childs and snapshots

2015-12-21 Thread Josh Durgin

On 12/21/2015 11:06 AM, Wido den Hollander wrote:

Hi,

While implementing the buildvolfrom method in libvirt for RBD I'm stuck
at some point.

$ virsh vol-clone --pool myrbdpool image1 image2

This would clone image1 to a new RBD image called 'image2'.

The code I've written now does:

1. Create a snapshot called image1@libvirt-
2. Protect the snapshot
3. Clone the snapshot to 'image1'

wido@wido-desktop:~/repos/libvirt$ ./tools/virsh vol-clone --pool
rbdpool image1 image2
Vol image2 cloned from image1

wido@wido-desktop:~/repos/libvirt$

root@alpha:~# rbd -p libvirt info image2
rbd image 'image2':
size 10240 MB in 2560 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.1976451ead36b
format: 2
features: layering, striping
flags:
parent: libvirt/image1@libvirt-1450724650
overlap: 10240 MB
stripe unit: 4096 kB
stripe count: 1
root@alpha:~#

But this could potentially lead to a lot of snapshots with children on
'image1'.

image1 itself will probably never change, but I'm wondering about the
negative performance impact this might have on a OSD.


Creating them isn't so bad, more snapshots that don't change don't have
much affect on the osds. Deleting them is what's expensive, since the
osds need to scan the objects to see which ones are part of the
snapshot and can be deleted. If you have too many snapshots created and
deleted, it can affect cluster load, so I'd rather avoid always
creating a snapshot.


I'd rather not hardcode a snapshot name like 'libvirt-parent-snapshot'
into libvirt. There is however no way to pass something like a snapshot
name in libvirt when cloning.

Any bright suggestions? Or is it fine to create so many snapshots?


You could have canonical names for the libvirt snapshots like you 
suggest, 'libvirt-', and check via rbd_diff_iterate2()

whether the parent image changed since the last snapshot. That's a bit
slower than plain cloning, but with object map + fast diff it's fast
again, since it doesn't need to scan all the objects anymore.

I think libvirt would need to expand its api a bit to be able to really
use it effectively to manage rbd. Hiding the snapshots becomes 
cumbersome if the application wants to use them too. If libvirt's

current model of clones lets parents be deleted before children,
that may be a hassle to hide too...

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is rbd_discard enough to wipe an RBD image?

2015-12-21 Thread Alexandre DERUMIER
>>I just want to know if this is sufficient to wipe a RBD image?

AFAIK, ceph write zeroes in the rados objects with discard is used.

They are an option for skip zeroes write if needed

OPTION(rbd_skip_partial_discard, OPT_BOOL, false) // when trying to discard a 
range inside an object, set to true to skip zeroing the range.
- Mail original -
De: "Wido den Hollander" 
À: "ceph-devel" 
Envoyé: Dimanche 20 Décembre 2015 22:21:50
Objet: Is rbd_discard enough to wipe an RBD image?

Hi, 

I'm busy implementing the volume wiping method of the libvirt storage 
pool backend and instead of writing to the whole RBD image with zeroes 
I'm using rbd_discard. 

Using a 4MB length I'm starting at offset 0 and work my way through the 
whole RBD image. 

A quick try shows me that my partition table + filesystem are gone on 
the RBD image after I've run rbd_discard. 

I just want to know if this is sufficient to wipe a RBD image? Or would 
it be better to fully fill the image with zeroes? 

-- 
Wido den Hollander 
42on B.V. 
Ceph trainer and consultant 

Phone: +31 (0)20 700 9902 
Skype: contact42on 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majord...@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fwd: Client still connect failed leader after that mon down

2015-12-21 Thread Sage Weil
On Mon, 21 Dec 2015, Zhi Zhang wrote:
> Regards,
> Zhi Zhang (David)
> Contact: zhang.david2...@gmail.com
>   zhangz.da...@outlook.com
> 
> 
> 
> -- Forwarded message --
> From: Jaze Lee <jaze...@gmail.com>
> Date: Mon, Dec 21, 2015 at 4:08 PM
> Subject: Re: Client still connect failed leader after that mon down
> To: Zhi Zhang <zhang.david2...@gmail.com>
> 
> 
> Hello,
> I am terrible sorry.
> I think we may not need to reconstruct the monclient.{h,cc}, we find
> the parameter is mon_client_hunt_interval is very usefull.
> When we set mon_client_hunt_interval = 0.5? the time to run a ceph
> command is very small even it first connects the down leader mon.
> 
> The first time i ask the question was because we find the parameter
> from official site
> http://docs.ceph.com/docs/master/rados/configuration/mon-config-ref/.
> It is write in this
> 
> mon client hung interval

Yep, that's a typo. Do you mind submitting a patch to fix it?

Thanks!
sage


> 
> Description:The client will try a new monitor every N seconds until it
> establishes a connection.
> Type:Double
> Default:3.0
> 
> And we set it. it is not work.
> 
> I think may be it is a slip of pen?
> The right configuration parameter should be mon client hunt interval
> 
> Can someone please help me to fix this in official site?
> 
> Thanks a lot.
> 
> 
> 
> 2015-12-21 14:00 GMT+08:00 Jaze Lee <jaze...@gmail.com>:
> > right now we use simple msg, and cpeh version is 0.80...
> >
> > 2015-12-21 10:55 GMT+08:00 Zhi Zhang <zhang.david2...@gmail.com>:
> >> Which msg type and ceph version are you using?
> >>
> >> Once we used 0.94.1 with async msg, we encountered similar issue.
> >> Client was trying to connect a down monitor when it was just started
> >> and this connection would hung there. This is because previous async
> >> msg used blocking connection mode.
> >>
> >> After we back ported non-blocking mode of async msg from higher ceph
> >> version, we haven't encountered such issue yet.
> >>
> >>
> >> Regards,
> >> Zhi Zhang (David)
> >> Contact: zhang.david2...@gmail.com
> >>   zhangz.da...@outlook.com
> >>
> >>
> >> On Fri, Dec 18, 2015 at 11:41 AM, Jevon Qiao <scaleq...@gmail.com> wrote:
> >>> On 17/12/15 21:27, Sage Weil wrote:
> >>>>
> >>>> On Thu, 17 Dec 2015, Jaze Lee wrote:
> >>>>>
> >>>>> Hello cephers:
> >>>>>  In our test, there are three monitors. We find client run ceph
> >>>>> command will slow when the leader mon is down. Even after long time, a
> >>>>> client run ceph command will also slow in first time.
> >>>>> >From strace, we find that the client first to connect the leader, then
> >>>>> after 3s, it connect the second.
> >>>>> After some search we find that the quorum is not change, the leader is
> >>>>> still the down monitor.
> >>>>> Is that normal?  Or is there something i miss?
> >>>>
> >>>> It's normal.  Even when the quorum does change, the client doesn't
> >>>> know that.  It should be contacting a random mon on startup, though, so I
> >>>> would expect the 3s delay 1/3 of the time.
> >>>
> >>> That's because client randomly picks up a mon from Monmap. But what we
> >>> observed is that when a mon is down no change is made to monmap(neither 
> >>> the
> >>> epoch nor the members). Is it the culprit for this phenomenon?
> >>>
> >>> Thanks,
> >>> Jevon
> >>>
> >>>> A long-standing low-priority feature request is to have the client 
> >>>> contact
> >>>> 2 mons in parallel so that it can still connect quickly if one is down.
> >>>> It's requires some non-trivial work in mon/MonClient.{cc,h} though and I
> >>>> don't think anyone has looked at it seriously.
> >>>>
> >>>> sage
> >>>>
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>> the body of a message to majord...@vger.kernel.org
> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> the body of a message to majord...@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> > --
> > 
> 
> 
> 
> --
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Issue with Ceph File System and LIO

2015-12-21 Thread Gregory Farnum
On Sun, Dec 20, 2015 at 6:38 PM, Eric Eastman
 wrote:
> On Fri, Dec 18, 2015 at 12:18 AM, Yan, Zheng  wrote:
>> On Fri, Dec 18, 2015 at 2:23 PM, Eric Eastman
>>  wrote:
 Hi Yan Zheng, Eric Eastman

 Similar bug was reported in f2fs, btrfs, it does affect 4.4-rc4, the fixing
 patch was merged into 4.4-rc5, dfd01f026058 ("sched/wait: Fix the signal
 handling fix").

 Related report & discussion was here:
 https://lkml.org/lkml/2015/12/12/149

 I'm not sure the current reported issue of ceph was related to that though,
 but at least try testing with an upgraded or patched kernel could verify 
 it.
 :)

 Thanks,
>
>>
>> please try rc5 kernel without patches and DEBUG_VM=y
>>
>> Regards
>> Yan, Zheng
>
>
> The latest test with 4.4rc5 with CONFIG_DEBUG_VM=y has ran for over 36
> hours with no ERRORS or WARNINGS.  My plan is to install the 4.4rc6
> kernel from the Ubuntu kernel-ppa site once it is available, and rerun
> the tests.
>
> Before running this test I had to rebuild the Ceph File System as
> after the last logged errors on Friday using the 4.4rc4 kernel, the
> Ceph File system hung accessing the exported image file.  After
> rebooting my iSCSI gateway using the Ceph File System, from / using
> command: strace du -a cephfs, the mount point, the hang happened on
> the newfsstatat call on my image file:
>
> write(1, "0\tcephfs/ctdb/.ctdb.lock\n", 250 cephfs/ctdb/.ctdb.lock
> ) = 25
> close(5)= 0
> write(1, "0\tcephfs/ctdb\n", 140 cephfs/ctdb
> )= 14
> newfstatat(4, "iscsi", {st_mode=S_IFDIR|0755, st_size=993814480896,
> ...}, AT_SYMLINK_NOFOLLOW) = 0
> openat(4, "iscsi", O_RDONLY|O_NOCTTY|O_NONBLOCK|O_DIRECTORY|O_NOFOLLOW) = 3
> fcntl(3, F_GETFD)   = 0
> fcntl(3, F_SETFD, FD_CLOEXEC)   = 0
> fstat(3, {st_mode=S_IFDIR|0755, st_size=993814480896, ...}) = 0
> fcntl(3, F_GETFL)   = 0x38800 (flags
> O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_NOFOLLOW)
> fcntl(3, F_SETFD, FD_CLOEXEC)   = 0
> newfstatat(4, "iscsi", {st_mode=S_IFDIR|0755, st_size=993814480896,
> ...}, AT_SYMLINK_NOFOLLOW) = 0
> fcntl(3, F_DUPFD, 3)= 5
> fcntl(5, F_GETFD)   = 0
> fcntl(5, F_SETFD, FD_CLOEXEC)   = 0
> getdents(3, /* 8 entries */, 65536) = 288
> getdents(3, /* 0 entries */, 65536) = 0
> close(3)= 0
> newfstatat(5, "iscsi900g.img", ^C
> ^C^C^C
> ^Z
> I could not break out with a ^C, and had to background the process to
> get my prompt back. The process would not die so I had to hard reset
> the system.
>
> This same hang happened on 2 other kernel mounted systems using a 4.3.0 
> kernel.
>
> On a separate system, I fuse mounted the file system and a du -a
> cephfs hung at the same point. Once again I could not break out of the
> hang, and had to hard reset the system.
>
> Restarting the MDS and Monitors did not clear the issue. Taking a
> quick look at the dumpcache showed it was large
>
> # ceph mds tell 0 dumpcache /tmp/dump.txt
> ok
> # wc /tmp/dump.txt
>   370556  5002449 59211054 /tmp/dump.txt
> # tail /tmp/dump.txt
> [inode 1259276 [...c4,head] ~mds0/stray0/1259276/ auth v977593
> snaprealm=0x561339e3fb00 f(v0 m2015-12-12 00:51:04.345614) n(v0
> rc2015-12-12 00:51:04.345614 1=0+1) (iversion lock) 0x561339c66228]
> [inode 120c1ba [...a6,head] ~mds0/stray0/120c1ba/ auth v742016
> snaprealm=0x56133ad19600 f(v0 m2015-12-10 18:25:55.880167) n(v0
> rc2015-12-10 18:25:55.880167 1=0+1) (iversion lock) 0x56133a5e0d88]
> [inode 10d0088 [...77,head] ~mds0/stray6/10d0088/ auth v292336
> snaprealm=0x5613537673c0 f(v0 m2015-12-08 19:23:20.269283) n(v0
> rc2015-12-08 19:23:20.269283 1=0+1) (iversion lock) 0x56134c2f7378]

These are deleted files that haven't been trimmed yet...

>
> I tried one more thing:
>
> ceph daemon mds.0 flush journal
>
> and restarted the MDS. Accessing the file system still locked up, but
> a du -a cephfs did not even get to the iscsi900g.img file. As I was
> running on a broken rc kernel, with snapshots turned on

...and I think we have some known issues in the tracker about snap
trimming and snapshotted inodes. So this is not entirely surprising.
:/
-Greg


>, when this
> corruption happened, I decided to recreated the file system and
> restarted the ESXi iSCSI test.
>
> Regards,
> Eric
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Improving Data-At-Rest encryption in Ceph

2015-12-21 Thread Adam Kupczyk
On Wed, Dec 16, 2015 at 11:33 PM, Sage Weil  wrote:
> On Wed, 16 Dec 2015, Adam Kupczyk wrote:
>> On Tue, Dec 15, 2015 at 3:23 PM, Lars Marowsky-Bree  wrote:
>> > On 2015-12-14T14:17:08, Radoslaw Zarzynski  wrote:
>> >
>> > Hi all,
>> >
>> > great to see this revived.
>> >
>> > However, I have come to see some concerns with handling the encryption
>> > within Ceph itself.
>> >
>> > The key part to any such approach is formulating the threat scenario.
>> > For the use cases we have seen, the data-at-rest encryption matters so
>> > they can confidently throw away disks without leaking data. It's not
>> > meant as a defense against an online attacker. There usually is no
>> > problem with "a few" disks being privileged, or one or two nodes that
>> > need an admin intervention for booting (to enter some master encryption
>> > key somehow, somewhere).
>> >
>> > However, that requires *all* data on the OSDs to be encrypted.
>> >
>> > Crucially, that includes not just the file system meta data (so not just
>> > the data), but also the root and especially the swap partition. Those
>> > potentially include swapped out data, coredumps, logs, etc.
>> >
>> > (As an optional feature, it'd be cool if an OSD could be moved to a
>> > different chassis and continue operating there, to speed up recovery.
>> > Another optional feature would be to eventually be able, for those
>> > customers that trust them ;-), supply the key to the on-disk encryption
>> > (OPAL et al).)
>> >
>> > The proposal that Joshua posted a while ago essentially remained based
>> > on dm-crypt, but put in simple hooks to retrieve the keys from some
>> > "secured" server via sftp/ftps instead of loading them from the root fs.
>> > Similar to deo, that ties the key to being on the network and knowing
>> > the OSD UUID.
>> >
>> > This would then also be somewhat easily extensible to utilize the same
>> > key management server via initrd/dracut.
>> >
>> > Yes, this means that each OSD disk is separately encrypted, but given
>> > modern CPUs, this is less of a problem. It does have the benefit of
>> > being completely transparent to Ceph, and actually covering the whole
>> > node.
>> Agreed, if encryption is infinitely fast dm-crypt is best solution.
>> Below is short analysis of encryption burden for dm-crypt and
>> OSD-encryption when using replicated pools.
>>
>> Summary:
>> OSD encryption requires 2.6 times less crypto operations then dm-crypt.
>
> Yeah, I believe that, but
>
>> Crypto ops are bottleneck.
>
> is this really true?  I don't think we've tried to measure performance
> with dm-crypt, but I also have never heard anyone complain about the
> additional CPU utilization or performance impact.  Have you observed this?
I made tests, mostly on my i7-4910MQ 2.9GHz(4 cores) with SSD.
The results for write were appallingly low, I guess due to kernel
problems with multi-cpu kcrypto[1]. I will not mention them, these
results will obfuscate discussion. And newer kernels >4.0.2 do fixes
the issue.

The results for read were 350MB/s, but CPU utilization was 44% in
kcrypto kernel worker(single core). This effectively means 11 % of
total crypto capacity, because intel-optimized AES-NI instruction is
used almost every cycle, making hyperthreading useless.

[1] 
http://unix.stackexchange.com/questions/203677/abysmal-general-dm-crypt-luks-write-performance
>
> sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: tool for applying 'ceph daemon ' command to all OSDs

2015-12-21 Thread Gregory Farnum
On Mon, Dec 21, 2015 at 9:59 PM, Dan Mick  wrote:
> I needed something to fetch current config values from all OSDs (sorta
> the opposite of 'injectargs --key value), so I hacked it, and then
> spiffed it up a bit.  Does this seem like something that would be useful
> in this form in the upstream Ceph, or does anyone have any thoughts on
> its design or structure?
>
> It requires a locally-installed ceph CLI and a ceph.conf that points to
> the cluster and any required keyrings.  You can also provide it with
> a YAML file mapping host to osds if you want to save time collecting
> that info for a statically-defined cluster, or if you want just a subset
> of OSDs.
>
> https://github.com/dmick/tools/blob/master/osd_daemon_cmd.py
>
> Excerpt from usage:
>
> Execute a Ceph osd daemon command on every OSD in a cluster with
> one connection to each OSD host.
>
> Usage:
> osd_daemon_cmd [-c CONF] [-u USER] [-f FILE] (COMMAND | -k KEY)
>
> Options:
>-c CONF   ceph.conf file to use [default: ./ceph.conf]
>-u USER   user to connect with ssh
>-f FILE   get names and osds from yaml
>COMMAND   command other than "config get" to execute
>-k KEYconfig key to retrieve with config get 

I naively like the functionality being available, but if I'm skimming
this correctly it looks like you're relying on the local node being
able to passwordless-ssh to all of the nodes, and for that account to
be able to access the ceph admin sockets. Granted we rely on the ssh
for ceph-deploy as well, so maybe that's okay, but I'm not sure in
this case since it implies a lot more network openness.

Relatedly (perhaps in an opposing direction), maybe we want anything
exposed over the network to have some sort of explicit permissions
model?

Maybe not and we should just ship the script for trusted users. I
would have liked it on the long-running cluster I'm sure you built it
for. ;)
Indecisively yours,
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is rbd_discard enough to wipe an RBD image?

2015-12-21 Thread Josh Durgin

On 12/21/2015 07:09 AM, Jason Dillaman wrote:

You will have to ensure that your writes are properly aligned with the object 
size (or object set if fancy striping is used on the RBD volume).  In that 
case, the discard is translated to remove operations on each individual backing 
object.  The only time zeros are written to disk is if you specify an offset 
somewhere in the middle of an object (i.e. the whole object cannot be deleted 
nor can it be truncated) -- this is the partial discard case controlled by that 
configuration param.



I'm curious what's using the virVolWipe stuff - it can't guarantee it's
actually wiping the data in many common configurations, not just with
ceph but with any kind of disk, since libvirt is usually not consuming
raw disks, and with modern flash and smr drives even that is not enough.
There's a recent patch improving the docs on this [1].

If the goal is just to make the data inaccessible to the libvirt user,
removing the image is just as good.

That said, with rbd there's not much cost to zeroing the image with
object map enabled - it's effectively just doing the data removal step
of 'rbd rm' early.

Josh

[1] http://comments.gmane.org/gmane.comp.emulators.libvirt/122235
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is rbd_discard enough to wipe an RBD image?

2015-12-21 Thread Jason Dillaman
You will have to ensure that your writes are properly aligned with the object 
size (or object set if fancy striping is used on the RBD volume).  In that 
case, the discard is translated to remove operations on each individual backing 
object.  The only time zeros are written to disk is if you specify an offset 
somewhere in the middle of an object (i.e. the whole object cannot be deleted 
nor can it be truncated) -- this is the partial discard case controlled by that 
configuration param.  

-- 

Jason Dillaman 


- Original Message -
> From: "Alexandre DERUMIER" <aderum...@odiso.com>
> To: "Wido den Hollander" <w...@42on.com>
> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>
> Sent: Monday, December 21, 2015 9:25:15 AM
> Subject: Re: Is rbd_discard enough to wipe an RBD image?
> 
> >>I just want to know if this is sufficient to wipe a RBD image?
> 
> AFAIK, ceph write zeroes in the rados objects with discard is used.
> 
> They are an option for skip zeroes write if needed
> 
> OPTION(rbd_skip_partial_discard, OPT_BOOL, false) // when trying to discard a
> range inside an object, set to true to skip zeroing the range.
> - Mail original -
> De: "Wido den Hollander" <w...@42on.com>
> À: "ceph-devel" <ceph-devel@vger.kernel.org>
> Envoyé: Dimanche 20 Décembre 2015 22:21:50
> Objet: Is rbd_discard enough to wipe an RBD image?
> 
> Hi,
> 
> I'm busy implementing the volume wiping method of the libvirt storage
> pool backend and instead of writing to the whole RBD image with zeroes
> I'm using rbd_discard.
> 
> Using a 4MB length I'm starting at offset 0 and work my way through the
> whole RBD image.
> 
> A quick try shows me that my partition table + filesystem are gone on
> the RBD image after I've run rbd_discard.
> 
> I just want to know if this is sufficient to wipe a RBD image? Or would
> it be better to fully fill the image with zeroes?
> 
> --
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
> 
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FreeBSD Building and Testing

2015-12-20 Thread 信泽
sorry for delay reply. Please have a try
https://github.com/ceph/ceph/commit/ae4a8162eacb606a7f65259c6ac236e144bfef0a.

2015-12-21 0:10 GMT+08:00 Willem Jan Withagen :
> Hi,
>
> Most of the Ceph is getting there in the most crude and rough state.
> So beneath is a status update on what is not working for me jet.
>
> Especially help with the aligment problem in os/FileJournal.cc would be
> appricated... It would allow me to run ceph-osd and run more tests to
> completion.
>
> What would happen if I comment out this test, and ignore the fact that
> thing might be unaligned?
> Is it a performance/paging issue?
> Or is data going to be corrupted?
>
> --WjW
>
> PASS: src/test/run-cli-tests
> 
> Testsuite summary for ceph 10.0.0
> 
> # TOTAL: 1
> # PASS:  1
> # SKIP:  0
> # XFAIL: 0
> # FAIL:  0
> # XPASS: 0
> # ERROR: 0
> 
>
> gmake test:
> 
> Testsuite summary for ceph 10.0.0
> 
> # TOTAL: 119
> # PASS:  95
> # SKIP:  0
> # XFAIL: 0
> # FAIL:  24
> # XPASS: 0
> # ERROR: 0
> 
>
> The folowing notes can be made with this:
> 1) the run-cli-tests run to completion because I excluded the RBD tests
> 2) gmake test has the following tests FAIL:
> FAIL: unittest_erasure_code_plugin
> FAIL: ceph-detect-init/run-tox.sh
> FAIL: test/erasure-code/test-erasure-code.sh
> FAIL: test/erasure-code/test-erasure-eio.sh
> FAIL: test/run-rbd-unit-tests.sh
> FAIL: test/ceph_objectstore_tool.py
> FAIL: test/test-ceph-helpers.sh
> FAIL: test/cephtool-test-osd.sh
> FAIL: test/cephtool-test-mon.sh
> FAIL: test/cephtool-test-mds.sh
> FAIL: test/cephtool-test-rados.sh
> FAIL: test/mon/osd-crush.sh
> FAIL: test/osd/osd-scrub-repair.sh
> FAIL: test/osd/osd-scrub-snaps.sh
> FAIL: test/osd/osd-config.sh
> FAIL: test/osd/osd-bench.sh
> FAIL: test/osd/osd-reactivate.sh
> FAIL: test/osd/osd-copy-from.sh
> FAIL: test/libradosstriper/rados-striper.sh
> FAIL: test/test_objectstore_memstore.sh
> FAIL: test/ceph-disk.sh
> FAIL: test/pybind/test_ceph_argparse.py
> FAIL: test/pybind/test_ceph_daemon.py
> FAIL: ../qa/workunits/erasure-code/encode-decode-non-regression.sh
>
> Most of the fails are because ceph-osd crashed consistently on:
> -1 journal  bl.is_aligned(block_size) 0
> bl.is_n_align_sized(CEPH_MINIMUM_BLOCK_SIZE) 1
> -1 journal  block_size 131072 CEPH_MINIMUM_BLOCK_SIZE 4096
> CEPH_PAGE_SIZE 4096 header.alignment 131072
> bl buffer::list(len=131072, buffer::ptr(0~131072 0x805319000 in raw
> 0x805319000 len 131072 nref 1))
> os/FileJournal.cc: In function 'void FileJournal::align_bl(off64_t,
> bufferlist &)' thread 805217400 time 2015-12-19 13:43:06.706797
> os/FileJournal.cc: 1045: FAILED assert(0 == "bl should be align")
>
> This is bugging me already for a few days, but I haven't found an easy
> way to debug this, run it in gdb while being live or in post-mortum.
>
> Further:
> A) unittest_erasure_code_plugin failes on the fact that there is a
> different error code returned when dlopen-ing a non existent library.
> load dlopen(.libs/libec_invalid.so): Cannot open
> ".libs/libec_invalid.so"load dlsym(.libs/libec_missing_version.so, _
> _erasure_code_init): Undefined symbol
> "__erasure_code_init"test/erasure-code/TestErasureCodePlugin.cc:88: Failure
> Value of: instance.factory("missing_version", g_conf->erasure_code_dir,
> profile, _code, )
>   Actual: -2
> Expected: -18
> load dlsym(.libs/libec_missing_entry_point.so, __erasure_code_init):
> Undefined symbol "__erasure_code_init"erasure_co
> de_init(fail_to_initialize,.libs): (3) No such processload
> __erasure_code_init()did not register fail_to_registerload
> : example erasure_code_init(example,.libs): (17) File existsload:
> example [  FAILED  ] ErasureCodePluginRegistryTest.
> all (330 ms)
>
> B) ceph-detect-init/run-tox.sh failes on the fact that I need to work in
> FreeBSD in the tests.
>
> C) ./gtest/include/gtest/internal/gtest-port.h:1358:: Condition
> has_owner_ && pthread_equal(owner_, pthread_se
> lf()) failed. The current thread is not holding the mutex @0x161ef20
> ./test/run-rbd-unit-tests.sh: line 9: 78053 Abort trap
> (core dumped) unittest_librbd
>
> Which I think I found some commit comments about in either trac or git
> about FreeBSD not being able to do things to its own thread. Got to look
> into this.
>
> D) Fix some of the other python code to work as expected.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Regards,
Xinze Chi
--
To 

Re: Issue with Ceph File System and LIO

2015-12-20 Thread Eric Eastman
On Fri, Dec 18, 2015 at 12:18 AM, Yan, Zheng  wrote:
> On Fri, Dec 18, 2015 at 2:23 PM, Eric Eastman
>  wrote:
>>> Hi Yan Zheng, Eric Eastman
>>>
>>> Similar bug was reported in f2fs, btrfs, it does affect 4.4-rc4, the fixing
>>> patch was merged into 4.4-rc5, dfd01f026058 ("sched/wait: Fix the signal
>>> handling fix").
>>>
>>> Related report & discussion was here:
>>> https://lkml.org/lkml/2015/12/12/149
>>>
>>> I'm not sure the current reported issue of ceph was related to that though,
>>> but at least try testing with an upgraded or patched kernel could verify it.
>>> :)
>>>
>>> Thanks,

>
> please try rc5 kernel without patches and DEBUG_VM=y
>
> Regards
> Yan, Zheng


The latest test with 4.4rc5 with CONFIG_DEBUG_VM=y has ran for over 36
hours with no ERRORS or WARNINGS.  My plan is to install the 4.4rc6
kernel from the Ubuntu kernel-ppa site once it is available, and rerun
the tests.

Before running this test I had to rebuild the Ceph File System as
after the last logged errors on Friday using the 4.4rc4 kernel, the
Ceph File system hung accessing the exported image file.  After
rebooting my iSCSI gateway using the Ceph File System, from / using
command: strace du -a cephfs, the mount point, the hang happened on
the newfsstatat call on my image file:

write(1, "0\tcephfs/ctdb/.ctdb.lock\n", 250 cephfs/ctdb/.ctdb.lock
) = 25
close(5)= 0
write(1, "0\tcephfs/ctdb\n", 140 cephfs/ctdb
)= 14
newfstatat(4, "iscsi", {st_mode=S_IFDIR|0755, st_size=993814480896,
...}, AT_SYMLINK_NOFOLLOW) = 0
openat(4, "iscsi", O_RDONLY|O_NOCTTY|O_NONBLOCK|O_DIRECTORY|O_NOFOLLOW) = 3
fcntl(3, F_GETFD)   = 0
fcntl(3, F_SETFD, FD_CLOEXEC)   = 0
fstat(3, {st_mode=S_IFDIR|0755, st_size=993814480896, ...}) = 0
fcntl(3, F_GETFL)   = 0x38800 (flags
O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_NOFOLLOW)
fcntl(3, F_SETFD, FD_CLOEXEC)   = 0
newfstatat(4, "iscsi", {st_mode=S_IFDIR|0755, st_size=993814480896,
...}, AT_SYMLINK_NOFOLLOW) = 0
fcntl(3, F_DUPFD, 3)= 5
fcntl(5, F_GETFD)   = 0
fcntl(5, F_SETFD, FD_CLOEXEC)   = 0
getdents(3, /* 8 entries */, 65536) = 288
getdents(3, /* 0 entries */, 65536) = 0
close(3)= 0
newfstatat(5, "iscsi900g.img", ^C
^C^C^C
^Z
I could not break out with a ^C, and had to background the process to
get my prompt back. The process would not die so I had to hard reset
the system.

This same hang happened on 2 other kernel mounted systems using a 4.3.0 kernel.

On a separate system, I fuse mounted the file system and a du -a
cephfs hung at the same point. Once again I could not break out of the
hang, and had to hard reset the system.

Restarting the MDS and Monitors did not clear the issue. Taking a
quick look at the dumpcache showed it was large

# ceph mds tell 0 dumpcache /tmp/dump.txt
ok
# wc /tmp/dump.txt
  370556  5002449 59211054 /tmp/dump.txt
# tail /tmp/dump.txt
[inode 1259276 [...c4,head] ~mds0/stray0/1259276/ auth v977593
snaprealm=0x561339e3fb00 f(v0 m2015-12-12 00:51:04.345614) n(v0
rc2015-12-12 00:51:04.345614 1=0+1) (iversion lock) 0x561339c66228]
[inode 120c1ba [...a6,head] ~mds0/stray0/120c1ba/ auth v742016
snaprealm=0x56133ad19600 f(v0 m2015-12-10 18:25:55.880167) n(v0
rc2015-12-10 18:25:55.880167 1=0+1) (iversion lock) 0x56133a5e0d88]
[inode 10d0088 [...77,head] ~mds0/stray6/10d0088/ auth v292336
snaprealm=0x5613537673c0 f(v0 m2015-12-08 19:23:20.269283) n(v0
rc2015-12-08 19:23:20.269283 1=0+1) (iversion lock) 0x56134c2f7378]

I tried one more thing:

ceph daemon mds.0 flush journal

and restarted the MDS. Accessing the file system still locked up, but
a du -a cephfs did not even get to the iscsi900g.img file. As I was
running on a broken rc kernel, with snapshots turned on, when this
corruption happened, I decided to recreated the file system and
restarted the ESXi iSCSI test.

Regards,
Eric
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Client still connect failed leader after that mon down

2015-12-20 Thread Zhi Zhang
Which msg type and ceph version are you using?

Once we used 0.94.1 with async msg, we encountered similar issue.
Client was trying to connect a down monitor when it was just started
and this connection would hung there. This is because previous async
msg used blocking connection mode.

After we back ported non-blocking mode of async msg from higher ceph
version, we haven't encountered such issue yet.


Regards,
Zhi Zhang (David)
Contact: zhang.david2...@gmail.com
  zhangz.da...@outlook.com


On Fri, Dec 18, 2015 at 11:41 AM, Jevon Qiao  wrote:
> On 17/12/15 21:27, Sage Weil wrote:
>>
>> On Thu, 17 Dec 2015, Jaze Lee wrote:
>>>
>>> Hello cephers:
>>>  In our test, there are three monitors. We find client run ceph
>>> command will slow when the leader mon is down. Even after long time, a
>>> client run ceph command will also slow in first time.
>>> >From strace, we find that the client first to connect the leader, then
>>> after 3s, it connect the second.
>>> After some search we find that the quorum is not change, the leader is
>>> still the down monitor.
>>> Is that normal?  Or is there something i miss?
>>
>> It's normal.  Even when the quorum does change, the client doesn't
>> know that.  It should be contacting a random mon on startup, though, so I
>> would expect the 3s delay 1/3 of the time.
>
> That's because client randomly picks up a mon from Monmap. But what we
> observed is that when a mon is down no change is made to monmap(neither the
> epoch nor the members). Is it the culprit for this phenomenon?
>
> Thanks,
> Jevon
>
>> A long-standing low-priority feature request is to have the client contact
>> 2 mons in parallel so that it can still connect quickly if one is down.
>> It's requires some non-trivial work in mon/MonClient.{cc,h} though and I
>> don't think anyone has looked at it seriously.
>>
>> sage
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: puzzling disapearance of /dev/sdc1

2015-12-18 Thread Loic Dachary
Hi Ilya,

It turns out that sgdisk 0.8.6 -i 2 /dev/vdb removes partitions and re-adds 
them on CentOS 7 with a 3.10.0-229.11.1.el7 kernel, in the same way partprobe 
does. It is used intensively by ceph-disk and inevitably leads to races where a 
device temporarily disapears. The same command (sgdisk 0.8.8) on Ubuntu 14.04 
with a 3.13.0-62-generic kernel only generates two udev change events and does 
not remove / add partitions. The source code between sgdisk 0.8.6 and sgdisk 
0.8.8 did not change in a significant way and the output of strace -e ioctl 
sgdisk -i 2 /dev/vdb is identical in both environments.

ioctl(3, BLKGETSIZE, 20971520)  = 0
ioctl(3, BLKGETSIZE64, 10737418240) = 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0
ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0
ioctl(3, BLKGETSIZE, 20971520)  = 0
ioctl(3, BLKGETSIZE64, 10737418240) = 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKGETSIZE, 20971520)  = 0
ioctl(3, BLKGETSIZE64, 10737418240) = 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0

This leads me to the conclusion that the difference is in how the kernel reacts 
to these ioctl.

What do you think ? 

Cheers

On 17/12/2015 17:26, Ilya Dryomov wrote:
> On Thu, Dec 17, 2015 at 3:10 PM, Loic Dachary <l...@dachary.org> wrote:
>> Hi Sage,
>>
>> On 17/12/2015 14:31, Sage Weil wrote:
>>> On Thu, 17 Dec 2015, Loic Dachary wrote:
>>>> Hi Ilya,
>>>>
>>>> This is another puzzling behavior (the log of all commands is at
>>>> http://tracker.ceph.com/issues/14094#note-4). in a nutshell, after a
>>>> series of sgdisk -i commands to examine various devices including
>>>> /dev/sdc1, the /dev/sdc1 file disappears (and I think it will showup
>>>> again although I don't have a definitive proof of this).
>>>>
>>>> It looks like a side effect of a previous partprobe command, the only
>>>> command I can think of that removes / re-adds devices. I thought calling
>>>> udevadm settle after running partprobe would be enough to ensure
>>>> partprobe completed (and since it takes as much as 2mn30 to return, I
>>>> would be shocked if it does not ;-).
> 
> Yeah, IIRC partprobe goes through every slot in the partition table,
> trying to first remove and then add the partition back.  But, I don't
> see any mention of partprobe in the log you referred to.
> 
> Should udevadm settle for a few vd* devices be taking that much time?
> I'd investigate that regardless of the issue at hand.
> 
>>>>
>>>> Any idea ? I desperately try to find a consistent behavior, something
>>>> reliable that we could use to say : "wait for the partition table to be
>>>> up to date in the kernel and all udev events generated by the partition
>>>> table update to complete".
>>>
>>> I wonder if the underlying issue is that we shouldn't be calling udevadm
>>> settle from something running from udev.  Instead, of a udev-triggered
>>> run of ceph-disk does something that changes the partitions, it
>>> should just exit and let udevadm run ceph-disk again on the new
>>> devices...?
> 
>>
>> Unless I missed something this is on CentOS 7 and ceph-disk is only called 
>> from udev as ceph-disk trigger which does nothing else but asynchronously 
>> delegate the work to systemd. Therefore there is no udevadm settle from 
>> within udev (which would deadlock and timeout every time... I hope ;-).
> 
> That's a sure lockup, until one of them times out.
> 
> How are you delegating to systemd?  Is it to avoid long-running udev
> events?  I'm probably missing something - udevadm settle wouldn't block
> on anything other than udev, so if you are shipping work off to
> somewhere else, udev can't be relied upon for waiting.
> 
> Thanks,
> 
> Ilya
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: puzzling disapearance of /dev/sdc1

2015-12-18 Thread Ilya Dryomov
On Fri, Dec 18, 2015 at 1:38 PM, Loic Dachary <l...@dachary.org> wrote:
> Hi Ilya,
>
> It turns out that sgdisk 0.8.6 -i 2 /dev/vdb removes partitions and re-adds 
> them on CentOS 7 with a 3.10.0-229.11.1.el7 kernel, in the same way partprobe 
> does. It is used intensively by ceph-disk and inevitably leads to races where 
> a device temporarily disapears. The same command (sgdisk 0.8.8) on Ubuntu 
> 14.04 with a 3.13.0-62-generic kernel only generates two udev change events 
> and does not remove / add partitions. The source code between sgdisk 0.8.6 
> and sgdisk 0.8.8 did not change in a significant way and the output of strace 
> -e ioctl sgdisk -i 2 /dev/vdb is identical in both environments.
>
> ioctl(3, BLKGETSIZE, 20971520)  = 0
> ioctl(3, BLKGETSIZE64, 10737418240) = 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0
> ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0
> ioctl(3, BLKGETSIZE, 20971520)  = 0
> ioctl(3, BLKGETSIZE64, 10737418240) = 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKGETSIZE, 20971520)  = 0
> ioctl(3, BLKGETSIZE64, 10737418240) = 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
>
> This leads me to the conclusion that the difference is in how the kernel 
> reacts to these ioctl.

I'm pretty sure it's not the kernel versions that matter here, but
systemd versions.  Those are all get-property ioctls, and I don't think
sgdisk -i does anything with the partition table.

What it probably does though is it opens the disk for write for some
reason.  When it closes it, udevd (systemd-udevd process) picks that
close up via inotify and issues the BLKRRPART ioctl, instructing the
kernel to re-read the partition table.  Technically, that's different
from what partprobe does, but it still generates those udev events you
are seeing in the monitor.

AFAICT udevd started doing this in v214.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: puzzling disapearance of /dev/sdc1

2015-12-18 Thread Loic Dachary


On 18/12/2015 16:31, Ilya Dryomov wrote:
> On Fri, Dec 18, 2015 at 1:38 PM, Loic Dachary <l...@dachary.org> wrote:
>> Hi Ilya,
>>
>> It turns out that sgdisk 0.8.6 -i 2 /dev/vdb removes partitions and re-adds 
>> them on CentOS 7 with a 3.10.0-229.11.1.el7 kernel, in the same way 
>> partprobe does. It is used intensively by ceph-disk and inevitably leads to 
>> races where a device temporarily disapears. The same command (sgdisk 0.8.8) 
>> on Ubuntu 14.04 with a 3.13.0-62-generic kernel only generates two udev 
>> change events and does not remove / add partitions. The source code between 
>> sgdisk 0.8.6 and sgdisk 0.8.8 did not change in a significant way and the 
>> output of strace -e ioctl sgdisk -i 2 /dev/vdb is identical in both 
>> environments.
>>
>> ioctl(3, BLKGETSIZE, 20971520)  = 0
>> ioctl(3, BLKGETSIZE64, 10737418240) = 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0
>> ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0
>> ioctl(3, BLKGETSIZE, 20971520)  = 0
>> ioctl(3, BLKGETSIZE64, 10737418240) = 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKGETSIZE, 20971520)  = 0
>> ioctl(3, BLKGETSIZE64, 10737418240) = 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>>
>> This leads me to the conclusion that the difference is in how the kernel 
>> reacts to these ioctl.
> 
> I'm pretty sure it's not the kernel versions that matter here, but
> systemd versions.  Those are all get-property ioctls, and I don't think
> sgdisk -i does anything with the partition table.
> 
> What it probably does though is it opens the disk for write for some
> reason.  When it closes it, udevd (systemd-udevd process) picks that
> close up via inotify and issues the BLKRRPART ioctl, instructing the
> kernel to re-read the partition table.  Technically, that's different
> from what partprobe does, but it still generates those udev events you
> are seeing in the monitor.
> 
> AFAICT udevd started doing this in v214.

That explains everything indeed.

# strace -f -e open sgdisk -i 2 /dev/vdb
...
open("/dev/vdb", O_RDONLY)  = 4
open("/dev/vdb", O_WRONLY|O_CREAT, 0644) = 4
open("/dev/vdb", O_RDONLY)  = 4
Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown)
Partition unique GUID: 7BBAA731-AA45-47B8-8661-B4FAA53C4162
First sector: 2048 (at 1024.0 KiB)
Last sector: 204800 (at 100.0 MiB)
Partition size: 202753 sectors (99.0 MiB)
Attribute flags: 
Partition name: 'ceph journal'

# strace -f -e open blkid /dev/vdb2
...
open("/etc/blkid.conf", O_RDONLY)   = 4
open("/dev/.blkid.tab", O_RDONLY)   = 4
open("/dev/vdb2", O_RDONLY) = 4
open("/sys/dev/block/253:18", O_RDONLY) = 5
open("/sys/block/vdb/dev", O_RDONLY)= 6
open("/dev/.blkid.tab-hVvwJi", O_RDWR|O_CREAT|O_EXCL, 0600) = 4

blkid does not open the device for write, hence the different behavior. 
Switching sgdisk in favor of blkid fixes the issue.

Nice catch !

> Thanks,
> 
> Ilya
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: puzzling disapearance of /dev/sdc1

2015-12-18 Thread Loic Dachary
Nevermind, got it:

CHANGES WITH 214:

* As an experimental feature, udev now tries to lock the
  disk device node (flock(LOCK_SH|LOCK_NB)) while it
  executes events for the disk or any of its partitions.
  Applications like partitioning programs can lock the
  disk device node (flock(LOCK_EX)) and claim temporary
  device ownership that way; udev will entirely skip all event
  handling for this disk and its partitions. If the disk
  was opened for writing, the close will trigger a partition
  table rescan in udev's "watch" facility, and if needed
  synthesize "change" events for the disk and all its partitions.
  This is now unconditionally enabled, and if it turns out to
  cause major problems, we might turn it on only for specific
  devices, or might need to disable it entirely. Device Mapper
  devices are excluded from this logic.


On 18/12/2015 17:32, Loic Dachary wrote:
> 
>>> AFAICT udevd started doing this in v214.
> 
> Do you have a specific commit / changelog entry in mind ? I'd like to add it 
> to the commit message fixing the problem reference.
> 
> Thanks !
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: Issue with Ceph File System and LIO

2015-12-18 Thread Mike Christie
Eric,

Do you have iSCSI data digests on?

On 12/15/2015 12:08 AM, Eric Eastman wrote:
> I am testing Linux Target SCSI, LIO, with a Ceph File System backstore
> and I am seeing this error on my LIO gateway.  I am using Ceph v9.2.0
> on a 4.4rc4 Kernel, on Trusty, using a kernel mounted Ceph File
> System.  A file on the Ceph File System is exported via iSCSI to a
> VMware ESXi 5.0 server, and I am seeing this error when doing a lot of
> I/O on the ESXi server.   Is this a LIO or a Ceph issue?
> 
> [Tue Dec 15 00:46:55 2015] [ cut here ]
> [Tue Dec 15 00:46:55 2015] WARNING: CPU: 0 PID: 1123421 at
> /home/kernel/COD/linux/fs/ceph/addr.c:125
> ceph_set_page_dirty+0x230/0x240 [ceph]()
> [Tue Dec 15 00:46:55 2015] Modules linked in: iptable_filter ip_tables
> x_tables xfs rbd iscsi_target_mod vhost_scsi tcm_qla2xxx ib_srpt
> tcm_fc tcm_usb_gadget tcm_loop target_core_file target_core_iblock
> target_core_pscsi target_core_user target_core_mod ipmi_devintf vhost
> qla2xxx ib_cm ib_sa ib_mad ib_core ib_addr libfc scsi_transport_fc
> libcomposite udc_core uio configfs ipmi_ssif ttm drm_kms_helper
> gpio_ich drm i2c_algo_bit fb_sys_fops coretemp syscopyarea ipmi_si
> sysfillrect ipmi_msghandler sysimgblt kvm acpi_power_meter 8250_fintek
> irqbypass hpilo shpchp input_leds serio_raw lpc_ich i7core_edac
> edac_core mac_hid ceph libceph libcrc32c fscache bonding lp parport
> mlx4_en vxlan ip6_udp_tunnel udp_tunnel ptp pps_core hid_generic
> usbhid hid hpsa mlx4_core psmouse bnx2 scsi_transport_sas fjes [last
> unloaded: target_core_mod]
> [Tue Dec 15 00:46:55 2015] CPU: 0 PID: 1123421 Comm: iscsi_trx
> Tainted: GW I 4.4.0-040400rc4-generic #201512061930
> [Tue Dec 15 00:46:55 2015] Hardware name: HP ProLiant DL360 G6, BIOS
> P64 01/22/2015
> [Tue Dec 15 00:46:55 2015]   fdc0ce43
> 880bf38c38c0 813c8ab4
> [Tue Dec 15 00:46:55 2015]   880bf38c38f8
> 8107d772 ea00127a8680
> [Tue Dec 15 00:46:55 2015]  8804e52c1448 8804e52c15b0
> 8804e52c10f0 0200
> [Tue Dec 15 00:46:55 2015] Call Trace:
> [Tue Dec 15 00:46:55 2015]  [] dump_stack+0x44/0x60
> [Tue Dec 15 00:46:55 2015]  [] 
> warn_slowpath_common+0x82/0xc0
> [Tue Dec 15 00:46:55 2015]  [] warn_slowpath_null+0x1a/0x20
> [Tue Dec 15 00:46:55 2015]  []
> ceph_set_page_dirty+0x230/0x240 [ceph]
> [Tue Dec 15 00:46:55 2015]  [] ?
> pagecache_get_page+0x150/0x1c0
> [Tue Dec 15 00:46:55 2015]  [] ?
> ceph_pool_perm_check+0x48/0x700 [ceph]
> [Tue Dec 15 00:46:55 2015]  [] set_page_dirty+0x3d/0x70
> [Tue Dec 15 00:46:55 2015]  []
> ceph_write_end+0x5e/0x180 [ceph]
> [Tue Dec 15 00:46:55 2015]  [] ?
> iov_iter_copy_from_user_atomic+0x156/0x220
> [Tue Dec 15 00:46:55 2015]  []
> generic_perform_write+0x114/0x1c0
> [Tue Dec 15 00:46:55 2015]  []
> ceph_write_iter+0xf8a/0x1050 [ceph]
> [Tue Dec 15 00:46:55 2015]  [] ?
> ceph_put_cap_refs+0x143/0x320 [ceph]
> [Tue Dec 15 00:46:55 2015]  [] ?
> check_preempt_wakeup+0xfa/0x220
> [Tue Dec 15 00:46:55 2015]  [] ? zone_statistics+0x7c/0xa0
> [Tue Dec 15 00:46:55 2015]  [] ? copy_page_to_iter+0x5e/0xa0
> [Tue Dec 15 00:46:55 2015]  [] ?
> skb_copy_datagram_iter+0x122/0x250
> [Tue Dec 15 00:46:55 2015]  [] vfs_iter_write+0x76/0xc0
> [Tue Dec 15 00:46:55 2015]  []
> fd_do_rw.isra.5+0xd8/0x1e0 [target_core_file]
> [Tue Dec 15 00:46:55 2015]  []
> fd_execute_rw+0xc5/0x2a0 [target_core_file]
> [Tue Dec 15 00:46:55 2015]  []
> sbc_execute_rw+0x22/0x30 [target_core_mod]
> [Tue Dec 15 00:46:55 2015]  []
> __target_execute_cmd+0x1f/0x70 [target_core_mod]
> [Tue Dec 15 00:46:55 2015]  []
> target_execute_cmd+0x195/0x2a0 [target_core_mod]
> [Tue Dec 15 00:46:55 2015]  []
> iscsit_execute_cmd+0x20a/0x270 [iscsi_target_mod]
> [Tue Dec 15 00:46:55 2015]  []
> iscsit_sequence_cmd+0xda/0x190 [iscsi_target_mod]
> [Tue Dec 15 00:46:55 2015]  []
> iscsi_target_rx_thread+0x51d/0xe30 [iscsi_target_mod]
> [Tue Dec 15 00:46:55 2015]  [] ? __switch_to+0x1dc/0x5a0
> [Tue Dec 15 00:46:55 2015]  [] ?
> iscsi_target_tx_thread+0x1e0/0x1e0 [iscsi_target_mod]
> [Tue Dec 15 00:46:55 2015]  [] kthread+0xd8/0xf0
> [Tue Dec 15 00:46:55 2015]  [] ?
> kthread_create_on_node+0x1a0/0x1a0
> [Tue Dec 15 00:46:55 2015]  [] ret_from_fork+0x3f/0x70
> [Tue Dec 15 00:46:55 2015]  [] ?
> kthread_create_on_node+0x1a0/0x1a0
> [Tue Dec 15 00:46:55 2015] ---[ end trace 4079437668c77cbb ]---
> [Tue Dec 15 00:47:45 2015] ABORT_TASK: Found referenced iSCSI task_tag: 
> 95784927
> [Tue Dec 15 00:47:45 2015] ABORT_TASK: ref_tag: 95784927 already
> complete, skipping
> 
> If it is a Ceph File System issue, let me know and I will open a bug.
> 
> Thanks
> 
> Eric
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to 

Re: Best way to measure client and recovery I/O

2015-12-18 Thread Kyle Bader
> I've been working with Sam Just today and we would like to get some
> performance data around client I/O and recovery I/O to test the new Op
> queue I've been working on. I know that we can just set and OSD out/in
> and such, but there seems like there could be a lot of variation in
> the results making it difficult to come to a good conclusion. We could
> just run the test many times, but I'd love to spend my time doing
> other things.

CBT [1] can do failure simulations while pushing load against the
cluster, here is a config to get you started:

https://gist.github.com/mmgaggle/471cd4227e961a243b22

The osds array in the recovery test portion is the list of osd ids
that you want to mark out during the test.

CBT requires a bit of setup, but there is a script that can do most of
it on a rpm based system. Make sure that your cbt head node has
keyless ssh to itself, the mons, clients, and osd hosts (including
accepting host keys). Let me know if you need help setting it up!

[1] https://github.com/ceph/cbt

-- 

Kyle Bader
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Issue with Ceph File System and LIO

2015-12-18 Thread Eric Eastman
Hi Mike,

On the EXSi server both Header Digest and Data Digest are set to Prohibited.

Eric

On Fri, Dec 18, 2015 at 2:54 PM, Mike Christie  wrote:
> Eric,
>
> Do you have iSCSI data digests on?
>
> On 12/15/2015 12:08 AM, Eric Eastman wrote:
>> I am testing Linux Target SCSI, LIO, with a Ceph File System backstore
>> and I am seeing this error on my LIO gateway.  I am using Ceph v9.2.0
>> on a 4.4rc4 Kernel, on Trusty, using a kernel mounted Ceph File
>> System.  A file on the Ceph File System is exported via iSCSI to a
>> VMware ESXi 5.0 server, and I am seeing this error when doing a lot of
>> I/O on the ESXi server.   Is this a LIO or a Ceph issue?
>>
>> [Tue Dec 15 00:46:55 2015] [ cut here ]
>> [Tue Dec 15 00:46:55 2015] WARNING: CPU: 0 PID: 1123421 at
>> /home/kernel/COD/linux/fs/ceph/addr.c:125
>> ceph_set_page_dirty+0x230/0x240 [ceph]()
>> [Tue Dec 15 00:46:55 2015] Modules linked in: iptable_filter ip_tables
>> x_tables xfs rbd iscsi_target_mod vhost_scsi tcm_qla2xxx ib_srpt
>> tcm_fc tcm_usb_gadget tcm_loop target_core_file target_core_iblock
>> target_core_pscsi target_core_user target_core_mod ipmi_devintf vhost
>> qla2xxx ib_cm ib_sa ib_mad ib_core ib_addr libfc scsi_transport_fc
>> libcomposite udc_core uio configfs ipmi_ssif ttm drm_kms_helper
>> gpio_ich drm i2c_algo_bit fb_sys_fops coretemp syscopyarea ipmi_si
>> sysfillrect ipmi_msghandler sysimgblt kvm acpi_power_meter 8250_fintek
>> irqbypass hpilo shpchp input_leds serio_raw lpc_ich i7core_edac
>> edac_core mac_hid ceph libceph libcrc32c fscache bonding lp parport
>> mlx4_en vxlan ip6_udp_tunnel udp_tunnel ptp pps_core hid_generic
>> usbhid hid hpsa mlx4_core psmouse bnx2 scsi_transport_sas fjes [last
>> unloaded: target_core_mod]
>> [Tue Dec 15 00:46:55 2015] CPU: 0 PID: 1123421 Comm: iscsi_trx
>> Tainted: GW I 4.4.0-040400rc4-generic #201512061930
>> [Tue Dec 15 00:46:55 2015] Hardware name: HP ProLiant DL360 G6, BIOS
>> P64 01/22/2015
>> [Tue Dec 15 00:46:55 2015]   fdc0ce43
>> 880bf38c38c0 813c8ab4
>> [Tue Dec 15 00:46:55 2015]   880bf38c38f8
>> 8107d772 ea00127a8680
>> [Tue Dec 15 00:46:55 2015]  8804e52c1448 8804e52c15b0
>> 8804e52c10f0 0200
>> [Tue Dec 15 00:46:55 2015] Call Trace:
>> [Tue Dec 15 00:46:55 2015]  [] dump_stack+0x44/0x60
>> [Tue Dec 15 00:46:55 2015]  [] 
>> warn_slowpath_common+0x82/0xc0
>> [Tue Dec 15 00:46:55 2015]  [] warn_slowpath_null+0x1a/0x20
>> [Tue Dec 15 00:46:55 2015]  []
>> ceph_set_page_dirty+0x230/0x240 [ceph]
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> pagecache_get_page+0x150/0x1c0
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> ceph_pool_perm_check+0x48/0x700 [ceph]
>> [Tue Dec 15 00:46:55 2015]  [] set_page_dirty+0x3d/0x70
>> [Tue Dec 15 00:46:55 2015]  []
>> ceph_write_end+0x5e/0x180 [ceph]
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> iov_iter_copy_from_user_atomic+0x156/0x220
>> [Tue Dec 15 00:46:55 2015]  []
>> generic_perform_write+0x114/0x1c0
>> [Tue Dec 15 00:46:55 2015]  []
>> ceph_write_iter+0xf8a/0x1050 [ceph]
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> ceph_put_cap_refs+0x143/0x320 [ceph]
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> check_preempt_wakeup+0xfa/0x220
>> [Tue Dec 15 00:46:55 2015]  [] ? zone_statistics+0x7c/0xa0
>> [Tue Dec 15 00:46:55 2015]  [] ? 
>> copy_page_to_iter+0x5e/0xa0
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> skb_copy_datagram_iter+0x122/0x250
>> [Tue Dec 15 00:46:55 2015]  [] vfs_iter_write+0x76/0xc0
>> [Tue Dec 15 00:46:55 2015]  []
>> fd_do_rw.isra.5+0xd8/0x1e0 [target_core_file]
>> [Tue Dec 15 00:46:55 2015]  []
>> fd_execute_rw+0xc5/0x2a0 [target_core_file]
>> [Tue Dec 15 00:46:55 2015]  []
>> sbc_execute_rw+0x22/0x30 [target_core_mod]
>> [Tue Dec 15 00:46:55 2015]  []
>> __target_execute_cmd+0x1f/0x70 [target_core_mod]
>> [Tue Dec 15 00:46:55 2015]  []
>> target_execute_cmd+0x195/0x2a0 [target_core_mod]
>> [Tue Dec 15 00:46:55 2015]  []
>> iscsit_execute_cmd+0x20a/0x270 [iscsi_target_mod]
>> [Tue Dec 15 00:46:55 2015]  []
>> iscsit_sequence_cmd+0xda/0x190 [iscsi_target_mod]
>> [Tue Dec 15 00:46:55 2015]  []
>> iscsi_target_rx_thread+0x51d/0xe30 [iscsi_target_mod]
>> [Tue Dec 15 00:46:55 2015]  [] ? __switch_to+0x1dc/0x5a0
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> iscsi_target_tx_thread+0x1e0/0x1e0 [iscsi_target_mod]
>> [Tue Dec 15 00:46:55 2015]  [] kthread+0xd8/0xf0
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> kthread_create_on_node+0x1a0/0x1a0
>> [Tue Dec 15 00:46:55 2015]  [] ret_from_fork+0x3f/0x70
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> kthread_create_on_node+0x1a0/0x1a0
>> [Tue Dec 15 00:46:55 2015] ---[ end trace 4079437668c77cbb ]---
>> [Tue Dec 15 00:47:45 2015] ABORT_TASK: Found referenced iSCSI task_tag: 
>> 95784927
>> [Tue Dec 15 00:47:45 2015] ABORT_TASK: ref_tag: 95784927 already
>> complete, skipping
>>
>> If it is a Ceph File System issue, let me know and I will open a bug.
>>
>> Thanks
>>
>> Eric
>> --
>> To unsubscribe from this 

Re: Issue with Ceph File System and LIO

2015-12-17 Thread Eric Eastman
I patched the 4.4rc4 kernel source and restarted the test.  Shortly
after starting it, this showed up in dmesg:

[Thu Dec 17 03:29:55 2015] WARNING: CPU: 0 PID: 2547 at
fs/ceph/addr.c:1162 ceph_write_begin+0xfb/0x120 [ceph]()
[Thu Dec 17 03:29:55 2015] Modules linked in: iscsi_target_mod
vhost_scsi tcm_qla2xxx ib_srpt tcm_fc tcm_usb_gadget tcm_loop
target_core_file target_core_iblock target_core_pscsi target_core_user
target_core_mod ipmi_devintf vhost qla2xxx ib_cm ib_sa ib_mad ib_core
ib_addr libfc scsi_transport_fc libcomposite udc_core uio configfs ttm
ipmi_ssif drm_kms_helper drm coretemp kvm gpio_ich i2c_algo_bit
i7core_edac fb_sys_fops syscopyarea edac_core sysfillrect sysimgblt
ipmi_si input_leds hpilo ipmi_msghandler shpchp acpi_power_meter
irqbypass serio_raw 8250_fintek lpc_ich mac_hid ceph bonding libceph
lp parport libcrc32c fscache mlx4_en vxlan ip6_udp_tunnel udp_tunnel
ptp pps_core hid_generic usbhid hid mlx4_core hpsa psmouse bnx2 fjes
scsi_transport_sas [last unloaded: target_core_mod]
[Thu Dec 17 03:29:55 2015] CPU: 0 PID: 2547 Comm: iscsi_trx Tainted: G
   W I 4.4.0-rc4-ede1 #1
[Thu Dec 17 03:29:55 2015] Hardware name: HP ProLiant DL360 G6, BIOS
P64 01/22/2015
[Thu Dec 17 03:29:55 2015]  c020cd47 8805f1e97958
813ad644 
[Thu Dec 17 03:29:55 2015]  8805f1e97990 81079702
8805f1e97a50 015dd000
[Thu Dec 17 03:29:55 2015]  880c034df800 0200
eab26a80 8805f1e979a0
[Thu Dec 17 03:29:55 2015] Call Trace:
[Thu Dec 17 03:29:55 2015]  [] dump_stack+0x44/0x60
[Thu Dec 17 03:29:55 2015]  [] warn_slowpath_common+0x82/0xc0
[Thu Dec 17 03:29:55 2015]  [] warn_slowpath_null+0x1a/0x20
[Thu Dec 17 03:29:55 2015]  []
ceph_write_begin+0xfb/0x120 [ceph]
[Thu Dec 17 03:29:55 2015]  []
generic_perform_write+0xbf/0x1a0
[Thu Dec 17 03:29:55 2015]  []
ceph_write_iter+0xf5c/0x1010 [ceph]
[Thu Dec 17 03:29:55 2015]  [] ? __enqueue_entity+0x6c/0x70
[Thu Dec 17 03:29:55 2015]  [] ?
iov_iter_get_pages+0x113/0x210
[Thu Dec 17 03:29:55 2015]  [] ?
skb_copy_datagram_iter+0x122/0x250
[Thu Dec 17 03:29:55 2015]  [] vfs_iter_write+0x63/0xa0
[Thu Dec 17 03:29:55 2015]  []
fd_do_rw.isra.5+0xc9/0x1b0 [target_core_file]
[Thu Dec 17 03:29:55 2015]  []
fd_execute_rw+0xc5/0x2a0 [target_core_file]
[Thu Dec 17 03:29:55 2015]  []
sbc_execute_rw+0x22/0x30 [target_core_mod]
[Thu Dec 17 03:29:55 2015]  []
__target_execute_cmd+0x1f/0x70 [target_core_mod]
[Thu Dec 17 03:29:55 2015]  []
target_execute_cmd+0x195/0x2a0 [target_core_mod]
[Thu Dec 17 03:29:55 2015]  []
iscsit_execute_cmd+0x20a/0x270 [iscsi_target_mod]
[Thu Dec 17 03:29:55 2015]  []
iscsit_sequence_cmd+0xda/0x190 [iscsi_target_mod]
[Thu Dec 17 03:29:55 2015]  []
iscsi_target_rx_thread+0x51d/0xe30 [iscsi_target_mod]
[Thu Dec 17 03:29:55 2015]  [] ? __switch_to+0x1cd/0x570
[Thu Dec 17 03:29:55 2015]  [] ?
iscsi_target_tx_thread+0x1c0/0x1c0 [iscsi_target_mod]
[Thu Dec 17 03:29:55 2015]  [] kthread+0xc9/0xe0
[Thu Dec 17 03:29:55 2015]  [] ?
kthread_create_on_node+0x180/0x180
[Thu Dec 17 03:29:55 2015]  [] ret_from_fork+0x3f/0x70
[Thu Dec 17 03:29:55 2015]  [] ?
kthread_create_on_node+0x180/0x180
[Thu Dec 17 03:29:55 2015] ---[ end trace 382a45986961da4e ]---

There are WARNINGs on both line 125 and 1162. I will attached the
whole set of dmesg output to the tracker ticket 14086

I wanted to note that file system snapshots are enabled and being used
on this file system.

Thanks
Eric

On Wed, Dec 16, 2015 at 8:15 AM, Eric Eastman
 wrote:
>>>
>> This warning is really strange. Could you try the attached debug patch.
>>
>> Regards
>> Yan, Zheng
>
> I will try the patch and get back to the list.
>
> Eric
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Issue with Ceph File System and LIO

2015-12-17 Thread Yan, Zheng
On Thu, Dec 17, 2015 at 4:56 PM, Eric Eastman
 wrote:
> I patched the 4.4rc4 kernel source and restarted the test.  Shortly
> after starting it, this showed up in dmesg:
>
> [Thu Dec 17 03:29:55 2015] WARNING: CPU: 0 PID: 2547 at
> fs/ceph/addr.c:1162 ceph_write_begin+0xfb/0x120 [ceph]()
> [Thu Dec 17 03:29:55 2015] Modules linked in: iscsi_target_mod
> vhost_scsi tcm_qla2xxx ib_srpt tcm_fc tcm_usb_gadget tcm_loop
> target_core_file target_core_iblock target_core_pscsi target_core_user
> target_core_mod ipmi_devintf vhost qla2xxx ib_cm ib_sa ib_mad ib_core
> ib_addr libfc scsi_transport_fc libcomposite udc_core uio configfs ttm
> ipmi_ssif drm_kms_helper drm coretemp kvm gpio_ich i2c_algo_bit
> i7core_edac fb_sys_fops syscopyarea edac_core sysfillrect sysimgblt
> ipmi_si input_leds hpilo ipmi_msghandler shpchp acpi_power_meter
> irqbypass serio_raw 8250_fintek lpc_ich mac_hid ceph bonding libceph
> lp parport libcrc32c fscache mlx4_en vxlan ip6_udp_tunnel udp_tunnel
> ptp pps_core hid_generic usbhid hid mlx4_core hpsa psmouse bnx2 fjes
> scsi_transport_sas [last unloaded: target_core_mod]
> [Thu Dec 17 03:29:55 2015] CPU: 0 PID: 2547 Comm: iscsi_trx Tainted: G
>W I 4.4.0-rc4-ede1 #1
> [Thu Dec 17 03:29:55 2015] Hardware name: HP ProLiant DL360 G6, BIOS
> P64 01/22/2015
> [Thu Dec 17 03:29:55 2015]  c020cd47 8805f1e97958
> 813ad644 
> [Thu Dec 17 03:29:55 2015]  8805f1e97990 81079702
> 8805f1e97a50 015dd000
> [Thu Dec 17 03:29:55 2015]  880c034df800 0200
> eab26a80 8805f1e979a0
> [Thu Dec 17 03:29:55 2015] Call Trace:
> [Thu Dec 17 03:29:55 2015]  [] dump_stack+0x44/0x60
> [Thu Dec 17 03:29:55 2015]  [] 
> warn_slowpath_common+0x82/0xc0
> [Thu Dec 17 03:29:55 2015]  [] warn_slowpath_null+0x1a/0x20
> [Thu Dec 17 03:29:55 2015]  []
> ceph_write_begin+0xfb/0x120 [ceph]
> [Thu Dec 17 03:29:55 2015]  []
> generic_perform_write+0xbf/0x1a0
> [Thu Dec 17 03:29:55 2015]  []
> ceph_write_iter+0xf5c/0x1010 [ceph]
> [Thu Dec 17 03:29:55 2015]  [] ? __enqueue_entity+0x6c/0x70
> [Thu Dec 17 03:29:55 2015]  [] ?
> iov_iter_get_pages+0x113/0x210
> [Thu Dec 17 03:29:55 2015]  [] ?
> skb_copy_datagram_iter+0x122/0x250
> [Thu Dec 17 03:29:55 2015]  [] vfs_iter_write+0x63/0xa0
> [Thu Dec 17 03:29:55 2015]  []
> fd_do_rw.isra.5+0xc9/0x1b0 [target_core_file]
> [Thu Dec 17 03:29:55 2015]  []
> fd_execute_rw+0xc5/0x2a0 [target_core_file]
> [Thu Dec 17 03:29:55 2015]  []
> sbc_execute_rw+0x22/0x30 [target_core_mod]
> [Thu Dec 17 03:29:55 2015]  []
> __target_execute_cmd+0x1f/0x70 [target_core_mod]
> [Thu Dec 17 03:29:55 2015]  []
> target_execute_cmd+0x195/0x2a0 [target_core_mod]
> [Thu Dec 17 03:29:55 2015]  []
> iscsit_execute_cmd+0x20a/0x270 [iscsi_target_mod]
> [Thu Dec 17 03:29:55 2015]  []
> iscsit_sequence_cmd+0xda/0x190 [iscsi_target_mod]
> [Thu Dec 17 03:29:55 2015]  []
> iscsi_target_rx_thread+0x51d/0xe30 [iscsi_target_mod]
> [Thu Dec 17 03:29:55 2015]  [] ? __switch_to+0x1cd/0x570
> [Thu Dec 17 03:29:55 2015]  [] ?
> iscsi_target_tx_thread+0x1c0/0x1c0 [iscsi_target_mod]
> [Thu Dec 17 03:29:55 2015]  [] kthread+0xc9/0xe0
> [Thu Dec 17 03:29:55 2015]  [] ?
> kthread_create_on_node+0x180/0x180
> [Thu Dec 17 03:29:55 2015]  [] ret_from_fork+0x3f/0x70
> [Thu Dec 17 03:29:55 2015]  [] ?
> kthread_create_on_node+0x180/0x180
> [Thu Dec 17 03:29:55 2015] ---[ end trace 382a45986961da4e ]---


Could you please try the apply the new incremental patch and try again.


Regards
Yan, Zheng


>
> There are WARNINGs on both line 125 and 1162. I will attached the
> whole set of dmesg output to the tracker ticket 14086
>
> I wanted to note that file system snapshots are enabled and being used
> on this file system.
>
> Thanks
> Eric
>
> On Wed, Dec 16, 2015 at 8:15 AM, Eric Eastman
>  wrote:

>>> This warning is really strange. Could you try the attached debug patch.
>>>
>>> Regards
>>> Yan, Zheng
>>
>> I will try the patch and get back to the list.
>>
>> Eric


cephfs1.patch
Description: Binary data


Re: puzzling disapearance of /dev/sdc1

2015-12-17 Thread Loic Dachary
Hi Sage,

On 17/12/2015 14:31, Sage Weil wrote:
> On Thu, 17 Dec 2015, Loic Dachary wrote:
>> Hi Ilya,
>>
>> This is another puzzling behavior (the log of all commands is at 
>> http://tracker.ceph.com/issues/14094#note-4). in a nutshell, after a 
>> series of sgdisk -i commands to examine various devices including 
>> /dev/sdc1, the /dev/sdc1 file disappears (and I think it will showup 
>> again although I don't have a definitive proof of this).
>>
>> It looks like a side effect of a previous partprobe command, the only 
>> command I can think of that removes / re-adds devices. I thought calling 
>> udevadm settle after running partprobe would be enough to ensure 
>> partprobe completed (and since it takes as much as 2mn30 to return, I 
>> would be shocked if it does not ;-).
>>
>> Any idea ? I desperately try to find a consistent behavior, something 
>> reliable that we could use to say : "wait for the partition table to be 
>> up to date in the kernel and all udev events generated by the partition 
>> table update to complete".
> 
> I wonder if the underlying issue is that we shouldn't be calling udevadm 
> settle from something running from udev.  Instead, of a udev-triggered 
> run of ceph-disk does something that changes the partitions, it 
> should just exit and let udevadm run ceph-disk again on the new 
> devices...?

Unless I missed something this is on CentOS 7 and ceph-disk is only called from 
udev as ceph-disk trigger which does nothing else but asynchronously delegate 
the work to systemd. Therefore there is no udevadm settle from within udev 
(which would deadlock and timeout every time... I hope ;-).

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: Client still connect failed leader after that mon down

2015-12-17 Thread Sage Weil
On Thu, 17 Dec 2015, Jaze Lee wrote:
> Hello cephers:
> In our test, there are three monitors. We find client run ceph
> command will slow when the leader mon is down. Even after long time, a
> client run ceph command will also slow in first time.
> >From strace, we find that the client first to connect the leader, then
> after 3s, it connect the second.
> After some search we find that the quorum is not change, the leader is
> still the down monitor.
> Is that normal?  Or is there something i miss?

It's normal.  Even when the quorum does change, the client doesn't 
know that.  It should be contacting a random mon on startup, though, so I 
would expect the 3s delay 1/3 of the time.

A long-standing low-priority feature request is to have the client contact 
2 mons in parallel so that it can still connect quickly if one is down.  
It's requires some non-trivial work in mon/MonClient.{cc,h} though and I 
don't think anyone has looked at it seriously.

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: puzzling disapearance of /dev/sdc1

2015-12-17 Thread Sage Weil
On Thu, 17 Dec 2015, Loic Dachary wrote:
> Hi Ilya,
> 
> This is another puzzling behavior (the log of all commands is at 
> http://tracker.ceph.com/issues/14094#note-4). in a nutshell, after a 
> series of sgdisk -i commands to examine various devices including 
> /dev/sdc1, the /dev/sdc1 file disappears (and I think it will showup 
> again although I don't have a definitive proof of this).
> 
> It looks like a side effect of a previous partprobe command, the only 
> command I can think of that removes / re-adds devices. I thought calling 
> udevadm settle after running partprobe would be enough to ensure 
> partprobe completed (and since it takes as much as 2mn30 to return, I 
> would be shocked if it does not ;-).
> 
> Any idea ? I desperately try to find a consistent behavior, something 
> reliable that we could use to say : "wait for the partition table to be 
> up to date in the kernel and all udev events generated by the partition 
> table update to complete".

I wonder if the underlying issue is that we shouldn't be calling udevadm 
settle from something running from udev.  Instead, of a udev-triggered 
run of ceph-disk does something that changes the partitions, it 
should just exit and let udevadm run ceph-disk again on the new 
devices...?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: puzzling disapearance of /dev/sdc1

2015-12-17 Thread Ilya Dryomov
On Thu, Dec 17, 2015 at 3:10 PM, Loic Dachary <l...@dachary.org> wrote:
> Hi Sage,
>
> On 17/12/2015 14:31, Sage Weil wrote:
>> On Thu, 17 Dec 2015, Loic Dachary wrote:
>>> Hi Ilya,
>>>
>>> This is another puzzling behavior (the log of all commands is at
>>> http://tracker.ceph.com/issues/14094#note-4). in a nutshell, after a
>>> series of sgdisk -i commands to examine various devices including
>>> /dev/sdc1, the /dev/sdc1 file disappears (and I think it will showup
>>> again although I don't have a definitive proof of this).
>>>
>>> It looks like a side effect of a previous partprobe command, the only
>>> command I can think of that removes / re-adds devices. I thought calling
>>> udevadm settle after running partprobe would be enough to ensure
>>> partprobe completed (and since it takes as much as 2mn30 to return, I
>>> would be shocked if it does not ;-).

Yeah, IIRC partprobe goes through every slot in the partition table,
trying to first remove and then add the partition back.  But, I don't
see any mention of partprobe in the log you referred to.

Should udevadm settle for a few vd* devices be taking that much time?
I'd investigate that regardless of the issue at hand.

>>>
>>> Any idea ? I desperately try to find a consistent behavior, something
>>> reliable that we could use to say : "wait for the partition table to be
>>> up to date in the kernel and all udev events generated by the partition
>>> table update to complete".
>>
>> I wonder if the underlying issue is that we shouldn't be calling udevadm
>> settle from something running from udev.  Instead, of a udev-triggered
>> run of ceph-disk does something that changes the partitions, it
>> should just exit and let udevadm run ceph-disk again on the new
>> devices...?

>
> Unless I missed something this is on CentOS 7 and ceph-disk is only called 
> from udev as ceph-disk trigger which does nothing else but asynchronously 
> delegate the work to systemd. Therefore there is no udevadm settle from 
> within udev (which would deadlock and timeout every time... I hope ;-).

That's a sure lockup, until one of them times out.

How are you delegating to systemd?  Is it to avoid long-running udev
events?  I'm probably missing something - udevadm settle wouldn't block
on anything other than udev, so if you are shipping work off to
somewhere else, udev can't be relied upon for waiting.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Issue with Ceph File System and LIO

2015-12-17 Thread Minfei Huang
Hi.

It may be helpful to address this issue, if we flip the debug.

Thanks
Minfei

On 12/17/15 at 01:56P, Eric Eastman wrote:
> I patched the 4.4rc4 kernel source and restarted the test.  Shortly
> after starting it, this showed up in dmesg:
> 
> [Thu Dec 17 03:29:55 2015] WARNING: CPU: 0 PID: 2547 at
> fs/ceph/addr.c:1162 ceph_write_begin+0xfb/0x120 [ceph]()
> [Thu Dec 17 03:29:55 2015] Modules linked in: iscsi_target_mod
> vhost_scsi tcm_qla2xxx ib_srpt tcm_fc tcm_usb_gadget tcm_loop
> target_core_file target_core_iblock target_core_pscsi target_core_user
> target_core_mod ipmi_devintf vhost qla2xxx ib_cm ib_sa ib_mad ib_core
> ib_addr libfc scsi_transport_fc libcomposite udc_core uio configfs ttm
> ipmi_ssif drm_kms_helper drm coretemp kvm gpio_ich i2c_algo_bit
> i7core_edac fb_sys_fops syscopyarea edac_core sysfillrect sysimgblt
> ipmi_si input_leds hpilo ipmi_msghandler shpchp acpi_power_meter
> irqbypass serio_raw 8250_fintek lpc_ich mac_hid ceph bonding libceph
> lp parport libcrc32c fscache mlx4_en vxlan ip6_udp_tunnel udp_tunnel
> ptp pps_core hid_generic usbhid hid mlx4_core hpsa psmouse bnx2 fjes
> scsi_transport_sas [last unloaded: target_core_mod]
> [Thu Dec 17 03:29:55 2015] CPU: 0 PID: 2547 Comm: iscsi_trx Tainted: G
>W I 4.4.0-rc4-ede1 #1
> [Thu Dec 17 03:29:55 2015] Hardware name: HP ProLiant DL360 G6, BIOS
> P64 01/22/2015
> [Thu Dec 17 03:29:55 2015]  c020cd47 8805f1e97958
> 813ad644 
> [Thu Dec 17 03:29:55 2015]  8805f1e97990 81079702
> 8805f1e97a50 015dd000
> [Thu Dec 17 03:29:55 2015]  880c034df800 0200
> eab26a80 8805f1e979a0
> [Thu Dec 17 03:29:55 2015] Call Trace:
> [Thu Dec 17 03:29:55 2015]  [] dump_stack+0x44/0x60
> [Thu Dec 17 03:29:55 2015]  [] 
> warn_slowpath_common+0x82/0xc0
> [Thu Dec 17 03:29:55 2015]  [] warn_slowpath_null+0x1a/0x20
> [Thu Dec 17 03:29:55 2015]  []
> ceph_write_begin+0xfb/0x120 [ceph]
> [Thu Dec 17 03:29:55 2015]  []
> generic_perform_write+0xbf/0x1a0
> [Thu Dec 17 03:29:55 2015]  []
> ceph_write_iter+0xf5c/0x1010 [ceph]
> [Thu Dec 17 03:29:55 2015]  [] ? __enqueue_entity+0x6c/0x70
> [Thu Dec 17 03:29:55 2015]  [] ?
> iov_iter_get_pages+0x113/0x210
> [Thu Dec 17 03:29:55 2015]  [] ?
> skb_copy_datagram_iter+0x122/0x250
> [Thu Dec 17 03:29:55 2015]  [] vfs_iter_write+0x63/0xa0
> [Thu Dec 17 03:29:55 2015]  []
> fd_do_rw.isra.5+0xc9/0x1b0 [target_core_file]
> [Thu Dec 17 03:29:55 2015]  []
> fd_execute_rw+0xc5/0x2a0 [target_core_file]
> [Thu Dec 17 03:29:55 2015]  []
> sbc_execute_rw+0x22/0x30 [target_core_mod]
> [Thu Dec 17 03:29:55 2015]  []
> __target_execute_cmd+0x1f/0x70 [target_core_mod]
> [Thu Dec 17 03:29:55 2015]  []
> target_execute_cmd+0x195/0x2a0 [target_core_mod]
> [Thu Dec 17 03:29:55 2015]  []
> iscsit_execute_cmd+0x20a/0x270 [iscsi_target_mod]
> [Thu Dec 17 03:29:55 2015]  []
> iscsit_sequence_cmd+0xda/0x190 [iscsi_target_mod]
> [Thu Dec 17 03:29:55 2015]  []
> iscsi_target_rx_thread+0x51d/0xe30 [iscsi_target_mod]
> [Thu Dec 17 03:29:55 2015]  [] ? __switch_to+0x1cd/0x570
> [Thu Dec 17 03:29:55 2015]  [] ?
> iscsi_target_tx_thread+0x1c0/0x1c0 [iscsi_target_mod]
> [Thu Dec 17 03:29:55 2015]  [] kthread+0xc9/0xe0
> [Thu Dec 17 03:29:55 2015]  [] ?
> kthread_create_on_node+0x180/0x180
> [Thu Dec 17 03:29:55 2015]  [] ret_from_fork+0x3f/0x70
> [Thu Dec 17 03:29:55 2015]  [] ?
> kthread_create_on_node+0x180/0x180
> [Thu Dec 17 03:29:55 2015] ---[ end trace 382a45986961da4e ]---
> 
> There are WARNINGs on both line 125 and 1162. I will attached the
> whole set of dmesg output to the tracker ticket 14086
> 
> I wanted to note that file system snapshots are enabled and being used
> on this file system.
> 
> Thanks
> Eric
> 
> On Wed, Dec 16, 2015 at 8:15 AM, Eric Eastman
>  wrote:
> >>>
> >> This warning is really strange. Could you try the attached debug patch.
> >>
> >> Regards
> >> Yan, Zheng
> >
> > I will try the patch and get back to the list.
> >
> > Eric
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: understanding partprobe failure

2015-12-17 Thread Ilya Dryomov
On Thu, Dec 17, 2015 at 1:19 PM, Loic Dachary  wrote:
> Hi Ilya,
>
> I'm seeing a partprobe failure right after a disk was zapped with sgdisk 
> --clear --mbrtogpt -- /dev/vdb:
>
> partprobe /dev/vdb failed : Error: Partition(s) 1 on /dev/vdb have been 
> written, but we have been unable to inform the kernel of the change, probably 
> because it/they are in use. As a result, the old partition(s) will remain in 
> use. You should reboot now before making further changes.
>
> waiting 60 seconds (see the log below) and trying again succeeds. The 
> partprobe call is guarded by udevadm settle to prevent udev actions from 
> racing and nothing else goes on in the machine.
>
> Any idea how that could happen ?
>
> Cheers
>
> 2015-12-17 11:46:10,356.356 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:DEBUG:ceph-disk:get_dm_uuid
>  /dev/vdb uuid path is /sys/dev/block/253:16/dm/uuid
> 2015-12-17 11:46:10,357.357 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:DEBUG:ceph-disk:Zapping
>  partition table on /dev/vdb
> 2015-12-17 11:46:10,358.358 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:INFO:ceph-disk:Running
>  command: /usr/sbin/sgdisk --zap-all -- /dev/vdb
> 2015-12-17 11:46:10,365.365 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Caution:
>  invalid backup GPT header, but valid main header; regenerating
> 2015-12-17 11:46:10,366.366 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:backup 
> header from main header.
> 2015-12-17 11:46:10,366.366 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:
> 2015-12-17 11:46:10,366.366 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Warning!
>  Main and backup partition tables differ! Use the 'c' and 'e' options
> 2015-12-17 11:46:10,367.367 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:on the 
> recovery & transformation menu to examine the two tables.
> 2015-12-17 11:46:10,367.367 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:
> 2015-12-17 11:46:10,367.367 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Warning!
>  One or more CRCs don't match. You should repair the disk!
> 2015-12-17 11:46:10,368.368 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:
> 2015-12-17 11:46:11,413.413 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:
> 2015-12-17 11:46:11,414.414 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Caution:
>  Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
> 2015-12-17 11:46:11,414.414 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:verification
>  and recovery are STRONGLY recommended.
> 2015-12-17 11:46:11,414.414 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:
> 2015-12-17 11:46:11,415.415 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Warning:
>  The kernel is still using the old partition table.
> 2015-12-17 11:46:11,415.415 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:The new 
> table will be used at the next reboot.
> 2015-12-17 11:46:11,416.416 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:GPT 
> data structures destroyed! You may now partition the disk using fdisk or
> 2015-12-17 11:46:11,416.416 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:other 
> utilities.
> 2015-12-17 11:46:11,416.416 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:INFO:ceph-disk:Running
>  command: /usr/sbin/sgdisk --clear --mbrtogpt -- /dev/vdb
> 2015-12-17 11:46:12,504.504 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Creating
>  new GPT entries.
> 2015-12-17 11:46:12,505.505 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Warning:
>  The kernel is still using the old partition table.
> 2015-12-17 11:46:12,505.505 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:The new 
> table will be used at the next reboot.
> 2015-12-17 11:46:12,505.505 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:The 
> operation has completed successfully.
> 2015-12-17 11:46:12,506.506 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:DEBUG:ceph-disk:Calling
>  partprobe on zapped device /dev/vdb
> 2015-12-17 11:46:12,507.507 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:INFO:ceph-disk:Running
>  command: /usr/bin/udevadm settle --timeout=600
> 2015-12-17 11:46:15,427.427 
> 

Re: rgw subuser create and admin api

2015-12-17 Thread Yehuda Sadeh-Weinraub
On Thu, Dec 17, 2015 at 9:04 AM, Derek Yarnell  wrote:
> I am having an issue with the 'radosgw-admin subuser create' command
> doing something different than the '/{admin}/user?subuser=json'
> admin API.  I want to leverage subusers in S3 which looks to be possible
> in my testing for bit more control without resorting to ACLs.
>
> radosgw-admin subuser create --uid=-staff --subuser=test1
> --access-key=a --secret=z --access=read
>
> This command will work and create a both a subuser -staff:test1 with
> permission read and a s3 key with the the correct access and secret key set.
>
> The Admin API will not allow me to do this it would seem as the
> following is accepted and a subuser is created however a swift_key is
> created instead.
>
> DEBUG:requests.packages.urllib3.connectionpool:"PUT
> /admin/user?subuser=json=-staff=test2=b=cc=read
> HTTP/1.1" 200 130
>
> The documentation for the admin API[0] does not seem to indicate that
> access-key is accepted at all.  Also if you pass key-type=s3 it will
> return a 400 with InvalidArgument although the documentation says it
> should accept the key type s3.
>
> Bug? Design?

Somewhat a bug. The whole subusers that use s3 was unintentional, so
when creating the subuser api, we didn't think of needing the access
key. For some reason we do get the key type. Can you open a ceph
tracker issue for that?

You can try using the metadata api to modify the user once it has been
created (need to get the user info, add the s3 key to the structure,
put the user info).

>
> One other issue is that a command that uses the --purge-keys from
> radosgw-admin seems to have no effect.  The following command removes
> the subuser and leaves the swift keys it has (but also any s3 keys too).
>
> radosgw-admin subuser rm --uid=-staff --subuser=test2 --purge-keys
>

It's a known issue, and it will be fixed soon (so it seems).

Thanks,
Yehuda

>
> [0] - http://docs.ceph.com/docs/master/radosgw/adminops/#create-subuser
>
>
> --
> Derek T. Yarnell
> University of Maryland
> Institute for Advanced Computer Studies
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rgw subuser create and admin api

2015-12-17 Thread Yehuda Sadeh-Weinraub
On Thu, Dec 17, 2015 at 12:06 PM, Derek Yarnell  wrote:
> On 12/17/15 2:36 PM, Yehuda Sadeh-Weinraub wrote:
>> Try 'section=user=cephtests'
>
> Doesn't seem to work either.
>
> # radosgw-admin metadata get user:cephtest
> {
> "key": "user:cephtest",
> "ver": {
> "tag": "_dhpzgdOjqJI-OsR1MsYV5-p",
> "ver": 1
> },
> "mtime": 1450378246,
> "data": {
> "user_id": "cephtest",
> "display_name": "Ceph Test",
> "email": "",
> "suspended": 0,
> "max_buckets": 1000,
> "auid": 0,
> "subusers": [],
> "keys": [
> {
> "user": "cephtest",
> "access_key": "eee",
> "secret_key": ""
> },
> {
> "user": "cephtest",
> "access_key": "aaa",
> "secret_key": ""
> }
> ],
> "swift_keys": [],
> "caps": [],
> "op_mask": "read, write, delete",
> "default_placement": "",
> "placement_tags": [],
> "bucket_quota": {
> "enabled": false,
> "max_size_kb": -1,
> "max_objects": -1
> },
> "user_quota": {
> "enabled": false,
> "max_size_kb": -1,
> "max_objects": -1
> },
> "temp_url_keys": []
> }
> }
>
>
> 2015-12-17 15:03:41.126024 7f88ef7e6700 20 RGWEnv::set(): HTTP_HOST:
> localhost:7480
> 2015-12-17 15:03:41.126056 7f88ef7e6700 20 RGWEnv::set(): HTTP_DATE:
> Thu, 17 Dec 2015 20:03:41 GMT
> 2015-12-17 15:03:41.126059 7f88ef7e6700 20 RGWEnv::set(): HTTP_ACCEPT: */*
> 2015-12-17 15:03:41.126064 7f88ef7e6700 20 RGWEnv::set():
> HTTP_ACCEPT_ENCODING: gzip, deflate
> 2015-12-17 15:03:41.126066 7f88ef7e6700 20 RGWEnv::set():
> HTTP_AUTHORIZATION: AWS RTJ1TL13CH613JRU2PJD:F6wMKxSrrFhl2m3fyo/M0yXIGT8=
> 2015-12-17 15:03:41.126070 7f88ef7e6700 20 RGWEnv::set():
> HTTP_USER_AGENT: python-requests/2.3.0 CPython/2.7.10 Darwin/14.5.0
> 2015-12-17 15:03:41.126071 7f88ef7e6700 20 RGWEnv::set():
> HTTP_X_FORWARDED_FOR: 192.168.86.254
> 2015-12-17 15:03:41.126073 7f88ef7e6700 20 RGWEnv::set():
> HTTP_X_FORWARDED_HOST: ceph.umiacs.umd.edu
> 2015-12-17 15:03:41.126075 7f88ef7e6700 20 RGWEnv::set():
> HTTP_X_FORWARDED_SERVER: cephproxy00.umiacs.umd.edu
> 2015-12-17 15:03:41.126077 7f88ef7e6700 20 RGWEnv::set():
> HTTP_CONNECTION: Keep-Alive
> 2015-12-17 15:03:41.126079 7f88ef7e6700 20 RGWEnv::set():
> REQUEST_METHOD: GET
> 2015-12-17 15:03:41.126080 7f88ef7e6700 20 RGWEnv::set(): REQUEST_URI:
> /admin/metadata/get
> 2015-12-17 15:03:41.126081 7f88ef7e6700 20 RGWEnv::set(): QUERY_STRING:
> section=user=cephtest
> 2015-12-17 15:03:41.126082 7f88ef7e6700 20 RGWEnv::set(): REMOTE_USER:
> 2015-12-17 15:03:41.126083 7f88ef7e6700 20 RGWEnv::set(): SCRIPT_URI:
> /admin/metadata/get
> 2015-12-17 15:03:41.126089 7f88ef7e6700 20 RGWEnv::set(): SERVER_PORT: 7480
> 2015-12-17 15:03:41.126090 7f88ef7e6700 20 HTTP_ACCEPT=*/*
> 2015-12-17 15:03:41.126093 7f88ef7e6700 20 HTTP_ACCEPT_ENCODING=gzip,
> deflate
> 2015-12-17 15:03:41.126094 7f88ef7e6700 20 HTTP_AUTHORIZATION=AWS
> RTJ1TL13CH613JRU2PJD:F6wMKxSrrFhl2m3fyo/M0yXIGT8=
> 2015-12-17 15:03:41.126094 7f88ef7e6700 20 HTTP_CONNECTION=Keep-Alive
> 2015-12-17 15:03:41.126095 7f88ef7e6700 20 HTTP_DATE=Thu, 17 Dec 2015
> 20:03:41 GMT
> 2015-12-17 15:03:41.126095 7f88ef7e6700 20 HTTP_HOST=localhost:7480
> 2015-12-17 15:03:41.126096 7f88ef7e6700 20
> HTTP_USER_AGENT=python-requests/2.3.0 CPython/2.7.10 Darwin/14.5.0
> 2015-12-17 15:03:41.126097 7f88ef7e6700 20
> HTTP_X_FORWARDED_FOR=192.168.86.254
> 2015-12-17 15:03:41.126097 7f88ef7e6700 20
> HTTP_X_FORWARDED_HOST=ceph.umiacs.umd.edu
> 2015-12-17 15:03:41.126098 7f88ef7e6700 20
> HTTP_X_FORWARDED_SERVER=cephproxy00.umiacs.umd.edu
> 2015-12-17 15:03:41.126099 7f88ef7e6700 20
> QUERY_STRING=section=user=cephtest
> 2015-12-17 15:03:41.126099 7f88ef7e6700 20 REMOTE_USER=
> 2015-12-17 15:03:41.126100 7f88ef7e6700 20 REQUEST_METHOD=GET
> 2015-12-17 15:03:41.126101 7f88ef7e6700 20 REQUEST_URI=/admin/metadata/get
> 2015-12-17 15:03:41.126101 7f88ef7e6700 20 SCRIPT_URI=/admin/metadata/get
> 2015-12-17 15:03:41.126102 7f88ef7e6700 20 SERVER_PORT=7480
> 2015-12-17 15:03:41.126104 7f88ef7e6700 20 RGWEnv::set(): HTTP_HOST:
> localhost:7480
> 2015-12-17 15:03:41.126105 7f88ef7e6700 20 RGWEnv::set(): HTTP_DATE:
> Thu, 17 Dec 2015 20:03:41 GMT
> 2015-12-17 15:03:41.126107 7f88ef7e6700 20 RGWEnv::set(): HTTP_ACCEPT: */*
> 2015-12-17 15:03:41.126108 7f88ef7e6700 20 RGWEnv::set():
> HTTP_ACCEPT_ENCODING: gzip, deflate
> 2015-12-17 15:03:41.126110 7f88ef7e6700 20 RGWEnv::set():
> HTTP_AUTHORIZATION: AWS RTJ1TL13CH613JRU2PJD:F6wMKxSrrFhl2m3fyo/M0yXIGT8=
> 2015-12-17 15:03:41.126113 7f88ef7e6700 20 RGWEnv::set():
> HTTP_USER_AGENT: python-requests/2.3.0 CPython/2.7.10 

Re: rgw subuser create and admin api

2015-12-17 Thread Derek Yarnell
On 12/17/15 2:36 PM, Yehuda Sadeh-Weinraub wrote:
> Try 'section=user=cephtests'

Doesn't seem to work either.

# radosgw-admin metadata get user:cephtest
{
"key": "user:cephtest",
"ver": {
"tag": "_dhpzgdOjqJI-OsR1MsYV5-p",
"ver": 1
},
"mtime": 1450378246,
"data": {
"user_id": "cephtest",
"display_name": "Ceph Test",
"email": "",
"suspended": 0,
"max_buckets": 1000,
"auid": 0,
"subusers": [],
"keys": [
{
"user": "cephtest",
"access_key": "eee",
"secret_key": ""
},
{
"user": "cephtest",
"access_key": "aaa",
"secret_key": ""
}
],
"swift_keys": [],
"caps": [],
"op_mask": "read, write, delete",
"default_placement": "",
"placement_tags": [],
"bucket_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
},
"temp_url_keys": []
}
}


2015-12-17 15:03:41.126024 7f88ef7e6700 20 RGWEnv::set(): HTTP_HOST:
localhost:7480
2015-12-17 15:03:41.126056 7f88ef7e6700 20 RGWEnv::set(): HTTP_DATE:
Thu, 17 Dec 2015 20:03:41 GMT
2015-12-17 15:03:41.126059 7f88ef7e6700 20 RGWEnv::set(): HTTP_ACCEPT: */*
2015-12-17 15:03:41.126064 7f88ef7e6700 20 RGWEnv::set():
HTTP_ACCEPT_ENCODING: gzip, deflate
2015-12-17 15:03:41.126066 7f88ef7e6700 20 RGWEnv::set():
HTTP_AUTHORIZATION: AWS RTJ1TL13CH613JRU2PJD:F6wMKxSrrFhl2m3fyo/M0yXIGT8=
2015-12-17 15:03:41.126070 7f88ef7e6700 20 RGWEnv::set():
HTTP_USER_AGENT: python-requests/2.3.0 CPython/2.7.10 Darwin/14.5.0
2015-12-17 15:03:41.126071 7f88ef7e6700 20 RGWEnv::set():
HTTP_X_FORWARDED_FOR: 192.168.86.254
2015-12-17 15:03:41.126073 7f88ef7e6700 20 RGWEnv::set():
HTTP_X_FORWARDED_HOST: ceph.umiacs.umd.edu
2015-12-17 15:03:41.126075 7f88ef7e6700 20 RGWEnv::set():
HTTP_X_FORWARDED_SERVER: cephproxy00.umiacs.umd.edu
2015-12-17 15:03:41.126077 7f88ef7e6700 20 RGWEnv::set():
HTTP_CONNECTION: Keep-Alive
2015-12-17 15:03:41.126079 7f88ef7e6700 20 RGWEnv::set():
REQUEST_METHOD: GET
2015-12-17 15:03:41.126080 7f88ef7e6700 20 RGWEnv::set(): REQUEST_URI:
/admin/metadata/get
2015-12-17 15:03:41.126081 7f88ef7e6700 20 RGWEnv::set(): QUERY_STRING:
section=user=cephtest
2015-12-17 15:03:41.126082 7f88ef7e6700 20 RGWEnv::set(): REMOTE_USER:
2015-12-17 15:03:41.126083 7f88ef7e6700 20 RGWEnv::set(): SCRIPT_URI:
/admin/metadata/get
2015-12-17 15:03:41.126089 7f88ef7e6700 20 RGWEnv::set(): SERVER_PORT: 7480
2015-12-17 15:03:41.126090 7f88ef7e6700 20 HTTP_ACCEPT=*/*
2015-12-17 15:03:41.126093 7f88ef7e6700 20 HTTP_ACCEPT_ENCODING=gzip,
deflate
2015-12-17 15:03:41.126094 7f88ef7e6700 20 HTTP_AUTHORIZATION=AWS
RTJ1TL13CH613JRU2PJD:F6wMKxSrrFhl2m3fyo/M0yXIGT8=
2015-12-17 15:03:41.126094 7f88ef7e6700 20 HTTP_CONNECTION=Keep-Alive
2015-12-17 15:03:41.126095 7f88ef7e6700 20 HTTP_DATE=Thu, 17 Dec 2015
20:03:41 GMT
2015-12-17 15:03:41.126095 7f88ef7e6700 20 HTTP_HOST=localhost:7480
2015-12-17 15:03:41.126096 7f88ef7e6700 20
HTTP_USER_AGENT=python-requests/2.3.0 CPython/2.7.10 Darwin/14.5.0
2015-12-17 15:03:41.126097 7f88ef7e6700 20
HTTP_X_FORWARDED_FOR=192.168.86.254
2015-12-17 15:03:41.126097 7f88ef7e6700 20
HTTP_X_FORWARDED_HOST=ceph.umiacs.umd.edu
2015-12-17 15:03:41.126098 7f88ef7e6700 20
HTTP_X_FORWARDED_SERVER=cephproxy00.umiacs.umd.edu
2015-12-17 15:03:41.126099 7f88ef7e6700 20
QUERY_STRING=section=user=cephtest
2015-12-17 15:03:41.126099 7f88ef7e6700 20 REMOTE_USER=
2015-12-17 15:03:41.126100 7f88ef7e6700 20 REQUEST_METHOD=GET
2015-12-17 15:03:41.126101 7f88ef7e6700 20 REQUEST_URI=/admin/metadata/get
2015-12-17 15:03:41.126101 7f88ef7e6700 20 SCRIPT_URI=/admin/metadata/get
2015-12-17 15:03:41.126102 7f88ef7e6700 20 SERVER_PORT=7480
2015-12-17 15:03:41.126104 7f88ef7e6700 20 RGWEnv::set(): HTTP_HOST:
localhost:7480
2015-12-17 15:03:41.126105 7f88ef7e6700 20 RGWEnv::set(): HTTP_DATE:
Thu, 17 Dec 2015 20:03:41 GMT
2015-12-17 15:03:41.126107 7f88ef7e6700 20 RGWEnv::set(): HTTP_ACCEPT: */*
2015-12-17 15:03:41.126108 7f88ef7e6700 20 RGWEnv::set():
HTTP_ACCEPT_ENCODING: gzip, deflate
2015-12-17 15:03:41.126110 7f88ef7e6700 20 RGWEnv::set():
HTTP_AUTHORIZATION: AWS RTJ1TL13CH613JRU2PJD:F6wMKxSrrFhl2m3fyo/M0yXIGT8=
2015-12-17 15:03:41.126113 7f88ef7e6700 20 RGWEnv::set():
HTTP_USER_AGENT: python-requests/2.3.0 CPython/2.7.10 Darwin/14.5.0
2015-12-17 15:03:41.126115 7f88ef7e6700 20 RGWEnv::set():
HTTP_X_FORWARDED_FOR: 192.168.86.254
2015-12-17 15:03:41.126117 7f88ef7e6700 20 RGWEnv::set():
HTTP_X_FORWARDED_HOST: ceph.umiacs.umd.edu
2015-12-17 15:03:41.126119 7f88ef7e6700 20 RGWEnv::set():
HTTP_X_FORWARDED_SERVER: cephproxy00.umiacs.umd.edu

Re: rgw subuser create and admin api

2015-12-17 Thread Derek Yarnell
On 12/17/15 1:09 PM, Yehuda Sadeh-Weinraub wrote:
>> Bug? Design?
> 
> Somewhat a bug. The whole subusers that use s3 was unintentional, so
> when creating the subuser api, we didn't think of needing the access
> key. For some reason we do get the key type. Can you open a ceph
> tracker issue for that?
> 
> You can try using the metadata api to modify the user once it has been
> created (need to get the user info, add the s3 key to the structure,
> put the user info).
> 

This will actually create a subuser and the S3 keys correctly (but not
let you specify the access_key and secret_key).

DEBUG:requests.packages.urllib3.connectionpool:"PUT
/admin/user?subuser=json=-staff=test3=s3=read=True
HTTP/1.1" 200 87

I know about a 'GET /admin/metadata/user?format=json' to get the list of
users from the adminops.  I see I can do things like 'radosgw-admin
metadata get user:cephtest' but I can't seem to get something like this
to work.

DEBUG:requests.packages.urllib3.connectionpool:"GET
/admin/metadata/get?format=json=user%3Acephtest HTTP/1.1" 404 20
ERROR:rgwadmin.rgw:{u'Code': u'NoSuchKey'}


-- 
Derek T. Yarnell
University of Maryland
Institute for Advanced Computer Studies
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rgw subuser create and admin api

2015-12-17 Thread Yehuda Sadeh-Weinraub
On Thu, Dec 17, 2015 at 11:05 AM, Derek Yarnell  wrote:
> On 12/17/15 1:09 PM, Yehuda Sadeh-Weinraub wrote:
>>> Bug? Design?
>>
>> Somewhat a bug. The whole subusers that use s3 was unintentional, so
>> when creating the subuser api, we didn't think of needing the access
>> key. For some reason we do get the key type. Can you open a ceph
>> tracker issue for that?
>>
>> You can try using the metadata api to modify the user once it has been
>> created (need to get the user info, add the s3 key to the structure,
>> put the user info).
>>
>
> This will actually create a subuser and the S3 keys correctly (but not
> let you specify the access_key and secret_key).
>
> DEBUG:requests.packages.urllib3.connectionpool:"PUT
> /admin/user?subuser=json=-staff=test3=s3=read=True
> HTTP/1.1" 200 87
>
> I know about a 'GET /admin/metadata/user?format=json' to get the list of
> users from the adminops.  I see I can do things like 'radosgw-admin
> metadata get user:cephtest' but I can't seem to get something like this
> to work.
>
> DEBUG:requests.packages.urllib3.connectionpool:"GET
> /admin/metadata/get?format=json=user%3Acephtest HTTP/1.1" 404 20
> ERROR:rgwadmin.rgw:{u'Code': u'NoSuchKey'}
>

Try 'section=user=cephtests'

>
> --
> Derek T. Yarnell
> University of Maryland
> Institute for Advanced Computer Studies
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Issue with Ceph File System and LIO

2015-12-17 Thread Eric Eastman
With cephfs.patch and cephfs1.patch applied and I am now seeing:

[Thu Dec 17 14:27:59 2015] [ cut here ]
[Thu Dec 17 14:27:59 2015] WARNING: CPU: 0 PID: 3036 at
fs/ceph/addr.c:1171 ceph_write_begin+0xfb/0x120 [ceph]()
[Thu Dec 17 14:27:59 2015] Modules linked in: iscsi_target_mod
vhost_scsi tcm_qla2xxx ib_srpt tcm_fc tcm_usb_gadget tcm_loop
target_core_file target_core_iblock target_core_pscsi target_core_user
target_core_mod ipmi_devintf vhost qla2xxx ib_cm ib_sa ib_mad ib_core
ib_addr libfc scsi_transport_fc libcomposite udc_core uio configfs ttm
drm_kms_helper drm ipmi_ssif coretemp gpio_ich i2c_algo_bit kvm
fb_sys_fops syscopyarea sysfillrect sysimgblt shpchp input_leds ceph
irqbypass i7core_edac serio_raw hpilo edac_core ipmi_si
ipmi_msghandler 8250_fintek lpc_ich acpi_power_meter libceph mac_hid
libcrc32c fscache bonding lp parport mlx4_en vxlan ip6_udp_tunnel
udp_tunnel ptp pps_core hid_generic usbhid hid mlx4_core hpsa psmouse
bnx2 fjes scsi_transport_sas [last unloaded: target_core_mod]
[Thu Dec 17 14:27:59 2015] CPU: 0 PID: 3036 Comm: iscsi_trx Tainted: G
   W I 4.4.0-rc4-ede2 #1
[Thu Dec 17 14:27:59 2015] Hardware name: HP ProLiant DL360 G6, BIOS
P64 01/22/2015
[Thu Dec 17 14:27:59 2015]  c02b2e37 880c0289b958
813ad644 
[Thu Dec 17 14:27:59 2015]  880c0289b990 81079702
880c0289ba50 000846c21000
[Thu Dec 17 14:27:59 2015]  880c009ea200 1000
ea00122ed700 880c0289b9a0
[Thu Dec 17 14:27:59 2015] Call Trace:
[Thu Dec 17 14:27:59 2015]  [] dump_stack+0x44/0x60
[Thu Dec 17 14:27:59 2015]  [] warn_slowpath_common+0x82/0xc0
[Thu Dec 17 14:27:59 2015]  [] warn_slowpath_null+0x1a/0x20
[Thu Dec 17 14:27:59 2015]  []
ceph_write_begin+0xfb/0x120 [ceph]
[Thu Dec 17 14:27:59 2015]  []
generic_perform_write+0xbf/0x1a0
[Thu Dec 17 14:27:59 2015]  []
ceph_write_iter+0xf5c/0x1010 [ceph]
[Thu Dec 17 14:27:59 2015]  [] ? __schedule+0x386/0x9c0
[Thu Dec 17 14:27:59 2015]  [] ? schedule+0x35/0x80
[Thu Dec 17 14:27:59 2015]  [] ? __slab_free+0xb5/0x290
[Thu Dec 17 14:27:59 2015]  [] ?
iov_iter_get_pages+0x113/0x210
[Thu Dec 17 14:27:59 2015]  [] vfs_iter_write+0x63/0xa0
[Thu Dec 17 14:27:59 2015]  []
fd_do_rw.isra.5+0xc9/0x1b0 [target_core_file]
[Thu Dec 17 14:27:59 2015]  []
fd_execute_rw+0xc5/0x2a0 [target_core_file]
[Thu Dec 17 14:27:59 2015]  []
sbc_execute_rw+0x22/0x30 [target_core_mod]
[Thu Dec 17 14:27:59 2015]  []
__target_execute_cmd+0x1f/0x70 [target_core_mod]
[Thu Dec 17 14:27:59 2015]  []
target_execute_cmd+0x195/0x2a0 [target_core_mod]
[Thu Dec 17 14:27:59 2015]  []
iscsit_execute_cmd+0x20a/0x270 [iscsi_target_mod]
[Thu Dec 17 14:27:59 2015]  []
iscsit_sequence_cmd+0xda/0x190 [iscsi_target_mod]
[Thu Dec 17 14:27:59 2015]  []
iscsi_target_rx_thread+0x51d/0xe30 [iscsi_target_mod]
[Thu Dec 17 14:27:59 2015]  [] ? __switch_to+0x1cd/0x570
[Thu Dec 17 14:27:59 2015]  [] ?
iscsi_target_tx_thread+0x1c0/0x1c0 [iscsi_target_mod]
[Thu Dec 17 14:27:59 2015]  [] kthread+0xc9/0xe0
[Thu Dec 17 14:27:59 2015]  [] ?
kthread_create_on_node+0x180/0x180
[Thu Dec 17 14:27:59 2015]  [] ret_from_fork+0x3f/0x70
[Thu Dec 17 14:27:59 2015]  [] ?
kthread_create_on_node+0x180/0x180
[Thu Dec 17 14:27:59 2015] ---[ end trace 8346192e3f29ed5d ]---

Each of the WARNING on line 1171 is followed by a WARNING on line 125.
The dmesg output is attached to the tracker ticket 14086

Regards,
Eric

On Thu, Dec 17, 2015 at 2:38 AM, Yan, Zheng  wrote:
> On Thu, Dec 17, 2015 at 4:56 PM, Eric Eastman
>  wrote:
>> I patched the 4.4rc4 kernel source and restarted the test.  Shortly
>> after starting it, this showed up in dmesg:
>>
>> [Thu Dec 17 03:29:55 2015] WARNING: CPU: 0 PID: 2547 at
>> fs/ceph/addr.c:1162 ceph_write_begin+0xfb/0x120 [ceph]()
>> [Thu Dec 17 03:29:55 2015] Modules linked in: iscsi_target_mod
>> vhost_scsi tcm_qla2xxx ib_srpt tcm_fc tcm_usb_gadget tcm_loop
>> target_core_file target_core_iblock target_core_pscsi target_core_user
>> target_core_mod ipmi_devintf vhost qla2xxx ib_cm ib_sa ib_mad ib_core
>> ib_addr libfc scsi_transport_fc libcomposite udc_core uio configfs ttm
>> ipmi_ssif drm_kms_helper drm coretemp kvm gpio_ich i2c_algo_bit
>> i7core_edac fb_sys_fops syscopyarea edac_core sysfillrect sysimgblt
>> ipmi_si input_leds hpilo ipmi_msghandler shpchp acpi_power_meter
>> irqbypass serio_raw 8250_fintek lpc_ich mac_hid ceph bonding libceph
>> lp parport libcrc32c fscache mlx4_en vxlan ip6_udp_tunnel udp_tunnel
>> ptp pps_core hid_generic usbhid hid mlx4_core hpsa psmouse bnx2 fjes
>> scsi_transport_sas [last unloaded: target_core_mod]
>> [Thu Dec 17 03:29:55 2015] CPU: 0 PID: 2547 Comm: iscsi_trx Tainted: G
>>W I 4.4.0-rc4-ede1 #1
>> [Thu Dec 17 03:29:55 2015] Hardware name: HP ProLiant DL360 G6, BIOS
>> P64 01/22/2015
>> [Thu Dec 17 03:29:55 2015]  c020cd47 8805f1e97958
>> 813ad644 
>> [Thu Dec 17 03:29:55 2015]  

Re: Issue with Ceph File System and LIO

2015-12-17 Thread Yan, Zheng
On Fri, Dec 18, 2015 at 2:23 PM, Eric Eastman
<eric.east...@keepertech.com> wrote:
>> Hi Yan Zheng, Eric Eastman
>>
>> Similar bug was reported in f2fs, btrfs, it does affect 4.4-rc4, the fixing
>> patch was merged into 4.4-rc5, dfd01f026058 ("sched/wait: Fix the signal
>> handling fix").
>>
>> Related report & discussion was here:
>> https://lkml.org/lkml/2015/12/12/149
>>
>> I'm not sure the current reported issue of ceph was related to that though,
>> but at least try testing with an upgraded or patched kernel could verify it.
>> :)
>>
>> Thanks,
>>
>>> -Original Message-
>>> From: ceph-devel-ow...@vger.kernel.org 
>>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of
>>> Yan, Zheng
>>> Sent: Friday, December 18, 2015 12:05 PM
>>> To: Eric Eastman
>>> Cc: Ceph Development
>>> Subject: Re: Issue with Ceph File System and LIO
>>>
>>> On Fri, Dec 18, 2015 at 3:49 AM, Eric Eastman
>>> <eric.east...@keepertech.com> wrote:
>>> > With cephfs.patch and cephfs1.patch applied and I am now seeing:
>>> >
>>> > [Thu Dec 17 14:27:59 2015] [ cut here ]
>>> > [Thu Dec 17 14:27:59 2015] WARNING: CPU: 0 PID: 3036 at
>>> > fs/ceph/addr.c:1171 ceph_write_begin+0xfb/0x120 [ceph]()
>>> > [Thu Dec 17 14:27:59 2015] Modules linked in: iscsi_target_mod
> ...
>>> >
>>>
>>> The page gets unlocked mystically. I still don't find any clue. Could
>>> you please try the new patch (not incremental patch). Besides, please
>>> enable CONFIG_DEBUG_VM when compiling the kernel.
>>>
>>> Thanks you very much
>>> Yan, Zheng
>>
> I have just installed the cephfs_new.patch and have set
> CONFIG_DEBUG_VM=y on a new 4.4rc4 kernel and restarted the ESXi iSCSI
> test to my Ceph File System gateway.  I plan to let it run overnight
> and report the status tomorrow.
>
> Let me know if I should move on to 4.4rc5 with or without patches and
> with or without  CONFIG_DEBUG_VM=y
>

please try rc5 kernel without patches and DEBUG_VM=y

Regards
Yan, Zheng


> Looking at the network traffic stats on my iSCSI gateway, with
> CONFIG_DEBUG_VM=y, throughput seems to be down by a factor of at least
> 10 compared to my last test without setting CONFIG_DEBUG_VM=y
>
> Regards,
> Eric
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Issue with Ceph File System and LIO

2015-12-17 Thread Eric Eastman
> Hi Yan Zheng, Eric Eastman
>
> Similar bug was reported in f2fs, btrfs, it does affect 4.4-rc4, the fixing
> patch was merged into 4.4-rc5, dfd01f026058 ("sched/wait: Fix the signal
> handling fix").
>
> Related report & discussion was here:
> https://lkml.org/lkml/2015/12/12/149
>
> I'm not sure the current reported issue of ceph was related to that though,
> but at least try testing with an upgraded or patched kernel could verify it.
> :)
>
> Thanks,
>
>> -Original Message-
>> From: ceph-devel-ow...@vger.kernel.org 
>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of
>> Yan, Zheng
>> Sent: Friday, December 18, 2015 12:05 PM
>> To: Eric Eastman
>> Cc: Ceph Development
>> Subject: Re: Issue with Ceph File System and LIO
>>
>> On Fri, Dec 18, 2015 at 3:49 AM, Eric Eastman
>> <eric.east...@keepertech.com> wrote:
>> > With cephfs.patch and cephfs1.patch applied and I am now seeing:
>> >
>> > [Thu Dec 17 14:27:59 2015] [ cut here ]
>> > [Thu Dec 17 14:27:59 2015] WARNING: CPU: 0 PID: 3036 at
>> > fs/ceph/addr.c:1171 ceph_write_begin+0xfb/0x120 [ceph]()
>> > [Thu Dec 17 14:27:59 2015] Modules linked in: iscsi_target_mod
...
>> >
>>
>> The page gets unlocked mystically. I still don't find any clue. Could
>> you please try the new patch (not incremental patch). Besides, please
>> enable CONFIG_DEBUG_VM when compiling the kernel.
>>
>> Thanks you very much
>> Yan, Zheng
>
I have just installed the cephfs_new.patch and have set
CONFIG_DEBUG_VM=y on a new 4.4rc4 kernel and restarted the ESXi iSCSI
test to my Ceph File System gateway.  I plan to let it run overnight
and report the status tomorrow.

Let me know if I should move on to 4.4rc5 with or without patches and
with or without  CONFIG_DEBUG_VM=y

Looking at the network traffic stats on my iSCSI gateway, with
CONFIG_DEBUG_VM=y, throughput seems to be down by a factor of at least
10 compared to my last test without setting CONFIG_DEBUG_VM=y

Regards,
Eric
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rgw subuser create and admin api

2015-12-17 Thread Derek Yarnell
On 12/17/15 3:15 PM, Yehuda Sadeh-Weinraub wrote:
> 
> Right. Reading the code again:
> 
> Try:
> GET /admin/metadata/user=cephtest

Thanks this is very helpful and works and I was able to also get the PUT
working.  Only question is that is it expected to return a 204 no content?

2015-12-17 17:42:39.422612 7f88f47f0700 20 RGWEnv::set(): HTTP_HOST:
localhost:7480
2015-12-17 17:42:39.422619 7f88f47f0700 20 RGWEnv::set():
HTTP_ACCEPT_ENCODING: gzip, deflate
2015-12-17 17:42:39.422621 7f88f47f0700 20 RGWEnv::set(): HTTP_ACCEPT: */*
2015-12-17 17:42:39.422623 7f88f47f0700 20 RGWEnv::set():
HTTP_USER_AGENT: python-requests/2.3.0 CPython/2.7.10 Darwin/14.5.0
2015-12-17 17:42:39.422625 7f88f47f0700 20 RGWEnv::set(): HTTP_DATE:
Thu, 17 Dec 2015 22:42:39 GMT
2015-12-17 17:42:39.422627 7f88f47f0700 20 RGWEnv::set(): CONTENT_TYPE:
application/json
2015-12-17 17:42:39.422629 7f88f47f0700 20 RGWEnv::set():
HTTP_AUTHORIZATION: AWS RTJ1TL13CH613JRU2PJD:K3xaPHDy6t3r0COfjwl9rAUsUfY=
2015-12-17 17:42:39.422630 7f88f47f0700 20 RGWEnv::set():
HTTP_X_FORWARDED_FOR: 192.168.86.254
2015-12-17 17:42:39.422632 7f88f47f0700 20 RGWEnv::set():
HTTP_X_FORWARDED_HOST: ceph.umiacs.umd.edu
2015-12-17 17:42:39.422634 7f88f47f0700 20 RGWEnv::set():
HTTP_X_FORWARDED_SERVER: cephproxy00.umiacs.umd.edu
2015-12-17 17:42:39.422636 7f88f47f0700 20 RGWEnv::set():
HTTP_CONNECTION: Keep-Alive
2015-12-17 17:42:39.422637 7f88f47f0700 20 RGWEnv::set():
CONTENT_LENGTH: 1531
2015-12-17 17:42:39.422638 7f88f47f0700 20 RGWEnv::set():
REQUEST_METHOD: PUT
2015-12-17 17:42:39.422640 7f88f47f0700 20 RGWEnv::set(): REQUEST_URI:
/admin/metadata/user
2015-12-17 17:42:39.422641 7f88f47f0700 20 RGWEnv::set(): QUERY_STRING:
key=-staff
2015-12-17 17:42:39.422643 7f88f47f0700 20 RGWEnv::set(): REMOTE_USER:
2015-12-17 17:42:39.422644 7f88f47f0700 20 RGWEnv::set(): SCRIPT_URI:
/admin/metadata/user
2015-12-17 17:42:39.422651 7f88f47f0700 20 RGWEnv::set(): SERVER_PORT: 7480
2015-12-17 17:42:39.422652 7f88f47f0700 20 CONTENT_LENGTH=1531
2015-12-17 17:42:39.422654 7f88f47f0700 20 CONTENT_TYPE=application/json
2015-12-17 17:42:39.422655 7f88f47f0700 20 HTTP_ACCEPT=*/*
2015-12-17 17:42:39.422655 7f88f47f0700 20 HTTP_ACCEPT_ENCODING=gzip,
deflate
2015-12-17 17:42:39.422656 7f88f47f0700 20 HTTP_AUTHORIZATION=AWS
RTJ1TL13CH613JRU2PJD:K3xaPHDy6t3r0COfjwl9rAUsUfY=
2015-12-17 17:42:39.422657 7f88f47f0700 20 HTTP_CONNECTION=Keep-Alive
2015-12-17 17:42:39.422658 7f88f47f0700 20 HTTP_DATE=Thu, 17 Dec 2015
22:42:39 GMT
2015-12-17 17:42:39.422658 7f88f47f0700 20 HTTP_HOST=localhost:7480
2015-12-17 17:42:39.422659 7f88f47f0700 20
HTTP_USER_AGENT=python-requests/2.3.0 CPython/2.7.10 Darwin/14.5.0
2015-12-17 17:42:39.422660 7f88f47f0700 20
HTTP_X_FORWARDED_FOR=192.168.86.254
2015-12-17 17:42:39.422660 7f88f47f0700 20
HTTP_X_FORWARDED_HOST=ceph.umiacs.umd.edu
2015-12-17 17:42:39.422661 7f88f47f0700 20
HTTP_X_FORWARDED_SERVER=cephproxy00.umiacs.umd.edu
2015-12-17 17:42:39.422662 7f88f47f0700 20 QUERY_STRING=key=-staff
2015-12-17 17:42:39.422662 7f88f47f0700 20 REMOTE_USER=
2015-12-17 17:42:39.422663 7f88f47f0700 20 REQUEST_METHOD=PUT
2015-12-17 17:42:39.422664 7f88f47f0700 20 REQUEST_URI=/admin/metadata/user
2015-12-17 17:42:39.422664 7f88f47f0700 20 SCRIPT_URI=/admin/metadata/user
2015-12-17 17:42:39.422665 7f88f47f0700 20 SERVER_PORT=7480
2015-12-17 17:42:39.422667 7f88f47f0700 20 RGWEnv::set(): HTTP_HOST:
localhost:7480
2015-12-17 17:42:39.422668 7f88f47f0700 20 RGWEnv::set():
HTTP_ACCEPT_ENCODING: gzip, deflate
2015-12-17 17:42:39.422670 7f88f47f0700 20 RGWEnv::set(): HTTP_ACCEPT: */*
2015-12-17 17:42:39.422671 7f88f47f0700 20 RGWEnv::set():
HTTP_USER_AGENT: python-requests/2.3.0 CPython/2.7.10 Darwin/14.5.0
2015-12-17 17:42:39.422672 7f88f47f0700 20 RGWEnv::set(): HTTP_DATE:
Thu, 17 Dec 2015 22:42:39 GMT
2015-12-17 17:42:39.422673 7f88f47f0700 20 RGWEnv::set(): CONTENT_TYPE:
application/json
2015-12-17 17:42:39.422674 7f88f47f0700 20 RGWEnv::set():
HTTP_AUTHORIZATION: AWS RTJ1TL13CH613JRU2PJD:K3xaPHDy6t3r0COfjwl9rAUsUfY=
2015-12-17 17:42:39.422676 7f88f47f0700 20 RGWEnv::set():
HTTP_X_FORWARDED_FOR: 192.168.86.254
2015-12-17 17:42:39.422677 7f88f47f0700 20 RGWEnv::set():
HTTP_X_FORWARDED_HOST: ceph.umiacs.umd.edu
2015-12-17 17:42:39.422678 7f88f47f0700 20 RGWEnv::set():
HTTP_X_FORWARDED_SERVER: cephproxy00.umiacs.umd.edu
2015-12-17 17:42:39.422679 7f88f47f0700 20 RGWEnv::set():
HTTP_CONNECTION: Keep-Alive
2015-12-17 17:42:39.422680 7f88f47f0700 20 RGWEnv::set():
CONTENT_LENGTH: 1531
2015-12-17 17:42:39.422681 7f88f47f0700 20 RGWEnv::set():
REQUEST_METHOD: PUT
2015-12-17 17:42:39.422682 7f88f47f0700 20 RGWEnv::set(): REQUEST_URI:
/admin/metadata/user
2015-12-17 17:42:39.422683 7f88f47f0700 20 RGWEnv::set(): QUERY_STRING:
key=-staff
2015-12-17 17:42:39.422684 7f88f47f0700 20 RGWEnv::set(): REMOTE_USER:
2015-12-17 17:42:39.422685 7f88f47f0700 20 RGWEnv::set(): SCRIPT_URI:
/admin/metadata/user
2015-12-17 17:42:39.422686 7f88f47f0700 20 RGWEnv::set(): SERVER_PORT: 7480
2015-12-17 

Re: understanding partprobe failure

2015-12-17 Thread Loic Dachary


On 17/12/2015 16:49, Ilya Dryomov wrote:
> On Thu, Dec 17, 2015 at 1:19 PM, Loic Dachary  wrote:
>> Hi Ilya,
>>
>> I'm seeing a partprobe failure right after a disk was zapped with sgdisk 
>> --clear --mbrtogpt -- /dev/vdb:
>>
>> partprobe /dev/vdb failed : Error: Partition(s) 1 on /dev/vdb have been 
>> written, but we have been unable to inform the kernel of the change, 
>> probably because it/they are in use. As a result, the old partition(s) will 
>> remain in use. You should reboot now before making further changes.
>>
>> waiting 60 seconds (see the log below) and trying again succeeds. The 
>> partprobe call is guarded by udevadm settle to prevent udev actions from 
>> racing and nothing else goes on in the machine.
>>
>> Any idea how that could happen ?
>>
>> Cheers
>>
>> 2015-12-17 11:46:10,356.356 
>> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:DEBUG:ceph-disk:get_dm_uuid
>>  /dev/vdb uuid path is /sys/dev/block/253:16/dm/uuid
>> 2015-12-17 11:46:10,357.357 
>> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:DEBUG:ceph-disk:Zapping
>>  partition table on /dev/vdb
>> 2015-12-17 11:46:10,358.358 
>> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:INFO:ceph-disk:Running
>>  command: /usr/sbin/sgdisk --zap-all -- /dev/vdb
>> 2015-12-17 11:46:10,365.365 
>> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Caution:
>>  invalid backup GPT header, but valid main header; regenerating
>> 2015-12-17 11:46:10,366.366 
>> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:backup 
>> header from main header.
>> 2015-12-17 11:46:10,366.366 
>> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:
>> 2015-12-17 11:46:10,366.366 
>> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Warning!
>>  Main and backup partition tables differ! Use the 'c' and 'e' options
>> 2015-12-17 11:46:10,367.367 
>> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:on the 
>> recovery & transformation menu to examine the two tables.
>> 2015-12-17 11:46:10,367.367 
>> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:
>> 2015-12-17 11:46:10,367.367 
>> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Warning!
>>  One or more CRCs don't match. You should repair the disk!
>> 2015-12-17 11:46:10,368.368 
>> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:
>> 2015-12-17 11:46:11,413.413 
>> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:
>> 2015-12-17 11:46:11,414.414 
>> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Caution:
>>  Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
>> 2015-12-17 11:46:11,414.414 
>> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:verification
>>  and recovery are STRONGLY recommended.
>> 2015-12-17 11:46:11,414.414 
>> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:
>> 2015-12-17 11:46:11,415.415 
>> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Warning:
>>  The kernel is still using the old partition table.
>> 2015-12-17 11:46:11,415.415 
>> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:The 
>> new table will be used at the next reboot.
>> 2015-12-17 11:46:11,416.416 
>> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:GPT 
>> data structures destroyed! You may now partition the disk using fdisk or
>> 2015-12-17 11:46:11,416.416 
>> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:other 
>> utilities.
>> 2015-12-17 11:46:11,416.416 
>> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:INFO:ceph-disk:Running
>>  command: /usr/sbin/sgdisk --clear --mbrtogpt -- /dev/vdb
>> 2015-12-17 11:46:12,504.504 
>> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Creating
>>  new GPT entries.
>> 2015-12-17 11:46:12,505.505 
>> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Warning:
>>  The kernel is still using the old partition table.
>> 2015-12-17 11:46:12,505.505 
>> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:The 
>> new table will be used at the next reboot.
>> 2015-12-17 11:46:12,505.505 
>> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:The 
>> operation has completed successfully.
>> 2015-12-17 11:46:12,506.506 
>> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:DEBUG:ceph-disk:Calling
>>  partprobe on zapped device /dev/vdb
>> 2015-12-17 11:46:12,507.507 
>> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:INFO:ceph-disk:Running
>>  command: /usr/bin/udevadm settle 

Re: rgw subuser create and admin api

2015-12-17 Thread Yehuda Sadeh-Weinraub
On Thu, Dec 17, 2015 at 2:44 PM, Derek Yarnell  wrote:
> On 12/17/15 3:15 PM, Yehuda Sadeh-Weinraub wrote:
>>
>> Right. Reading the code again:
>>
>> Try:
>> GET /admin/metadata/user=cephtest
>
> Thanks this is very helpful and works and I was able to also get the PUT
> working.  Only question is that is it expected to return a 204 no content?

Yes, it's expected.

Yehuda

>
> 2015-12-17 17:42:39.422612 7f88f47f0700 20 RGWEnv::set(): HTTP_HOST:
> localhost:7480
> 2015-12-17 17:42:39.422619 7f88f47f0700 20 RGWEnv::set():
> HTTP_ACCEPT_ENCODING: gzip, deflate
> 2015-12-17 17:42:39.422621 7f88f47f0700 20 RGWEnv::set(): HTTP_ACCEPT: */*
> 2015-12-17 17:42:39.422623 7f88f47f0700 20 RGWEnv::set():
> HTTP_USER_AGENT: python-requests/2.3.0 CPython/2.7.10 Darwin/14.5.0
> 2015-12-17 17:42:39.422625 7f88f47f0700 20 RGWEnv::set(): HTTP_DATE:
> Thu, 17 Dec 2015 22:42:39 GMT
> 2015-12-17 17:42:39.422627 7f88f47f0700 20 RGWEnv::set(): CONTENT_TYPE:
> application/json
> 2015-12-17 17:42:39.422629 7f88f47f0700 20 RGWEnv::set():
> HTTP_AUTHORIZATION: AWS RTJ1TL13CH613JRU2PJD:K3xaPHDy6t3r0COfjwl9rAUsUfY=
> 2015-12-17 17:42:39.422630 7f88f47f0700 20 RGWEnv::set():
> HTTP_X_FORWARDED_FOR: 192.168.86.254
> 2015-12-17 17:42:39.422632 7f88f47f0700 20 RGWEnv::set():
> HTTP_X_FORWARDED_HOST: ceph.umiacs.umd.edu
> 2015-12-17 17:42:39.422634 7f88f47f0700 20 RGWEnv::set():
> HTTP_X_FORWARDED_SERVER: cephproxy00.umiacs.umd.edu
> 2015-12-17 17:42:39.422636 7f88f47f0700 20 RGWEnv::set():
> HTTP_CONNECTION: Keep-Alive
> 2015-12-17 17:42:39.422637 7f88f47f0700 20 RGWEnv::set():
> CONTENT_LENGTH: 1531
> 2015-12-17 17:42:39.422638 7f88f47f0700 20 RGWEnv::set():
> REQUEST_METHOD: PUT
> 2015-12-17 17:42:39.422640 7f88f47f0700 20 RGWEnv::set(): REQUEST_URI:
> /admin/metadata/user
> 2015-12-17 17:42:39.422641 7f88f47f0700 20 RGWEnv::set(): QUERY_STRING:
> key=-staff
> 2015-12-17 17:42:39.422643 7f88f47f0700 20 RGWEnv::set(): REMOTE_USER:
> 2015-12-17 17:42:39.422644 7f88f47f0700 20 RGWEnv::set(): SCRIPT_URI:
> /admin/metadata/user
> 2015-12-17 17:42:39.422651 7f88f47f0700 20 RGWEnv::set(): SERVER_PORT: 7480
> 2015-12-17 17:42:39.422652 7f88f47f0700 20 CONTENT_LENGTH=1531
> 2015-12-17 17:42:39.422654 7f88f47f0700 20 CONTENT_TYPE=application/json
> 2015-12-17 17:42:39.422655 7f88f47f0700 20 HTTP_ACCEPT=*/*
> 2015-12-17 17:42:39.422655 7f88f47f0700 20 HTTP_ACCEPT_ENCODING=gzip,
> deflate
> 2015-12-17 17:42:39.422656 7f88f47f0700 20 HTTP_AUTHORIZATION=AWS
> RTJ1TL13CH613JRU2PJD:K3xaPHDy6t3r0COfjwl9rAUsUfY=
> 2015-12-17 17:42:39.422657 7f88f47f0700 20 HTTP_CONNECTION=Keep-Alive
> 2015-12-17 17:42:39.422658 7f88f47f0700 20 HTTP_DATE=Thu, 17 Dec 2015
> 22:42:39 GMT
> 2015-12-17 17:42:39.422658 7f88f47f0700 20 HTTP_HOST=localhost:7480
> 2015-12-17 17:42:39.422659 7f88f47f0700 20
> HTTP_USER_AGENT=python-requests/2.3.0 CPython/2.7.10 Darwin/14.5.0
> 2015-12-17 17:42:39.422660 7f88f47f0700 20
> HTTP_X_FORWARDED_FOR=192.168.86.254
> 2015-12-17 17:42:39.422660 7f88f47f0700 20
> HTTP_X_FORWARDED_HOST=ceph.umiacs.umd.edu
> 2015-12-17 17:42:39.422661 7f88f47f0700 20
> HTTP_X_FORWARDED_SERVER=cephproxy00.umiacs.umd.edu
> 2015-12-17 17:42:39.422662 7f88f47f0700 20 QUERY_STRING=key=-staff
> 2015-12-17 17:42:39.422662 7f88f47f0700 20 REMOTE_USER=
> 2015-12-17 17:42:39.422663 7f88f47f0700 20 REQUEST_METHOD=PUT
> 2015-12-17 17:42:39.422664 7f88f47f0700 20 REQUEST_URI=/admin/metadata/user
> 2015-12-17 17:42:39.422664 7f88f47f0700 20 SCRIPT_URI=/admin/metadata/user
> 2015-12-17 17:42:39.422665 7f88f47f0700 20 SERVER_PORT=7480
> 2015-12-17 17:42:39.422667 7f88f47f0700 20 RGWEnv::set(): HTTP_HOST:
> localhost:7480
> 2015-12-17 17:42:39.422668 7f88f47f0700 20 RGWEnv::set():
> HTTP_ACCEPT_ENCODING: gzip, deflate
> 2015-12-17 17:42:39.422670 7f88f47f0700 20 RGWEnv::set(): HTTP_ACCEPT: */*
> 2015-12-17 17:42:39.422671 7f88f47f0700 20 RGWEnv::set():
> HTTP_USER_AGENT: python-requests/2.3.0 CPython/2.7.10 Darwin/14.5.0
> 2015-12-17 17:42:39.422672 7f88f47f0700 20 RGWEnv::set(): HTTP_DATE:
> Thu, 17 Dec 2015 22:42:39 GMT
> 2015-12-17 17:42:39.422673 7f88f47f0700 20 RGWEnv::set(): CONTENT_TYPE:
> application/json
> 2015-12-17 17:42:39.422674 7f88f47f0700 20 RGWEnv::set():
> HTTP_AUTHORIZATION: AWS RTJ1TL13CH613JRU2PJD:K3xaPHDy6t3r0COfjwl9rAUsUfY=
> 2015-12-17 17:42:39.422676 7f88f47f0700 20 RGWEnv::set():
> HTTP_X_FORWARDED_FOR: 192.168.86.254
> 2015-12-17 17:42:39.422677 7f88f47f0700 20 RGWEnv::set():
> HTTP_X_FORWARDED_HOST: ceph.umiacs.umd.edu
> 2015-12-17 17:42:39.422678 7f88f47f0700 20 RGWEnv::set():
> HTTP_X_FORWARDED_SERVER: cephproxy00.umiacs.umd.edu
> 2015-12-17 17:42:39.422679 7f88f47f0700 20 RGWEnv::set():
> HTTP_CONNECTION: Keep-Alive
> 2015-12-17 17:42:39.422680 7f88f47f0700 20 RGWEnv::set():
> CONTENT_LENGTH: 1531
> 2015-12-17 17:42:39.422681 7f88f47f0700 20 RGWEnv::set():
> REQUEST_METHOD: PUT
> 2015-12-17 17:42:39.422682 7f88f47f0700 20 RGWEnv::set(): REQUEST_URI:
> /admin/metadata/user
> 2015-12-17 17:42:39.422683 7f88f47f0700 20 

Re: [ceph-users] v10.0.0 released

2015-12-17 Thread Loic Dachary
The script handles UTF-8 fine, the copy/paste is at fault here ;-)

On 24/11/2015 07:59, piotr.da...@ts.fujitsu.com wrote:
>> -Original Message-
>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
>> ow...@vger.kernel.org] On Behalf Of Sage Weil
>> Sent: Monday, November 23, 2015 5:08 PM
>>
>> This is the first development release for the Jewel cycle.  We are off to a
>> good start, with lots of performance improvements flowing into the tree.
>> We are targetting sometime in Q1 2016 for the final Jewel.
>>
>> [..]
>> (`pr#5853 `_, Piotr Dałek)
> 
> Hopefully at that point the script that generates this list will learn how to 
> handle UTF-8 ;-)
> 
> 
> With best regards / Pozdrawiam
> Piotr Dałek
> ___
> ceph-users mailing list
> ceph-us...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: Client still connect failed leader after that mon down

2015-12-17 Thread Jevon Qiao

On 17/12/15 21:27, Sage Weil wrote:

On Thu, 17 Dec 2015, Jaze Lee wrote:

Hello cephers:
 In our test, there are three monitors. We find client run ceph
command will slow when the leader mon is down. Even after long time, a
client run ceph command will also slow in first time.
>From strace, we find that the client first to connect the leader, then
after 3s, it connect the second.
After some search we find that the quorum is not change, the leader is
still the down monitor.
Is that normal?  Or is there something i miss?

It's normal.  Even when the quorum does change, the client doesn't
know that.  It should be contacting a random mon on startup, though, so I
would expect the 3s delay 1/3 of the time.
That's because client randomly picks up a mon from Monmap. But what we 
observed is that when a mon is down no change is made to monmap(neither 
the epoch nor the members). Is it the culprit for this phenomenon?


Thanks,
Jevon

A long-standing low-priority feature request is to have the client contact
2 mons in parallel so that it can still connect quickly if one is down.
It's requires some non-trivial work in mon/MonClient.{cc,h} though and I
don't think anyone has looked at it seriously.

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Issue with Ceph File System and LIO

2015-12-17 Thread Yan, Zheng
On Fri, Dec 18, 2015 at 3:49 AM, Eric Eastman
 wrote:
> With cephfs.patch and cephfs1.patch applied and I am now seeing:
>
> [Thu Dec 17 14:27:59 2015] [ cut here ]
> [Thu Dec 17 14:27:59 2015] WARNING: CPU: 0 PID: 3036 at
> fs/ceph/addr.c:1171 ceph_write_begin+0xfb/0x120 [ceph]()
> [Thu Dec 17 14:27:59 2015] Modules linked in: iscsi_target_mod
> vhost_scsi tcm_qla2xxx ib_srpt tcm_fc tcm_usb_gadget tcm_loop
> target_core_file target_core_iblock target_core_pscsi target_core_user
> target_core_mod ipmi_devintf vhost qla2xxx ib_cm ib_sa ib_mad ib_core
> ib_addr libfc scsi_transport_fc libcomposite udc_core uio configfs ttm
> drm_kms_helper drm ipmi_ssif coretemp gpio_ich i2c_algo_bit kvm
> fb_sys_fops syscopyarea sysfillrect sysimgblt shpchp input_leds ceph
> irqbypass i7core_edac serio_raw hpilo edac_core ipmi_si
> ipmi_msghandler 8250_fintek lpc_ich acpi_power_meter libceph mac_hid
> libcrc32c fscache bonding lp parport mlx4_en vxlan ip6_udp_tunnel
> udp_tunnel ptp pps_core hid_generic usbhid hid mlx4_core hpsa psmouse
> bnx2 fjes scsi_transport_sas [last unloaded: target_core_mod]
> [Thu Dec 17 14:27:59 2015] CPU: 0 PID: 3036 Comm: iscsi_trx Tainted: G
>W I 4.4.0-rc4-ede2 #1
> [Thu Dec 17 14:27:59 2015] Hardware name: HP ProLiant DL360 G6, BIOS
> P64 01/22/2015
> [Thu Dec 17 14:27:59 2015]  c02b2e37 880c0289b958
> 813ad644 
> [Thu Dec 17 14:27:59 2015]  880c0289b990 81079702
> 880c0289ba50 000846c21000
> [Thu Dec 17 14:27:59 2015]  880c009ea200 1000
> ea00122ed700 880c0289b9a0
> [Thu Dec 17 14:27:59 2015] Call Trace:
> [Thu Dec 17 14:27:59 2015]  [] dump_stack+0x44/0x60
> [Thu Dec 17 14:27:59 2015]  [] 
> warn_slowpath_common+0x82/0xc0
> [Thu Dec 17 14:27:59 2015]  [] warn_slowpath_null+0x1a/0x20
> [Thu Dec 17 14:27:59 2015]  []
> ceph_write_begin+0xfb/0x120 [ceph]
> [Thu Dec 17 14:27:59 2015]  []
> generic_perform_write+0xbf/0x1a0
> [Thu Dec 17 14:27:59 2015]  []
> ceph_write_iter+0xf5c/0x1010 [ceph]
> [Thu Dec 17 14:27:59 2015]  [] ? __schedule+0x386/0x9c0
> [Thu Dec 17 14:27:59 2015]  [] ? schedule+0x35/0x80
> [Thu Dec 17 14:27:59 2015]  [] ? __slab_free+0xb5/0x290
> [Thu Dec 17 14:27:59 2015]  [] ?
> iov_iter_get_pages+0x113/0x210
> [Thu Dec 17 14:27:59 2015]  [] vfs_iter_write+0x63/0xa0
> [Thu Dec 17 14:27:59 2015]  []
> fd_do_rw.isra.5+0xc9/0x1b0 [target_core_file]
> [Thu Dec 17 14:27:59 2015]  []
> fd_execute_rw+0xc5/0x2a0 [target_core_file]
> [Thu Dec 17 14:27:59 2015]  []
> sbc_execute_rw+0x22/0x30 [target_core_mod]
> [Thu Dec 17 14:27:59 2015]  []
> __target_execute_cmd+0x1f/0x70 [target_core_mod]
> [Thu Dec 17 14:27:59 2015]  []
> target_execute_cmd+0x195/0x2a0 [target_core_mod]
> [Thu Dec 17 14:27:59 2015]  []
> iscsit_execute_cmd+0x20a/0x270 [iscsi_target_mod]
> [Thu Dec 17 14:27:59 2015]  []
> iscsit_sequence_cmd+0xda/0x190 [iscsi_target_mod]
> [Thu Dec 17 14:27:59 2015]  []
> iscsi_target_rx_thread+0x51d/0xe30 [iscsi_target_mod]
> [Thu Dec 17 14:27:59 2015]  [] ? __switch_to+0x1cd/0x570
> [Thu Dec 17 14:27:59 2015]  [] ?
> iscsi_target_tx_thread+0x1c0/0x1c0 [iscsi_target_mod]
> [Thu Dec 17 14:27:59 2015]  [] kthread+0xc9/0xe0
> [Thu Dec 17 14:27:59 2015]  [] ?
> kthread_create_on_node+0x180/0x180
> [Thu Dec 17 14:27:59 2015]  [] ret_from_fork+0x3f/0x70
> [Thu Dec 17 14:27:59 2015]  [] ?
> kthread_create_on_node+0x180/0x180
> [Thu Dec 17 14:27:59 2015] ---[ end trace 8346192e3f29ed5d ]---
>

The page gets unlocked mystically. I still don't find any clue. Could
you please try the new patch (not incremental patch). Besides, please
enable CONFIG_DEBUG_VM when compiling the kernel.

Thanks you very much
Yan, Zheng


cephfs_new.patch
Description: Binary data


  1   2   3   4   5   6   7   8   9   10   >