Re: poor OSD performance using kernel 3.4

2012-05-30 Thread Stefan Priebe - Profihost AG
Am 29.05.2012 23:41, schrieb Mark Nelson:

 When you are using 1 thread, you are hitting a ~40MB/s limit (probably
 networking related) before the data gets to the journal.
1GB/s is capable of at least 130Mb/s and i get 130MB/s with 3.0.30 using
16 threads. I don't get why i should hit a limit here.

 Because (in
 this case) the filestore data disk can handle that throughput,
 everything looks nice and consistent.
osd bench and fio and dd tells me the underlying disks can handle
260MB/s (Intel SSD).

 In this case, that 40MB/s limit with 1 thread has increased.  Now more
 data is getting fed into the journal than the filestore can write out to
 disk.  Eventually writes stall while the data is being written out.

I don't want to argue but why should this only happen with 3.4.0 and NOT
with 3.0.30. Even though it does not matter which underlying FS i use.
It is the same with XFS AND btrfs.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD deadlock with cephfs client and OSD on same machine

2012-05-30 Thread Amon Ott
On Tuesday 29 May 2012 you wrote:
 On Tue, 29 May 2012, Amon Ott wrote:
  Conclusion: If you want to run OSD and cephfs kernel client on the same
  Linux server and have a libc6 before 2.14 (e.g. Debian's newest in
  experimental is 2.13) or a kernel before 2.6.39, either do not use ext4
  (but btrfs is still unstable) or risk data loss by missing syncs through
  the workaround of forcing filestore_fsync_flushes_journal_data to true.

 Note that fsync_flushed_journal_data should only be set to true with ext3
 and the 'data=ordered' or 'data=journal' mount option.  It is an
 implementation artifact only that fsync() will flush all previous writes.

I am fully aware of that, this is why I mentioned the risk of data loss.

  Please consider putting out a fat warning at least at build time, if
  syncfs() is not available, e.g. No syncfs() syscall, please expect a
  deadlock when running osd on non-btrfs together with a local cephfs
  mount. Even better would be a quick runtime test for missing syncfs()
  and storage on non-btrfs that spits out a warning, if deadlock is
  possible.

 I think a runtime warning makes more sense; nobody will see the build time
 warning (e.g., those installed debs).

Yes, fully agreed.

  As a side effect, the experienced lockup seems to be a good way to
  reproduce the long standing bug 1047 - when our cluster tried to recover,
  all MDS instances died with those symptoms. It seems that a partial sync
  of journal or data partition causes that broken state.

 Interesting!  If you could also note on that bug what the metadata
 workload was (what was making hard links?), that would be great!

We are auto creating up to 200 preconfigured home directories on all four 
nodes, each home dir consists of ca. 400 dirs and files with ca. 16 MB of 
data. AFAIK, there are no hard links involved. So it is a massive parallel 
creation of many small files, probably lots of metadata for them.

Will put that as note to the bug, too.

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH   Tel: +49 30 24342334
Am Köllnischen Park 1Fax: +49 30 24342336
10179 Berlin http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-30 Thread Stefan Priebe - Profihost AG
Am 30.05.2012 09:01, schrieb Stefan Majer:
 Hi Stefan,
 
 what is your replication factor ? If it set to 2 and your osds have a
 single 1GB/sec link you never will see more than 120MB/sec i suspect
 much less because every write have to go to the same wire twice from
 each osd.
Sure - but right now i see 10MB/s with kernel 3.4 and 170MB/s with
3.0.30 using bonded 2x 1Gbit/s links.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


building test cluster : missing /etc/ceph/client.admin.keyring, need help

2012-05-30 Thread Alexandre DERUMIER

Hi, 
I'm building my rados test cluster, 


3 servers,with on each server : 1 mon - 5 osd

mon daemon and osd are started, but when i use ceph command, it's missing 
client.admin.keyring

root@cephtest1:/etc/ceph# ceph -w
2012-05-30 09:05:35.255619 7fd1e9cfa760 -1 auth: failed to open keyring from 
/etc/ceph/client.admin.keyring
2012-05-30 09:05:35.255631 7fd1e9cfa760 -1 monclient(hunting): failed to open 
keyring: (2) No such file or directory
2012-05-30 09:05:35.255693 7fd1e9cfa760 -1 ceph_tool_common_init failed.


root@cephtest1:/etc/ceph# ls /etc/ceph/
ceph.conf  osd.0.keyring  osd.1.keyring  osd.2.keyring  osd.3.keyring  
osd.4.keyring

Do I need to generate a keyring ? how can I do it ? 






/etc/ceph.conf 


[global] 
; use cephx or none 
auth supported = cephx 
keyring = /etc/ceph/$name.keyring 


[mon] 
mon data = /srv/mon.$id 


[mds] 


[osd] 
osd data = /srv/osd.$id 
osd journal = /srv/osd.$id.journal 
osd journal size = 1000 
; uncomment the following line if you are mounting with ext4 
; filestore xattr use omap = true 


[mon.a] 
host = cephtest1 
mon addr = 10.3.94.27:6789 


[mon.b] 
host = cephtest2 
mon addr = 10.3.94.28:6789 


[mon.c] 
host = cephtest3 
mon addr = 10.3.94.29:6789 


[osd.0] 
host = cephtest1 
addr = 10.3.94.27 


[osd.1] 
host = cephtest1 
addr = 10.3.94.27 


[osd.2] 
host = cephtest1 
addr = 10.3.94.27 


[osd.3] 
host = cephtest1 
addr = 10.3.94.27 


[osd.4] 
host = cephtest1 
addr = 10.3.94.27 


[osd.5] 
host = cephtest2 
addr = 10.3.94.28 


[osd.6] 
host = cephtest2 
addr = 10.3.94.28 


[osd.7] 
host = cephtest2 
addr = 10.3.94.28 


[osd.8] 
host = cephtest2 
addr = 10.3.94.28 


[osd.9] 
host = cephtest2 
addr = 10.3.94.28 


[osd.10] 
host = cephtest3 
addr = 10.3.94.29 

[osd.11] 
host = cephtest3 
addr = 10.3.94.29 


[osd.12] 
host = cephtest3 
addr = 10.3.94.29 


[osd.13] 
host = cephtest3 
addr = 10.3.94.29 


[osd.14] 
host = cephtest3 
addr = 10.3.94.29 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: building test cluster : missing /etc/ceph/client.admin.keyring, need help

2012-05-30 Thread Stefan Priebe - Profihost AG
Am 30.05.2012 09:20, schrieb Alexandre DERUMIER:
 
 Hi, 
 I'm building my rados test cluster, 
 
 
 3 servers,with on each server : 1 mon - 5 osd
 
 mon daemon and osd are started, but when i use ceph command, it's missing 
 client.admin.keyring
 
 root@cephtest1:/etc/ceph# ceph -w
 2012-05-30 09:05:35.255619 7fd1e9cfa760 -1 auth: failed to open keyring from 
 /etc/ceph/client.admin.keyring
 2012-05-30 09:05:35.255631 7fd1e9cfa760 -1 monclient(hunting): failed to open 
 keyring: (2) No such file or directory
 2012-05-30 09:05:35.255693 7fd1e9cfa760 -1 ceph_tool_common_init failed.
Just run:
mkcephfs -a -c /etc/ceph/ceph.conf -k /etc/ceph/client.admin.keyring

and it will create the admin key for you.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: building test cluster : missing /etc/ceph/client.admin.keyring, need help

2012-05-30 Thread Alexandre DERUMIER
ok ,thanks

I had created the cluster, following the official doc
http://ceph.com/docs/master/config-cluster/deploying-ceph-with-mkcephfs/
with

mkcephfs -a -c /etc/ceph/ceph.conf -k ceph.keyring

and file was created in /srv

# cat /srv/ceph.keyring 
[client.admin]
key = AQCQwcVPGIAwHhAAuS5Veg7GoOyzh59zq2TKag==


is it an error in documentation ?


- Mail original - 

De: Stefan Priebe - Profihost AG s.pri...@profihost.ag 
À: Alexandre DERUMIER aderum...@odiso.com 
Cc: ceph-devel@vger.kernel.org 
Envoyé: Mercredi 30 Mai 2012 09:25:56 
Objet: Re: building test cluster : missing /etc/ceph/client.admin.keyring, need 
help 

Am 30.05.2012 09:20, schrieb Alexandre DERUMIER: 
 
 Hi, 
 I'm building my rados test cluster, 
 
 
 3 servers,with on each server : 1 mon - 5 osd 
 
 mon daemon and osd are started, but when i use ceph command, it's missing 
 client.admin.keyring 
 
 root@cephtest1:/etc/ceph# ceph -w 
 2012-05-30 09:05:35.255619 7fd1e9cfa760 -1 auth: failed to open keyring from 
 /etc/ceph/client.admin.keyring 
 2012-05-30 09:05:35.255631 7fd1e9cfa760 -1 monclient(hunting): failed to open 
 keyring: (2) No such file or directory 
 2012-05-30 09:05:35.255693 7fd1e9cfa760 -1 ceph_tool_common_init failed. 
Just run: 
mkcephfs -a -c /etc/ceph/ceph.conf -k /etc/ceph/client.admin.keyring 

and it will create the admin key for you. 

Stefan 



-- 

-- 




Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: building test cluster : missing /etc/ceph/client.admin.keyring, need help

2012-05-30 Thread Alexandre DERUMIER
root@cephtest1:/srv# cp /srv/ceph.keyring /etc/ceph/client.admin.keyring
root@cephtest1:/srv# ceph -w
2012-05-30 09:26:40.336175pg v572: 2880 pgs: 2880 active+clean; 0 bytes 
data, 544 MB used, 2039 GB / 2039 GB avail
2012-05-30 09:26:40.342175   mds e1: 0/0/1 up
2012-05-30 09:26:40.342207   osd e17: 15 osds: 15 up, 15 in
2012-05-30 09:26:40.342331   log 2012-05-30 09:06:35.419340 osd.9 
10.3.94.28:6812/13794 260 : [INF] 2.3bb scrub ok
2012-05-30 09:26:40.342424   mon e1: 3 mons at 
{a=10.3.94.27:6789/0,b=10.3.94.28:6789/0,c=10.3.94.29:6789/0}

Ok, the fun will begin now :)


- Mail original - 

De: Alexandre DERUMIER aderum...@odiso.com 
À: Stefan Priebe - Profihost AG s.pri...@profihost.ag 
Cc: ceph-devel@vger.kernel.org 
Envoyé: Mercredi 30 Mai 2012 09:33:40 
Objet: Re: building test cluster : missing /etc/ceph/client.admin.keyring, need 
help 

ok ,thanks 

I had created the cluster, following the official doc 
http://ceph.com/docs/master/config-cluster/deploying-ceph-with-mkcephfs/ 
with 

mkcephfs -a -c /etc/ceph/ceph.conf -k ceph.keyring 

and file was created in /srv 

# cat /srv/ceph.keyring 
[client.admin] 
key = AQCQwcVPGIAwHhAAuS5Veg7GoOyzh59zq2TKag== 


is it an error in documentation ? 


- Mail original - 

De: Stefan Priebe - Profihost AG s.pri...@profihost.ag 
À: Alexandre DERUMIER aderum...@odiso.com 
Cc: ceph-devel@vger.kernel.org 
Envoyé: Mercredi 30 Mai 2012 09:25:56 
Objet: Re: building test cluster : missing /etc/ceph/client.admin.keyring, need 
help 

Am 30.05.2012 09:20, schrieb Alexandre DERUMIER: 
 
 Hi, 
 I'm building my rados test cluster, 
 
 
 3 servers,with on each server : 1 mon - 5 osd 
 
 mon daemon and osd are started, but when i use ceph command, it's missing 
 client.admin.keyring 
 
 root@cephtest1:/etc/ceph# ceph -w 
 2012-05-30 09:05:35.255619 7fd1e9cfa760 -1 auth: failed to open keyring from 
 /etc/ceph/client.admin.keyring 
 2012-05-30 09:05:35.255631 7fd1e9cfa760 -1 monclient(hunting): failed to open 
 keyring: (2) No such file or directory 
 2012-05-30 09:05:35.255693 7fd1e9cfa760 -1 ceph_tool_common_init failed. 
Just run: 
mkcephfs -a -c /etc/ceph/ceph.conf -k /etc/ceph/client.admin.keyring 

and it will create the admin key for you. 

Stefan 



-- 

-- 




Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 

-- 
To unsubscribe from this list: send the line unsubscribe ceph-devel in 
the body of a message to majord...@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 



-- 

-- 




Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


rbd command : is it possible to pass authkey in argument ?

2012-05-30 Thread Alexandre DERUMIER
Hi,

I'm writing rbd module for proxmox kvm distribution,

Is it possible to pass authkey as argument in rbd command line ? (I can do in 
with qemu-rbd drive option)

Or does it need to use a keyring file ?

Regards,

Alexandre

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-30 Thread Stefan Priebe - Profihost AG
Am 30.05.2012 09:19, schrieb Stefan Majer:
 Hi,
 
 ok, so your replication level is 2 and you have 2*1GB/sec right ?
Generally yes - but for this new test it was just 1*1GB/s (see below).

 do you have a iostat -x 3 output and or a dstat from all effected
 machines during your rados bench runs as well ?

As the output looks exactly the same on all OSDs here is it from ONE osd:

Kernel 3.4:
http://pastebin.com/raw.php?i=sV9vKsWy

Kernel 3.0:
http://pastebin.com/raw.php?i=eafjpPpK

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problems when benchmarking Ceph

2012-05-30 Thread Nam Dang
Hi,

i just figured out one problem in my benchmark: all the concurrent
threads use the same file layer provided by the OS - so probably this
can be a bottleneck when the number of threads increases.
I wonder if I can connect directly to the MDS and access the the
underlying file system through some library? Sorry for my inexperience
but I haven't found any mentioning of IO operations for files in the
API. Did I miss something?

Best regards,

Nam Dang
Email: n...@de.cs.titech.ac.jp
Tokyo Institute of Technology
Tokyo, Japan


On Wed, May 30, 2012 at 7:28 PM, Nam Dang n...@de.cs.titech.ac.jp wrote:
 Dear all,

 I am using Ceph as a baseline for Hadoop. In Hadoop there is a
 NNThroughputBenchmark, which tries to test the upper limit of the
 namenode (a.k.a MDS in Ceph).
 This NNThroughputBenchmark basically creates a master node, and
 creates many threads that sends requests to the master node as
 possible. This approach minimizes communication overhead when
 employing actual clients. The code can be found here:
 https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.19/src/test/org/apache/hadoop/hdfs/NNThroughputBenchmark.java

 So I'm testing Ceph in similar manner:
 - Mount Ceph to a folder
 - Create many threads and send requests to the MDS on the MDS (so no
 network communication)
 - I do not write any data to the files: just mere file creation.

 However, I notice very poor performance on Ceph (only about 485
 ops/sec, as oppose to 8000ops/sec in Hadoop, and I'm not sure why. I
 also notice that when I tried to remove the folder created by the
 interrupted test of the benchmark mentioned above, I took too long I
 had to Ctrl+Break out of the rm program. I'm thinking that the reason
 could be that I'm using Java IO instead of Ceph direct data
 manipulation code. Also, I didn't write any data so there shouldn't be
 any overhead of communicating with the OSDs (or is my assumption
 wrong?)

 So do you have any idea on this?

 My configuration at the moment:
 - Ceph 0.47.1
 - Intel Xeon 5 2.4Ghz, 4x2 cores
 - 24GB of RAM
 - One node for Monitor, One for MDS, 5 for OSD (of the same configuration)
 - I mount Ceph to a folder on the MDS and run the simulation on that
 folder (creating, opening, deleting files) - Right now I'm just
 working on creating files so I haven't tested with others.

 And I'm wondering if there is anyway I can use the API to manipulate
 the file system directly instead of mounting through the OS and use
 the OS's basic file manipulation layer.
 I checked the API doc at http://ceph.com/docs/master/api/librados/ and
 it appears that there is no clear way of accessing the Ceph's file
 system directly, only object-based storage system.

 Thank you very much for your help!

 Below is the configuration of my Ceph installation:

 ; disable authentication
 [mon]
        mon data = /home/namd/ceph/mon

 [osd]
        osd data = /home/namd/ceph/osd
        osd journal = /home/namd/ceph/osd.journal
        osd journal size = 1000
        osd min rep = 3
        osd max rep = 3
        ; the following line is for ext4 partition
        filestore xattr use omap = true

 [mds.1]
        host=sakura09

 [mds.2]
        host=sakura10

 [mds.3]
        host=sakura11

 [mds.4]
        host=sakura12

 [mon.0]
        host=sakura08
        mon addr=192.168.172.178:6789

 [osd.0]
        host=sakura13

 [osd.1]
        host=sakura14

 [osd.2]
        host=sakura15

 [osd.3]
        host=sakura16

 [osd.4]
        host=sakura17

 [osd.5]
        host=sakura18



 Best regards,

 Nam Dang
 Email: n...@de.cs.titech.ac.jp
 Tokyo Institute of Technology
 Tokyo, Japan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-30 Thread Stefan Priebe - Profihost AG
Am 30.05.2012 13:20, schrieb Stefan Majer:
 H,
 
 
 On Wed, May 30, 2012 at 1:04 PM, Stefan Priebe - Profihost AG
 s.pri...@profihost.ag mailto:s.pri...@profihost.ag wrote:
 
 Am 30.05.2012 09:19, schrieb Stefan Majer:
  Hi,
 
  ok, so your replication level is 2 and you have 2*1GB/sec right ?
 Generally yes - but for this new test it was just 1*1GB/s (see below).
 
  do you have a iostat -x 3 output and or a dstat from all effected
  machines during your rados bench runs as well ?
 
 As the output looks exactly the same on all OSDs here is it from ONE
 osd:
 
 Kernel 3.4:
 http://pastebin.com/raw.php?i=sV9vKsWy
 
 This is strange, looks like a real regression in 3.4 ? but i guess it is
 only possible to track down this by doing 
 git bisect on the kernel sources :-(
I also tried 3.3 and 3.2 it's the same... (haven't tested 3.1).

 Kernel 3.0:
 http://pastebin.com/raw.php?i=eafjpPpK
 
 
 Here you can see a constant rate to disk of ~ 50 - 70Mbyte/sec with
 about 10-15% utilization on  them. So this test is not disk bound i
 guess your network is saturated. Can you run dstat during this test as
 well to see the network bandwith used as well.
Absolutely correct. I'm aware of this. I just want to have this result
with 3.4 so that i can use btrfs.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd command : is it possible to pass authkey in argument ?

2012-05-30 Thread Wido den Hollander

Hi,

On 05/30/2012 11:15 AM, Alexandre DERUMIER wrote:

Hi,

I'm writing rbd module for proxmox kvm distribution,

Is it possible to pass authkey as argument in rbd command line ? (I can do in 
with qemu-rbd drive option)


I'm not sure yet, have you tried with they argument --key ?

I tried and that failed and looking through the source-code it seems 
that it isn't possible yet.


Adding it seems rather simple however, so it could be added.

Wido



Or does it need to use a keyring file ?

Regards,

Alexandre

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-30 Thread Mark Nelson

On 05/30/2012 01:33 AM, Stefan Priebe - Profihost AG wrote:

I setup some tests today to try to replicate your findings (and also
check results against some previous ones I've done).  I don't think I'm
seeing exactly the same results as you, but I definitely see xfs
performing worse in this specific test than btrfs.  I've included the
results here.

Full results are available here:
http://nhm.ceph.com/results/mailinglist-tests/

But these tests shows exactly he same bad behaviour i'm seeing. Instead
of having a constant sequential write ratio you've heavily jumping
values. Are you able to test with XFS and 3.0.32? You'll then probably
see an absolutely constant write ratio.

Greets,
Stefan


The jumping around is due to the writes to the underlying OSD disk not 
being able to keep up with the journal.  I think it's more a symptom of 
the problem rather than the problem itself.  Presumably the OSD data 
disk is performing slowly because of the number of seeks that are 
happening (In my tests almost always between 40-60 on XFS, and growing 
over time on btrfs).  It's entirely possible that something changed 
going from 3.0 to 3.4 that is causing the seek behavior to be worse.  
I'll try the test again on a 3.0 kernel and record seekwatcher results 
to see if the write patterns look any different.


Btw, I apologize if you mentioned this already, but are you running MONs 
on the OSD nodes?  Also, what version of glibc do you have?


Thanks,
Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-30 Thread Stefan Priebe - Profihost AG
Hi Mark,

didn't had the time to answer your mails - but i will get on this one first.

 Would you mind installing blktrace and running blktrace -o test-3.4 -d
 /dev/sdb on the OSD node during a short (say 60s) test on 3.4?
sure no problem.

here it is:
http://www.mediafire.com/?6cw87btn7mzco25

Output:
=== sdb ===
  CPU  0:18075 events,  848 KiB data
  CPU  1:10738 events,  504 KiB data
  CPU  2: 8639 events,  405 KiB data
  CPU  3: 8614 events,  404 KiB data
  CPU  4:0 events,0 KiB data
  CPU  5:0 events,0 KiB data
  CPU  6:  143 events,7 KiB data
  CPU  7:0 events,0 KiB data
  Total: 46209 events (dropped 0), 2167 KiB data

 If you could archive/send me the results, that might help us get an idea
 of what is actually getting sent out to the disk.  Your data disk
 throughput on 3.0 looks pretty close to what I normally get (including
 on 3.4).  I'm guessing the issue you are seeing on 3.4 is probably not
 the seek problem I mentioned earlier (unless something is causing so
 many seeks that it more or less paralyzes the disk).
As i have a SSD i can't believe seeks can be a problem.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-30 Thread Mark Nelson

On 5/30/12 7:41 AM, Stefan Priebe - Profihost AG wrote:

Hi Mark,

didn't had the time to answer your mails - but i will get on this one first.


Would you mind installing blktrace and running blktrace -o test-3.4 -d
/dev/sdb on the OSD node during a short (say 60s) test on 3.4?

sure no problem.

here it is:
http://www.mediafire.com/?6cw87btn7mzco25

Output:
=== sdb ===
   CPU  0:18075 events,  848 KiB data
   CPU  1:10738 events,  504 KiB data
   CPU  2: 8639 events,  405 KiB data
   CPU  3: 8614 events,  404 KiB data
   CPU  4:0 events,0 KiB data
   CPU  5:0 events,0 KiB data
   CPU  6:  143 events,7 KiB data
   CPU  7:0 events,0 KiB data
   Total: 46209 events (dropped 0), 2167 KiB data


Great, thanks.  I'll try to look at the results later this morning.  If 
you want to look at them yourself you can open them with the blkparse 
program (and seekwatcher too, though there is a bug in the src you have 
to fix to make it work right)



If you could archive/send me the results, that might help us get an idea
of what is actually getting sent out to the disk.  Your data disk
throughput on 3.0 looks pretty close to what I normally get (including
on 3.4).  I'm guessing the issue you are seeing on 3.4 is probably not
the seek problem I mentioned earlier (unless something is causing so
many seeks that it more or less paralyzes the disk).

As i have a SSD i can't believe seeks can be a problem.


Ah, sorry. I  forgot you were on SSD.  Honestly I'm surpised that with 
3.0 you weren't getting better performance.  Something to look into once 
we figure out why your 3.4 performance is so bad!

Stefan


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-30 Thread Stefan Priebe - Profihost AG
Am 30.05.2012 15:27, schrieb Mark Nelson:
 Great, thanks.  I'll try to look at the results later this morning.  If
 you want to look at them yourself you can open them with the blkparse
 program (and seekwatcher too, though there is a bug in the src you have
 to fix to make it work right)
I've no idea about blkparse and seekwatcher - so i don't know what i
should do with the output...

 If you could archive/send me the results, that might help us get an idea
 of what is actually getting sent out to the disk.  Your data disk
 throughput on 3.0 looks pretty close to what I normally get (including
 on 3.4).  I'm guessing the issue you are seeing on 3.4 is probably not
 the seek problem I mentioned earlier (unless something is causing so
 many seeks that it more or less paralyzes the disk).
 As i have a SSD i can't believe seeks can be a problem.
 
 Ah, sorry. I  forgot you were on SSD.  Honestly I'm surpised that with
 3.0 you weren't getting better performance.  Something to look into once
 we figure out why your 3.4 performance is so bad!
Yes i think this is another problem.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-30 Thread Stefan Priebe - Profihost AG
Am 30.05.2012 15:38, schrieb Stefan Majer:
 There is a small howto from linus:
 http://kerneltrap.org/node/11753
 
 you basically need to be able to compile the kernel from source and
 start in the freshly checked out source 
 git bisect good v3.0
 git bisect bad v3.2
  
 Then git will pick a version inbetween an you can compile this, depoy it
 to your machine an look if it good or bad.
 Then tell git if it was bad or good and git will again choose a version
 between both versions. So you will get a single commit or a handful of
 commits which are probably the cause of the problem.
Thanks will try that after mark has looked into the blktrace ;-)

Thanks,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RBD operations, pinging client that serves lingering tid

2012-05-30 Thread Guido Winkelmann
Hi,

Whenever I'm doing any operations on rbd volumes (like import, copy) using the 
rbd command line client, I'm getting these messages every couple of seconds:

2012-05-30 15:53:08.010326 7f027aa47700  0 client.4159.objecter  pinging osd 
that serves lingering tid 1 (osd.2)
2012-05-30 15:53:08.010344 7f027aa47700  0 client.4159.objecter  pinging osd 
that serves lingering tid 2 (osd.0)

What does this mean? Is that anything to worry about?

Yesterday, these messages were only mentioning osd.2, not osd.0...

Guido
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-30 Thread Mark Nelson

On 5/30/12 7:41 AM, Stefan Priebe - Profihost AG wrote:

Hi Mark,

didn't had the time to answer your mails - but i will get on this one first.


Would you mind installing blktrace and running blktrace -o test-3.4 -d
/dev/sdb on the OSD node during a short (say 60s) test on 3.4?

sure no problem.

here it is:
http://www.mediafire.com/?6cw87btn7mzco25

Output:
=== sdb ===
   CPU  0:18075 events,  848 KiB data
   CPU  1:10738 events,  504 KiB data
   CPU  2: 8639 events,  405 KiB data
   CPU  3: 8614 events,  404 KiB data
   CPU  4:0 events,0 KiB data
   CPU  5:0 events,0 KiB data
   CPU  6:  143 events,7 KiB data
   CPU  7:0 events,0 KiB data
   Total: 46209 events (dropped 0), 2167 KiB data


If you could archive/send me the results, that might help us get an idea
of what is actually getting sent out to the disk.  Your data disk
throughput on 3.0 looks pretty close to what I normally get (including
on 3.4).  I'm guessing the issue you are seeing on 3.4 is probably not
the seek problem I mentioned earlier (unless something is causing so
many seeks that it more or less paralyzes the disk).

As i have a SSD i can't believe seeks can be a problem.

Stefan

Ok, I put up a seekwatcher movie showing the writes going to your SSD:

http://nhm.ceph.com/movies/mailinglist-tests/stefan.mpg

Some quick observations:

In your blktrace results there are some really big gaps after cfq 
schedule dispatch:



  8,16   0011.386025866 0  m   N cfq schedule dispatch
  8,16   2  97512.393446988  3074  A  WS 176147976 + 8 - 
(8,17) 176145928

  8,16   0012.762164080 0  m   N cfq schedule dispatch
  8,16   0 219313.355165118  3312  A WSM 175875008 + 227 - 
(8,17) 175872960


Specifically, the gap in the movie where there is no write activity 
around second 30 correlates in the blktrace results with one of these 
stalls:

  8,16   0029.548567957 0  m   N cfq schedule dispatch
  8,16   2 218534.548923918  2688  A   W 2192 + 8 - (8,17) 144


As to why this is happening, I don't know yet.  I'll have more later.

Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problems when benchmarking Ceph

2012-05-30 Thread Gregory Farnum
The library you're looking for is libceph. It does exist and it's fairly 
full-featured, but it's not nearly as well documented as the librados C api is. 
However, you'll probably get more use out of one of the Hadoop bindings. If you 
check out the git repository you'll find one set in src/client/hadoop, with 
limited instructions, that I believe currently apply to the .20/1.0.x branch. 
Or you might look at Noah's ongoing work on a much cleaner set of bindings at 
https://github.com/noahdesu/ceph/tree/wip-java-cephfs

I dunno about the MDS performance, though; there are too many possibilities 
with that. Maybe try out these options and see how they do first?
-Greg


On Wednesday, May 30, 2012 at 4:15 AM, Nam Dang wrote:

 Hi,
 
 i just figured out one problem in my benchmark: all the concurrent
 threads use the same file layer provided by the OS - so probably this
 can be a bottleneck when the number of threads increases.
 I wonder if I can connect directly to the MDS and access the the
 underlying file system through some library? Sorry for my inexperience
 but I haven't found any mentioning of IO operations for files in the
 API. Did I miss something?
 
 Best regards,
 
 Nam Dang
 Email: n...@de.cs.titech.ac.jp (mailto:n...@de.cs.titech.ac.jp)
 Tokyo Institute of Technology
 Tokyo, Japan
 
 
 On Wed, May 30, 2012 at 7:28 PM, Nam Dang n...@de.cs.titech.ac.jp 
 (mailto:n...@de.cs.titech.ac.jp) wrote:
  Dear all,
  
  I am using Ceph as a baseline for Hadoop. In Hadoop there is a
  NNThroughputBenchmark, which tries to test the upper limit of the
  namenode (a.k.a MDS in Ceph).
  This NNThroughputBenchmark basically creates a master node, and
  creates many threads that sends requests to the master node as
  possible. This approach minimizes communication overhead when
  employing actual clients. The code can be found here:
  https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.19/src/test/org/apache/hadoop/hdfs/NNThroughputBenchmark.java
  
  So I'm testing Ceph in similar manner:
  - Mount Ceph to a folder
  - Create many threads and send requests to the MDS on the MDS (so no
  network communication)
  - I do not write any data to the files: just mere file creation.
  
  However, I notice very poor performance on Ceph (only about 485
  ops/sec, as oppose to 8000ops/sec in Hadoop, and I'm not sure why. I
  also notice that when I tried to remove the folder created by the
  interrupted test of the benchmark mentioned above, I took too long I
  had to Ctrl+Break out of the rm program. I'm thinking that the reason
  could be that I'm using Java IO instead of Ceph direct data
  manipulation code. Also, I didn't write any data so there shouldn't be
  any overhead of communicating with the OSDs (or is my assumption
  wrong?)
  
  So do you have any idea on this?
  
  My configuration at the moment:
  - Ceph 0.47.1
  - Intel Xeon 5 2.4Ghz, 4x2 cores
  - 24GB of RAM
  - One node for Monitor, One for MDS, 5 for OSD (of the same configuration)
  - I mount Ceph to a folder on the MDS and run the simulation on that
  folder (creating, opening, deleting files) - Right now I'm just
  working on creating files so I haven't tested with others.
  
  And I'm wondering if there is anyway I can use the API to manipulate
  the file system directly instead of mounting through the OS and use
  the OS's basic file manipulation layer.
  I checked the API doc at http://ceph.com/docs/master/api/librados/ and
  it appears that there is no clear way of accessing the Ceph's file
  system directly, only object-based storage system.
  
  Thank you very much for your help!
  
  Below is the configuration of my Ceph installation:
  
  ; disable authentication
  [mon]
  mon data = /home/namd/ceph/mon
  
  [osd]
  osd data = /home/namd/ceph/osd
  osd journal = /home/namd/ceph/osd.journal
  osd journal size = 1000
  osd min rep = 3
  osd max rep = 3
  ; the following line is for ext4 partition
  filestore xattr use omap = true
  
  [mds.1]
  host=sakura09
  
  [mds.2]
  host=sakura10
  
  [mds.3]
  host=sakura11
  
  [mds.4]
  host=sakura12
  
  [mon.0]
  host=sakura08
  mon addr=192.168.172.178:6789
  
  [osd.0]
  host=sakura13
  
  [osd.1]
  host=sakura14
  
  [osd.2]
  host=sakura15
  
  [osd.3]
  host=sakura16
  
  [osd.4]
  host=sakura17
  
  [osd.5]
  host=sakura18
  
  
  
  Best regards,
  
  Nam Dang
  Email: n...@de.cs.titech.ac.jp (mailto:n...@de.cs.titech.ac.jp)
  Tokyo Institute of Technology
  Tokyo, Japan
 
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org 
 (mailto:majord...@vger.kernel.org)
 More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-30 Thread Mark Nelson

On 05/30/2012 09:53 AM, Stefan Priebe wrote:

Am 30.05.2012 16:49, schrieb Mark Nelson:

On 05/30/2012 08:38 AM, Stefan Majer wrote:

No i dont think so either, this was just a example. Maybe it is totaly
different.


You could try setting up a pool with a replication level of 1 and see
how that does. It will be faster in any event, but it would be
interesting to see how much faster.

is there an easier way than modifying the crush map?

PS: i also tested noop scheduler - same result.

Stefan


something like:

ceph osd pool create POOL [pg_num [pgp_num]]

then:

ceph osd pool set POOL size VALUE


Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd command : is it possible to pass authkey in argument ?

2012-05-30 Thread Sage Weil
On Wed, 30 May 2012, Wido den Hollander wrote:
 Hi,
 
 On 05/30/2012 11:15 AM, Alexandre DERUMIER wrote:
  Hi,
  
  I'm writing rbd module for proxmox kvm distribution,
  
  Is it possible to pass authkey as argument in rbd command line ? (I can do
  in with qemu-rbd drive option)
 
 I'm not sure yet, have you tried with they argument --key ?

Yep, --key is what you want.

sage

 
 I tried and that failed and looking through the source-code it seems that it
 isn't possible yet.
 
 Adding it seems rather simple however, so it could be added.
 
 Wido
 
  
  Or does it need to use a keyring file ?
  
  Regards,
  
  Alexandre
  
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph.spec does not list libuuid as build time dependency

2012-05-30 Thread Gregory Farnum
Thanks for the bug report. I created a tracker entry for it: 
http://tracker.newdream.net/issues/2484
I don't imagine it will take long for somebody who knows how to handle a .spec 
to fix. ;)
-Greg


On Friday, May 25, 2012 at 6:33 AM, Guido Winkelmann wrote:

 Hi,
 
 Building ceph 0.47.2 from the included ceph.spec file fails:
 
 checking for uuid_parse in -luuid... no
 configure: error: in `/home/guido/rpmbuild/BUILD/ceph-0.47.2':
 configure: error: libuuid not found
 See `config.log' for more details
 error: Bad exit status from /var/tmp/rpm-tmp.SdK9Ms (%build)
 
 
 RPM build errors:
 Bad exit status from /var/tmp/rpm-tmp.SdK9Ms (%build)
 
 It works after installing libuuid-devel. Maybe that package ought to be 
 listed 
 as a dependency in ceph.spec.
 
 Regards,
 Guido
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org 
 (mailto:majord...@vger.kernel.org)
 More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: distributed cluster

2012-05-30 Thread Tommi Virtanen
On Wed, May 30, 2012 at 4:47 AM, Jerker Nyberg jer...@update.uu.se wrote:
 I am waiting as fast as I can for it to be production ready. :-)

I feel like starting a quote of the week collection ;)

One more thing I remember is worth mentioning: Ceph doesn't place
objects near you, CRUSH is completely deterministic based on the
object name. Hence, your worst case may actually look like this:

sites: west, east
servers: a,b in west; c,d in east
client: x in west

Write from client, with bad luck, will go
x-d, replication: d-a, d-b

Now you've used 3x bandwidth on the WAN.


Currently, the only way to work around this is with pools, and there's
nothing automatic about that.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fwd: OSD per disk.

2012-05-30 Thread Tommi Virtanen
For the benefit of the mailing list; there's an assert(nlock  0)
crash in there.

-- Forwarded message --
From: chandrashekhar mekala chandub...@gmail.com
Date: Tue, May 29, 2012 at 9:51 PM
Subject: Re: OSD per disk.
To: Tommi Virtanen t...@inktank.com


Yes , I am using mkcephfs to create  cluster.

osd.1.log
-
2012-05-30 10:17:58.948426 7f98b4856780 journal close /data/osd.1/osd.1.journal
./common/Mutex.h: In function 'void Mutex::Unlock()' thread
7f98b4856780 time 2012-05-30 10:17:58.948800
./common/Mutex.h: 117: FAILED assert(nlock  0)
 ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
 1: (Mutex::Unlock()+0x93) [0x4f5c73]
 2: (OSD::init()+0x459) [0x546b79]
 3: (main()+0x2496) [0x4a8bc6]
 4: (__libc_start_main()+0xed) [0x7f98b2d2676d]
 5: /usr/bin/ceph-osd() [0x4aa68d]
 ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
 1: (Mutex::Unlock()+0x93) [0x4f5c73]
 2: (OSD::init()+0x459) [0x546b79]
 3: (main()+0x2496) [0x4a8bc6]
 4: (__libc_start_main()+0xed) [0x7f98b2d2676d]
 5: /usr/bin/ceph-osd() [0x4aa68d]
*** Caught signal (Aborted) **
 in thread 7f98b4856780
 ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
 1: /usr/bin/ceph-osd() [0x5fb1d6]
 2: (()+0xfcb0) [0x7f98b3ef2cb0]
 3: (gsignal()+0x35) [0x7f98b2d3b445]
 4: (abort()+0x17b) [0x7f98b2d3ebab]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f98b368969d]
 6: (()+0xb5846) [0x7f98b3687846]
 7: (()+0xb5873) [0x7f98b3687873]
 8: (()+0xb596e) [0x7f98b368796e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x200) [0x5ce010]
 10: (Mutex::Unlock()+0x93) [0x4f5c73]
 11: (OSD::init()+0x459) [0x546b79]
 12: (main()+0x2496) [0x4a8bc6]
 13: (__libc_start_main()+0xed) [0x7f98b2d2676d]
 14: /usr/bin/ceph-osd() [0x4aa68d]




On Tue, May 29, 2012 at 9:56 PM, Tommi Virtanen t...@inktank.com wrote:

 On Mon, May 28, 2012 at 2:25 AM, chandrashekhar chandub...@gmail.com wrote:
  Thanks Alexandre,  I created four directories in /data  
  (osd0,osd1,osd2,osd3)
  and mounted as below:
 
  /dev/sdb1 - /data/osd1
  /dev/sdc1 - /data/osd2
  /dev/sdd1 - /data/osd3
 
 
  But when I start ceph its starting mons and md daemons but not osds. Please 
  help
  me to get this working.

 How did you create the cluster? mkcephfs?

 Do you see log entries in /var/log/ceph/*osd*.log ? What do they say?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD deadlock with cephfs client and OSD on same machine

2012-05-30 Thread Tommi Virtanen
On Tue, May 29, 2012 at 11:59 PM, Amon Ott a@m-privacy.de wrote:
 AFAIR, when the deadlocks came, there were some GB of the 12 GB RAM still
 unused, not even for caching. But it might be a problem with low memory,
 because we are running with 32 Bit.

 Would it be possible to preallocate a significant amount of RAM for the
 purpose of syncing? I would not mind reserving a few 100 MB for that, but
 deadlocks must not happen in any case. Can the size of the journal give a
 hint on how much is needed?

The code  complexity overhead of managing that reserved buffer has so
far prevented that approach from being really adopted, anywhere in the
Linux kernel community, as far as I know.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD per disk.

2012-05-30 Thread Sam Just
That bug was fixed in f859f25d7eaec01fb6a82b88409d005c8177fea2.

It actually appears to mean that the osd failed to authenticate.
-Sam

On Wed, May 30, 2012 at 9:53 AM, Tommi Virtanen t...@inktank.com wrote:
 For the benefit of the mailing list; there's an assert(nlock  0)
 crash in there.

 -- Forwarded message --
 From: chandrashekhar mekala chandub...@gmail.com
 Date: Tue, May 29, 2012 at 9:51 PM
 Subject: Re: OSD per disk.
 To: Tommi Virtanen t...@inktank.com


 Yes , I am using mkcephfs to create  cluster.

 osd.1.log
 -
 2012-05-30 10:17:58.948426 7f98b4856780 journal close 
 /data/osd.1/osd.1.journal
 ./common/Mutex.h: In function 'void Mutex::Unlock()' thread
 7f98b4856780 time 2012-05-30 10:17:58.948800
 ./common/Mutex.h: 117: FAILED assert(nlock  0)
  ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
  1: (Mutex::Unlock()+0x93) [0x4f5c73]
  2: (OSD::init()+0x459) [0x546b79]
  3: (main()+0x2496) [0x4a8bc6]
  4: (__libc_start_main()+0xed) [0x7f98b2d2676d]
  5: /usr/bin/ceph-osd() [0x4aa68d]
  ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
  1: (Mutex::Unlock()+0x93) [0x4f5c73]
  2: (OSD::init()+0x459) [0x546b79]
  3: (main()+0x2496) [0x4a8bc6]
  4: (__libc_start_main()+0xed) [0x7f98b2d2676d]
  5: /usr/bin/ceph-osd() [0x4aa68d]
 *** Caught signal (Aborted) **
  in thread 7f98b4856780
  ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
  1: /usr/bin/ceph-osd() [0x5fb1d6]
  2: (()+0xfcb0) [0x7f98b3ef2cb0]
  3: (gsignal()+0x35) [0x7f98b2d3b445]
  4: (abort()+0x17b) [0x7f98b2d3ebab]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f98b368969d]
  6: (()+0xb5846) [0x7f98b3687846]
  7: (()+0xb5873) [0x7f98b3687873]
  8: (()+0xb596e) [0x7f98b368796e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
 const*)+0x200) [0x5ce010]
  10: (Mutex::Unlock()+0x93) [0x4f5c73]
  11: (OSD::init()+0x459) [0x546b79]
  12: (main()+0x2496) [0x4a8bc6]
  13: (__libc_start_main()+0xed) [0x7f98b2d2676d]
  14: /usr/bin/ceph-osd() [0x4aa68d]




 On Tue, May 29, 2012 at 9:56 PM, Tommi Virtanen t...@inktank.com wrote:

 On Mon, May 28, 2012 at 2:25 AM, chandrashekhar chandub...@gmail.com wrote:
  Thanks Alexandre,  I created four directories in /data  
  (osd0,osd1,osd2,osd3)
  and mounted as below:
 
  /dev/sdb1 - /data/osd1
  /dev/sdc1 - /data/osd2
  /dev/sdd1 - /data/osd3
 
 
  But when I start ceph its starting mons and md daemons but not osds. 
  Please help
  me to get this working.

 How did you create the cluster? mkcephfs?

 Do you see log entries in /var/log/ceph/*osd*.log ? What do they say?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Kernel crash

2012-05-30 Thread Guido Winkelmann
Hi,

I just saw a kernel crash on one of my machines. It had the cephfs from the 
ceph cluster mounted using the in-kernel client:

[522247.751071]  [814d3383] ? release_sock+0xe3/0x110
[522247.751182]  [815ea438] __bad_area_nosemaphore+0x1d1/0x1f0
[522247.751290]  [815ea46a] bad_area_nosemaphore+0x13/0x15
[522247.751397]  [815f7b76] do_page_fault+0x416/0x4f0
[522247.751503]  [814ce5dd] ? sock_recvmsg+0x11d/0x140
[522247.751611]  [812c0ea6] ? cpumask_next_and+0x36/0x50
[522247.751718]  [815f4475] page_fault+0x25/0x30
[522247.751828]  [a03ccba4] ? ceph_x_destroy_authorizer+0x14/0x40 
[libceph]
[522247.751995]  [a040f9be] get_authorizer+0x6e/0x140 [ceph]
[522247.752104]  [814ce646] ? kernel_recvmsg+0x46/0x60
[522247.752213]  [a03b969a] prepare_write_connect+0x17a/0x270 
[libceph]
[522247.752378]  [a03bba75] con_work+0x755/0x2c40 [libceph]
[522247.752486]  [810876a3] ? update_rq_clock+0x43/0x1b0
[522247.752598]  [a03bb320] ? ceph_msg_new+0x2d0/0x2d0 [libceph]
[522247.752707]  [810747ae] process_one_work+0x11e/0x470
[522247.752815]  [810755bf] worker_thread+0x15f/0x360
[522247.752925]  [81075460] ? manage_workers+0x230/0x230
[522247.753032]  [81079da3] kthread+0x93/0xa0
[522247.753137]  [815fd2a4] kernel_thread_helper+0x4/0x10
[522247.753245]  [81079d10] ? 
kthread_freezable_should_stop+0x70/0x70
[522247.753355]  [815fd2a0] ? gs_change+0x13/0x13
[522247.753459] ---[ end trace b9ba686594d99f89 ]---

These lines are all that I could still read on the screen. (Good thing there's 
Opens Source OCR programs out there...) I do not know how to extract more 
information about that crash (scrolling up does not work), but I'm leaving the 
machine like that over night in case someone can tell me.

Kernel version was 3.3.6-3.fc16.x86_64, Ceph cluster is version 0.47.2. The 
crash happened after I issued an rbd command. Another thing that might be 
related is that I stopped and restarted the entire cluster twice since 
mounting the cephfs. The first time, I disabled cephx, the second time I 
enabled it again.

Regards,
Guido

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd command : is it possible to pass authkey in argument ?

2012-05-30 Thread Tommi Virtanen
On Wed, May 30, 2012 at 2:15 AM, Alexandre DERUMIER aderum...@odiso.com wrote:
 Is it possible to pass authkey as argument in rbd command line ? (I can do in 
 with qemu-rbd drive option)

I see you got an answer for your actual question. I wanted to take a
different angle.

You should really avoid putting secrets on command lines, or in
process environment. Those are readable to all local users. This is
why I advocate keyring files.

Alternatively, with qemu, the monitor command mechanism they have
would let you add the drives, before starting up the vm, without the
secrets being visible to others.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-30 Thread Stefan Priebe

Hi Mark,

Am 30.05.2012 16:56, schrieb Mark Nelson:

On 05/30/2012 09:53 AM, Stefan Priebe wrote:

Am 30.05.2012 16:49, schrieb Mark Nelson:

You could try setting up a pool with a replication level of 1 and see
how that does. It will be faster in any event, but it would be
interesting to see how much faster.

is there an easier way than modifying the crush map?



something like:
ceph osd pool create POOL [pg_num [pgp_num]]
then:
ceph osd pool set POOL size VALUE


With pool size 1 the writes are constant around 112MB/s:
http://pastebin.com/raw.php?i=haDPNTfQ

So has it something todo with the replication?

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/13] libceph: cleanups preparing for state cleanup

2012-05-30 Thread Alex Elder

I am working on some fairly big changes to the ceph/RADOS client
messenger code.  The ultimate goal is to simplify the code by
having it more obviously follow a state machine, but getting to
that point isn't necessarily easy.

My approach is to make evolutionary changes, making a long series
of small changes, each of which either produces code that works
identical to before, or which very explicitly fixes a bug or adds
or modifies a feature while continuing to provide functionality
that is equivalent and/or meets what is required.

This series contains a few small batches of changes to begin this
process.  In general they're dependent on each other, so they are
being provided in one series, but I group them below into some
smaller logical subsets.

-Alex

[PATCH 01/13] libceph: eliminate connection state DEAD
[PATCH 02/13] libceph: kill bad_proto ceph connection op
[PATCH 03/13] libceph: delete useless SOCK_CLOSED manipulations
These three delete dead/unused code

[PATCH 04/13] libceph: rename socket callbacks
[PATCH 05/13] libceph: rename kvec_reset and kvec_add functions
These two simply rename some symbols.

[PATCH 06/13] libceph: embed ceph messenger structure in ceph_client
[PATCH 07/13] libceph: embed ceph connection structure in mon_client
These two each change a structure definition so that what was
once a pointer to a structure becomes instead an embedded
structure of the pointed-to type.  Doing this makes it obvious
that the relationship between the containing structure and the
embedded one is purely one-to-one.

[PATCH 08/13] libceph: start separating connection flags from state
This identifies a set of values kept in the state field and
records them instead in a new flags field, so it is obvious
the role each plays (whether it's a state diagram state, or
whether it's a Boolean flag).

[PATCH 09/13] libceph: start tracking connection socket state
This adds code to explicitly track the state of the socket
used by a ceph connection.  It begins the process of trying
to clean up some fuzziness in how the overall state of a
ceph connection is tracked.

[PATCH 10/13] libceph: provide osd number when creating osd
[PATCH 11/13] libceph: init monitor connection when opening
[PATCH 12/13] libceph: fully initialize connection in con_init()
[PATCH 13/13] libceph: set CLOSED state bit in con_init
This series moves things around a bit so that all ceph
connection initialization is done at one time, by code
defined with the net/ceph/messenger.c source file.  It
also makes explicit that a newly initialized ceph connection
is in CLOSED state.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/13] libceph: eliminate connection state DEAD

2012-05-30 Thread Alex Elder

The ceph connection state DEAD is never set and is therefore not
needed.  Eliminate it.

Signed-off-by: Alex Elder el...@inktank.com
---
 include/linux/ceph/messenger.h |1 -
 net/ceph/messenger.c   |6 --
 2 files changed, 0 insertions(+), 7 deletions(-)

diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index 2521a95..aa506ca 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -119,7 +119,6 @@ struct ceph_msg_pos {
 #define CLOSED 10 /* we've closed the connection */
 #define SOCK_CLOSED11 /* socket state changed to closed */
 #define OPENING 13 /* open connection w/ (possibly new) peer */
-#define DEAD14 /* dead, about to kfree */
 #define BACKOFF 15

 /*
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 1a80907..42ca8aa 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -2087,12 +2087,6 @@ bad_tag:
  */
 static void queue_con(struct ceph_connection *con)
 {
-   if (test_bit(DEAD, con-state)) {
-   dout(queue_con %p ignoring: DEAD\n,
-con);
-   return;
-   }
-
if (!con-ops-get(con)) {
dout(queue_con %p ref count 0\n, con);
return;
--
1.7.5.4


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/13] libceph: kill bad_proto ceph connection op

2012-05-30 Thread Alex Elder

No code sets a bad_proto method in its ceph connection operations
vector, so just get rid of it.

Signed-off-by: Alex Elder el...@inktank.com
---
 include/linux/ceph/messenger.h |3 ---
 net/ceph/messenger.c   |5 -
 2 files changed, 0 insertions(+), 8 deletions(-)

diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index aa506ca..74f6c9b 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -31,9 +31,6 @@ struct ceph_connection_operations {
int (*verify_authorizer_reply) (struct ceph_connection *con, int len);
int (*invalidate_authorizer)(struct ceph_connection *con);

-   /* protocol version mismatch */
-   void (*bad_proto) (struct ceph_connection *con);
-
/* there was some error on the socket (disconnect, whatever) */
void (*fault) (struct ceph_connection *con);

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 42ca8aa..07af994 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1356,11 +1356,6 @@ static void fail_protocol(struct ceph_connection 
*con)

 {
reset_connection(con);
set_bit(CLOSED, con-state);  /* in case there's queued work */
-
-   mutex_unlock(con-mutex);
-   if (con-ops-bad_proto)
-   con-ops-bad_proto(con);
-   mutex_lock(con-mutex);
 }

 static int process_connect(struct ceph_connection *con)
--
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/13] libceph: delete useless SOCK_CLOSED manipulations

2012-05-30 Thread Alex Elder

In con_close_socket(), SOCK_CLOSED is set in the connection state,
then cleared again after shutting down the socket.  Nothing between
the setting and clearing of that bit will ever be affected by it,
so there's no point in setting/clearing it at all.  So don't.

Signed-off-by: Alex Elder el...@inktank.com
---
 net/ceph/messenger.c |2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 07af994..fe3c2a1 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -338,11 +338,9 @@ static int con_close_socket(struct ceph_connection 
*con)

dout(con_close_socket on %p sock %p\n, con, con-sock);
if (!con-sock)
return 0;
-   set_bit(SOCK_CLOSED, con-state);
rc = con-sock-ops-shutdown(con-sock, SHUT_RDWR);
sock_release(con-sock);
con-sock = NULL;
-   clear_bit(SOCK_CLOSED, con-state);
return rc;
 }

--
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/13] libceph: rename kvec_reset and kvec_add functions

2012-05-30 Thread Alex Elder

The functions ceph_con_out_kvec_reset() and ceph_con_out_kvec_add()
are entirely private functions, so drop the ceph_ prefix in their
name to make them slightly more wieldy.

Signed-off-by: Alex Elder el...@inktank.com
---
 net/ceph/messenger.c |   48 


 1 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 5ad1f0a..2e9054f 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -484,14 +484,14 @@ static u32 get_global_seq(struct ceph_messenger 
*msgr, u32 gt)

return ret;
 }

-static void ceph_con_out_kvec_reset(struct ceph_connection *con)
+static void con_out_kvec_reset(struct ceph_connection *con)
 {
con-out_kvec_left = 0;
con-out_kvec_bytes = 0;
con-out_kvec_cur = con-out_kvec[0];
 }

-static void ceph_con_out_kvec_add(struct ceph_connection *con,
+static void con_out_kvec_add(struct ceph_connection *con,
size_t size, void *data)
 {
int index;
@@ -532,7 +532,7 @@ static void prepare_write_message(struct 
ceph_connection *con)

struct ceph_msg *m;
u32 crc;

-   ceph_con_out_kvec_reset(con);
+   con_out_kvec_reset(con);
con-out_kvec_is_msg = true;
con-out_msg_done = false;

@@ -540,9 +540,9 @@ static void prepare_write_message(struct 
ceph_connection *con)

 * TCP packet that's a good thing. */
if (con-in_seq  con-in_seq_acked) {
con-in_seq_acked = con-in_seq;
-   ceph_con_out_kvec_add(con, sizeof (tag_ack), tag_ack);
+   con_out_kvec_add(con, sizeof (tag_ack), tag_ack);
con-out_temp_ack = cpu_to_le64(con-in_seq_acked);
-   ceph_con_out_kvec_add(con, sizeof (con-out_temp_ack),
+   con_out_kvec_add(con, sizeof (con-out_temp_ack),
con-out_temp_ack);
}

@@ -570,12 +570,12 @@ static void prepare_write_message(struct 
ceph_connection *con)

BUG_ON(le32_to_cpu(m-hdr.front_len) != m-front.iov_len);

/* tag + hdr + front + middle */
-   ceph_con_out_kvec_add(con, sizeof (tag_msg), tag_msg);
-   ceph_con_out_kvec_add(con, sizeof (m-hdr), m-hdr);
-   ceph_con_out_kvec_add(con, m-front.iov_len, m-front.iov_base);
+   con_out_kvec_add(con, sizeof (tag_msg), tag_msg);
+   con_out_kvec_add(con, sizeof (m-hdr), m-hdr);
+   con_out_kvec_add(con, m-front.iov_len, m-front.iov_base);

if (m-middle)
-   ceph_con_out_kvec_add(con, m-middle-vec.iov_len,
+   con_out_kvec_add(con, m-middle-vec.iov_len,
m-middle-vec.iov_base);

/* fill in crc (except data pages), footer */
@@ -624,12 +624,12 @@ static void prepare_write_ack(struct 
ceph_connection *con)

 con-in_seq_acked, con-in_seq);
con-in_seq_acked = con-in_seq;

-   ceph_con_out_kvec_reset(con);
+   con_out_kvec_reset(con);

-   ceph_con_out_kvec_add(con, sizeof (tag_ack), tag_ack);
+   con_out_kvec_add(con, sizeof (tag_ack), tag_ack);

con-out_temp_ack = cpu_to_le64(con-in_seq_acked);
-   ceph_con_out_kvec_add(con, sizeof (con-out_temp_ack),
+   con_out_kvec_add(con, sizeof (con-out_temp_ack),
con-out_temp_ack);

con-out_more = 1;  /* more will follow.. eventually.. */
@@ -642,8 +642,8 @@ static void prepare_write_ack(struct ceph_connection 
*con)

 static void prepare_write_keepalive(struct ceph_connection *con)
 {
dout(prepare_write_keepalive %p\n, con);
-   ceph_con_out_kvec_reset(con);
-   ceph_con_out_kvec_add(con, sizeof (tag_keepalive), tag_keepalive);
+   con_out_kvec_reset(con);
+   con_out_kvec_add(con, sizeof (tag_keepalive), tag_keepalive);
set_bit(WRITE_PENDING, con-state);
 }

@@ -688,8 +688,8 @@ static struct ceph_auth_handshake 
*get_connect_authorizer(struct ceph_connection

  */
 static void prepare_write_banner(struct ceph_connection *con)
 {
-   ceph_con_out_kvec_add(con, strlen(CEPH_BANNER), CEPH_BANNER);
-   ceph_con_out_kvec_add(con, sizeof (con-msgr-my_enc_addr),
+   con_out_kvec_add(con, strlen(CEPH_BANNER), CEPH_BANNER);
+   con_out_kvec_add(con, sizeof (con-msgr-my_enc_addr),
con-msgr-my_enc_addr);

con-out_more = 0;
@@ -736,10 +736,10 @@ static int prepare_write_connect(struct 
ceph_connection *con)

con-out_connect.authorizer_len = auth ?
cpu_to_le32(auth-authorizer_buf_len) : 0;

-   ceph_con_out_kvec_add(con, sizeof (con-out_connect),
+   con_out_kvec_add(con, sizeof (con-out_connect),
con-out_connect);
if (auth  auth-authorizer_buf_len)
-   ceph_con_out_kvec_add(con, auth-authorizer_buf_len,
+   con_out_kvec_add(con, auth-authorizer_buf_len,

[PATCH 06/13] libceph: embed ceph messenger structure in ceph_client

2012-05-30 Thread Alex Elder

A ceph client has a pointer to a ceph messenger structure in it.
There is always exactly one ceph messenger for a ceph client, so
there is no need to allocate it separate from the ceph client
structure.

Switch the ceph_client structure to embed its ceph_messenger
structure.

Signed-off-by: Alex Elder el...@inktank.com
---
 fs/ceph/mds_client.c   |2 +-
 include/linux/ceph/libceph.h   |2 +-
 include/linux/ceph/messenger.h |9 +
 net/ceph/ceph_common.c |   18 +-
 net/ceph/messenger.c   |   30 +-
 net/ceph/mon_client.c  |6 +++---
 net/ceph/osd_client.c  |4 ++--
 7 files changed, 26 insertions(+), 45 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 200bc87..ad30261 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -394,7 +394,7 @@ static struct ceph_mds_session 
*register_session(struct ceph_mds_client *mdsc,

s-s_seq = 0;
mutex_init(s-s_mutex);

-   ceph_con_init(mdsc-fsc-client-msgr, s-s_con);
+   ceph_con_init(mdsc-fsc-client-msgr, s-s_con);
s-s_con.private = s;
s-s_con.ops = mds_con_ops;
s-s_con.peer_name.type = CEPH_ENTITY_TYPE_MDS;
diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h
index 92eef7c..927361c 100644
--- a/include/linux/ceph/libceph.h
+++ b/include/linux/ceph/libceph.h
@@ -131,7 +131,7 @@ struct ceph_client {
u32 supported_features;
u32 required_features;

-   struct ceph_messenger *msgr;   /* messenger instance */
+   struct ceph_messenger msgr;   /* messenger instance */
struct ceph_mon_client monc;
struct ceph_osd_client osdc;

diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index 74f6c9b..3fbd4be 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -211,10 +211,11 @@ extern int ceph_msgr_init(void);
 extern void ceph_msgr_exit(void);
 extern void ceph_msgr_flush(void);

-extern struct ceph_messenger *ceph_messenger_create(
-   struct ceph_entity_addr *myaddr,
-   u32 features, u32 required);
-extern void ceph_messenger_destroy(struct ceph_messenger *);
+extern void ceph_messenger_init(struct ceph_messenger *msgr,
+   struct ceph_entity_addr *myaddr,
+   u32 supported_features,
+   u32 required_features,
+   bool nocrc);

 extern void ceph_con_init(struct ceph_messenger *msgr,
  struct ceph_connection *con);
diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c
index cc91319..2de3ea1 100644
--- a/net/ceph/ceph_common.c
+++ b/net/ceph/ceph_common.c
@@ -468,19 +468,15 @@ struct ceph_client *ceph_create_client(struct 
ceph_options *opt, void *private,

/* msgr */
if (ceph_test_opt(client, MYIP))
myaddr = client-options-my_addr;
-   client-msgr = ceph_messenger_create(myaddr,
-client-supported_features,
-client-required_features);
-   if (IS_ERR(client-msgr)) {
-   err = PTR_ERR(client-msgr);
-   goto fail;
-   }
-   client-msgr-nocrc = ceph_test_opt(client, NOCRC);
+   ceph_messenger_init(client-msgr, myaddr,
+   client-supported_features,
+   client-required_features,
+   ceph_test_opt(client, NOCRC));

/* subsystems */
err = ceph_monc_init(client-monc, client);
if (err  0)
-   goto fail_msgr;
+   goto fail;
err = ceph_osdc_init(client-osdc, client);
if (err  0)
goto fail_monc;
@@ -489,8 +485,6 @@ struct ceph_client *ceph_create_client(struct 
ceph_options *opt, void *private,


 fail_monc:
ceph_monc_stop(client-monc);
-fail_msgr:
-   ceph_messenger_destroy(client-msgr);
 fail:
kfree(client);
return ERR_PTR(err);
@@ -515,8 +509,6 @@ void ceph_destroy_client(struct ceph_client *client)

ceph_debugfs_client_cleanup(client);

-   ceph_messenger_destroy(client-msgr);
-
ceph_destroy_options(client-options);

kfree(client);
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 2e9054f..19f1948 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -2243,18 +2243,14 @@ out:


 /*
- * create a new messenger instance
+ * initialize a new messenger instance
  */
-struct ceph_messenger *ceph_messenger_create(struct ceph_entity_addr 
*myaddr,

-u32 supported_features,
-u32 required_features)
+void ceph_messenger_init(struct ceph_messenger *msgr,
+   struct ceph_entity_addr *myaddr,
+   u32 supported_features,
+   u32 required_features,
+   bool nocrc)
 {
-  

[PATCH 07/13] libceph: embed ceph connection structure in mon_client

2012-05-30 Thread Alex Elder

A monitor client has a pointer to a ceph connection structure in it.
This is the only one of the three ceph client types that do it this
way; the OSD and MDS clients embed the connection into their main
structures.  There is always exactly one ceph connection for a
monitor client, so there is no need to allocate it separate from the
monitor client structure.

So switch the ceph_mon_client structure to embed its
ceph_connection structure.

Signed-off-by: Alex Elder el...@inktank.com
---
 include/linux/ceph/mon_client.h |2 +-
 net/ceph/mon_client.c   |   47 
--

 2 files changed, 21 insertions(+), 28 deletions(-)

diff --git a/include/linux/ceph/mon_client.h 
b/include/linux/ceph/mon_client.h

index 545f859..2113e38 100644
--- a/include/linux/ceph/mon_client.h
+++ b/include/linux/ceph/mon_client.h
@@ -70,7 +70,7 @@ struct ceph_mon_client {
bool hunting;
int cur_mon;   /* last monitor i contacted */
unsigned long sub_sent, sub_renew_after;
-   struct ceph_connection *con;
+   struct ceph_connection con;
bool have_fsid;

/* pending generic requests */
diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c
index 704dc95..ac4d6b1 100644
--- a/net/ceph/mon_client.c
+++ b/net/ceph/mon_client.c
@@ -106,9 +106,9 @@ static void __send_prepared_auth_request(struct 
ceph_mon_client *monc, int len)

monc-pending_auth = 1;
monc-m_auth-front.iov_len = len;
monc-m_auth-hdr.front_len = cpu_to_le32(len);
-   ceph_con_revoke(monc-con, monc-m_auth);
+   ceph_con_revoke(monc-con, monc-m_auth);
ceph_msg_get(monc-m_auth);  /* keep our ref */
-   ceph_con_send(monc-con, monc-m_auth);
+   ceph_con_send(monc-con, monc-m_auth);
 }

 /*
@@ -117,8 +117,8 @@ static void __send_prepared_auth_request(struct 
ceph_mon_client *monc, int len)

 static void __close_session(struct ceph_mon_client *monc)
 {
dout(__close_session closing mon%d\n, monc-cur_mon);
-   ceph_con_revoke(monc-con, monc-m_auth);
-   ceph_con_close(monc-con);
+   ceph_con_revoke(monc-con, monc-m_auth);
+   ceph_con_close(monc-con);
monc-cur_mon = -1;
monc-pending_auth = 0;
ceph_auth_reset(monc-auth);
@@ -142,9 +142,9 @@ static int __open_session(struct ceph_mon_client *monc)
monc-want_next_osdmap = !!monc-want_next_osdmap;

dout(open_session mon%d opening\n, monc-cur_mon);
-   monc-con-peer_name.type = CEPH_ENTITY_TYPE_MON;
-   monc-con-peer_name.num = cpu_to_le64(monc-cur_mon);
-   ceph_con_open(monc-con,
+   monc-con.peer_name.type = CEPH_ENTITY_TYPE_MON;
+   monc-con.peer_name.num = cpu_to_le64(monc-cur_mon);
+   ceph_con_open(monc-con,
  monc-monmap-mon_inst[monc-cur_mon].addr);

/* initiatiate authentication handshake */
@@ -226,8 +226,8 @@ static void __send_subscribe(struct ceph_mon_client 
*monc)


msg-front.iov_len = p - msg-front.iov_base;
msg-hdr.front_len = cpu_to_le32(msg-front.iov_len);
-   ceph_con_revoke(monc-con, msg);
-   ceph_con_send(monc-con, ceph_msg_get(msg));
+   ceph_con_revoke(monc-con, msg);
+   ceph_con_send(monc-con, ceph_msg_get(msg));

monc-sub_sent = jiffies | 1;  /* never 0 */
}
@@ -247,7 +247,7 @@ static void handle_subscribe_ack(struct 
ceph_mon_client *monc,

if (monc-hunting) {
pr_info(mon%d %s session established\n,
monc-cur_mon,
-   ceph_pr_addr(monc-con-peer_addr.in_addr));
+   ceph_pr_addr(monc-con.peer_addr.in_addr));
monc-hunting = false;
}
dout(handle_subscribe_ack after %d seconds\n, seconds);
@@ -461,7 +461,7 @@ static int do_generic_request(struct ceph_mon_client 
*monc,

req-request-hdr.tid = cpu_to_le64(req-tid);
__insert_generic_request(monc, req);
monc-num_generic_requests++;
-   ceph_con_send(monc-con, ceph_msg_get(req-request));
+   ceph_con_send(monc-con, ceph_msg_get(req-request));
mutex_unlock(monc-mutex);

err = wait_for_completion_interruptible(req-completion);
@@ -684,8 +684,8 @@ static void __resend_generic_request(struct 
ceph_mon_client *monc)


for (p = rb_first(monc-generic_request_tree); p; p = rb_next(p)) {
req = rb_entry(p, struct ceph_mon_generic_request, node);
-   ceph_con_revoke(monc-con, req-request);
-   ceph_con_send(monc-con, ceph_msg_get(req-request));
+   ceph_con_revoke(monc-con, req-request);
+   ceph_con_send(monc-con, ceph_msg_get(req-request));
}
 }

@@ -705,7 +705,7 @@ static void delayed_work(struct work_struct *work)
__close_session(monc);

[PATCH 08/13] libceph: start separating connection flags from state

2012-05-30 Thread Alex Elder

A ceph_connection holds a mixture of connection state (as in state
machine state) and connection flags in a single state field.  To
make the distinction more clear, define a new flags field and use
it rather than the state field to hold Boolean flag values.

Signed-off-by: Alex Elder el...@inktank.com
---
 include/linux/ceph/messenger.h |   18 +
 net/ceph/messenger.c   |   50 


 2 files changed, 37 insertions(+), 31 deletions(-)

diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index 3fbd4be..920235e 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -103,20 +103,25 @@ struct ceph_msg_pos {
 #define MAX_DELAY_INTERVAL (5 * 60 * HZ)

 /*
- * ceph_connection state bit flags
+ * ceph_connection flag bits
  */
+
 #define LOSSYTX 0  /* we can close channel or drop messages on 
errors */

-#define CONNECTING 1
-#define NEGOTIATING2
 #define KEEPALIVE_PENDING  3
 #define WRITE_PENDING  4  /* we have data ready to send */
+#define SOCK_CLOSED11 /* socket state changed to closed */
+#define BACKOFF 15
+
+/*
+ * ceph_connection states
+ */
+#define CONNECTING 1
+#define NEGOTIATING2
 #define STANDBY8  /* no outgoing messages, socket closed.  we 
keep
* the ceph_connection around to maintain shared
* state with the peer. */
 #define CLOSED 10 /* we've closed the connection */
-#define SOCK_CLOSED11 /* socket state changed to closed */
 #define OPENING 13 /* open connection w/ (possibly new) peer */
-#define BACKOFF 15

 /*
  * A single connection with another host.
@@ -133,7 +138,8 @@ struct ceph_connection {

struct ceph_messenger *msgr;
struct socket *sock;
-   unsigned long state;/* connection state (see flags above) */
+   unsigned long flags;
+   unsigned long state;
const char *error_msg;  /* error message, if any */

struct ceph_entity_addr peer_addr; /* peer address */
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 19f1948..29055df 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -176,7 +176,7 @@ static void ceph_sock_write_space(struct sock *sk)
 * buffer. See net/ipv4/tcp_input.c:tcp_check_space()
 * and net/core/stream.c:sk_stream_write_space().
 */
-   if (test_bit(WRITE_PENDING, con-state)) {
+   if (test_bit(WRITE_PENDING, con-flags)) {
if (sk_stream_wspace(sk) = sk_stream_min_wspace(sk)) {
dout(%s %p queueing write work\n, __func__, con);
clear_bit(SOCK_NOSPACE, sk-sk_socket-flags);
@@ -203,7 +203,7 @@ static void ceph_sock_state_change(struct sock *sk)
dout(%s TCP_CLOSE\n, __func__);
case TCP_CLOSE_WAIT:
dout(%s TCP_CLOSE_WAIT\n, __func__);
-   if (test_and_set_bit(SOCK_CLOSED, con-state) == 0) {
+   if (test_and_set_bit(SOCK_CLOSED, con-flags) == 0) {
if (test_bit(CONNECTING, con-state))
con-error_msg = connection failed;
else
@@ -393,9 +393,9 @@ void ceph_con_close(struct ceph_connection *con)
 ceph_pr_addr(con-peer_addr.in_addr));
set_bit(CLOSED, con-state);  /* in case there's queued work */
clear_bit(STANDBY, con-state);  /* avoid connect_seq bump */
-   clear_bit(LOSSYTX, con-state);  /* so we retry next connect */
-   clear_bit(KEEPALIVE_PENDING, con-state);
-   clear_bit(WRITE_PENDING, con-state);
+   clear_bit(LOSSYTX, con-flags);  /* so we retry next connect */
+   clear_bit(KEEPALIVE_PENDING, con-flags);
+   clear_bit(WRITE_PENDING, con-flags);
mutex_lock(con-mutex);
reset_connection(con);
con-peer_global_seq = 0;
@@ -612,7 +612,7 @@ static void prepare_write_message(struct 
ceph_connection *con)

prepare_write_message_footer(con);
}

-   set_bit(WRITE_PENDING, con-state);
+   set_bit(WRITE_PENDING, con-flags);
 }

 /*
@@ -633,7 +633,7 @@ static void prepare_write_ack(struct ceph_connection 
*con)

con-out_temp_ack);

con-out_more = 1;  /* more will follow.. eventually.. */
-   set_bit(WRITE_PENDING, con-state);
+   set_bit(WRITE_PENDING, con-flags);
 }

 /*
@@ -644,7 +644,7 @@ static void prepare_write_keepalive(struct 
ceph_connection *con)

dout(prepare_write_keepalive %p\n, con);
con_out_kvec_reset(con);
con_out_kvec_add(con, sizeof (tag_keepalive), tag_keepalive);
-   set_bit(WRITE_PENDING, con-state);
+   set_bit(WRITE_PENDING, con-flags);
 }

 /*
@@ -673,7 +673,7 @@ static struct ceph_auth_handshake 
*get_connect_authorizer(struct ceph_connection


if (IS_ERR(auth))
return auth;
-   if 

[PATCH 09/13] libceph: start tracking connection socket state

2012-05-30 Thread Alex Elder

Start explicitly keeping track of the state of a ceph connection's
socket, separate from the state of the connection itself.  Create
placeholder functions to encapsulate the state transitions.


| NEW* |  transient initial state

| con_sock_state_init()
v
--
| CLOSED |  initialized, but no socket (and no
--  TCP connection)
 ^  \
 |   \ con_sock_state_connecting()
 |--
 |  \
 + con_sock_state_closed()   \
 |\   \
 | \   \
 |  --- \
 |  | CLOSING |  socket event;   \
 |  ---  await close  \
 |   ^|
 |   ||
 |   + con_sock_state_closing()   |
 |  / \   |
 | /   ---|
 |/   \   v
 |   /--
 |  /-| CONNECTING |  socket created, TCP
 |  |   / --  connect initiated
 |  |   | con_sock_state_connected()
 |  |   v
-
| CONNECTED |  TCP connection established
-

Make the socket state an atomic variable, reinforcing that it's a
distinct transtion with no possible intermediate/both states.
This is almost certainly overkill at this point, though the
transitions into CONNECTED and CLOSING state do get called via
socket callback (the rest of the transitions occur with the
connection mutex held).  We can back out the atomicity later.

Signed-off-by: Alex Elder el...@inktank.com
---
 include/linux/ceph/messenger.h |8 -
 net/ceph/messenger.c   |   63 


 2 files changed, 69 insertions(+), 2 deletions(-)

diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index 920235e..5e852f4 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -137,14 +137,18 @@ struct ceph_connection {
const struct ceph_connection_operations *ops;

struct ceph_messenger *msgr;
+
+   atomic_t sock_state;
struct socket *sock;
+   struct ceph_entity_addr peer_addr; /* peer address */
+   struct ceph_entity_addr peer_addr_for_me;
+
unsigned long flags;
unsigned long state;
const char *error_msg;  /* error message, if any */

-   struct ceph_entity_addr peer_addr; /* peer address */
struct ceph_entity_name peer_name; /* peer name */
-   struct ceph_entity_addr peer_addr_for_me;
+
unsigned peer_features;
u32 connect_seq;  /* identify the most recent connection
 attempt for this connection, client */
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 29055df..7e11b07 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -29,6 +29,14 @@
  * the sender.
  */

+/* State values for ceph_connection-sock_state; NEW is assumed to be 0 */
+
+#define CON_SOCK_STATE_NEW 0   /* - CLOSED */
+#define CON_SOCK_STATE_CLOSED  1   /* - CONNECTING */
+#define CON_SOCK_STATE_CONNECTING  2   /* - CONNECTED or - CLOSING */
+#define CON_SOCK_STATE_CONNECTED   3   /* - CLOSING or - CLOSED */
+#define CON_SOCK_STATE_CLOSING 4   /* - CLOSED */
+
 /* static tag bytes (protocol control messages) */
 static char tag_msg = CEPH_MSGR_TAG_MSG;
 static char tag_ack = CEPH_MSGR_TAG_ACK;
@@ -147,6 +155,54 @@ void ceph_msgr_flush(void)
 }
 EXPORT_SYMBOL(ceph_msgr_flush);

+/* Connection socket state transition functions */
+
+static void con_sock_state_init(struct ceph_connection *con)
+{
+   int old_state;
+
+   old_state = atomic_xchg(con-sock_state, CON_SOCK_STATE_CLOSED);
+   if (WARN_ON(old_state != CON_SOCK_STATE_NEW))
+   printk(%s: unexpected old state %d\n, __func__, old_state);
+}
+
+static void con_sock_state_connecting(struct ceph_connection *con)
+{
+   int old_state;
+
+   old_state = atomic_xchg(con-sock_state, CON_SOCK_STATE_CONNECTING);
+   if (WARN_ON(old_state != CON_SOCK_STATE_CLOSED))
+   printk(%s: unexpected old state %d\n, __func__, old_state);
+}
+
+static void con_sock_state_connected(struct ceph_connection *con)
+{
+   int old_state;
+
+   old_state = atomic_xchg(con-sock_state, CON_SOCK_STATE_CONNECTED);
+   if (WARN_ON(old_state != CON_SOCK_STATE_CONNECTING))
+   printk(%s: unexpected old state %d\n, __func__, old_state);
+}
+
+static void con_sock_state_closing(struct ceph_connection *con)
+{
+   int old_state;
+
+   old_state = atomic_xchg(con-sock_state, CON_SOCK_STATE_CLOSING);
+   if (WARN_ON(old_state != CON_SOCK_STATE_CONNECTING 
+   old_state != 

[PATCH 10/13] libceph: provide osd number when creating osd

2012-05-30 Thread Alex Elder

Pass the osd number to the create_osd() routine, and move the
initialization of fields that depend on it therein.

Signed-off-by: Alex Elder el...@inktank.com
---
 net/ceph/osd_client.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index cca4c7f..e30efbc 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -624,7 +624,7 @@ static void osd_reset(struct ceph_connection *con)
 /*
  * Track open sessions with osds.
  */
-static struct ceph_osd *create_osd(struct ceph_osd_client *osdc)
+static struct ceph_osd *create_osd(struct ceph_osd_client *osdc, int onum)
 {
struct ceph_osd *osd;

@@ -634,6 +634,7 @@ static struct ceph_osd *create_osd(struct 
ceph_osd_client *osdc)


atomic_set(osd-o_ref, 1);
osd-o_osdc = osdc;
+   osd-o_osd = onum;
INIT_LIST_HEAD(osd-o_requests);
INIT_LIST_HEAD(osd-o_linger_requests);
INIT_LIST_HEAD(osd-o_osd_lru);
@@ -643,6 +644,7 @@ static struct ceph_osd *create_osd(struct 
ceph_osd_client *osdc)

osd-o_con.private = osd;
osd-o_con.ops = osd_con_ops;
osd-o_con.peer_name.type = CEPH_ENTITY_TYPE_OSD;
+   osd-o_con.peer_name.num = cpu_to_le64(onum);

INIT_LIST_HEAD(osd-o_keepalive_item);
return osd;
@@ -998,15 +1000,13 @@ static int __map_request(struct ceph_osd_client 
*osdc,

req-r_osd = __lookup_osd(osdc, o);
if (!req-r_osd  o = 0) {
err = -ENOMEM;
-   req-r_osd = create_osd(osdc);
+   req-r_osd = create_osd(osdc, o);
if (!req-r_osd) {
list_move(req-r_req_lru_item, osdc-req_notarget);
goto out;
}

dout(map_request osd %p is osd%d\n, req-r_osd, o);
-   req-r_osd-o_osd = o;
-   req-r_osd-o_con.peer_name.num = cpu_to_le64(o);
__insert_osd(osdc, req-r_osd);

ceph_con_open(req-r_osd-o_con, osdc-osdmap-osd_addr[o]);
--
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/13] libceph: init monitor connection when opening

2012-05-30 Thread Alex Elder

Hold off initializing a monitor client's connection until just
before it gets opened for use.

Signed-off-by: Alex Elder el...@inktank.com
---
 net/ceph/mon_client.c |   13 ++---
 1 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c
index ac4d6b1..77da480 100644
--- a/net/ceph/mon_client.c
+++ b/net/ceph/mon_client.c
@@ -119,6 +119,7 @@ static void __close_session(struct ceph_mon_client 
*monc)

dout(__close_session closing mon%d\n, monc-cur_mon);
ceph_con_revoke(monc-con, monc-m_auth);
ceph_con_close(monc-con);
+   monc-con.private = NULL;
monc-cur_mon = -1;
monc-pending_auth = 0;
ceph_auth_reset(monc-auth);
@@ -141,9 +142,13 @@ static int __open_session(struct ceph_mon_client *monc)
monc-sub_renew_after = jiffies;  /* i.e., expired */
monc-want_next_osdmap = !!monc-want_next_osdmap;

-   dout(open_session mon%d opening\n, monc-cur_mon);
+   ceph_con_init(monc-client-msgr, monc-con);
+   monc-con.private = monc;
+   monc-con.ops = mon_con_ops;
monc-con.peer_name.type = CEPH_ENTITY_TYPE_MON;
monc-con.peer_name.num = cpu_to_le64(monc-cur_mon);
+
+   dout(open_session mon%d opening\n, monc-cur_mon);
ceph_con_open(monc-con,
  monc-monmap-mon_inst[monc-cur_mon].addr);

@@ -760,10 +765,6 @@ int ceph_monc_init(struct ceph_mon_client *monc, 
struct ceph_client *cl)

goto out;

/* connection */
-   ceph_con_init(monc-client-msgr, monc-con);
-   monc-con.private = monc;
-   monc-con.ops = mon_con_ops;
-
/* authentication */
monc-auth = ceph_auth_init(cl-options-name,
cl-options-key);
@@ -836,8 +837,6 @@ void ceph_monc_stop(struct ceph_mon_client *monc)
mutex_lock(monc-mutex);
__close_session(monc);

-   monc-con.private = NULL;
-
mutex_unlock(monc-mutex);

ceph_auth_destroy(monc-auth);
--
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/13] libceph: fully initialize connection in con_init()

2012-05-30 Thread Alex Elder

Move the initialization of a ceph connection's private pointer,
operations vector pointer, and peer name information into
ceph_con_init().  Rearrange the arguments so the connection pointer
is first.  Hide the byte-swapping of the peer entity number inside
ceph_con_init()

Signed-off-by: Alex Elder el...@inktank.com
---
 fs/ceph/mds_client.c   |7 ++-
 include/linux/ceph/messenger.h |6 --
 net/ceph/messenger.c   |9 -
 net/ceph/mon_client.c  |8 +++-
 net/ceph/osd_client.c  |7 ++-
 5 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index ad30261..ecd7f15 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -394,11 +394,8 @@ static struct ceph_mds_session 
*register_session(struct ceph_mds_client *mdsc,

s-s_seq = 0;
mutex_init(s-s_mutex);

-   ceph_con_init(mdsc-fsc-client-msgr, s-s_con);
-   s-s_con.private = s;
-   s-s_con.ops = mds_con_ops;
-   s-s_con.peer_name.type = CEPH_ENTITY_TYPE_MDS;
-   s-s_con.peer_name.num = cpu_to_le64(mds);
+   ceph_con_init(s-s_con, s, mds_con_ops, mdsc-fsc-client-msgr,
+   CEPH_ENTITY_TYPE_MDS, mds);

spin_lock_init(s-s_gen_ttl_lock);
s-s_cap_gen = 0;
diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index 5e852f4..dd27837 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -227,8 +227,10 @@ extern void ceph_messenger_init(struct 
ceph_messenger *msgr,

u32 required_features,
bool nocrc);

-extern void ceph_con_init(struct ceph_messenger *msgr,
- struct ceph_connection *con);
+extern void ceph_con_init(struct ceph_connection *con, void *private,
+   const struct ceph_connection_operations *ops,
+   struct ceph_messenger *msgr, __u8 entity_type,
+   __u64 entity_num);
 extern void ceph_con_open(struct ceph_connection *con,
  struct ceph_entity_addr *addr);
 extern bool ceph_con_opened(struct ceph_connection *con);
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 7e11b07..cdf8299 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -514,15 +514,22 @@ void ceph_con_put(struct ceph_connection *con)
 /*
  * initialize a new connection.
  */
-void ceph_con_init(struct ceph_messenger *msgr, struct ceph_connection 
*con)

+void ceph_con_init(struct ceph_connection *con, void *private,
+   const struct ceph_connection_operations *ops,
+   struct ceph_messenger *msgr, __u8 entity_type, __u64 entity_num)
 {
dout(con_init %p\n, con);
memset(con, 0, sizeof(*con));
+   con-private = private;
atomic_set(con-nref, 1);
+   con-ops = ops;
con-msgr = msgr;

con_sock_state_init(con);

+   con-peer_name.type = (__u8) entity_type;
+   con-peer_name.num = cpu_to_le64(entity_num);
+
mutex_init(con-mutex);
INIT_LIST_HEAD(con-out_queue);
INIT_LIST_HEAD(con-out_sent);
diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c
index 77da480..9b4cef9 100644
--- a/net/ceph/mon_client.c
+++ b/net/ceph/mon_client.c
@@ -142,11 +142,9 @@ static int __open_session(struct ceph_mon_client *monc)
monc-sub_renew_after = jiffies;  /* i.e., expired */
monc-want_next_osdmap = !!monc-want_next_osdmap;

-   ceph_con_init(monc-client-msgr, monc-con);
-   monc-con.private = monc;
-   monc-con.ops = mon_con_ops;
-   monc-con.peer_name.type = CEPH_ENTITY_TYPE_MON;
-   monc-con.peer_name.num = cpu_to_le64(monc-cur_mon);
+   ceph_con_init(monc-con, monc, mon_con_ops,
+   monc-client-msgr,
+   CEPH_ENTITY_TYPE_MON, monc-cur_mon);

dout(open_session mon%d opening\n, monc-cur_mon);
ceph_con_open(monc-con,
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index e30efbc..1f3951a 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -640,11 +640,8 @@ static struct ceph_osd *create_osd(struct 
ceph_osd_client *osdc, int onum)

INIT_LIST_HEAD(osd-o_osd_lru);
osd-o_incarnation = 1;

-   ceph_con_init(osdc-client-msgr, osd-o_con);
-   osd-o_con.private = osd;
-   osd-o_con.ops = osd_con_ops;
-   osd-o_con.peer_name.type = CEPH_ENTITY_TYPE_OSD;
-   osd-o_con.peer_name.num = cpu_to_le64(onum);
+   ceph_con_init(osd-o_con, osd, osd_con_ops, osdc-client-msgr,
+   CEPH_ENTITY_TYPE_OSD, onum);

INIT_LIST_HEAD(osd-o_keepalive_item);
return osd;
--
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 13/13] libceph: set CLOSED state bit in con_init

2012-05-30 Thread Alex Elder

Once a connection is fully initialized, it is really in a CLOSED
state, so make that explicit by setting the bit in its state field.

It is possible for a connection in NEGOTIATING state to get a
failure, leading to ceph_fault() and ultimately ceph_con_close().
Clear that bits if it is set in that case, to reflect that the
connection truly is closed and is no longer participating in a
connect sequence.

Issue a warning if ceph_con_open() is called on a connection that
is not in CLOSED state.

Signed-off-by: Alex Elder el...@inktank.com
---
 net/ceph/messenger.c |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index cdf8299..85bfe12 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -452,10 +452,13 @@ void ceph_con_close(struct ceph_connection *con)
dout(con_close %p peer %s\n, con,
 ceph_pr_addr(con-peer_addr.in_addr));
set_bit(CLOSED, con-state);  /* in case there's queued work */
+   clear_bit(NEGOTIATING, con-state);
clear_bit(STANDBY, con-state);  /* avoid connect_seq bump */
+
clear_bit(LOSSYTX, con-flags);  /* so we retry next connect */
clear_bit(KEEPALIVE_PENDING, con-flags);
clear_bit(WRITE_PENDING, con-flags);
+
mutex_lock(con-mutex);
reset_connection(con);
con-peer_global_seq = 0;
@@ -472,7 +475,8 @@ void ceph_con_open(struct ceph_connection *con, 
struct ceph_entity_addr *addr)

 {
dout(con_open %p %s\n, con, ceph_pr_addr(addr-in_addr));
set_bit(OPENING, con-state);
-   clear_bit(CLOSED, con-state);
+   WARN_ON(!test_and_clear_bit(CLOSED, con-state));
+
memcpy(con-peer_addr, addr, sizeof(*addr));
con-delay = 0;  /* reset backoff memory */
queue_con(con);
@@ -534,6 +538,8 @@ void ceph_con_init(struct ceph_connection *con, void 
*private,

INIT_LIST_HEAD(con-out_queue);
INIT_LIST_HEAD(con-out_sent);
INIT_DELAYED_WORK(con-work, con_work);
+
+   set_bit(CLOSED, con-state);
 }
 EXPORT_SYMBOL(ceph_con_init);

--
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-30 Thread Mark Nelson

On 05/30/2012 01:26 PM, Stefan Priebe wrote:

Hi Mark,

Am 30.05.2012 16:56, schrieb Mark Nelson:

On 05/30/2012 09:53 AM, Stefan Priebe wrote:

Am 30.05.2012 16:49, schrieb Mark Nelson:

You could try setting up a pool with a replication level of 1 and see
how that does. It will be faster in any event, but it would be
interesting to see how much faster.

is there an easier way than modifying the crush map?



something like:
ceph osd pool create POOL [pg_num [pgp_num]]
then:
ceph osd pool set POOL size VALUE


With pool size 1 the writes are constant around 112MB/s:
http://pastebin.com/raw.php?i=haDPNTfQ

So has it something todo with the replication?

Stefan


Well now that is interesting.  Replication is pretty network heavy.  In 
addition to the client transfers to the OSDs, you have each OSD node 
sending and receiving data from each other.  Based on these results it 
looks like you may be stalling waiting for data to replicate so the 
client stops sending new requests.  If you set the osd, filestore, and 
messenger debugging up to like 20 you'll get a ton of info that may 
provide more clues.


Otherwise, a while ago I started making a list of performance related 
settings and tests that we (Inktank) may want to check for customers.  
Note that this is a work in progress and the values may not be exactly 
right yet.  You could check and see if any of the networking settings 
have changed on your setup between 3.0 and 3.4:


http://ceph.com/wiki/Performance_analysis

Also there was a thread a while back where Jim Schutt saw problems that 
looked like disk performance issues due to tcp autotuning policy:


http://www.spinics.net/lists/ceph-devel/msg05049.html

That seemed to be more an issue with lots of clients and OSDs per node, 
but I thought I'd mention it since some of the effects are similar.


Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-30 Thread Stefan Priebe

Am 30.05.2012 20:47, schrieb Stefan Majer:


 From my perspective marks hints regarding blktrace end up in the same
summary as the iostat ouput gives.
You see stalls, not induced by disk by any means, no other obvious hints
where the lag might come from.
So if you want to know why kernels  3.2 are slow for your workload i
would drill down this with git bisect.


OK here are some tests regarding the kernel version - all made with XFS.

Starting with 3.2.0-rc1 it drops from 164MB/s (bonding) to 119MB/s but 
it never goes down to 0MB/s. 3.2.18 shows the same as 3.2-rc1.


Then with 3.3-rc1 i'm seeing even faster speed (178MB/s) than with 3.0.X 
- so everything is fine again. So it seems 3.2.X had another bug which 
reduced the speed which was fixed in 3.3.


Beginning with 3.3-rc4 it get's bad with drops to 0MB/s. So it should be 
a commit between 3.3-rc3 and 3.3-rc4. Sadly this are 370 commits. No 
idea where to start.


Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Question about libcephfs mounting mechanism

2012-05-30 Thread Nam Dang
Dear all,

I want to inquire on ceph's internal mounting mechanism. I am using
wip-java-cephfs API to access Ceph internal file system directly.
I want to create several threads that access cephFS simultaneously and
independently, i.e. the mounting in each thread is
independent from the other's, and there is no shared underlying data
structure. As far as I know, java-ceph does not use shared structure,
but I'm not so sure about the underlying code in libcephfs though. I'm
worried that it may be similar to Virtual File System layer, upon
which multiple threads can access concurrently but internally the
mounting point is shared. I hope somebody with experience with Ceph
can help me answer this question.

Thank you very much,

Best regards,
Nam Dang
Tokyo Institute of Technology
Tokyo, Japan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question about libcephfs mounting mechanism

2012-05-30 Thread Sage Weil
On Thu, 31 May 2012, Nam Dang wrote:
 Dear all,
 
 I want to inquire on ceph's internal mounting mechanism. I am using
 wip-java-cephfs API to access Ceph internal file system directly.
 I want to create several threads that access cephFS simultaneously and
 independently, i.e. the mounting in each thread is
 independent from the other's, and there is no shared underlying data
 structure. As far as I know, java-ceph does not use shared structure,
 but I'm not so sure about the underlying code in libcephfs though. I'm
 worried that it may be similar to Virtual File System layer, upon
 which multiple threads can access concurrently but internally the
 mounting point is shared. I hope somebody with experience with Ceph
 can help me answer this question.

Each ceph_create() call instantiates a new instance of the client, and 
nothing is shared between clients.  This should get you the behavior 
you're after!

Cheers-
sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html