Re: lltng enabled by default and qemu apparmor|selinux problems

2015-10-12 Thread HEWLETT, Paul (Paul)
IF I can add my $0.02 - we were unable to use the libradosstriper library in 
RHEL6 because it uses the same initialisation tags as librados and lttng does 
not like that. We had no problems with RHEL7 version of ceph because lttng is 
not enabled. Please do not re-enable lttng in RHEL7 and later branches….

Regards
Paul




On 11/10/2015 18:06, "ceph-devel-ow...@vger.kernel.org on behalf of Alexandre 
DERUMIER"  
wrote:

>Hi,
>
>it seem that since this commit
>
>https://github.com/ceph/ceph/pull/4261/files
>
>lltng is enabled by default.
>
>But this give error with qemu when apparmor|selinux is enabled.
>
>That's why ubuntu && redhat now disable it for their own packages.
>
>https://bugzilla.redhat.com/show_bug.cgi?id=1223319
>https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1432644
>
>In the ubuntu launchpad, Sage has made a reply
>
>"
>Sage Weil (sage-newdream) wrote on 2015-04-02: #21
>FWIW, we are disabling the lttng support in the final hammer release to avoid 
>this issue (until we come up with a better solution)."
>
>
>It seem that it's still enabled by default in ceph git and ceph.com packages.
>
>Is it still planned to disable by default ?
>
>
>
>--
>To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>the body of a message to majord...@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
N�r��yb�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj"��!�i

Fwd: [newstore (again)] how disable double write WAL

2015-10-12 Thread David Casier

Hello everybody,
fragment is stored in rocksdb before being written to "/fragments" ?
I separed "/db" and "/fragments" but during the bench, everything is 
writing to "/db"

I changed options "newstore_sync_*" without success.

Is there any way to write all metadata in "/db" and all data in 
"/fragments" ?


--


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fwd: [newstore (again)] how disable double write WAL

2015-10-12 Thread Sage Weil
On Mon, 12 Oct 2015, David Casier wrote:
> Hello everybody,
> fragment is stored in rocksdb before being written to "/fragments" ?
> I separed "/db" and "/fragments" but during the bench, everything is writing
> to "/db"
> I changed options "newstore_sync_*" without success.
> 
> Is there any way to write all metadata in "/db" and all data in "/fragments" ?

You can set newstore_overlay_max = 0 to avoid most data landing in db/.  
But if you are overwriting an existing object, doing write-ahead logging 
is usually unavoidable because we need to make the update atomic (and the 
underlying posix fs doesn't provide that).  The wip-newstore-frags branch 
mitigates this somewhat for larger writes by limiting fragment size, but 
for small IOs this is pretty much always going to be the case.  For small 
IOs, though, putting things in db/ is generally better since we can 
combine many small ios into a single (rocksdb) journal/wal write.  And 
often leave them there (via the 'overlay' behavior).

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reply: [PATCH] rbd: prevent kernel stack blow up on rbd map

2015-10-12 Thread Ilya Dryomov
On Mon, Oct 12, 2015 at 4:22 AM, Caoxudong  wrote:
> By the way, do you think it's necessary that we add the clone-chain-length 
> limit in user-space code too?

librbd is different in a lot of ways and there isn't a clean separation
between the client part (i.e. what is essentially reimplemented in the
kernel) and the rest (management and maintenance parts, etc).  It's
certainly not necessary, whether it's desirable - I'm not sure.  Also,
librbd limit, if we were to introduce one, would probably have to be
bigger.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re:Re: The questions of data collection and cache tiering in Ceph

2015-10-12 Thread 蔡毅
Greg,
Thank you  a lot for your timely reply. These are really helpful for me.I 
also have some doubts.
In Ceph, besides monitoring pool, pg, object, it can also acquire other 
statistics such as  CPU, IOPS, BW. In order to acquire the information , 
do the Ceph need to call other tools or have achieved functions in the source 
code because I could only find easy equation (like the division ) in 
source code?
As a object , there are two parts : data and attributes. Do they store in 
different spaces finally because I find these is some attributes information in 
OMap?
In your reply, you said ,”any subsequent operations on that object will 
wait until that durable op is readable before completing.” so if I set the 
system 
flushes the objects from journal to disk every 15s, does it mean I could not 
read the object in 15s because I only write the object on the journal but not 
yet 
on the disk? Is it possible to cause some problems?
Thank you so much. 
Yours,
Chay







在 2015-10-09 02:34:25,"Gregory Farnum"  写道:
>On Thu, Oct 8, 2015 at 9:09 AM, 蔡毅  wrote:
>>
>> Dear developers,
>>
>>Recently I met some troubles when I read the Ceph’s source code and 
>> understand the architecture.
>> The details of problems are as followed.
>>
>>1.In monitoring tools, they can collect much data when Ceph runs. I 
>> wonder what
>> kind of data the Ceph can provide (object data, PG data or other data?). 
>> Could the
>> Ceph provide every object’s data (e.g. The times the object is read or wrote
>> ,the latest time the object is used ,etc.) ,if Ceph could ,in source code 
>> ,where
>> could I find these details. I really want to know the monitoring data the 
>> Ceph
>> can provides and where they are in source code so that I could know how to 
>> use it
>> more efficiently. For example, I know the Ceph could provide the data of the
>> objects’ number per PG, the read and write bandwidth, but I couldn’t find 
>> how to
>> achieve these in source code.
>
>I'm not quite sure what you're asking here, but I think you'll want to
>look at the MPGStats.h message (in ceph/src/messages), and trace
>backwards through the OSD code (ceph/src/osd/) which creates them and
>then forwards through the monitor code (ceph/src/mon/OSDMonitor.cc)
>
>>
>>2.From official documents, Ceph provides the cache tiering to improve
>> performance. But I couldn’t find more details to describe the cache tiering
>> like which kind of algorithm the cache agent uses. In the source code, where
>> could I find these?
>
>The cache tiering is part of the OSD. Look at the TierAgentState.h
>file and the parts of ReplicatedPG.cc which reference it.
>
>>
>>   3.In write process , there are two responses to client ,first is from 
>> journal and
>> second is occurred when object writes to real disk .so when I write a object 
>> to
>> Ceph using librbd, does not the write finish until the second response 
>> occurs and
>> what mean the first and second responses for clients? When a object writes 
>> to journal
>> but not to filestore (that is not to disk ), could I read this object? If I 
>> could,
>> where could I read this object?
>
>You get a response from the OSDs:
>1) when the write operation is durable.
>2) when the write operation is readable.
>
>The order these arrive in will depend on your OSD configuration (btrfs
>can send readable before durable; xfs always sends durable first;
>etc). If you get a "durable" response, any subsequent operations on
>that object will wait until that durable op is readable before
>completing.
>-Greg
N嫥叉靣笡y氊b瞂千v豝�)藓{.n�+壏渮榏z鳐妠ay�蕠跈�,jf"穐殝鄗�畐ア�⒎:+v墾妛鑚豰稛�珣赙zZ+凒殠娸"濟!秈

ceph branch status

2015-10-12 Thread ceph branch robot
-- All Branches --

Adam C. Emerson 
2015-09-14 12:32:18 -0400   wip-cxx11time
2015-09-15 12:09:20 -0400   wip-cxx11concurrency

Adam Crume 
2014-12-01 20:45:58 -0800   wip-doc-rbd-replay

Alfredo Deza 
2015-03-23 16:39:48 -0400   wip-11212

Alfredo Deza 
2014-07-08 13:58:35 -0400   wip-8679
2014-09-04 13:58:14 -0400   wip-8366
2014-10-13 11:10:10 -0400   wip-9730

Ali Maredia 
2015-09-22 15:10:10 -0400   wip-cmake
2015-10-09 14:58:17 -0400   wip-10587-split-servers

Boris Ranto 
2015-09-04 15:19:11 +0200   wip-bash-completion

Casey Bodley 
2015-09-28 17:09:11 -0400   wip-cxx14-test
2015-09-29 15:18:17 -0400   wip-fio-objectstore

Dan Mick 
2013-07-16 23:00:06 -0700   wip-5634

Daniel Gryniewicz 
2015-10-05 09:28:40 -0400   wip-dang-cmake

Danny Al-Gaaf 
2015-04-23 16:32:00 +0200   wip-da-SCA-20150421
2015-04-23 17:18:57 +0200   wip-nosetests
2015-04-23 18:20:16 +0200   wip-unify-num_objects_degraded
2015-09-28 16:05:12 +0200   wip-da-SCA-20150910

David Zafman 
2014-08-29 10:41:23 -0700   wip-libcommon-rebase
2015-04-24 13:14:23 -0700   wip-cot-giant
2015-08-04 07:39:00 -0700   wip-12577-hammer
2015-09-28 11:33:11 -0700   wip-12983

Dongmao Zhang 
2014-11-14 19:14:34 +0800   thesues-master

Greg Farnum 
2015-04-29 21:44:11 -0700   wip-init-names
2015-07-16 09:28:24 -0700   hammer-12297
2015-10-01 22:46:38 -0700   greg-fs-testing
2015-10-02 13:00:59 -0700   greg-infernalis-lock-testing
2015-10-02 13:09:05 -0700   greg-infernalis-lock-testing-cacher
2015-10-07 00:45:24 -0700   greg-infernalis-fs

Greg Farnum 
2014-10-23 13:33:44 -0700   wip-forward-scrub

Guang G Yang 
2015-06-26 20:31:44 +   wip-ec-readall
2015-07-23 16:13:19 +   wip-12316

Guang Yang 
2014-08-08 10:41:12 +   wip-guangyy-pg-splitting
2014-09-25 00:47:46 +   wip-9008
2014-09-30 10:36:39 +   guangyy-wip-9614

Haomai Wang 
2014-07-27 13:37:49 +0800   wip-flush-set
2015-04-20 00:47:59 +0800   update-organization
2015-07-21 19:33:56 +0800   fio-objectstore
2015-08-26 09:57:27 +0800   wip-recovery-attr

Ilya Dryomov 
2014-09-05 16:15:10 +0400   wip-rbd-notify-errors

Ivo Jimenez 
2015-08-24 23:12:45 -0700   hammer-with-new-workunit-for-wip-12551

Jason Dillaman 
2015-07-31 13:55:23 -0400   wip-12383-next
2015-08-31 23:17:53 -0400   wip-12698
2015-09-01 10:17:02 -0400   wip-11287

Jenkins 
2015-09-30 12:59:03 -0700   rhcs-v0.94.3-ubuntu

Jenkins 
2014-07-29 05:24:39 -0700   wip-nhm-hang
2015-02-02 10:35:28 -0800   wip-sam-v0.92
2015-08-21 12:46:32 -0700   last
2015-08-21 12:46:32 -0700   loic-v9.0.3
2015-09-15 10:23:18 -0700   rhcs-v0.80.8
2015-09-21 16:48:32 -0700   rhcs-v0.94.1-ubuntu

Joao Eduardo Luis 
2014-09-10 09:39:23 +0100   wip-leveldb-get.dumpling

Joao Eduardo Luis 
2014-07-22 15:41:42 +0100   wip-leveldb-misc

Joao Eduardo Luis 
2014-09-02 17:19:52 +0100   wip-leveldb-get
2014-10-17 16:20:11 +0100   wip-paxos-fix
2014-10-21 21:32:46 +0100   wip-9675.dumpling
2015-07-27 21:56:42 +0100   wip-11470.hammer
2015-09-09 15:45:45 +0100   wip-11786.hammer

Joao Eduardo Luis 
2014-11-17 16:43:53 +   wip-mon-osdmap-cleanup
2014-12-15 16:18:56 +   wip-giant-mon-backports
2014-12-17 17:13:57 +   wip-mon-backports.firefly
2014-12-17 23:15:10 +   wip-mon-sync-fix.dumpling
2015-01-07 23:01:00 +   wip-mon-blackhole-mlog-0.87.7
2015-01-10 02:40:42 +   wip-dho-joao
2015-01-10 02:46:31 +   wip-mon-paxos-fix
2015-01-26 13:00:09 +   wip-mon-datahealth-fix
2015-02-04 22:36:14 +   wip-10643
2015-09-09 15:43:51 +0100   wip-11786.firefly

Joao Eduardo Luis 
2015-05-27 23:48:45 +0100   wip-mon-scrub
2015-05-29 12:21:43 +0100   wip-11545
2015-06-05 16:12:57 +0100   wip-10507
2015-06-16 14:34:11 +0100   wip-11470
2015-06-25 00:16:41 +0100   wip-10507-2
2015-07-14 16:52:35 +0100   wip-joao-testing
2015-09-08 09:48:41 +0100   wip-leveldb-hang

John 

Initial performance cluster SimpleMessenger vs AsyncMessenger results

2015-10-12 Thread Mark Nelson

Hi Guy,

Given all of the recent data on how different memory allocator 
configurations improve SimpleMessenger performance (and the effect of 
memory allocators and transparent hugepages on RSS memory usage), I 
thought I'd run some tests looking how AsyncMessenger does in 
comparison.  We spoke about these a bit at the last performance meeting 
but here's the full write up.  The rough conclusion as of right now 
appears to be:


1) AsyncMessenger performance is not dependent on the memory allocator 
like with SimpleMessenger.


2) AsyncMessenger is faster than SimpleMessenger with TCMalloc + 32MB 
(ie default) thread cache.


3) AsyncMessenger is consistently faster than SimpleMessenger for 128K 
random reads.


4) AsyncMessenger is sometimes slower than SimpleMessenger when memory 
allocator optimizations are used.


5) AsyncMessenger currently uses far more RSS memory than SimpleMessenger.

Here's a link to the paper:

https://drive.google.com/file/d/0B2gTBZrkrnpZS1Q4VktjZkhrNHc/view

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: wip-addr

2015-10-12 Thread Sage Weil
On Mon, 12 Oct 2015, David Zafman wrote:
> I don't understand how encode/decode of entity_addr_t is changing without
> versioning in the encode/decode.  This means that this branch is changing the
> ceph-objectstore-tool export format if CEPH_FEATURE_MSG_ADDR2 is part of the
> features.  So we could bump super_header::super_ver if the export format must
> change.
> 
> Now that I look at it, I'm sure I can clear the watchers and old_watchers in
> object_info_t during export because that is dynamic information and it happens
> to include entity_addr_t.  I need to verify this, but that may be the only
> reason that the objectstore tool needs a valid features value to be passed
> there.

Ah, yeah... clearing watchers (perhaps optionally, though) sounds fine.  

sage

> 
> David
> 
> On 10/9/15 2:49 PM, Sage Weil wrote:
> > > 2.
> > > >(about line 2067 in src/tools/ceph_objectstore_tool.cc)
> > > >(use via ceph cmd?) tools - "object store tool".
> > > >This has a way to serialize objects which includes a watch list
> > > >which includes an address.  There should be an option here to say
> > > >whether to include exported addresses.
> > I think it's safe to use defaults here.. what do you think, David?
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: wip-addr

2015-10-12 Thread David Zafman


I don't understand how encode/decode of entity_addr_t is changing 
without versioning in the encode/decode.  This means that this branch is 
changing the ceph-objectstore-tool export format if 
CEPH_FEATURE_MSG_ADDR2 is part of the features.  So we could bump 
super_header::super_ver if the export format must change.


Now that I look at it, I'm sure I can clear the watchers and 
old_watchers in object_info_t during export because that is dynamic 
information and it happens to include entity_addr_t.  I need to verify 
this, but that may be the only reason that the objectstore tool needs a 
valid features value to be passed there.


David

On 10/9/15 2:49 PM, Sage Weil wrote:

2.
>(about line 2067 in src/tools/ceph_objectstore_tool.cc)
>(use via ceph cmd?) tools - "object store tool".
>This has a way to serialize objects which includes a watch list
>which includes an address.  There should be an option here to say
>whether to include exported addresses.

I think it's safe to use defaults here.. what do you think, David?


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fwd: [newstore (again)] how disable double write WAL

2015-10-12 Thread David Casier

Ok,
Great.

With these  settings :
//
newstore_max_dir_size = 4096
newstore_sync_io = true
newstore_sync_transaction = true
newstore_sync_submit_transaction = true
newstore_sync_wal_apply = true
newstore_overlay_max = 0
//

And direct IO in the benchmark tool (fio)

I see that the HDD is 100% charged and there are notransfer of /db to 
/fragments after stopping benchmark : Great !


But when i launch a bench with random blocs of 256k, i see random blocs 
between 32k and 256k on HDD. Any idea ?


Debits to the HDD are about 8MBps when they could be higher with larger 
blocs (~30MBps)

And 70 MBps without fsync (hard drive cache disabled).

Other questions :
newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread 
fsync_wq) ?

newstore_sync_transaction -> true = sync in DB ?
newstore_sync_submit_transaction -> if false then kv_queue (only if 
newstore_sync_transaction=false) ?

newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ?

Is it true ?

Way for cache with battery (sync DB and no sync data) ?

Thanks for everything !

On 10/12/2015 03:01 PM, Sage Weil wrote:

On Mon, 12 Oct 2015, David Casier wrote:

Hello everybody,
fragment is stored in rocksdb before being written to "/fragments" ?
I separed "/db" and "/fragments" but during the bench, everything is writing
to "/db"
I changed options "newstore_sync_*" without success.

Is there any way to write all metadata in "/db" and all data in "/fragments" ?

You can set newstore_overlay_max = 0 to avoid most data landing in db/.
But if you are overwriting an existing object, doing write-ahead logging
is usually unavoidable because we need to make the update atomic (and the
underlying posix fs doesn't provide that).  The wip-newstore-frags branch
mitigates this somewhat for larger writes by limiting fragment size, but
for small IOs this is pretty much always going to be the case.  For small
IOs, though, putting things in db/ is generally better since we can
combine many small ios into a single (rocksdb) journal/wal write.  And
often leave them there (via the 'overlay' behavior).

sage




--


Cordialement,

*David CASIER
DCConsulting SARL


4 Trait d'Union
77127 LIEUSAINT

**Ligne directe: _01 75 98 53 85_
Email: _david.casier@aevoo.fr_
* 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: lltng enabled by default and qemu apparmor|selinux problems

2015-10-12 Thread Jason Dillaman
I have an open PR [1] to dynamically enable LTTng-UST via new config options.  
This change will hopefully trickle down to older release and will avoid the 
SElinux / AppArmor issues in the default case (which led to downstream Ubuntu 
and Fedora disabling LTTng-UST support).  Anyone that wants to use LTTng-UST 
(i.e. for generating RBD replay traces) can enable the support and adjust their 
SElinux / AppArmor rules to accommodate.

[1] https://github.com/ceph/ceph/pull/6135

-- 

Jason Dillaman 


- Original Message -
> From: "Paul HEWLETT (Paul)" 
> To: "Alexandre DERUMIER" , "ceph-devel" 
> 
> Cc: "Sage Weil" 
> Sent: Monday, October 12, 2015 4:28:06 AM
> Subject: Re: lltng enabled by default and qemu apparmor|selinux problems
> 
> IF I can add my $0.02 - we were unable to use the libradosstriper library in
> RHEL6 because it uses the same initialisation tags as librados and lttng
> does not like that. We had no problems with RHEL7 version of ceph because
> lttng is not enabled. Please do not re-enable lttng in RHEL7 and later
> branches….
> 
> Regards
> Paul
> 
> 
> 
> 
> On 11/10/2015 18:06, "ceph-devel-ow...@vger.kernel.org on behalf of Alexandre
> DERUMIER"  aderum...@odiso.com> wrote:
> 
> >Hi,
> >
> >it seem that since this commit
> >
> >https://github.com/ceph/ceph/pull/4261/files
> >
> >lltng is enabled by default.
> >
> >But this give error with qemu when apparmor|selinux is enabled.
> >
> >That's why ubuntu && redhat now disable it for their own packages.
> >
> >https://bugzilla.redhat.com/show_bug.cgi?id=1223319
> >https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1432644
> >
> >In the ubuntu launchpad, Sage has made a reply
> >
> >"
> >Sage Weil (sage-newdream) wrote on 2015-04-02:   #21
> >FWIW, we are disabling the lttng support in the final hammer release to
> >avoid this issue (until we come up with a better solution)."
> >
> >
> >It seem that it's still enabled by default in ceph git and ceph.com
> >packages.
> >
> >Is it still planned to disable by default ?
> >
> >
> >
> >--
> >To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >the body of a message to majord...@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> N�r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w��j:+v���w�j�mzZ+��ݢj"��
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fwd: [newstore (again)] how disable double write WAL

2015-10-12 Thread Sage Weil
Hi David-

On Mon, 12 Oct 2015, David Casier wrote:
> Ok,
> Great.
> 
> With these  settings :
> //
> newstore_max_dir_size = 4096
> newstore_sync_io = true
> newstore_sync_transaction = true
> newstore_sync_submit_transaction = true

Is this a hard disk?  Those settings probably don't make sense since it 
does every IO synchronously, blocking the submitting IO path...

> newstore_sync_wal_apply = true
> newstore_overlay_max = 0
> //
> 
> And direct IO in the benchmark tool (fio)
> 
> I see that the HDD is 100% charged and there are notransfer of /db to
> /fragments after stopping benchmark : Great !
> 
> But when i launch a bench with random blocs of 256k, i see random blocs
> between 32k and 256k on HDD. Any idea ?

Random IOs have to be write ahead logged in rocksdb, which has its own IO 
pattern.  Since you made everything sync above I think it'll depend on 
how many osd threads get batched together at a time.. maybe.  Those 
settings aren't something I've really tested, and probably only make 
sense with very fast NVMe devices.

> Debits to the HDD are about 8MBps when they could be higher with larger 
> blocs> (~30MBps)
> And 70 MBps without fsync (hard drive cache disabled).
> 
> Other questions :
> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread
> fsync_wq) ?

yes

> newstore_sync_transaction -> true = sync in DB ?

synchronously do the rocksdb commit too

> newstore_sync_submit_transaction -> if false then kv_queue (only if
> newstore_sync_transaction=false) ?

yeah.. there is an annoying rocksdb behavior that makes an async 
transaction submit block if a sync one is in progress, so this queues them 
up and explicitly batches them.

> newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ?

the txn commit completion threads can do the wal work synchronously.. this 
is only a good idea if it's doing aio (which it generally is).

> Is it true ?
> 
> Way for cache with battery (sync DB and no sync data) ?

?
s

> 
> Thanks for everything !
> 
> On 10/12/2015 03:01 PM, Sage Weil wrote:
> > On Mon, 12 Oct 2015, David Casier wrote:
> > > Hello everybody,
> > > fragment is stored in rocksdb before being written to "/fragments" ?
> > > I separed "/db" and "/fragments" but during the bench, everything is
> > > writing
> > > to "/db"
> > > I changed options "newstore_sync_*" without success.
> > > 
> > > Is there any way to write all metadata in "/db" and all data in
> > > "/fragments" ?
> > You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> > But if you are overwriting an existing object, doing write-ahead logging
> > is usually unavoidable because we need to make the update atomic (and the
> > underlying posix fs doesn't provide that).  The wip-newstore-frags branch
> > mitigates this somewhat for larger writes by limiting fragment size, but
> > for small IOs this is pretty much always going to be the case.  For small
> > IOs, though, putting things in db/ is generally better since we can
> > combine many small ios into a single (rocksdb) journal/wal write.  And
> > often leave them there (via the 'overlay' behavior).
> > 
> > sage
> > 
> 
> 
> -- 
> 
> 
> Cordialement,
> 
> *David CASIER
> DCConsulting SARL
> 
> 
> 4 Trait d'Union
> 77127 LIEUSAINT
> 
> **Ligne directe: _01 75 98 53 85_
> Email: _david.casier@aevoo.fr_
> * 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rgw and the next hammer release v0.94.4

2015-10-12 Thread Yehuda Sadeh-Weinraub
Yeah, it should be fine.

On Mon, Oct 12, 2015 at 3:56 PM, Loic Dachary  wrote:
> Hi,
>
> After todays private discussion and the merge of 
> https://github.com/ceph/ceph/pull/6161, I will assume the current hammer 
> branch (7f485ed5aa620fe982561663bf64356b7e2c38f2) is ready for QE to start 
> their own round of testing. If I misinterpreted what you wrote, please speak 
> up and I'll do what's needed ;-)
>
> Cheers
>
> On 02/10/2015 22:31, Loic Dachary wrote:
>> Hi Yehuda,
>>
>> The next hammer release as found at https://github.com/ceph/ceph/tree/hammer 
>> passed the rgw suite (http://tracker.ceph.com/issues/12701#note-58).
>> Do you think the hammer branch is ready for QE to start their own round of 
>> testing ?
>>
>> Cheers
>>
>> P.S. http://tracker.ceph.com/issues/12701#Release-information has direct 
>> links to the pull requests merged into hammer since v0.94.3 in case you need 
>> more context about one of them.
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: throttles

2015-10-12 Thread Somnath Roy


dump_historic_ops, slow requests

2015-10-12 Thread Deneau, Tom
I have a small ceph cluster (3 nodes, 5 osds each, journals all just partitions
on the spinner disks) and I have noticed that when I hit it with a bunch of
rados bench clients all doing writes of large (40M objects) with --no-cleanup,
the rados bench commands seem to finish OK but I often get health warnings like
HEALTH_WARN 4 requests are blocked > 32 sec;
2 osds have slow requests 3 ops are blocked > 32.768 sec on 
osd.9
1 ops are blocked > 32.768 sec on osd.10
2 osds have slow requests
After a couple of minutes, health goes to HEALTH_OK.

But if I go to the node containing osd.10 for example and do dump_historic_ops
I do get lots of around 20-sec durations but nothing over 32 sec.

The 20-sec or so ops are always  "ack+ondisk+write+known_if_redirected"
with type_data = "commit sent: apply or cleanup"
and the following are typical event timings

   initiated: 14:06:58.205937
  reached_pg: 14:07:01.823288, gap=  3617.351
 started: 14:07:01.823359, gap= 0.071
   waiting for subops from 3: 14:07:01.855259, gap=31.900
 commit_queued_for_journal_write: 14:07:03.132697, gap=  1277.438
  write_thread_in_journal_buffer: 14:07:03.143356, gap=10.659
 journaled_completion_queued: 14:07:04.175863, gap=  1032.507
   op_commit: 14:07:04.585040, gap=   409.177
  op_applied: 14:07:04.589751, gap= 4.711
sub_op_commit_rec from 3: 14:07:14.682925, gap= 10093.174
 commit_sent: 14:07:14.683081, gap= 0.156
done: 14:07:14.683119, gap= 0.038

Should I expect to see a historic op with duration greater than 32 sec?

-- Tom Deneau

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


throttles

2015-10-12 Thread Deneau, Tom
Looking at the perf counters on my osds, I see wait counts for the following
throttle related perf counters:  (This is from trying to benchmark using
multiple rados bench client processes).

   throttle-filestore_bytes
   throttle-msgr_dispatch_throttler-client
   throttle-osd_client_bytes
   throttle-osd_client_messages

What are the config variables that would allow me to experiment with these 
throttle limits?
(When I look at the output from --admin-daemon osd.xx.asok config show, it's
not clear which items these correspond to).

-- Tom Deneau


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


enable rbd on ec pool ?

2015-10-12 Thread Tomy Cheru
Is there a patch available to enable rbd over an EC pool ?

Currently its restricted,
2015-10-12 10:52:23.042085 7f4721ca1840 -1 librbd: error adding image to 
directory: (95) Operation not supported
rbd: create error: (95) Operation not supported

Thanks,
tomy




PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: lltng enabled by default and qemu apparmor|selinux problems

2015-10-12 Thread Alexandre DERUMIER
>>I have an open PR [1] to dynamically enable LTTng-UST via new config options

Great ,Thanks Jason !



- Mail original -
De: "Jason Dillaman" 
À: "Paul HEWLETT (Paul)" 
Cc: "aderumier" , "ceph-devel" 
, "Sage Weil" 
Envoyé: Lundi 12 Octobre 2015 20:32:43
Objet: Re: lltng enabled by default and qemu apparmor|selinux problems

I have an open PR [1] to dynamically enable LTTng-UST via new config options. 
This change will hopefully trickle down to older release and will avoid the 
SElinux / AppArmor issues in the default case (which led to downstream Ubuntu 
and Fedora disabling LTTng-UST support). Anyone that wants to use LTTng-UST 
(i.e. for generating RBD replay traces) can enable the support and adjust their 
SElinux / AppArmor rules to accommodate. 

[1] https://github.com/ceph/ceph/pull/6135 

-- 

Jason Dillaman 


- Original Message - 
> From: "Paul HEWLETT (Paul)"  
> To: "Alexandre DERUMIER" , "ceph-devel" 
>  
> Cc: "Sage Weil"  
> Sent: Monday, October 12, 2015 4:28:06 AM 
> Subject: Re: lltng enabled by default and qemu apparmor|selinux problems 
> 
> IF I can add my $0.02 - we were unable to use the libradosstriper library in 
> RHEL6 because it uses the same initialisation tags as librados and lttng 
> does not like that. We had no problems with RHEL7 version of ceph because 
> lttng is not enabled. Please do not re-enable lttng in RHEL7 and later 
> branches…. 
> 
> Regards 
> Paul 
> 
> 
> 
> 
> On 11/10/2015 18:06, "ceph-devel-ow...@vger.kernel.org on behalf of Alexandre 
> DERUMIER"  aderum...@odiso.com> wrote: 
> 
> >Hi, 
> > 
> >it seem that since this commit 
> > 
> >https://github.com/ceph/ceph/pull/4261/files 
> > 
> >lltng is enabled by default. 
> > 
> >But this give error with qemu when apparmor|selinux is enabled. 
> > 
> >That's why ubuntu && redhat now disable it for their own packages. 
> > 
> >https://bugzilla.redhat.com/show_bug.cgi?id=1223319 
> >https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1432644 
> > 
> >In the ubuntu launchpad, Sage has made a reply 
> > 
> >" 
> >Sage Weil (sage-newdream) wrote on 2015-04-02: #21 
> >FWIW, we are disabling the lttng support in the final hammer release to 
> >avoid this issue (until we come up with a better solution)." 
> > 
> > 
> >It seem that it's still enabled by default in ceph git and ceph.com 
> >packages. 
> > 
> >Is it still planned to disable by default ? 
> > 
> > 
> > 
> >-- 
> >To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> >the body of a message to majord...@vger.kernel.org 
> >More majordomo info at http://vger.kernel.org/majordomo-info.html 
> N�r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w��j:+v���w�j�mzZ+��ݢj"��
>  
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rgw and the next hammer release v0.94.4

2015-10-12 Thread Loic Dachary
Hi,

After todays private discussion and the merge of 
https://github.com/ceph/ceph/pull/6161, I will assume the current hammer branch 
(7f485ed5aa620fe982561663bf64356b7e2c38f2) is ready for QE to start their own 
round of testing. If I misinterpreted what you wrote, please speak up and I'll 
do what's needed ;-)

Cheers

On 02/10/2015 22:31, Loic Dachary wrote:
> Hi Yehuda,
> 
> The next hammer release as found at https://github.com/ceph/ceph/tree/hammer 
> passed the rgw suite (http://tracker.ceph.com/issues/12701#note-58). 
> Do you think the hammer branch is ready for QE to start their own round of 
> testing ?
> 
> Cheers
> 
> P.S. http://tracker.ceph.com/issues/12701#Release-information has direct 
> links to the pull requests merged into hammer since v0.94.3 in case you need 
> more context about one of them.
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


hammer branch for v0.94.4 ready for QE

2015-10-12 Thread Loic Dachary
Hi Yuri,

The hammer branch for v0.94.4 as found at 
https://github.com/ceph/ceph/commits/hammer has been approved by Yehuda, Josh 
and Sam (there are no CephFS related commits according to Greg, hence his 
approval was not relevant) and is ready for QE. For the record, the head is 
https://github.com/ceph/ceph/commit/7f485ed5aa620fe982561663bf64356b7e2c38f2 
and the details of the tests run are at http://tracker.ceph.com/issues/12701.

This time around, instead of adding the table to the description, I propose you 
add it as a comment (which can be edited later on). It is easier because it's 
not overloaded with unrelated content. There also is the matter of the maximum 
size of the description field: there is a real risk of exceeding it and 
truncate the result.

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre














signature.asc
Description: OpenPGP digital signature


Re: [ceph-users] Initial performance cluster SimpleMessenger vs AsyncMessenger results

2015-10-12 Thread Haomai Wang
resend

On Tue, Oct 13, 2015 at 10:56 AM, Haomai Wang  wrote:
> COOL
>
> Interesting that async messenger will consume more memory than simple, in my
> mind I always think async should use less memory. I will give a look at this
>
> On Tue, Oct 13, 2015 at 12:50 AM, Mark Nelson  wrote:
>>
>> Hi Guy,
>>
>> Given all of the recent data on how different memory allocator
>> configurations improve SimpleMessenger performance (and the effect of memory
>> allocators and transparent hugepages on RSS memory usage), I thought I'd run
>> some tests looking how AsyncMessenger does in comparison.  We spoke about
>> these a bit at the last performance meeting but here's the full write up.
>> The rough conclusion as of right now appears to be:
>>
>> 1) AsyncMessenger performance is not dependent on the memory allocator
>> like with SimpleMessenger.
>>
>> 2) AsyncMessenger is faster than SimpleMessenger with TCMalloc + 32MB (ie
>> default) thread cache.
>>
>> 3) AsyncMessenger is consistently faster than SimpleMessenger for 128K
>> random reads.
>>
>> 4) AsyncMessenger is sometimes slower than SimpleMessenger when memory
>> allocator optimizations are used.
>>
>> 5) AsyncMessenger currently uses far more RSS memory than SimpleMessenger.
>>
>> Here's a link to the paper:
>>
>> https://drive.google.com/file/d/0B2gTBZrkrnpZS1Q4VktjZkhrNHc/view
>>
>> Mark
>> ___
>> ceph-users mailing list
>> ceph-us...@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> --
>
> Best Regards,
>
> Wheat



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Initial performance cluster SimpleMessenger vs AsyncMessenger results

2015-10-12 Thread Gregory Farnum
On Mon, Oct 12, 2015 at 9:50 AM, Mark Nelson  wrote:
> Hi Guy,
>
> Given all of the recent data on how different memory allocator
> configurations improve SimpleMessenger performance (and the effect of memory
> allocators and transparent hugepages on RSS memory usage), I thought I'd run
> some tests looking how AsyncMessenger does in comparison.  We spoke about
> these a bit at the last performance meeting but here's the full write up.
> The rough conclusion as of right now appears to be:
>
> 1) AsyncMessenger performance is not dependent on the memory allocator like
> with SimpleMessenger.
>
> 2) AsyncMessenger is faster than SimpleMessenger with TCMalloc + 32MB (ie
> default) thread cache.
>
> 3) AsyncMessenger is consistently faster than SimpleMessenger for 128K
> random reads.
>
> 4) AsyncMessenger is sometimes slower than SimpleMessenger when memory
> allocator optimizations are used.
>
> 5) AsyncMessenger currently uses far more RSS memory than SimpleMessenger.
>
> Here's a link to the paper:
>
> https://drive.google.com/file/d/0B2gTBZrkrnpZS1Q4VktjZkhrNHc/view

Can you clarify these tests a bit more? I can't make the number of
nodes, OSDs, and SSDs work out properly. Were the FIO jobs 256
concurrent ops per job, or in aggregate? Is there any more info that
might suggest why the 128KB rand-read (but not read nor write, and not
4k rand-read) was so asymmetrical?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html