date:20150908

Hi,

Paolo Bonzini from qemu team has finally applied my qemu jemalloc patch 
in his for-upstream branch

https://github.com/bonzini/qemu/releases/tag/for-upstream
https://github.com/bonzini/qemu/tree/for-upstream

So,It'll be in qemu master soon and ready for qemu 2.5


I have write some small benchmark results with librbd in the commit itself
https://github.com/bonzini/qemu/commit/efc1e9f08020cd460eb2204e3092b39e408e1fa9


you simply need to compile qemu with --enable-jemalloc, to enable jemmaloc 
support.

Regards,

Alexandre
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Huge memory usage spike in OSD on hammer/giant

Was that just driver issue?
If so, could we face same kind of issue on different distributed file systems.
I'm just asking.

I'm quite interested in:

 What kind of HBA you are using
 Which version of driver caused the issue

Does any Cepher have any comment on Mariusz's comment?

 Shinobu

- Original Message -
From: "Mariusz Gronczewski" 
To: "池信泽" 
Cc: "Shinobu Kinjo" , ceph-users@lists.ceph.com
Sent: Tuesday, September 8, 2015 7:09:32 PM
Subject: Re: [ceph-users] Huge memory usage spike in OSD on hammer/giant

For those interested:

Bug that caused ceph to go haywire was a emulex nic driver dropping
packets when making more than few hundred megabits (basically linear
change compared to load) which caused osds to flap constantly once
something gone wrong (high traffic, osd go down, ceph starts to
reallocationg stuff, which causes more traffic, more osds flap, etc)

upgrading kernel to 4.1.6 (was present at least in 4.0.1, and in c6
"distro" kernel) fixed that and it started to rebuild correctly

Lessons learned, buy Intel NICs...

On Mon, 7 Sep 2015 20:51:57 +0800, 池信泽  wrote:

> Yeh, There is bug which would use huge memory. It be triggered when osd
> down or add into cluster and do recovery/backfilling.
> 
> The patch https://github.com/ceph/ceph/pull/5656
> https://github.com/ceph/ceph/pull/5451 merged into master would fix it, and
> it would be backport.
> 
> I think ceph v0.93 or newer version maybe hit this bug.
> 
> 2015-09-07 20:42 GMT+08:00 Shinobu Kinjo :
> 
> > How heavy network traffic was?
> >
> > Have you tried to capture that traffic between cluster and public network
> > to see where such a bunch of traffic came from?
> >
> >  Shinobu
> >
> > - Original Message -
> > From: "Jan Schermer" 
> > To: "Mariusz Gronczewski" 
> > Cc: ceph-users@lists.ceph.com
> > Sent: Monday, September 7, 2015 9:17:04 PM
> > Subject: Re: [ceph-users] Huge memory usage spike in OSD on hammer/giant
> >
> > Hmm, even network traffic went up.
> > Nothing in logs on the mons which started 9/4 ~6 AM?
> >
> > Jan
> >
> > > On 07 Sep 2015, at 14:11, Mariusz Gronczewski <
> > mariusz.gronczew...@efigence.com> wrote:
> > >
> > > On Mon, 7 Sep 2015 13:44:55 +0200, Jan Schermer  wrote:
> > >
> > >> Maybe some configuration change occured that now takes effect when you
> > start the OSD?
> > >> Not sure what could affect memory usage though - some ulimit values
> > maybe (stack size), number of OSD threads (compare the number from this OSD
> > to the rest of OSDs), fd cache size. Look in /proc and compare everything.
> > >> Also look in "ceph osd tree" - didn't someone touch it while you were
> > gone?
> > >>
> > >> Jan
> > >>
> > >
> > >> number of OSD threads (compare the number from this OSD to the rest of
> > > OSDs),
> > >
> > > it occured on all OSDs, and it looked like that
> > > http://imgur.com/IIMIyRG
> > >
> > > sadly I was on vacation so I didnt manage to catch it before ;/ but I'm
> > > sure there was no config change
> > >
> > >
> > >>> On 07 Sep 2015, at 13:40, Mariusz Gronczewski <
> > mariusz.gronczew...@efigence.com> wrote:
> > >>>
> > >>> On Mon, 7 Sep 2015 13:02:38 +0200, Jan Schermer 
> > wrote:
> > >>>
> >  Apart from bug causing this, this could be caused by failure of other
> > OSDs (even temporary) that starts backfills.
> > 
> >  1) something fails
> >  2) some PGs move to this OSD
> >  3) this OSD has to allocate memory for all the PGs
> >  4) whatever fails gets back up
> >  5) the memory is never released.
> > 
> >  A similiar scenario is possible if for example someone confuses "ceph
> > osd crush reweight" with "ceph osd reweight" (yes, this happened to me :-)).
> > 
> >  Did you try just restarting the OSD before you upgraded it?
> > >>>
> > >>> stopped, upgraded, started. it helped a bit ( <3GB per OSD) but beside
> > >>> that nothing changed. I've tried to wait till it stops eating CPU then
> > >>> restart it but it still eats >2GB of memory which means I can't start
> > >>> all 4 OSDs at same time ;/
> > >>>
> > >>> I've also added noin,nobackfill,norecover flags but that didnt help
> > >>>
> > >>> it is suprising for me because before all 4 OSDs total ate less than
> > >>> 2GBs of memory so I though I have enough headroom, and we did restart
> > >>> machines and removed/added os to test if recovery/rebalance goes fine
> > >>>
> > >>> it also does not have any external traffic at the moment
> > >>>
> > >>>
> > > On 07 Sep 2015, at 12:58, Mariusz Gronczewski <
> > mariusz.gronczew...@efigence.com> wrote:
> > >
> > > Hi,
> > >
> > > over a weekend (was on vacation so I didnt get exactly what happened)
> > > our OSDs started eating in excess of 6GB of RAM (well RSS), which
> > was a
> > > problem considering

Re: [ceph-users] osd daemon cpu threads

2015-09-08 Thread Gurvinder Singh

Thanks Jan for the reply.  It's good to know that Ceph can use extra cpus
for throughput. I am wondering if any one in the community has
used/experimented with Arm v8 2.5 GHz prosessors instead of Intel E5.
On Sep 8, 2015 12:28 PM, "Jan Schermer"  wrote:

> In terms of throughput yes - one OSD may have thousands of threads doing
> work so it will scale accross multiple clients.
> But in terms of latency you are still limited by a throughput of one core,
> so for database workloads or any type of synchronous or single-threaded IO
> more cores will be of no help.
>
> Jan
>
> > On 08 Sep 2015, at 10:50, Gurvinder Singh <
> gurvindersinghdah...@gmail.com> wrote:
> >
> > Hi,
> >
> > Just wondering if a Ceph OSD daemon supports multi threading and can get
> > benefit from multi core Intel/ARM processor. E.g. 12 disk server with 36
> > Intel or 48 ARM cores.
> >
> > Thanks,
> > Gurvinder
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Huge memory usage spike in OSD on hammer/giant

2015-09-08 Thread Jan Schermer

YMMV, same story like SSD selection.
Intels have their own problems :-)

Jan

> On 08 Sep 2015, at 12:09, Mariusz Gronczewski 
>  wrote:
> 
> For those interested:
> 
> Bug that caused ceph to go haywire was a emulex nic driver dropping
> packets when making more than few hundred megabits (basically linear
> change compared to load) which caused osds to flap constantly once
> something gone wrong (high traffic, osd go down, ceph starts to
> reallocationg stuff, which causes more traffic, more osds flap, etc)
> 
> upgrading kernel to 4.1.6 (was present at least in 4.0.1, and in c6
> "distro" kernel) fixed that and it started to rebuild correctly
> 
> Lessons learned, buy Intel NICs...
> 
> On Mon, 7 Sep 2015 20:51:57 +0800, 池信泽  wrote:
> 
>> Yeh, There is bug which would use huge memory. It be triggered when osd
>> down or add into cluster and do recovery/backfilling.
>> 
>> The patch https://github.com/ceph/ceph/pull/5656
>> https://github.com/ceph/ceph/pull/5451 merged into master would fix it, and
>> it would be backport.
>> 
>> I think ceph v0.93 or newer version maybe hit this bug.
>> 
>> 2015-09-07 20:42 GMT+08:00 Shinobu Kinjo :
>> 
>>> How heavy network traffic was?
>>> 
>>> Have you tried to capture that traffic between cluster and public network
>>> to see where such a bunch of traffic came from?
>>> 
>>> Shinobu
>>> 
>>> - Original Message -
>>> From: "Jan Schermer" 
>>> To: "Mariusz Gronczewski" 
>>> Cc: ceph-users@lists.ceph.com
>>> Sent: Monday, September 7, 2015 9:17:04 PM
>>> Subject: Re: [ceph-users] Huge memory usage spike in OSD on hammer/giant
>>> 
>>> Hmm, even network traffic went up.
>>> Nothing in logs on the mons which started 9/4 ~6 AM?
>>> 
>>> Jan
>>> 
 On 07 Sep 2015, at 14:11, Mariusz Gronczewski <
>>> mariusz.gronczew...@efigence.com> wrote:
 
 On Mon, 7 Sep 2015 13:44:55 +0200, Jan Schermer  wrote:
 
> Maybe some configuration change occured that now takes effect when you
>>> start the OSD?
> Not sure what could affect memory usage though - some ulimit values
>>> maybe (stack size), number of OSD threads (compare the number from this OSD
>>> to the rest of OSDs), fd cache size. Look in /proc and compare everything.
> Also look in "ceph osd tree" - didn't someone touch it while you were
>>> gone?
> 
> Jan
> 
 
> number of OSD threads (compare the number from this OSD to the rest of
 OSDs),
 
 it occured on all OSDs, and it looked like that
 http://imgur.com/IIMIyRG
 
 sadly I was on vacation so I didnt manage to catch it before ;/ but I'm
 sure there was no config change
 
 
>> On 07 Sep 2015, at 13:40, Mariusz Gronczewski <
>>> mariusz.gronczew...@efigence.com> wrote:
>> 
>> On Mon, 7 Sep 2015 13:02:38 +0200, Jan Schermer 
>>> wrote:
>> 
>>> Apart from bug causing this, this could be caused by failure of other
>>> OSDs (even temporary) that starts backfills.
>>> 
>>> 1) something fails
>>> 2) some PGs move to this OSD
>>> 3) this OSD has to allocate memory for all the PGs
>>> 4) whatever fails gets back up
>>> 5) the memory is never released.
>>> 
>>> A similiar scenario is possible if for example someone confuses "ceph
>>> osd crush reweight" with "ceph osd reweight" (yes, this happened to me :-)).
>>> 
>>> Did you try just restarting the OSD before you upgraded it?
>> 
>> stopped, upgraded, started. it helped a bit ( <3GB per OSD) but beside
>> that nothing changed. I've tried to wait till it stops eating CPU then
>> restart it but it still eats >2GB of memory which means I can't start
>> all 4 OSDs at same time ;/
>> 
>> I've also added noin,nobackfill,norecover flags but that didnt help
>> 
>> it is suprising for me because before all 4 OSDs total ate less than
>> 2GBs of memory so I though I have enough headroom, and we did restart
>> machines and removed/added os to test if recovery/rebalance goes fine
>> 
>> it also does not have any external traffic at the moment
>> 
>> 
 On 07 Sep 2015, at 12:58, Mariusz Gronczewski <
>>> mariusz.gronczew...@efigence.com> wrote:
 
 Hi,
 
 over a weekend (was on vacation so I didnt get exactly what happened)
 our OSDs started eating in excess of 6GB of RAM (well RSS), which
>>> was a
 problem considering that we had only 8GB of ram for 4 OSDs (about 700
 pgs per osd and about 70GB space used. So spam of coredumps and OOMs
 blocked the osds down to unusabiltity.
 
 I then upgraded one of OSDs to hammer which made it a bit better
>>> (~2GB
 per osd) but still much higher usage than before.
 
 any ideas what would be a reason for that ? logs are mostly full on

[ceph-users] ceph-users test

2015-09-08 Thread Shikejun

ceph-users test
-
本邮件及其附件含有杭州华三通信技术有限公司的保密信息，仅限于发送给上面地址中列出
的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、
或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本
邮件！
This e-mail and its attachments contain confidential information from H3C, 
which is
intended only for the person or entity whose address is listed above. Any use 
of the
information contained herein in any way (including, but not limited to, total 
or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify 
the sender
by phone or email immediately and delete it!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ceph-users test

2015-09-08 Thread Shikejun

[ceph-users]ceph-users test
-
本邮件及其附件含有杭州华三通信技术有限公司的保密信息，仅限于发送给上面地址中列出
的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、
或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本
邮件！
This e-mail and its attachments contain confidential information from H3C, 
which is
intended only for the person or entity whose address is listed above. Any use 
of the
information contained herein in any way (including, but not limited to, total 
or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify 
the sender
by phone or email immediately and delete it!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] qemu jemalloc support soon in master (applied in paolo upstream branch)

That would be my life saver.
Thanks a lot!

> you simply need to compile qemu with --enable-jemalloc, to enable jemmaloc 
> support.

- Original Message -
From: "Alexandre DERUMIER" 
To: "ceph-users" , "ceph-devel" 

Sent: Tuesday, September 8, 2015 7:58:15 PM
Subject: [ceph-users] qemu jemalloc support soon in master (applied in paolo 
upstream branch)

Hi,

Paolo Bonzini from qemu team has finally applied my qemu jemalloc patch 
in his for-upstream branch

https://github.com/bonzini/qemu/releases/tag/for-upstream
https://github.com/bonzini/qemu/tree/for-upstream

So,It'll be in qemu master soon and ready for qemu 2.5


I have write some small benchmark results with librbd in the commit itself
https://github.com/bonzini/qemu/commit/efc1e9f08020cd460eb2204e3092b39e408e1fa9


you simply need to compile qemu with --enable-jemalloc, to enable jemmaloc 
support.

Regards,

Alexandre
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Huge memory usage spike in OSD on hammer/giant

> eat between 2 and 6 GB RAM

That is a bit huge difference, I think.

- Original Message -
From: "Mariusz Gronczewski" 
To: "Jan Schermer" 
Cc: ceph-users@lists.ceph.com
Sent: Tuesday, September 8, 2015 8:17:43 PM
Subject: Re: [ceph-users] Huge memory usage spike in OSD on hammer/giant

the worst thing that cluster was running (on light load tho) for about
6 months now and I already flashed firmware to those cards which made
problem "disappear" for small loads, so I wasnt even expecting problem
in that place. Sadly OSDs still eat between 2 and 6 GB RAM each but I
hope that will stop once recovery finishes.

On Tue, 8 Sep 2015 12:31:03 +0200, Jan Schermer
 wrote:

> YMMV, same story like SSD selection.
> Intels have their own problems :-)
> 
> Jan
> 
> > On 08 Sep 2015, at 12:09, Mariusz Gronczewski 
> >  wrote:
> > 
> > For those interested:
> > 
> > Bug that caused ceph to go haywire was a emulex nic driver dropping
> > packets when making more than few hundred megabits (basically linear
> > change compared to load) which caused osds to flap constantly once
> > something gone wrong (high traffic, osd go down, ceph starts to
> > reallocationg stuff, which causes more traffic, more osds flap, etc)
> > 
> > upgrading kernel to 4.1.6 (was present at least in 4.0.1, and in c6
> > "distro" kernel) fixed that and it started to rebuild correctly
> > 
> > Lessons learned, buy Intel NICs...
> > 
> > On Mon, 7 Sep 2015 20:51:57 +0800, 池信泽  wrote:
> > 
> >> Yeh, There is bug which would use huge memory. It be triggered when osd
> >> down or add into cluster and do recovery/backfilling.
> >> 
> >> The patch https://github.com/ceph/ceph/pull/5656
> >> https://github.com/ceph/ceph/pull/5451 merged into master would fix it, and
> >> it would be backport.
> >> 
> >> I think ceph v0.93 or newer version maybe hit this bug.
> >> 
> >> 2015-09-07 20:42 GMT+08:00 Shinobu Kinjo :
> >> 
> >>> How heavy network traffic was?
> >>> 
> >>> Have you tried to capture that traffic between cluster and public network
> >>> to see where such a bunch of traffic came from?
> >>> 
> >>> Shinobu
> >>> 
> >>> - Original Message -
> >>> From: "Jan Schermer" 
> >>> To: "Mariusz Gronczewski" 
> >>> Cc: ceph-users@lists.ceph.com
> >>> Sent: Monday, September 7, 2015 9:17:04 PM
> >>> Subject: Re: [ceph-users] Huge memory usage spike in OSD on hammer/giant
> >>> 
> >>> Hmm, even network traffic went up.
> >>> Nothing in logs on the mons which started 9/4 ~6 AM?
> >>> 
> >>> Jan
> >>> 
>  On 07 Sep 2015, at 14:11, Mariusz Gronczewski <
> >>> mariusz.gronczew...@efigence.com> wrote:
>  
>  On Mon, 7 Sep 2015 13:44:55 +0200, Jan Schermer  wrote:
>  
> > Maybe some configuration change occured that now takes effect when you
> >>> start the OSD?
> > Not sure what could affect memory usage though - some ulimit values
> >>> maybe (stack size), number of OSD threads (compare the number from this 
> >>> OSD
> >>> to the rest of OSDs), fd cache size. Look in /proc and compare everything.
> > Also look in "ceph osd tree" - didn't someone touch it while you were
> >>> gone?
> > 
> > Jan
> > 
>  
> > number of OSD threads (compare the number from this OSD to the rest of
>  OSDs),
>  
>  it occured on all OSDs, and it looked like that
>  http://imgur.com/IIMIyRG
>  
>  sadly I was on vacation so I didnt manage to catch it before ;/ but I'm
>  sure there was no config change
>  
>  
> >> On 07 Sep 2015, at 13:40, Mariusz Gronczewski <
> >>> mariusz.gronczew...@efigence.com> wrote:
> >> 
> >> On Mon, 7 Sep 2015 13:02:38 +0200, Jan Schermer 
> >>> wrote:
> >> 
> >>> Apart from bug causing this, this could be caused by failure of other
> >>> OSDs (even temporary) that starts backfills.
> >>> 
> >>> 1) something fails
> >>> 2) some PGs move to this OSD
> >>> 3) this OSD has to allocate memory for all the PGs
> >>> 4) whatever fails gets back up
> >>> 5) the memory is never released.
> >>> 
> >>> A similiar scenario is possible if for example someone confuses "ceph
> >>> osd crush reweight" with "ceph osd reweight" (yes, this happened to me 
> >>> :-)).
> >>> 
> >>> Did you try just restarting the OSD before you upgraded it?
> >> 
> >> stopped, upgraded, started. it helped a bit ( <3GB per OSD) but beside
> >> that nothing changed. I've tried to wait till it stops eating CPU then
> >> restart it but it still eats >2GB of memory which means I can't start
> >> all 4 OSDs at same time ;/
> >> 
> >> I've also added noin,nobackfill,norecover flags but that didnt help
> >> 
> >> it is suprising for me because before all 4 OSDs total ate less than
> >>

Re: [ceph-users] osd daemon cpu threads

2015-09-08 Thread Jan Schermer

In terms of throughput yes - one OSD may have thousands of threads doing work 
so it will scale accross multiple clients.
But in terms of latency you are still limited by a throughput of one core, so 
for database workloads or any type of synchronous or single-threaded IO more 
cores will be of no help.

Jan

> On 08 Sep 2015, at 10:50, Gurvinder Singh  
> wrote:
> 
> Hi,
> 
> Just wondering if a Ceph OSD daemon supports multi threading and can get
> benefit from multi core Intel/ARM processor. E.g. 12 disk server with 36
> Intel or 48 ARM cores.
> 
> Thanks,
> Gurvinder
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph Tuning + KV backend

2015-09-08 Thread Niels Jakob Darger


Hello,

Excuse my ignorance, I have just joined this list and started using Ceph 
(which looks very cool). On AWS I have set up a 5-way Ceph cluster (4 
vCPUs, 32G RAM, dedicated SSDs for system, osd and journal) with the 
Object Gateway. For the purpose of simplicity of the test all the nodes 
are identical and each node contains osd, mon and the radosgw.


I have run parallel inserts from all 5 nodes, I can insert about 
10-12000 objects per minute. The insert rate is relatively constant 
regardless of whether I run 1 insert process per node or 5, i.e. a total 
of 5 or 25.


These are just numbers, of course, and not meaningful without more 
context. But looking at the nodes I think the cluster could run faster - 
the CPUs are not doing much, there isn't much I/O wait - only about 50% 
utilisation and only on the SSDs storing the journals on two of the 
nodes (I've set the replication to 2), the other file systems are almost 
idle. The network is far from maxed out and the processes are not using 
much memory. I've tried increasing osd_op_threads to 5 or 10 but that 
didn't make much difference.


The co-location of all the daemons on all the nodes may not be ideal, 
but since there isn't much resource use or contention I don't think 
that's the problem.


So two questions:

1) Are there any good resources on tuning Ceph? There's quite a few 
posts out there testing and timing specific setups with RAID controller 
X and 12 disks of brand Y etc. but I'm more looking for general tuning 
guidelines - explaining the big picture.


2) What's the status of the keyvalue backend? The documentation on 
http://ceph.com/docs/master/rados/configuration/keyvaluestore-config-ref/ looks 
nice but I found it difficult to work out how to switch to the keyvalue 
backend, the Internet suggests "osd objectstore = keyvaluestore-dev", 
but that didn't seem to work so I checked out the source code and it 
looks like "osd objectstore = keyvaluestore" does it. However, it 
results in nasty things in the log file ("*** experimental feature 
'keyvaluestore' is not enabled *** This feature is marked as 
experimental ...") so perhaps it's too early to use the KV backend for 
production use?


Thanks & regards,
Jakob
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] jemalloc and transparent hugepage

Hi,
I have found an interesting article about jemalloc and transparent hugepages

https://www.digitalocean.com/company/blog/transparent-huge-pages-and-alternative-memory-allocators/


Could be great to see if disable transparent hugepage help to have lower 
jemalloc memory usage.


Regards,

Alexandre

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ensuring write activity is finished

2015-09-08 Thread Deneau, Tom

When measuring read bandwidth using rados bench, I've been doing the
following:
   * write some objects using rados bench write --no-cleanup
   * drop caches on the osd nodes
   * use rados bench seq to read.

I've noticed that on the first rados bench seq immediately following the rados 
bench write,
there is often activity on the journal partitions which must be a carry over 
from the rados
bench write.

What is the preferred way to ensure that all write activity is finished before 
starting
to use rados bench seq?

-- Tom Deneau

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD respawning -- FAILED assert(clone_size.count(clone))

2015-09-08 Thread David Zafman



Chris,

I was wondering if you still had /tmp/snap.out laying around, could you 
send it to me?   The way the dump to json code works if the "clones" is 
empty it doesn't show me what two other structures look like.


David

On 9/5/15 3:24 PM, Chris Taylor wrote:

# ceph-dencoder type SnapSet import /tmp/snap.out decode dump_json
{
"snap_context": {
"seq": 9197,
"snaps": [
9197
]
},
"head_exists": 1,
"clones": []
}


On 09/03/2015 04:48 PM, David Zafman wrote:


If you have ceph-dencoder installed or can build v0.94.3 to build the 
binary, you can dump the SnapSet for the problem object. Once you 
understand the removal procedure you could do the following to get a 
look at the SnapSet information.


Find the object from --op list with snapid -2 and cut and paste that 
json into the following command


Something like:
$ ceph-objectstore-tool --data-path xx --journal-path xx 
'["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":-2,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 
' get-attr snapset > /tmp/snap.out


$ ceph-dencoder type SnapSet import /tmp/snap.out decode dump_json


{
"snap_context": {
"seq": 4,
"snaps": [
4,
3,
2,
1
]
},
"head_exists": 1,
"clones": [
{
"snap": 1,
"size": 1032,
"overlap": "[]"
},
{
"snap": 2,
"size": 452,
"overlap": "[]"
},
{
"snap": 3,
"size": 452,
"overlap": "[]"
},
{
"snap": 4,
"size": 452,
"overlap": "[]"
}
]
}

On 9/3/15 2:44 PM, David Zafman wrote:


Chris,

WARNING: Do this at your own risk.  You are deleting one of the 
snapshots of a specific portion of an rbd image.  I'm not sure how 
rbd will react.  Maybe you should repair the SnapSet instead of 
remove the inconsistency.   However, as far as I know there isn't a 
tool to it.


If you are able to build from Ceph source, I happen to have an 
enhancement to ceph-objectstore-tool to output the SnapSet.


---

The message preceding the assert is in the same thread so " 
rb.0.8c2990.238e1f29.8cc0/23ed//3" has the object name in 
it.  The 23ed is the RADOS clone/snap ID.


First, get a backup by export the pg using the 
ceph-objectstore-tool.  Specify a --file somewhere with enough of 
disk space.


$ ceph-objectstore-tool --data-path xx --journal-path xx 
--op export --pgid 3.f9 --file destination

Exporting 3.f9

Read 3/c55800f9/rb.0.8c2990.238e1f29.8cc0/23ed

Export successful

Now you need the JSON of the object in question.  The 3rd line of 
output has the snapid 9197 which is 23ed in decimal.


$ ceph-objectstore-tool --data-path xx --journal-path xx 
--op list rb.0.8c2990.238e1f29.8cc0



["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":9196,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 

["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid",9197,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 

["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":9198,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 

["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":-2,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 



To remove it, cut and paste your output line with snapid 9197 inside 
single quotes like this:


$ ceph-objectstore-tool --data-path xx --journal-path xx 
'["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":9197,"hash":###,"max":0,"pool":3,"namespace":"","max":0}]' 
remove



remove 3/c55800f9/rb.0.8c2990.238e1f29.8cc0/23ed

To get all the OSDs to boot you'll have to do the remove on all OSDs 
that contain this PG and have an entry with snapid 9197 for this 
object.


David

On 9/3/15 11:29 AM, Chris Taylor wrote:

On 09/03/2015 10:20 AM, David Zafman wrote:


This crash is what happens if a clone is missing from SnapSet 
(internal data) for an object in the ObjectStore. If you had out 
of space issues, this could possibly have been caused by being 
able to rename or create files in a directory, but not being able 
to update SnapSet.


I've completely rewritten that logic so scrub doesn't crash, but 
it hasn't been in a release yet.  In the future scrub will just 
report an unexpected clone in the ObjectStore.


You'll need to find and remove the extraneous clone. Bump the 
"debug osd" to 20 so that you'll get the name of the object in the 
log.  Start an OSD and after it crashes examine the log. Then 
remove the extraneous object using ceph-objectstore-tool. You'll 
have to repeat this process if there are more of these.


David


I looked for an example of how to use the ceph-objectstore-tool 
aside from what was provided with "-h". I really don't

Re: [ceph-users] OSD respawning -- FAILED assert(clone_size.count(clone))

2015-09-08 Thread Chris Taylor


Attached is the snap.out

On 09/08/2015 01:47 PM, David Zafman wrote:


Chris,

I was wondering if you still had /tmp/snap.out laying around, could 
you send it to me?   The way the dump to json code works if the 
"clones" is empty it doesn't show me what two other structures look like.


David

On 9/5/15 3:24 PM, Chris Taylor wrote:

# ceph-dencoder type SnapSet import /tmp/snap.out decode dump_json
{
"snap_context": {
"seq": 9197,
"snaps": [
9197
]
},
"head_exists": 1,
"clones": []
}


On 09/03/2015 04:48 PM, David Zafman wrote:


If you have ceph-dencoder installed or can build v0.94.3 to build 
the binary, you can dump the SnapSet for the problem object. Once 
you understand the removal procedure you could do the following to 
get a look at the SnapSet information.


Find the object from --op list with snapid -2 and cut and paste that 
json into the following command


Something like:
$ ceph-objectstore-tool --data-path xx --journal-path xx 
'["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":-2,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 
' get-attr snapset > /tmp/snap.out


$ ceph-dencoder type SnapSet import /tmp/snap.out decode dump_json


{
"snap_context": {
"seq": 4,
"snaps": [
4,
3,
2,
1
]
},
"head_exists": 1,
"clones": [
{
"snap": 1,
"size": 1032,
"overlap": "[]"
},
{
"snap": 2,
"size": 452,
"overlap": "[]"
},
{
"snap": 3,
"size": 452,
"overlap": "[]"
},
{
"snap": 4,
"size": 452,
"overlap": "[]"
}
]
}

On 9/3/15 2:44 PM, David Zafman wrote:


Chris,

WARNING: Do this at your own risk.  You are deleting one of the 
snapshots of a specific portion of an rbd image.  I'm not sure how 
rbd will react.  Maybe you should repair the SnapSet instead of 
remove the inconsistency.   However, as far as I know there isn't a 
tool to it.


If you are able to build from Ceph source, I happen to have an 
enhancement to ceph-objectstore-tool to output the SnapSet.


---

The message preceding the assert is in the same thread so " 
rb.0.8c2990.238e1f29.8cc0/23ed//3" has the object name in 
it.  The 23ed is the RADOS clone/snap ID.


First, get a backup by export the pg using the 
ceph-objectstore-tool.  Specify a --file somewhere with enough of 
disk space.


$ ceph-objectstore-tool --data-path xx --journal-path xx 
--op export --pgid 3.f9 --file destination

Exporting 3.f9

Read 3/c55800f9/rb.0.8c2990.238e1f29.8cc0/23ed

Export successful

Now you need the JSON of the object in question.  The 3rd line of 
output has the snapid 9197 which is 23ed in decimal.


$ ceph-objectstore-tool --data-path xx --journal-path xx 
--op list rb.0.8c2990.238e1f29.8cc0



["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":9196,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 

["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid",9197,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 

["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":9198,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 

["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":-2,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 



To remove it, cut and paste your output line with snapid 9197 
inside single quotes like this:


$ ceph-objectstore-tool --data-path xx --journal-path xx 
'["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":9197,"hash":###,"max":0,"pool":3,"namespace":"","max":0}]' 
remove



remove 3/c55800f9/rb.0.8c2990.238e1f29.8cc0/23ed

To get all the OSDs to boot you'll have to do the remove on all 
OSDs that contain this PG and have an entry with snapid 9197 for 
this object.


David

On 9/3/15 11:29 AM, Chris Taylor wrote:

On 09/03/2015 10:20 AM, David Zafman wrote:


This crash is what happens if a clone is missing from SnapSet 
(internal data) for an object in the ObjectStore. If you had out 
of space issues, this could possibly have been caused by being 
able to rename or create files in a directory, but not being able 
to update SnapSet.


I've completely rewritten that logic so scrub doesn't crash, but 
it hasn't been in a release yet.  In the future scrub will just 
report an unexpected clone in the ObjectStore.


You'll need to find and remove the extraneous clone. Bump the 
"debug osd" to 20 so that you'll get the name of the object in 
the log.  Start an OSD and after it crashes examine the log. Then 
remove the extraneous object using ceph-objectstore-tool. You'll 
have to repeat this process if there are more of these.


David


I looked for an example of how to use the

[ceph-users] OSD crash

2015-09-08 Thread Alex Gorbachev

Hello,

We have run into an OSD crash this weekend with the following dump.  Please
advise what this could be.

Best regards,
Alex


2015-09-07 14:55:01.345638 7fae6c158700  0 -- 10.80.4.25:6830/2003934 >>
10.80.4.15:6813/5003974 pipe(0x1dd73000 sd=257 :6830 s=2 pgs=14271 cs=251
l=0 c=0x10d34580).fault with nothing to send, going to standby
2015-09-07 14:56:16.948998 7fae643e8700 -1 *** Caught signal (Segmentation
fault) **
 in thread 7fae643e8700

 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
 1: /usr/bin/ceph-osd() [0xacb3ba]
 2: (()+0x10340) [0x7faea044e340]
 3:
(tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
unsigned long, int)+0x103) [0x7faea067fac3]
 4: (tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*,
unsigned long)+0x1b) [0x7faea067fb7b]
 5: (operator delete(void*)+0x1f8) [0x7faea068ef68]
 6: (std::_Rb_tree >, std::_Select1st > >, std::less,
std::allocator > > >::_M_erase(std::_Rb_tree_node > >*)+0x58) [0xca2438]
 7: (std::_Rb_tree >, std::_Select1st > >, std::less,
std::allocator > > >::erase(int const&)+0xdf) [0xca252f]
 8: (Pipe::writer()+0x93c) [0xca097c]
 9: (Pipe::Writer::entry()+0xd) [0xca40dd]
 10: (()+0x8182) [0x7faea0446182]
 11: (clone()+0x6d) [0x7fae9e9b100d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.

--- begin dump of recent events ---
-1> 2015-08-20 05:32:32.454940 7fae8e897700  0 --
10.80.4.25:6830/2003934 >> 10.80.4.15:6806/4003754 pipe(0x1992d000 sd=142
:6830 s=0 pgs=0 cs=0 l=0 c=0x12bf5700).accept connect_seq 816 vs existing
815 state standby
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Huge memory usage spike in OSD on hammer/giant

2015-09-08 Thread Chad William Seys

Does 'ceph tell osd.* heap release' help with OSD RAM usage?

From
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-August/003932.html

Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Inconsistent PGs that ceph pg repair does not fix

That's a good news.

Shinobu

- Original Message -
From: "Sage Weil" 
To: "Andras Pataki" 
Cc: ceph-users@lists.ceph.com, ceph-de...@vger.kernel.org
Sent: Wednesday, September 9, 2015 3:07:29 AM
Subject: Re: [ceph-users] Inconsistent PGs that ceph pg repair does not fix

On Tue, 8 Sep 2015, Andras Pataki wrote:
> Hi Sam,
> 
> I saw that ceph 0.94.3 is out and it contains a resolution to the issue below 
> (http://tracker.ceph.com/issues/12577).  I installed it on our cluster, but 
> unfortunately it didn't resolve the issue.  Same as before, I have a couple 
> of inconsistent pg's, and run ceph pg repair on them - the OSD says:
> 
> 2015-09-08 11:21:53.930324 7f49c17ea700  0 log_channel(cluster) log [INF] : 
> 2.439 repair starts
> 2015-09-08 11:27:57.708394 7f49c17ea700 -1 log_channel(cluster) log [ERR] : 
> repair 2.439 b029e439/1022d93.0f0c/head//2 on disk data digest 
> 0xb3d78a6e != 0xa3944ad0
> 2015-09-08 11:28:32.359938 7f49c17ea700 -1 log_channel(cluster) log [ERR] : 
> 2.439 repair 1 errors, 0 fixed
> 2015-09-08 11:28:32.364506 7f49c17ea700  0 log_channel(cluster) log [INF] : 
> 2.439 deep-scrub starts
> 2015-09-08 11:29:18.650876 7f49c17ea700 -1 log_channel(cluster) log [ERR] : 
> deep-scrub 2.439 b029e439/1022d93.0f0c/head//2 on disk data digest 
> 0xb3d78a6e != 0xa3944ad0
> 2015-09-08 11:29:23.136109 7f49c17ea700 -1 log_channel(cluster) log [ERR] : 
> 2.439 deep-scrub 1 errors
> 
> $ ceph tell osd.* version | grep version | sort | uniq -c
>  94 "version": "ceph version 0.94.3 
> (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)"
> 
> Could you have another look?

The fix was merged into master in 
6a949e10198a1787f2008b6c537b7060d191d236, after v0.94.3 was released.  It 
will be in v0.94.4.

Note that we had a bunch of similar errors on our internal lab cluster and 
this resolved them.  We installed the test build from gitbuilder, 
available at 
http://gitbuilder.ceph.com/ceph-rpm-centos7-x86_64-basic/ref/hammer/ (or 
similar, adjust URL for your distro).

sage


> 
> Thanks,
> 
> Andras
> 
> 
> 
> From: Andras Pataki
> Sent: Monday, August 3, 2015 4:09 PM
> To: Samuel Just
> Cc: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org
> Subject: Re: [ceph-users] Inconsistent PGs that ceph pg repair does not fix
> 
> Done: http://tracker.ceph.com/issues/12577
> BTW, I¹m using the latest release 0.94.2 on all machines.
> 
> Andras
> 
> 
> On 8/3/15, 3:38 PM, "Samuel Just"  wrote:
> 
> >Hrm, that's certainly supposed to work.  Can you make a bug?  Be sure
> >to note what version you are running (output of ceph-osd -v).
> >-Sam
> >
> >On Mon, Aug 3, 2015 at 12:34 PM, Andras Pataki
> > wrote:
> >> Summary: I am having problems with inconsistent PG's that the 'ceph pg
> >> repair' command does not fix.  Below are the details.  Any help would be
> >> appreciated.
> >>
> >> # Find the inconsistent PG's
> >> ~# ceph pg dump | grep inconsistent
> >> dumped all in format plain
> >> 2.439 42080 00 017279507143 31033103 active+clean+inconsistent2015-08-03
> >> 14:49:17.29288477323'2250145 77480:890566 [78,54]78 [78,54]78
> >> 77323'22501452015-08-03 14:49:17.29253877323'2250145 2015-08-03
> >> 14:49:17.292538
> >> 2.8b9 40830 00 016669590823 30513051 active+clean+inconsistent2015-08-03
> >> 14:46:05.14006377323'2249886 77473:897325 [7,72]7 [7,72]7
> >> 77323'22498862015-08-03 14:22:47.83406377323'2249886 2015-08-03
> >> 14:22:47.834063
> >>
> >> # Look at the first one:
> >> ~# ceph pg deep-scrub 2.439
> >> instructing pg 2.439 on osd.78 to deep-scrub
> >>
> >> # The logs of osd.78 show:
> >> 2015-08-03 15:16:34.409738 7f09ec04a700  0 log_channel(cluster) log
> >>[INF] :
> >> 2.439 deep-scrub starts
> >> 2015-08-03 15:16:51.364229 7f09ec04a700 -1 log_channel(cluster) log
> >>[ERR] :
> >> deep-scrub 2.439 b029e439/1022d93.0f0c/head//2 on disk data
> >>digest
> >> 0xb3d78a6e != 0xa3944ad0
> >> 2015-08-03 15:16:52.763977 7f09ec04a700 -1 log_channel(cluster) log
> >>[ERR] :
> >> 2.439 deep-scrub 1 errors
> >>
> >> # Finding the object in question:
> >> ~# find ~ceph/osd/ceph-78/current/2.439_head -name
> >>1022d93.0f0c* -ls
> >> 21510412310 4100 -rw-r--r--   1 root root  4194304 Jun 30 17:09
> >>
> >>/var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
> >>0022d93.0f0c__head_B029E439__2
> >> ~# md5sum
> >>
> >>/var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
> >>0022d93.0f0c__head_B029E439__2
> >> 4e4523244deec051cfe53dd48489a5db
> >>
> >>/var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
> >>0022d93.0f0c__head_B029E439__2
> >>
> >> # The object on the backup osd:
> >> ~# find ~ceph/osd/ceph-54/current/2.439_head -name
> >>1022d93.0f0c* -ls
> >> 6442614367 4100 -rw-r--r--   1 root root  4194304 Jun 30 17:09
> >>
>

Re: [ceph-users] Cannot add/create new monitor on ceph v0.94.3

2015-09-08 Thread Chang, Fangzhe (Fangzhe)

Thanks for the answer.

NTP is running on both the existing monitor and the new monitor being installed.
I did run ceph-deploy in the same directory as I created the cluster. However, 
I need to tweak the options supplied to ceph-deploy a little bit since I was 
running it behind a corporate firewall.

I noticed the ceph-create-keys process is running on the background. When I ran 
it manually, I got the following results.

$ python /usr/sbin/ceph-create-keys --cluster ceph -i 
INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'


-Original Message-
From: Brad Hubbard [mailto:bhubb...@redhat.com] 
Sent: Sunday, September 06, 2015 11:58 PM
To: Chang, Fangzhe (Fangzhe)
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Cannot add/create new monitor on ceph v0.94.3

- Original Message -
> From: "Fangzhe Chang (Fangzhe)" 
> To: ceph-users@lists.ceph.com
> Sent: Saturday, 5 September, 2015 6:26:16 AM
> Subject: [ceph-users] Cannot add/create new monitor on ceph v0.94.3
> 
> 
> 
> Hi,
> 
> I’m trying to add a second monitor using ‘ceph-deploy mon new  hostname>’. However, the log file shows the following error:
> 
> 2015-09-04 16:13:54.863479 7f4cbc3f7700 0 cephx: verify_reply couldn't 
> decrypt with error: error decoding block for decryption
> 
> 2015-09-04 16:13:54.863491 7f4cbc3f7700 0 -- :6789/0 
> >> :6789/0 pipe(0x413 sd=12 :57954 s=1 pgs=0 
> cs=0 l=0 c=0x3f29600).failed verifying authorize reply

A couple of things to look at are verifying all your clocks are in sync (ntp 
helps here) and making sure you are running ceph-deploy in the directory you 
used to create the cluster.

> 
> 
> 
> Does anyone know how to resolve this?
> 
> Thanks
> 
> 
> 
> Fangzhe
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Huge memory usage spike in OSD on hammer/giant

Have you ever?

http://ceph.com/docs/master/rados/troubleshooting/memory-profiling/

Shinobu

- Original Message -
From: "Chad William Seys" 
To: "Mariusz Gronczewski" , "Shinobu Kinjo" 
, ceph-users@lists.ceph.com
Sent: Wednesday, September 9, 2015 6:14:15 AM
Subject: Re: Huge memory usage spike in OSD on hammer/giant

Does 'ceph tell osd.* heap release' help with OSD RAM usage?

From
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-August/003932.html

Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to improve ceph cluster capacity usage

On Tue, Sep 1, 2015 at 3:58 PM, huang jun  wrote:
> hi,all
>
> Recently, i did some experiments on OSD data distribution,
> we set up a cluster with 72 OSDs,all 2TB sata disk,
> and ceph version is v0.94.3 and linux kernel version is 3.18,
> and set "ceph osd crush tunables optimal".
> There are 3 pools:
> pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash
> rjenkins pg_num 512 pgp_num 512 last_change 302 stripe_width 0
> pool 1 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
> rjenkins pg_num 4096 pgp_num 4096 last_change 832
> crash_replay_interval 45 stripe_width 0
> pool 2 'metadata' replicated size 3 min_size 1 crush_ruleset 0
> object_hash rjenkins pg_num 512 pgp_num 512 last_change 302
> stripe_width 0
>
> the osd pg num of each osd:
> pool  : 0  1  2  | SUM
> 
> osd.0   13 10518 | 136
> osd.1   17 11026 | 153
> osd.2   15 11420 | 149
> osd.3   11 10117 | 129
> osd.4   8  10617 | 131
> osd.5   12 10219 | 133
> osd.6   19 11429 | 162
> osd.7   16 11521 | 152
> osd.8   15 11725 | 157
> osd.9   13 11723 | 153
> osd.10  13 13316 | 162
> osd.11  14 10521 | 140
> osd.12  11 94 16 | 121
> osd.13  12 11021 | 143
> osd.14  20 11926 | 165
> osd.15  12 12519 | 156
> osd.16  15 12622 | 163
> osd.17  13 10919 | 141
> osd.18  8  11919 | 146
> osd.19  14 11419 | 147
> osd.20  17 11329 | 159
> osd.21  17 11127 | 155
> osd.22  13 12120 | 154
> osd.23  14 95 23 | 132
> osd.24  17 11026 | 153
> osd.25  13 13315 | 161
> osd.26  17 12424 | 165
> osd.27  16 11920 | 155
> osd.28  19 13430 | 183
> osd.29  13 12120 | 154
> osd.30  11 97 20 | 128
> osd.31  12 10918 | 139
> osd.32  10 11215 | 137
> osd.33  18 11428 | 160
> osd.34  19 11229 | 160
> osd.35  16 12132 | 169
> osd.36  13 11118 | 142
> osd.37  15 10722 | 144
> osd.38  21 12924 | 174
> osd.39  9  12117 | 147
> osd.40  11 10218 | 131
> osd.41  14 10119 | 134
> osd.42  16 11925 | 160
> osd.43  12 11813 | 143
> osd.44  17 11425 | 156
> osd.45  11 11415 | 140
> osd.46  12 10716 | 135
> osd.47  15 11123 | 149
> osd.48  14 11520 | 149
> osd.49  9  94 13 | 116
> osd.50  14 11718 | 149
> osd.51  13 11219 | 144
> osd.52  11 12622 | 159
> osd.53  12 12218 | 152
> osd.54  13 12120 | 154
> osd.55  17 11425 | 156
> osd.56  11 11818 | 147
> osd.57  22 13725 | 184
> osd.58  15 10522 | 142
> osd.59  13 12018 | 151
> osd.60  12 11019 | 141
> osd.61  21 11428 | 163
> osd.62  12 97 18 | 127
> osd.63  19 10931 | 159
> osd.64  10 13221 | 163
> osd.65  19 13721 | 177
> osd.66  22 10732 | 161
> osd.67  12 10720 | 139
> osd.68  14 10022 | 136
> osd.69  16 11024 | 150
> osd.70  9  10114 | 124
> osd.71  15 11224 | 151
>
> 
> SUM   : 1024   8192   1536   |
>
> We can found that, for poolid=1(data pool),
> osd.57 and osd.65 both have 137 PGs but osd.12 and osd.49 only have 94 PGs,
> which maybe cause data distribution imbanlance, and reduces the space
> utilization of the cluster.
>
> Use "crushtool -i crush.raw --test --show-mappings --rule 0 --num-rep
> 2 --min-x 1 --max-x %s"
> we tested different pool pg_num:
>
> Total PG num PG num stats
>  ---
> 4096 avg: 113.78 (avg stands for avg PG num of every osd)
> total: 8192  (total stands for total PG num, include replica PG)
> max: 139 +0.221680 (max stands for max PG num on OSD, +0.221680 stands
> for percent above avage PG num )
> min: 113 -0.226562 (min stands for min PG num on OSD, -0.226562 stands
> for ratio below avage PG num )
>
> 8192 avg: 227.56
> total: 16384
> max: 267 0.173340
> min: 226 -0.129883
>
> 16384 avg: 455.11
> total: 32768
> max: 502 0.103027
> min: 455 -0.127686
>
> 32768 avg: 910.22
> total: 65536
> max: 966 0.061279
> min: 910 -0.076050
>
> With bigger pg_num, the gap between the maximum and the minimum decreased.
> But it's unreasonable to set such large pg_num, which will increase
> OSD and MON load.
>
> Is there any way to get a more balanced PG distribution of the cluster?
> We tried "ceph osd reweight-by-pg 110 data" many times, but that can
> not resolve the problem.

The

Re: [ceph-users] How objects are reshuffled on addition of new OSD

On Tue, Sep 1, 2015 at 2:31 AM, Shesha Sreenivasamurthy  wrote:
> I had a question regarding how OSD locations are determined by CRUSH.
>
> From the CRUSH paper I gather that the replica locations of an object (A) is
> a vector (v) that is got by the function c(r,x) = (hash (x) + rp) mod m).

It is a hash function, but I don't think this is quite right. Objects
are hashed (quickly, using rjenkins or something) into a placement
group. The CRUSH function is then run on that placement group to
assign it to a vector of OSDs; this is pretty configurable and takes a
tree as input (with the choice of straw, list, etc types).

>
> Now when new OSDs are added, objects are shuffled to maintain uniform data
> distribution. What in the above equation changes so that only minimal
> movement is achieved. More specifically, if nothing in the above equation
> changes then all the objects again map to the same locations. If p is
> changed, then lots of object location can be changed. Therefore, how does
> CRUSH guarantees only minimal data movement.

Like I said, that's not the equation. It's more like you have three
doors to choose from at each of three levels, and when you add a new
door somewhere in the tree, you only move a little bit of the data
around.

>
> Followup question is, if there in an ongoing IO to an object, the primary
> replica is the one that will be getting updated. Does the re-shuffling in
> that case do not consider currently hot objects for movement ?

It definitely does not consider heat. Everything is based on the
object names (locators, more specifically, but they're generally the
same). Responsibility for maintaining the IO lives in layers above
CRUSH.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] A friend just gave you $10 to try DigitalOcean

2015-09-08 Thread DigitalOcean

Subject Line (for proofing, doesn't show in actual email)
 == Deploy a Server
For Free! ==

  Your friend ajmdfeipan has been using DigitalOcean – a cloud hosting
service designed just for developers – and thought you might want to give
it a shot. You can deploy a server in 55 seconds from our control panel or
use our simple API.

 Because you’ve been invited by a friend, we’d like to give you a $10
credit to try it out. Just use this link to create an account and you’ll
be credited automatically.

 Redeem Your Credit


  Happy Coding,
 DigitalOcean

© DigitalOcean 
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Inconsistency in 'ceph df' stats

This comes up periodically on the mailing list; see eg
http://www.spinics.net/lists/ceph-users/msg15907.html

I'm not sure if your case fits within those odd parameters or not, but
I bet it does. :)
-Greg

On Mon, Aug 31, 2015 at 8:16 PM, Stillwell, Bryan
 wrote:
> On one of our staging ceph clusters (firefly 0.80.10) I've noticed that
> some
> of the statistics in the 'ceph df' output don't seem to match up.  For
> example
> in the output below the amount of raw used is 8,402G, which with triple
> replication would be 2,800.7G used (all the pools are triple replication).
> However, if you add up the numbers used by all the pools (424G + 2538G +
> 103G)
> you get 3,065G used (a difference of +264.3G).
>
> GLOBAL:
> SIZE   AVAIL  RAW USED %RAW USED
> 50275G 41873G8402G 16.71
> POOLS:
> NAME  ID USED  %USED MAX AVAIL OBJECTS
> data  0  0 013559G   0
> metadata  1  0 013559G   0
> rbd   2  0 013559G   0
> volumes   3   424G  0.8413559G  159651
> images4  2538G  5.0513559G  325198
> backups   5  0 013559G   0
> instances 6   103G  0.2113559G   25310
>
> The max avail amount doesn't line up either.  If you take 3 * 13,559G you
> get
> 40,677G available, but the global stat is 41,873G (a difference of 1,196G).
>
>
> On another staging cluster the numbers are closer to what I would expect.
> The
> amount of raw used is 7,037G, which with triple replication should be
> 2,345.7G.  However, adding up the amounts used by all the pools (102G +
> 1749G
> + 478G + 14G) is 2,343G (a difference of just -2.7G).
>
> GLOBAL:
> SIZE   AVAIL  RAW USED %RAW USED
> 50275G 43238G7037G 14.00
> POOLS:
> NAME  ID USED   %USED MAX AVAIL OBJECTS
> data  0   0 013657G   0
> metadata  1   0 013657G   0
> rbd   2   0 013657G   0
> volumes   3102G  0.2013657G   27215
> images4   1749G  3.4813657G  224259
> instances 5478G  0.9513657G   79221
> backups   6   0 013657G   0
> scbench   8  14704M  0.0313657G3677
>
> The max avail is a little further off.  Taking 3 * 13,657G you get 40,971G,
> but the global stat is 43,238G (a difference of 2,267G).
>
> My guess would have been that the global numbers would include some of the
> overhead involved which lines up with the second cluster, but the first
> cluster would have -264.3G of overhead which just doesn't make sense.  Any
> ideas where these stats might be getting off?
>
> Thanks,
> Bryan
>
>
> 
>
> This E-mail and any of its attachments may contain Time Warner Cable 
> proprietary information, which is privileged, confidential, or subject to 
> copyright belonging to Time Warner Cable. This E-mail is intended solely for 
> the use of the individual or entity to which it is addressed. If you are not 
> the intended recipient of this E-mail, you are hereby notified that any 
> dissemination, distribution, copying, or action taken in relation to the 
> contents of and attachments to this E-mail is strictly prohibited and may be 
> unlawful. If you have received this E-mail in error, please notify the sender 
> immediately and permanently delete the original and any copy of this E-mail 
> and any printout.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rebalancing taking very long time

On Wed, Sep 2, 2015 at 9:34 PM, Bob Ababurko  wrote:
> When I lose a disk OR replace a OSD in my POC ceph cluster, it takes a very
> long time to rebalance.  I should note that my cluster is slightly unique in
> that I am using cephfs(shouldn't matter?) and it currently contains about
> 310 million objects.
>
> The last time I replaced a disk/OSD was 2.5 days ago and it is still
> rebalancing.  This is on a cluster with no client load.
>
> The configurations is 5 hosts with 6 x 1TB 7200rpm SATA OSD's & 1 850 Pro
> SSD which contains the journals for said OSD's.  Thats means 30 OSD's in
> total.  System disk is on its own disk.  I'm also using a backend network
> with single Gb NIC.  THe rebalancing rate(objects/s) seems to be very slow
> when it is close to finishingsay <1% objects misplaced.
>
> It doesn't seem right that it would take 2+ days to rebalance a 1TB disk
> with no load on the cluster.  Are my expectations off?

Possibly...Ceph basically needs to treat each object as a single IO.
If you're recovering from a failed disk then you've got to replicate
roughly 310 million * 3 / 30 = 31 million objects. If it's perfectly
balanced across 30 disks that get 80 IOPS that's 12916 seconds (~3.5
hours) worth of work just to read each file — and in reality it's
likely to take more than one IO to read the file, and then you have to
spend a bunch to write it as well.

>
> I'm not sure if my pg_num/pgp_num needs to be changed OR the rebalance time
> is dependent on the number of objects in the pool.  These are thoughts i've
> had but am not certain are relevant here.

Rebalance time is dependent on the number of objects in the pool. You
*might* see an improvement by increasing "osd max push objects" from
its default of 10...or you might not. That many small files isn't
something I've explored.
-Greg

>
> $ sudo ceph -v
> ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
>
> $ sudo ceph -s
> [sudo] password for bababurko:
> cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79
>  health HEALTH_WARN
> 5 pgs backfilling
> 5 pgs stuck unclean
> recovery 3046506/676638611 objects misplaced (0.450%)
>  monmap e1: 3 mons at
> {cephmon01=10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0}
> election epoch 20, quorum 0,1,2 cephmon01,cephmon02,cephmon03
>  mdsmap e6070: 1/1/1 up {0=cephmds01=up:active}, 1 up:standby
>  osdmap e4395: 30 osds: 30 up, 30 in; 5 remapped pgs
>   pgmap v3100039: 2112 pgs, 3 pools, 6454 GB data, 321 Mobjects
> 18319 GB used, 9612 GB / 27931 GB avail
> 3046506/676638611 objects misplaced (0.450%)
> 2095 active+clean
>   12 active+clean+scrubbing+deep
>5 active+remapped+backfilling
> recovery io 2294 kB/s, 147 objects/s
>
> $ sudo rados df
> pool name KB  objects   clones degraded
> unfound   rdrd KB   wrwr KB
> cephfs_data   676756996233574670200
> 0  21368341676984208   7052266742
> cephfs_metadata42738  105843700
> 0 16130199  30718800215295996938   3811963908
> rbd0000
> 00000
>   total used 19209068780336805139
>   total avail10079469460
>   total space29288538240
>
> $ sudo ceph osd pool get cephfs_data pgp_num
> pg_num: 1024
> $ sudo ceph osd pool get cephfs_metadata pgp_num
> pg_num: 1024
>
>
> thanks,
> Bob
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Ceph-community] Ceph MeetUp Berlin Sept 28

2015-09-08 Thread Joao Eduardo Luis

This may see more traction in ceph-users and ceph-devel.

Most people don't usually subscribe to ceph-community.

Cheers!

  -Joao

On 09/08/2015 11:44 AM, Robert Sander wrote:
> Hi,
> 
> the next meetup in Berlin takes place on September 28 at 18:00 CEST.
> 
> Please RSVP at http://www.meetup.com/de/Ceph-Berlin/events/222906639/
> 
> Regards
> 
> 
> 
> ___
> Ceph-community mailing list
> ceph-commun...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] A few questions and remarks about cephx

On Sun, Sep 6, 2015 at 10:07 AM, Marin Bernard  wrote:
> Hi,
>
> I've just setup Ceph Hammer (latest version) on a single node (1 MON, 1
> MDS, 4 OSDs) for testing purposes. I used ceph-deploy. I only
> configured CephFS as I don't use RBD. My pool config is as follows:
>
> $ sudo ceph df
> GLOBAL:
> SIZE  AVAIL RAW USED %RAW USED
> 7428G 7258G 169G  2.29
> POOLS:
> NAMEID USED   %USED MAX AVAIL
>  OBJECTS
> cephfs_data 1168G  2.26 7209G
>  78691
> cephfs_metadata 2  41301k 0 7209G
>  2525
>
> Cluster is sane:
>
> $ sudo ceph status
> cluster 72aba9bb-20db-4f62-8d03-0a8a1019effa
>  health HEALTH_OK
>  monmap e1: 1 mons at {nice-srv-cosd-00=10.16.1.161:6789/0}
> election epoch 1, quorum 0 nice-srv-cosd-00
>  mdsmap e5: 1/1/1 up {0=nice-srv-cosd-00=up:active}
>  osdmap e71: 4 osds: 4 up, 4 in
>   pgmap v3723: 240 pgs, 2 pools, 167 GB data, 80969 objects
> 168 GB used, 7259 GB / 7428 GB avail
>  240 active+clean
>   client io 59391 kB/s wr, 29 op/s
>
> CephFS is mounted on a client node, which uses a dedicated cephx key
> 'client.mynode'. I've had a hard time trying to figure out which cephx
>  capabilities were required to give the node RW access to CephFS. I
> found documentation covering cephx capabilities for RBD, but not for
> CephFS. Did I miss something ? As of now, the 'client.mynode' key has
> the following capabilities, which seem sufficient:

CephFS is still not as well documented since nobody's building a
product on it yet.

>
> $ sudo ceph auth get client.mynode
> exported keyring for client.mynode
> [client.mynode]
> key = myBeautifulKey
> caps mds = "allow r"
> caps mon = "allow r"
> caps osd = "allow rw pool=cephfs_metadata, allow rw
> pool=cephfs_data"

The clients don't need to access the metadata pool at all; only the
MDSes need access to that.

>
>
> Here are a few questions and remarks I made for myself when dealing
> with cephx:
>
> 1. Are mds caps needed for CephFS clients? If so, do they need r or rw
> access ? Is it documented somewhere ?

I think this just needs an "allow" in all released versions, although
we're making the language more flexible for Infernalis. (At least, we
hope https://github.com/ceph/ceph/pull/5638/ merges for Infernalis!)
It may not be documented well, but it's at least at
http://ceph.com/docs/master/rados/operations/user-management/#authorization-capabilities

> 2. CephFS requires the clients to have rw access to multiple pools
> (data + metadata). I couldn't find the correct syntax to use with 'ceph
> auth caps' anywhere but on the ML archive (
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg17058.html).
> I suggest to add some documentation for it on the main website. Or is
> it already there ?

Actually, clients just need to access whichever data pools they're
using. I thought we had documentation for multiple pools but I can't
find it; you should submit a bug! :)

>
>
> 3. I found 'ceph auth caps' syntax validation rather weak, as the
> command did not return an error in the case of an incorrect syntax. For
> instance, the following command did not raise an error whereas it is
> (probably) syntactically incorrect:
>
> $ sudo ceph auth caps client.mynode mon 'allow r' mds 'allow r' osd
> 'allow rw pool=cephfs_metadata,cephfs_data'
>
> I suppose the comma is considered as a part of a single pool name, thus
> resulting in:
>
> $ sudo ceph auth get client.mynode
> exported keyring for client.mynode
> [cl
> ient.mynode]
> key = myBeautifulKey
> caps mds = "allow r"
>
> caps mon = "allow r"
> caps osd = "allow rw
> pool=cephfs_metadata,cephfs_data"
>
> Is it expected behaviour? Are special chars allowed in pool names ?

We've waffled on doing validation or not (cap syntax is validated by
the daemons using it, not the monitors, and we want to keep it
flexible in case eg the monitors are still being upgraded but you're
using new-style syntax).

>
>
> 4. With the capabilities shown above, the client node was still able to
> mount CephFS and to make thousands of reads and writes without any
> error. However, since capabilities were incorrect, it only had rw
> access to the 'cephfs_metadata' pool, and no access at all to the
> 'cephfs_data' pool. As a consequence, files, folders, permissions,
> sizes and other metadata were written and retrieved correctly, but the
> actual data were lost in vacuum. Shouldn't such a strange situation
> raise an error on the client ?

If you use a new enough (hammer, maybe? otherwise Infernalis)
ceph-fuse it will raise an error. I'm not sure if it's in the kernel
client but if not it will be soon, but of course you're unlikely to be
using one that's new enough yet.
-Greg

>
>
> Thanks!
>
> Marin.
> ___
>

Re: [ceph-users] CephFS/Fuse : detect package upgrade to remount

On Fri, Sep 4, 2015 at 9:15 AM, Florent B  wrote:
> Hi everyone,
>
> I would like to know if there is a way on Debian to detect an upgrade of
> ceph-fuse package, that "needs" remouting CephFS.
>
> When I upgrade my systems, I do a "aptitude update && aptitude
> safe-upgrade".
>
> When ceph-fuse package is upgraded, it would be nice to remount all
> CephFS points,  I suppose.
>
> Does someone did this ?

I'm not sure how this could work. It'd be nice to smoothly upgrade for
users, but
1) We don't automatically restart the OSD or monitor daemons on
upgrade, because users want to control how many of their processes are
down at once (and the load spike of a rebooting OSD),
2) I'm not sure how you could safely/cleanly restart a process that's
serving a filesystem. It's not like we can force users to stop using
the cephfs mountpoint and then reopen all their files after we reboot.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osds on 2 nodes vs. on one node

On Fri, Sep 4, 2015 at 12:24 AM, Deneau, Tom  wrote:
> After running some other experiments, I see now that the high single-node
> bandwidth only occurs when ceph-mon is also running on that same node.
> (In these small clusters I only had one ceph-mon running).
> If I compare to a single-node where ceph-mon is not running, I see
> basically identical performance to the two-node arrangement.
>
> So now my question is:  Is it expected that there would be such
> a large performance difference between using osds on a single node
> where ceph-mon is running vs. using osds on a single node where
> ceph-mon is not running?

No. There's clearly some kind of weird confound going on here.
Honestly my first thought (I haven't heard of anything like this
before) is that you might want to look at the power-saving profile of
your nodes. Maybe the extra load of the monitor is keeping the CPU
awake or something...
-Greg

>
> -- Tom
>
>> -Original Message-
>> From: Deneau, Tom
>> Sent: Thursday, September 03, 2015 10:39 AM
>> To: 'Christian Balzer'; ceph-users
>> Subject: RE: [ceph-users] osds on 2 nodes vs. on one node
>>
>> Rewording to remove confusion...
>>
>> Config 1: set up a cluster with 1 node with 6 OSDs Config 2: identical
>> hardware, set up a cluster with 2 nodes with 3 OSDs each
>>
>> In each case I do the following:
>>1) rados bench write --no-cleanup the same number of 4M size objects
>>2) drop caches on all osd nodes
>>3) rados bench seq  -t 4 to sequentially read the objects
>>   and record the read bandwidth
>>
>> Rados bench is running on a separate client, not on an OSD node.
>> The client has plenty of spare CPU power and the network and disk utilization
>> are not limiting factors.
>>
>> With Config 1, I see approximately 70% more sequential read bandwidth than
>> with Config 2.
>>
>> In both cases the primary OSDs of the objecgts appear evenly distributed
>> across OSDs.
>>
>> Yes, replication factor is 2 but since we are only measuring read
>> performance, I don't think that matters.
>>
>> Question is whether there is a ceph parameter that might be throttling the
>> 2 node configuation?
>>
>> -- Tom
>>
>> > -Original Message-
>> > From: Christian Balzer [mailto:ch...@gol.com]
>> > Sent: Wednesday, September 02, 2015 7:29 PM
>> > To: ceph-users
>> > Cc: Deneau, Tom
>> > Subject: Re: [ceph-users] osds on 2 nodes vs. on one node
>> >
>> >
>> > Hello,
>> >
>> > On Wed, 2 Sep 2015 22:38:12 + Deneau, Tom wrote:
>> >
>> > > In a small cluster I have 2 OSD nodes with identical hardware, each
>> > > with
>> > > 6 osds.
>> > >
>> > > * Configuration 1:  I shut down the osds on one node so I am using 6
>> > > OSDS on a single node
>> > >
>> > Shut down how?
>> > Just a "service blah stop" or actually removing them from the cluster
>> > aka CRUSH map?
>> >
>> > > * Configuration 2:  I shut down 3 osds on each node so now I have 6
>> > > total OSDS but 3 on each node.
>> > >
>> > Same as above.
>> > And in this case even more relevant, because just shutting down random
>> > OSDs on both nodes would result in massive recovery action at best and
>> > more likely a broken cluster.
>> >
>> > > I measure read performance using rados bench from a separate client node.
>> > Default parameters?
>> >
>> > > The client has plenty of spare CPU power and the network and disk
>> > > utilization are not limiting factors. In all cases, the pool type is
>> > > replicated so we're just reading from the primary.
>> > >
>> > Replicated as in size 2?
>> > We can guess/assume that from your cluster size, but w/o you telling
>> > us or giving us all the various config/crush outputs that is only a guess.
>> >
>> > > With Configuration 1, I see approximately 70% more bandwidth than
>> > > with configuration 2.
>> >
>> > Never mind that bandwidth is mostly irrelevant in real life, which
>> > bandwidth, read or write?
>> >
>> > > In general, any configuration where the osds span 2 nodes gets
>> > > poorer performance but in particular when the 2 nodes have equal
>> > > amounts of traffic.
>> > >
>> >
>> > Again, guessing from what you're actually doing this isn't particular
>> > surprising.
>> > Because with a single node, default rules and replication of 2 your
>> > OSDs never have to replicate anything when it comes to writes.
>> > Whereas with 2 nodes replication happens and takes more time (latency)
>> > and might also saturate your network (we have of course no idea how
>> > your cluster looks like).
>> >
>> > Christian
>> >
>> > > Is there any ceph parameter that might be throttling the cases where
>> > > osds span 2 nodes?
>> > >
>> > > -- Tom Deneau, AMD
>> > > ___
>> > > ceph-users mailing list
>> > > ceph-users@lists.ceph.com
>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >
>> >
>> >
>> > --
>> > Christian BalzerNetwork/Systems Engineer
>> > ch...@gol.com   Global OnLine Japan/Fusion

Re: [ceph-users] crash on rbd bench-write

2015-09-08 Thread Jason Dillaman

> The client version is what was installed by the ceph-deploy install
> ceph-client command. Via the debian-hammer repo. Per the quickstart doc.
> Are you saying I need to install a different client version somehow?

You listed the version as 0.80.10 which is a Ceph Firefly release -- Hammer is 
0.94.x.  Was this a clean install or did you attempt to perform an upgrade?

> I'm running the rbd command from the client, can you point me to how I
> might use the bench-write command correctly? Perhaps that option doesn't
> do what I thought. Happy to be corrected, still trying to grok how
> things should go together.

You would execute bench-write just as you did.  I am just saying there is no 
reason to map the rbd image via the kernel RBD driver (i.e. no need to run 'rbd 
map' prior to executing the bench-write command).

Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS and caching

On Thu, Sep 3, 2015 at 11:58 PM, Kyle Hutson  wrote:
> I was wondering if anybody could give me some insight as to how CephFS does
> its caching - read-caching in particular.
>
> We are using CephFS with an EC pool on the backend with a replicated cache
> pool in front of it. We're seeing some very slow read times. Trying to
> compute an md5sum on a 15GB file twice in a row (so it should be in cache)
> takes the time from 23 minutes down to 17 minutes, but this is over a 10Gbps
> network and with a crap-ton of OSDs (over 300), so I would expect it to be
> down in the 2-3 minute range.

A single sequential read won't necessarily promote an object into the
cache pool (although if you're using Hammer I think it will), so you
want to check if it's actually getting promoted into the cache before
assuming that's happened.

>
> I'm just trying to figure out what we can do to increase the performance. I
> have over 300 TB of live data that I have to be careful with, though, so I
> have to have some level of caution.
>
> Is there some other caching we can do (client-side or server-side) that
> might give us a decent performance boost?

Which client are you using for this testing? Have you looked at the
readahead settings? That's usually the big one; if you're only asking
for 4KB at once then stuff is going to be slow no matter what (a
single IO takes at minimum about 2 milliseconds right now, although
the RADOS team is working to improve that).
-Greg

>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] jemalloc and transparent hugepage

I have done small benchmark with tcmalloc and jemalloc, transparent 
hugepage=always|never. 

for tcmalloc, they are no difference.
but for jemalloc, the difference is huge (around 25% lower with tp=never).

jemmaloc 4.6.0+tp=never vs tcmalloc use 10% more RSS memory

jemmaloc 4.0+tp=never almost use same RSS memory than tcmalloc !


I don't have monitored memory usage in recovery, but I think it should help too.




tcmalloc 2.1 tp=always
---
USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND

root  67746  120  1.0 1531220 671152 ?  Ssl  01:18   0:43 
/usr/bin/ceph-osd --cluster=ceph -i 0 -f
root  67764  144  1.0 1570256 711232 ?  Ssl  01:18   0:51 
/usr/bin/ceph-osd --cluster=ceph -i 1 -f

root  68363  220  0.9 1522292 655888 ?  Ssl  01:19   0:46 
/usr/bin/ceph-osd --cluster=ceph -i 0 -f
root  68381  261  1.0 1563396 702500 ?  Ssl  01:19   0:55 
/usr/bin/ceph-osd --cluster=ceph -i 1 -f

root  68963  228  1.0 1519240 666196 ?  Ssl  01:20   0:31 
/usr/bin/ceph-osd --cluster=ceph -i 0 -f
root  68981  268  1.0 1564452 694352 ?  Ssl  01:20   0:37 
/usr/bin/ceph-osd --cluster=ceph -i 1 -f



tcmalloc 2.1  tp=never
-
USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND

root  69560  144  1.0 1544968 677584 ?  Ssl  01:21   0:20 
/usr/bin/ceph-osd --cluster=ceph -i 0 -f
root  69578  167  1.0 1568620 704456 ?  Ssl  01:21   0:23 
/usr/bin/ceph-osd --cluster=ceph -i 1 -f


root  70156  164  0.9 1519680 649776 ?  Ssl  01:21   0:16 
/usr/bin/ceph-osd --cluster=ceph -i 0 -f
root  70174  214  1.0 1559772 692828 ?  Ssl  01:21   0:19 
/usr/bin/ceph-osd --cluster=ceph -i 1 -f

root  70757  202  0.9 1520376 650572 ?  Ssl  01:22   0:20 
/usr/bin/ceph-osd --cluster=ceph -i 0 -f
root  70775  236  1.0 1560644 694088 ?  Ssl  01:22   0:23 
/usr/bin/ceph-osd --cluster=ceph -i 1 -f



jemalloc 3.6 tp = always

USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND

root  92005 46.1  1.4 2033864 967512 ?  Ssl  01:00   0:04 
/usr/bin/ceph-osd --cluster=ceph -i 5 -f
root  92027 45.5  1.4 2021624 963536 ?  Ssl  01:00   0:04 
/usr/bin/ceph-osd --cluster=ceph -i 4 -f



root  92703  191  1.5 2138724 1002376 ? Ssl  01:02   1:16 
/usr/bin/ceph-osd --cluster=ceph -i 5 -f
root  92721  183  1.5 2126228 986448 ?  Ssl  01:02   1:13 
/usr/bin/ceph-osd --cluster=ceph -i 4 -f


root  93366  258  1.4 2139052 984132 ?  Ssl  01:03   1:09 
/usr/bin/ceph-osd --cluster=ceph -i 5 -f
root  93384  250  1.5 2126244 990348 ?  Ssl  01:03   1:07 
/usr/bin/ceph-osd --cluster=ceph -i 4 -f



jemalloc 3.6 tp = never
---
USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND

root  93990  238  1.1 2105812 762628 ?  Ssl  01:04   1:16 
/usr/bin/ceph-osd --cluster=ceph -i 4 -f
root  94033  263  1.1 2118288 781768 ?  Ssl  01:04   1:18 
/usr/bin/ceph-osd --cluster=ceph -i 5 -f


root  94656  266  1.1 2139096 781392 ?  Ssl  01:05   0:58 
/usr/bin/ceph-osd --cluster=ceph -i 5 -f
root  94674  257  1.1 2126316 760632 ?  Ssl  01:05   0:56 
/usr/bin/ceph-osd --cluster=ceph -i 4 -f

root  95317  297  1.1 2135044 780532 ?  Ssl  01:06   0:35 
/usr/bin/ceph-osd --cluster=ceph -i 5 -f
root  95335  284  1.1 2112016 760972 ?  Ssl  01:06   0:34 
/usr/bin/ceph-osd --cluster=ceph -i 4 -f



jemalloc 4.0 tp = always

USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND

root 100275  198  1.3 1784520 880288 ?  Ssl  01:14   0:45 
/usr/bin/ceph-osd --cluster=ceph -i 4 -f
root 100320  239  1.1 1793184 760824 ?  Ssl  01:14   0:47 
/usr/bin/ceph-osd --cluster=ceph -i 5 -f


root 100897  200  1.3 1765780 891256 ?  Ssl  01:15   0:50 
/usr/bin/ceph-osd --cluster=ceph -i 4 -f
root 100942  245  1.1 1817436 746956 ?  Ssl  01:15   0:53 
/usr/bin/ceph-osd --cluster=ceph -i 5 -f

root 101517  196  1.3 1769904 877132 ?  Ssl  01:16   0:33 
/usr/bin/ceph-osd --cluster=ceph -i 4 -f
root 101562  258  1.1 1805172 746532 ?  Ssl  01:16   0:36 
/usr/bin/ceph-osd --cluster=ceph -i 5 -f


jemalloc 4.0 tp = never
---
USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND

root  98362 87.8  1.0 1841748 678848 ?  Ssl  01:10   0:53 
/usr/bin/ceph-osd --cluster=ceph -i 4 -f
root  98405 97.0  1.0 1846328 699620 ?  Ssl  01:10   0:56 
/usr/bin/ceph-osd --cluster=ceph -i 5 -f



root  99018  233  1.0 1812580 698848 ?  Ssl  01:12   0:30 
/usr/bin/ceph-osd --cluster=ceph -i 5 -f
root  99036  226  1.0 1822344 677420 ?  Ssl  01:12   0:29 
/usr/bin/ceph-osd --cluster=ceph -i 4 -f

root  99666  281  1.0 1814640 696420 ?  Ssl  01:13   0:33 
/usr/bin/ceph-osd --cluster=ceph -i 5 -f
root  99684  266  1.0 1835676

Re: [ceph-users] jemalloc and transparent hugepage

2015-09-08 Thread Mark Nelson

Excellent investigation Alexandre!  Have you noticed any performance 
difference with tp=never?


Mark

On 09/08/2015 06:33 PM, Alexandre DERUMIER wrote:

I have done small benchmark with tcmalloc and jemalloc, transparent 
hugepage=always|never.

for tcmalloc, they are no difference.
but for jemalloc, the difference is huge (around 25% lower with tp=never).

jemmaloc 4.6.0+tp=never vs tcmalloc use 10% more RSS memory

jemmaloc 4.0+tp=never almost use same RSS memory than tcmalloc !


I don't have monitored memory usage in recovery, but I think it should help too.




tcmalloc 2.1 tp=always
---
USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND

root  67746  120  1.0 1531220 671152 ?  Ssl  01:18   0:43 
/usr/bin/ceph-osd --cluster=ceph -i 0 -f
root  67764  144  1.0 1570256 711232 ?  Ssl  01:18   0:51 
/usr/bin/ceph-osd --cluster=ceph -i 1 -f

root  68363  220  0.9 1522292 655888 ?  Ssl  01:19   0:46 
/usr/bin/ceph-osd --cluster=ceph -i 0 -f
root  68381  261  1.0 1563396 702500 ?  Ssl  01:19   0:55 
/usr/bin/ceph-osd --cluster=ceph -i 1 -f

root  68963  228  1.0 1519240 666196 ?  Ssl  01:20   0:31 
/usr/bin/ceph-osd --cluster=ceph -i 0 -f
root  68981  268  1.0 1564452 694352 ?  Ssl  01:20   0:37 
/usr/bin/ceph-osd --cluster=ceph -i 1 -f



tcmalloc 2.1  tp=never
-
USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND

root  69560  144  1.0 1544968 677584 ?  Ssl  01:21   0:20 
/usr/bin/ceph-osd --cluster=ceph -i 0 -f
root  69578  167  1.0 1568620 704456 ?  Ssl  01:21   0:23 
/usr/bin/ceph-osd --cluster=ceph -i 1 -f


root  70156  164  0.9 1519680 649776 ?  Ssl  01:21   0:16 
/usr/bin/ceph-osd --cluster=ceph -i 0 -f
root  70174  214  1.0 1559772 692828 ?  Ssl  01:21   0:19 
/usr/bin/ceph-osd --cluster=ceph -i 1 -f

root  70757  202  0.9 1520376 650572 ?  Ssl  01:22   0:20 
/usr/bin/ceph-osd --cluster=ceph -i 0 -f
root  70775  236  1.0 1560644 694088 ?  Ssl  01:22   0:23 
/usr/bin/ceph-osd --cluster=ceph -i 1 -f



jemalloc 3.6 tp = always

USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND

root  92005 46.1  1.4 2033864 967512 ?  Ssl  01:00   0:04 
/usr/bin/ceph-osd --cluster=ceph -i 5 -f
root  92027 45.5  1.4 2021624 963536 ?  Ssl  01:00   0:04 
/usr/bin/ceph-osd --cluster=ceph -i 4 -f



root  92703  191  1.5 2138724 1002376 ? Ssl  01:02   1:16 
/usr/bin/ceph-osd --cluster=ceph -i 5 -f
root  92721  183  1.5 2126228 986448 ?  Ssl  01:02   1:13 
/usr/bin/ceph-osd --cluster=ceph -i 4 -f


root  93366  258  1.4 2139052 984132 ?  Ssl  01:03   1:09 
/usr/bin/ceph-osd --cluster=ceph -i 5 -f
root  93384  250  1.5 2126244 990348 ?  Ssl  01:03   1:07 
/usr/bin/ceph-osd --cluster=ceph -i 4 -f



jemalloc 3.6 tp = never
---
USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND

root  93990  238  1.1 2105812 762628 ?  Ssl  01:04   1:16 
/usr/bin/ceph-osd --cluster=ceph -i 4 -f
root  94033  263  1.1 2118288 781768 ?  Ssl  01:04   1:18 
/usr/bin/ceph-osd --cluster=ceph -i 5 -f


root  94656  266  1.1 2139096 781392 ?  Ssl  01:05   0:58 
/usr/bin/ceph-osd --cluster=ceph -i 5 -f
root  94674  257  1.1 2126316 760632 ?  Ssl  01:05   0:56 
/usr/bin/ceph-osd --cluster=ceph -i 4 -f

root  95317  297  1.1 2135044 780532 ?  Ssl  01:06   0:35 
/usr/bin/ceph-osd --cluster=ceph -i 5 -f
root  95335  284  1.1 2112016 760972 ?  Ssl  01:06   0:34 
/usr/bin/ceph-osd --cluster=ceph -i 4 -f



jemalloc 4.0 tp = always

USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND

root 100275  198  1.3 1784520 880288 ?  Ssl  01:14   0:45 
/usr/bin/ceph-osd --cluster=ceph -i 4 -f
root 100320  239  1.1 1793184 760824 ?  Ssl  01:14   0:47 
/usr/bin/ceph-osd --cluster=ceph -i 5 -f


root 100897  200  1.3 1765780 891256 ?  Ssl  01:15   0:50 
/usr/bin/ceph-osd --cluster=ceph -i 4 -f
root 100942  245  1.1 1817436 746956 ?  Ssl  01:15   0:53 
/usr/bin/ceph-osd --cluster=ceph -i 5 -f

root 101517  196  1.3 1769904 877132 ?  Ssl  01:16   0:33 
/usr/bin/ceph-osd --cluster=ceph -i 4 -f
root 101562  258  1.1 1805172 746532 ?  Ssl  01:16   0:36 
/usr/bin/ceph-osd --cluster=ceph -i 5 -f


jemalloc 4.0 tp = never
---
USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND

root  98362 87.8  1.0 1841748 678848 ?  Ssl  01:10   0:53 
/usr/bin/ceph-osd --cluster=ceph -i 4 -f
root  98405 97.0  1.0 1846328 699620 ?  Ssl  01:10   0:56 
/usr/bin/ceph-osd --cluster=ceph -i 5 -f



root  99018  233  1.0 1812580 698848 ?  Ssl  01:12   0:30 
/usr/bin/ceph-osd --cluster=ceph -i 5 -f
root  99036  226  1.0 1822344 677420 ?  Ssl  01:12   0:29 
/usr/bin/ceph-osd

Re: [ceph-users] jemalloc and transparent hugepage

2015-09-08 Thread Mark Nelson

Also, for what it's worth, I did analysis during recovery (though not 
with different transparent hugepage settings).  You can see it on slide 
#13 here:


http://nhm.ceph.com/mark_nelson_ceph_tech_talk.odp

On 09/08/2015 06:49 PM, Mark Nelson wrote:

Excellent investigation Alexandre!  Have you noticed any performance
difference with tp=never?

Mark

On 09/08/2015 06:33 PM, Alexandre DERUMIER wrote:

I have done small benchmark with tcmalloc and jemalloc, transparent
hugepage=always|never.

for tcmalloc, they are no difference.
but for jemalloc, the difference is huge (around 25% lower with
tp=never).

jemmaloc 4.6.0+tp=never vs tcmalloc use 10% more RSS memory

jemmaloc 4.0+tp=never almost use same RSS memory than tcmalloc !


I don't have monitored memory usage in recovery, but I think it should
help too.




tcmalloc 2.1 tp=always
---
USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND

root  67746  120  1.0 1531220 671152 ?  Ssl  01:18   0:43
/usr/bin/ceph-osd --cluster=ceph -i 0 -f
root  67764  144  1.0 1570256 711232 ?  Ssl  01:18   0:51
/usr/bin/ceph-osd --cluster=ceph -i 1 -f

root  68363  220  0.9 1522292 655888 ?  Ssl  01:19   0:46
/usr/bin/ceph-osd --cluster=ceph -i 0 -f
root  68381  261  1.0 1563396 702500 ?  Ssl  01:19   0:55
/usr/bin/ceph-osd --cluster=ceph -i 1 -f

root  68963  228  1.0 1519240 666196 ?  Ssl  01:20   0:31
/usr/bin/ceph-osd --cluster=ceph -i 0 -f
root  68981  268  1.0 1564452 694352 ?  Ssl  01:20   0:37
/usr/bin/ceph-osd --cluster=ceph -i 1 -f



tcmalloc 2.1  tp=never
-
USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND

root  69560  144  1.0 1544968 677584 ?  Ssl  01:21   0:20
/usr/bin/ceph-osd --cluster=ceph -i 0 -f
root  69578  167  1.0 1568620 704456 ?  Ssl  01:21   0:23
/usr/bin/ceph-osd --cluster=ceph -i 1 -f


root  70156  164  0.9 1519680 649776 ?  Ssl  01:21   0:16
/usr/bin/ceph-osd --cluster=ceph -i 0 -f
root  70174  214  1.0 1559772 692828 ?  Ssl  01:21   0:19
/usr/bin/ceph-osd --cluster=ceph -i 1 -f

root  70757  202  0.9 1520376 650572 ?  Ssl  01:22   0:20
/usr/bin/ceph-osd --cluster=ceph -i 0 -f
root  70775  236  1.0 1560644 694088 ?  Ssl  01:22   0:23
/usr/bin/ceph-osd --cluster=ceph -i 1 -f



jemalloc 3.6 tp = always

USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND

root  92005 46.1  1.4 2033864 967512 ?  Ssl  01:00   0:04
/usr/bin/ceph-osd --cluster=ceph -i 5 -f
root  92027 45.5  1.4 2021624 963536 ?  Ssl  01:00   0:04
/usr/bin/ceph-osd --cluster=ceph -i 4 -f



root  92703  191  1.5 2138724 1002376 ? Ssl  01:02   1:16
/usr/bin/ceph-osd --cluster=ceph -i 5 -f
root  92721  183  1.5 2126228 986448 ?  Ssl  01:02   1:13
/usr/bin/ceph-osd --cluster=ceph -i 4 -f


root  93366  258  1.4 2139052 984132 ?  Ssl  01:03   1:09
/usr/bin/ceph-osd --cluster=ceph -i 5 -f
root  93384  250  1.5 2126244 990348 ?  Ssl  01:03   1:07
/usr/bin/ceph-osd --cluster=ceph -i 4 -f



jemalloc 3.6 tp = never
---
USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND

root  93990  238  1.1 2105812 762628 ?  Ssl  01:04   1:16
/usr/bin/ceph-osd --cluster=ceph -i 4 -f
root  94033  263  1.1 2118288 781768 ?  Ssl  01:04   1:18
/usr/bin/ceph-osd --cluster=ceph -i 5 -f


root  94656  266  1.1 2139096 781392 ?  Ssl  01:05   0:58
/usr/bin/ceph-osd --cluster=ceph -i 5 -f
root  94674  257  1.1 2126316 760632 ?  Ssl  01:05   0:56
/usr/bin/ceph-osd --cluster=ceph -i 4 -f

root  95317  297  1.1 2135044 780532 ?  Ssl  01:06   0:35
/usr/bin/ceph-osd --cluster=ceph -i 5 -f
root  95335  284  1.1 2112016 760972 ?  Ssl  01:06   0:34
/usr/bin/ceph-osd --cluster=ceph -i 4 -f



jemalloc 4.0 tp = always

USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND

root 100275  198  1.3 1784520 880288 ?  Ssl  01:14   0:45
/usr/bin/ceph-osd --cluster=ceph -i 4 -f
root 100320  239  1.1 1793184 760824 ?  Ssl  01:14   0:47
/usr/bin/ceph-osd --cluster=ceph -i 5 -f


root 100897  200  1.3 1765780 891256 ?  Ssl  01:15   0:50
/usr/bin/ceph-osd --cluster=ceph -i 4 -f
root 100942  245  1.1 1817436 746956 ?  Ssl  01:15   0:53
/usr/bin/ceph-osd --cluster=ceph -i 5 -f

root 101517  196  1.3 1769904 877132 ?  Ssl  01:16   0:33
/usr/bin/ceph-osd --cluster=ceph -i 4 -f
root 101562  258  1.1 1805172 746532 ?  Ssl  01:16   0:36
/usr/bin/ceph-osd --cluster=ceph -i 5 -f


jemalloc 4.0 tp = never
---
USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND

root  98362 87.8  1.0 1841748 678848 ?  Ssl  01:10   0:53
/usr/bin/ceph-osd --cluster=ceph -i 4 -f
root  98405 97.0  1.0 1846328 699620 ?  Ssl  01:10   0:56
/usr/bin/ceph-osd

[ceph-users] [Ceph-community] Ceph MeetUp Berlin Sept 28

2015-09-08 Thread Robert Sander

Hi,

the next meetup in Berlin takes place on September 28 at 18:00 CEST.

Please RSVP at http://www.meetup.com/de/Ceph-Berlin/events/222906639/

Regards
-- 
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin



signature.asc
Description: OpenPGP digital signature
___
Ceph-community mailing list
ceph-commun...@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] maximum object size

2015-09-08 Thread HEWLETT, Paul (Paul)

Hi All

We have recently encountered a problem on Hammer (0.94.2) whereby we
cannot write objects > 2GB in size to the rados backend.
(NB not RadosGW, CephFS or RBD)

I found the following issue
https://wiki.ceph.com/Planning/Blueprints/Firefly/Object_striping_in_librad
os which seems to address this but no progress reported.

What are the implications of writing such large objects to RADOS? What
impact is expected on the XFS backend particularly regarding the size and
location of the journal?

Any prospect of progressing the issue reported in the enclosed link?

Interestingly I could not find anywhere in the ceph documentation that
describes the 2GB limitation. The implication of most of the website docs
is that there is no limit on objects stored in Ceph. The only hint is that
osd_max_write_size is a 32 bit signed integer.

If we use erasure coding will this reduce the impact? I.e. 4+1 EC will
only write 500MB to each OSD and then this value will be tested against
the chunk size instead of the total file size?

The relevant code in Ceph is:

src/FileJournal.cc:

  needed_space = ((int64_t)g_conf->osd_max_write_size) << 20;
  needed_space += (2 * sizeof(entry_header_t)) + get_top();
  if (header.max_size - header.start < needed_space) {
derr << "FileJournal::create: OSD journal is not large enough to hold "
<< "osd_max_write_size bytes!" << dendl;
ret = -ENOSPC;
goto free_buf;
  }

src/osd/OSD.cc:

// too big?
if (cct->_conf->osd_max_write_size &&
m->get_data_len() > cct->_conf->osd_max_write_size << 20) {
// journal can't hold commit!
 derr << "handle_op msg data len " << m->get_data_len()
 << " > osd_max_write_size " << (cct->_conf->osd_max_write_size << 20)
 << " on " << *m << dendl;
service.reply_op_error(op, -OSD_WRITETOOBIG);
return;
  }

Interestingly the code in OSD.cc looks like a bug - the max_write value
should be cast to an int64_t before shifting left 20 bits (which is done
correctly in FileJournal.cc). Otherwise overflow may occur and negative
values generated.


Any comments welcome - any help appreciated.

Regards
Paul


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-08 Thread Quentin Hartman

On Tue, Sep 8, 2015 at 9:05 AM, Mark Nelson  wrote:

> A list of hardware that is known to work well would be incredibly
>> valuable to people getting started. It doesn't have to be exhaustive,
>> nor does it have to provide all the guidance someone could want. A
>> simple "these things have worked for others" would be sufficient. If
>> nothing else, it will help people justify more expensive gear when their
>> approval people say "X seems just as good and is cheaper, why can't we
>> get that?".
>>
>
> So I have my opinions on different drives, but I think we do need to be
> really careful not to appear to endorse or pick on specific vendors. The
> more we can stick to high-level statements like:
>
> - Drives should have high write endurance
> - Drives should perform well with O_DSYNC writes
> - Drives should support power loss protection for data in motion
>
> The better I think.  Once those are established, I think it's reasonable
> to point out that certain drives meet (or do not meet) those criteria and
> get feedback from the community as to whether or not vendor's marketing
> actually reflects reality.  It'd also be really nice to see more
> information available like the actual hardware (capacitors, flash cells,
> etc) used in the drives.  I've had to show photos of the innards of
> specific drives to vendors to get them to give me accurate information
> regarding certain drive capabilities.  Having a database of such things
> available to the community would be really helpful.
>
>
That's probably a very good approach. I think it would be pretty simple to
avoid the appearance of endorsement if the data is presented correctly.

>
>> To that point, I think perhaps though something more important than a
>> list of known "good" hardware would be a list of known "bad" hardware,
>>
>
> I'm rather hesitant to do this unless it's been specifically confirmed by
> the vendor.  It's too easy to point fingers (see the recent kernel trim bug
> situation).

I disagree. I think that only comes into play if you claim to know why the
hardware has problems. In this case, if you simply state "people who have
used this drive have experienced a large number of seemingly premature
failures when using them as journals" that provides sufficient warning to
users, and if the vendor wants to engage the community and potentially pin
down why and help us find a way to make the device work or confirm that
it's just not suited, then that's on them. Samsung seems to be doing
exactly that. It would be great to have them help provide that level of
detail, but again, I don't think it's necessary. We're not saying
"ceph/redhat/$whatever says this hardware sucks" we're saying "The
community has found that using this hardware with ceph has exhibited these
negative behaviors...". At that point you're just relaying experiences and
collecting them in a central location. It's up to the reader to draw
conclusions from it.

But again, I think more important than either of these would be a
collection of use cases with actual journal write volumes that have
occurred in those use cases so that people can make more informed
purchasing decisions. The fact that my small openstack cluster created 3.6T
of writes per month on my journal drives (3 OSD each) is somewhat
mind-blowing. That's almost four times the amount of writes my best guess
estimates indicated we'd be doing. Clearly there's more going on than we
are used to paying attention to. Someone coming to ceph and seeing the cost
of DC-class SSDs versus consumer-class SSDs will almost certainly suffer
from some amount of sticker shock, and even if they don't their purchasing
approval people almost certainly will. This is especially true for people
in smaller organizations where SSDs are still somewhat exotic. And when
they come back with the "Why won't cheaper thing X be OK?" they need to
have sufficient information to answer that. Without a test environment to
generate data with, they will need to rely on the experiences of others,
and right now those experiences don't seem to be documented anywhere, and
if they are, they are not very discoverable.

QH
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Problem] I cannot start the OSD daemon

2015-09-08 Thread Robert LeBlanc

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I would check that the /var/lib/ceph/osd/ceph-0/ is mounted and has
the file structure for Ceph.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Sep 7, 2015 at 2:16 AM, Aaron  wrote:
> Hi All,
>
> I cannot start the OSD daemon, and need your helps. Any advice is
> appreciated.
>
> When I deployed the ceph cluster through ceph-deploy, it worked. But after
> some days, all the OSD daemons were down, and I could not start them. Then I
> redeployed it some times, and it was still this case.
>
> The environment:
> OS: CentOS 6.6
> Ceph: hammer(0.92.2)
>
> When the OSD daemon was down, I started the OSD daemon on the OSD node, and
> the console outputted:
> [root@node1 ~]# service ceph start
> === osd.0 ===
> libust[2285/2285]: Warning: HOME environment variable not set. Disabling
> LTTng-UST per-user tracing. (in setup_local_apps() at lttng-ust-comm.c:305)
> create-or-move updated item name 'osd.0' weight 0.11 at location
> {host=node1,root=default} to crush map
> Starting Ceph osd.0 on node1...
> libust[2329/2329]: Warning: HOME environment variable not set. Disabling
> LTTng-UST per-user tracing. (in setup_local_apps() at lttng-ust-comm.c:305)
> starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0
> /var/lib/ceph/osd/ceph-0/journal
> [root@node1 ~]#
>
> And the log file of ceph, /var/log/ceph/ceph-osd.0.log, recorded:
> 2015-09-07 15:13:02.433546 7f74cea0d800  0 ceph version 0.94.2
> (5fb85614ca8f354284c713a2f9c610860720bbf3), process ceph-osd, pid 1045
> 2015-09-07 15:13:02.508300 7f74cea0d800  0
> filestore(/var/lib/ceph/osd/ceph-0) backend generic (magic 0xef53)
> 2015-09-07 15:13:02.511270 7f74cea0d800  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP
> ioctl is supported and appears to work
> 2015-09-07 15:13:02.511293 7f74cea0d800  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP
> ioctl is disabled via 'filestore fiemap' config option
>
> 2015-09-07 15:13:02.520826 7f74cea0d800  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features:
> syscall(SYS_syncfs, fd) fully supported
> 2015-09-07 15:13:02.523890 7f74cea0d800  0
> filestore(/var/lib/ceph/osd/ceph-0) limited size xattrs
> 2015-09-07 15:13:02.527387 7f74cea0d800  0
> filestore(/var/lib/ceph/osd/ceph-0) mount: enabling WRITEAHEAD journal mode:
> checkpoint is not enabled
> 2015-09-07 15:13:02.528870 7f74cea0d800 -1 journal FileJournal::_open:
> disabling aio for non-block journal.  Use journal_force_aio to force use of
> aio anyway
> 2015-09-07 15:13:02.528890 7f74cea0d800  1 2015-09-07 15:13:02.528890
> 7f74cea0d800  1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 19:
> 5368709120 bytes, block size 4096 bytes, directio = 1, aio = 0
>
> 2015-09-07 15:13:02.529881 7f74cea0d800 -1
> filestore(/var/lib/ceph/osd/ceph-0) could not find 3//head//3 in index: (2)
> No such file or directory
> 2015-09-07 15:13:02.530039 7f74cea0d800 -1
> filestore(/var/lib/ceph/osd/ceph-0) could not find 2//head//3 in index: (2)
> No such file or directory
> 2015-09-07 15:13:02.530273 7f74cea0d800 -1
> filestore(/var/lib/ceph/osd/ceph-0) could not find 1//head//3 in index: (2)
> No such file or directory
> 2015-09-07 15:13:02.530349 7f74cea0d800 -1
> filestore(/var/lib/ceph/osd/ceph-0) could not find 0//head//3 in index: (2)
> No such file or directory
>
> 2015-09-07 15:13:02.530487 7f74cea0d800 -1
> filestore(/var/lib/ceph/osd/ceph-0) could not find 7//head//3 in index: (2)
> No such file or directory
> 2015-09-07 15:13:02.530569 7f74cea0d800 -1
> filestore(/var/lib/ceph/osd/ceph-0) could not find 6//head//3 in index: (2)
> No such file or directory
> 2015-09-07 15:13:02.530770 7f74cea0d800  0
> filestore(/var/lib/ceph/osd/ceph-0)  error (39) Directory not empty not
> handled on operation 0x55d8ef6 (4434.0.1, or op 1, counting from 0)
> 2015-09-07 15:13:02.530798 7f74cea0d800  0
> filestore(/var/lib/ceph/osd/ceph-0) ENOTEMPTY suggests garbage data in osd
> data dir
>
> 2015-09-07 15:13:02.530801 7f74cea0d800  0
> filestore(/var/lib/ceph/osd/ceph-0)  transaction dump:
> {
> "ops": [
> {
> "op_num": 0,
> "op_name": "remove",
> "collection": "3.5_head",
> "oid": "5\/\/head\/\/3"
> },
> {
> "op_num": 1,
> "op_name": "rmcoll",
> "collection": "3.5_head"
> }
> ]
> }
>
>   "op_name": "remove",
> "collection": "3.5_head",
> "oid": "5\/\/head\/\/3"
> },
> {
> "op_num": 1,
> "op_name": "rmcoll",
> "collection": "3.5_head"
> }
> ]
> }
>
> 2015-09-07 15:13:02.533771 7f74cea0d800 -1 os/FileStore.cc: In function
> 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&,
> uint64_t, int, ThreadPool::TPHandle*)' thread 7f74cea0d800 time 2015-09-07
> 15:13:02.530848
>

Re: [ceph-users] jemalloc and transparent hugepage

I email you guys about using jemalloc.
There might be workaround to use it much more effectively.
I hope some of you saw my email...

Shinobu

- Original Message -
From: "Mark Nelson" 
To: "Alexandre DERUMIER" , "ceph-devel" 
, "ceph-users" 
Sent: Wednesday, September 9, 2015 8:52:35 AM
Subject: Re: [ceph-users] jemalloc and transparent hugepage

Also, for what it's worth, I did analysis during recovery (though not 
with different transparent hugepage settings).  You can see it on slide 
#13 here:

http://nhm.ceph.com/mark_nelson_ceph_tech_talk.odp

On 09/08/2015 06:49 PM, Mark Nelson wrote:
> Excellent investigation Alexandre!  Have you noticed any performance
> difference with tp=never?
>
> Mark
>
> On 09/08/2015 06:33 PM, Alexandre DERUMIER wrote:
>> I have done small benchmark with tcmalloc and jemalloc, transparent
>> hugepage=always|never.
>>
>> for tcmalloc, they are no difference.
>> but for jemalloc, the difference is huge (around 25% lower with
>> tp=never).
>>
>> jemmaloc 4.6.0+tp=never vs tcmalloc use 10% more RSS memory
>>
>> jemmaloc 4.0+tp=never almost use same RSS memory than tcmalloc !
>>
>>
>> I don't have monitored memory usage in recovery, but I think it should
>> help too.
>>
>>
>>
>>
>> tcmalloc 2.1 tp=always
>> ---
>> USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
>>
>> root  67746  120  1.0 1531220 671152 ?  Ssl  01:18   0:43
>> /usr/bin/ceph-osd --cluster=ceph -i 0 -f
>> root  67764  144  1.0 1570256 711232 ?  Ssl  01:18   0:51
>> /usr/bin/ceph-osd --cluster=ceph -i 1 -f
>>
>> root  68363  220  0.9 1522292 655888 ?  Ssl  01:19   0:46
>> /usr/bin/ceph-osd --cluster=ceph -i 0 -f
>> root  68381  261  1.0 1563396 702500 ?  Ssl  01:19   0:55
>> /usr/bin/ceph-osd --cluster=ceph -i 1 -f
>>
>> root  68963  228  1.0 1519240 666196 ?  Ssl  01:20   0:31
>> /usr/bin/ceph-osd --cluster=ceph -i 0 -f
>> root  68981  268  1.0 1564452 694352 ?  Ssl  01:20   0:37
>> /usr/bin/ceph-osd --cluster=ceph -i 1 -f
>>
>>
>>
>> tcmalloc 2.1  tp=never
>> -
>> USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
>>
>> root  69560  144  1.0 1544968 677584 ?  Ssl  01:21   0:20
>> /usr/bin/ceph-osd --cluster=ceph -i 0 -f
>> root  69578  167  1.0 1568620 704456 ?  Ssl  01:21   0:23
>> /usr/bin/ceph-osd --cluster=ceph -i 1 -f
>>
>>
>> root  70156  164  0.9 1519680 649776 ?  Ssl  01:21   0:16
>> /usr/bin/ceph-osd --cluster=ceph -i 0 -f
>> root  70174  214  1.0 1559772 692828 ?  Ssl  01:21   0:19
>> /usr/bin/ceph-osd --cluster=ceph -i 1 -f
>>
>> root  70757  202  0.9 1520376 650572 ?  Ssl  01:22   0:20
>> /usr/bin/ceph-osd --cluster=ceph -i 0 -f
>> root  70775  236  1.0 1560644 694088 ?  Ssl  01:22   0:23
>> /usr/bin/ceph-osd --cluster=ceph -i 1 -f
>>
>>
>>
>> jemalloc 3.6 tp = always
>> 
>> USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
>>
>> root  92005 46.1  1.4 2033864 967512 ?  Ssl  01:00   0:04
>> /usr/bin/ceph-osd --cluster=ceph -i 5 -f
>> root  92027 45.5  1.4 2021624 963536 ?  Ssl  01:00   0:04
>> /usr/bin/ceph-osd --cluster=ceph -i 4 -f
>>
>>
>>
>> root  92703  191  1.5 2138724 1002376 ? Ssl  01:02   1:16
>> /usr/bin/ceph-osd --cluster=ceph -i 5 -f
>> root  92721  183  1.5 2126228 986448 ?  Ssl  01:02   1:13
>> /usr/bin/ceph-osd --cluster=ceph -i 4 -f
>>
>>
>> root  93366  258  1.4 2139052 984132 ?  Ssl  01:03   1:09
>> /usr/bin/ceph-osd --cluster=ceph -i 5 -f
>> root  93384  250  1.5 2126244 990348 ?  Ssl  01:03   1:07
>> /usr/bin/ceph-osd --cluster=ceph -i 4 -f
>>
>>
>>
>> jemalloc 3.6 tp = never
>> ---
>> USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
>>
>> root  93990  238  1.1 2105812 762628 ?  Ssl  01:04   1:16
>> /usr/bin/ceph-osd --cluster=ceph -i 4 -f
>> root  94033  263  1.1 2118288 781768 ?  Ssl  01:04   1:18
>> /usr/bin/ceph-osd --cluster=ceph -i 5 -f
>>
>>
>> root  94656  266  1.1 2139096 781392 ?  Ssl  01:05   0:58
>> /usr/bin/ceph-osd --cluster=ceph -i 5 -f
>> root  94674  257  1.1 2126316 760632 ?  Ssl  01:05   0:56
>> /usr/bin/ceph-osd --cluster=ceph -i 4 -f
>>
>> root  95317  297  1.1 2135044 780532 ?  Ssl  01:06   0:35
>> /usr/bin/ceph-osd --cluster=ceph -i 5 -f
>> root  95335  284  1.1 2112016 760972 ?  Ssl  01:06   0:34
>> /usr/bin/ceph-osd --cluster=ceph -i 4 -f
>>
>>
>>
>> jemalloc 4.0 tp = always
>> 
>> USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
>>
>> root 100275  198  1.3 1784520 880288 ?  Ssl  01:14   0:45
>> /usr/bin/ceph-osd --cluster=ceph -i 4 -f
>> root 100320  239  1.1 1793184 760824 ?  Ssl  01:14   0:47
>> /usr/bin/ceph-osd --cluster=ceph -i

Re: [ceph-users] Cannot add/create new monitor on ceph v0.94.3

2015-09-08 Thread Brad Hubbard

I'd suggest starting the mon with debugging turned right up and taking
a good look at the output.

Cheers,
Brad

- Original Message -
> From: "Fangzhe Chang (Fangzhe)" 
> To: "Brad Hubbard" 
> Cc: ceph-users@lists.ceph.com
> Sent: Wednesday, 9 September, 2015 7:35:42 AM
> Subject: RE: [ceph-users] Cannot add/create new monitor on ceph v0.94.3
> 
> Thanks for the answer.
> 
> NTP is running on both the existing monitor and the new monitor being
> installed.
> I did run ceph-deploy in the same directory as I created the cluster.
> However, I need to tweak the options supplied to ceph-deploy a little bit
> since I was running it behind a corporate firewall.
> 
> I noticed the ceph-create-keys process is running on the background. When I
> ran it manually, I got the following results.
> 
> $ python /usr/sbin/ceph-create-keys --cluster ceph -i 
> INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
> INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
> INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
> 
> 
> -Original Message-
> From: Brad Hubbard [mailto:bhubb...@redhat.com]
> Sent: Sunday, September 06, 2015 11:58 PM
> To: Chang, Fangzhe (Fangzhe)
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Cannot add/create new monitor on ceph v0.94.3
> 
> - Original Message -
> > From: "Fangzhe Chang (Fangzhe)" 
> > To: ceph-users@lists.ceph.com
> > Sent: Saturday, 5 September, 2015 6:26:16 AM
> > Subject: [ceph-users] Cannot add/create new monitor on ceph v0.94.3
> > 
> > 
> > 
> > Hi,
> > 
> > I’m trying to add a second monitor using ‘ceph-deploy mon new  > hostname>’. However, the log file shows the following error:
> > 
> > 2015-09-04 16:13:54.863479 7f4cbc3f7700 0 cephx: verify_reply couldn't
> > decrypt with error: error decoding block for decryption
> > 
> > 2015-09-04 16:13:54.863491 7f4cbc3f7700 0 -- :6789/0
> > >> :6789/0 pipe(0x413 sd=12 :57954 s=1 pgs=0
> > cs=0 l=0 c=0x3f29600).failed verifying authorize reply
> 
> A couple of things to look at are verifying all your clocks are in sync (ntp
> helps here) and making sure you are running ceph-deploy in the directory you
> used to create the cluster.
> 
> > 
> > 
> > 
> > Does anyone know how to resolve this?
> > 
> > Thanks
> > 
> > 
> > 
> > Fangzhe
> > 
> > 
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Tuning + KV backend

2015-09-08 Thread Haomai Wang

On Wed, Sep 9, 2015 at 3:00 AM, Niels Jakob Darger  wrote:
> Hello,
>
> Excuse my ignorance, I have just joined this list and started using Ceph
> (which looks very cool). On AWS I have set up a 5-way Ceph cluster (4 vCPUs,
> 32G RAM, dedicated SSDs for system, osd and journal) with the Object
> Gateway. For the purpose of simplicity of the test all the nodes are
> identical and each node contains osd, mon and the radosgw.
>
> I have run parallel inserts from all 5 nodes, I can insert about 10-12000
> objects per minute. The insert rate is relatively constant regardless of
> whether I run 1 insert process per node or 5, i.e. a total of 5 or 25.
>
> These are just numbers, of course, and not meaningful without more context.
> But looking at the nodes I think the cluster could run faster - the CPUs are
> not doing much, there isn't much I/O wait - only about 50% utilisation and
> only on the SSDs storing the journals on two of the nodes (I've set the
> replication to 2), the other file systems are almost idle. The network is
> far from maxed out and the processes are not using much memory. I've tried
> increasing osd_op_threads to 5 or 10 but that didn't make much difference.
>
> The co-location of all the daemons on all the nodes may not be ideal, but
> since there isn't much resource use or contention I don't think that's the
> problem.
>
> So two questions:
>
> 1) Are there any good resources on tuning Ceph? There's quite a few posts
> out there testing and timing specific setups with RAID controller X and 12
> disks of brand Y etc. but I'm more looking for general tuning guidelines -
> explaining the big picture.
>
> 2) What's the status of the keyvalue backend? The documentation on
> http://ceph.com/docs/master/rados/configuration/keyvaluestore-config-ref/
> looks nice but I found it difficult to work out how to switch to the
> keyvalue backend, the Internet suggests "osd objectstore =
> keyvaluestore-dev", but that didn't seem to work so I checked out the source
> code and it looks like "osd objectstore = keyvaluestore" does it. However,
> it results in nasty things in the log file ("*** experimental feature
> 'keyvaluestore' is not enabled *** This feature is marked as experimental
> ...") so perhaps it's too early to use the KV backend for production use?
>

Hmm, we need to modify doc to swith to "keyvaluestore". Since this is
a experimental feature, it doesn't recommended to used for production
usage.

> Thanks & regards,
> Jakob
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] jemalloc and transparent hugepage

>>Have you noticed any performance difference with tp=never?

No difference.

I think hugepage could speedup big memory sets like 100-200GB, but for 1-2GB 
they are no noticable difference.






- Mail original -
De: "Mark Nelson" 
À: "aderumier" , "ceph-devel" 
, "ceph-users" 
Cc: "Somnath Roy" 
Envoyé: Mercredi 9 Septembre 2015 01:49:35
Objet: Re: [ceph-users] jemalloc and transparent hugepage

Excellent investigation Alexandre! Have you noticed any performance 
difference with tp=never? 

Mark 

On 09/08/2015 06:33 PM, Alexandre DERUMIER wrote: 
> I have done small benchmark with tcmalloc and jemalloc, transparent 
> hugepage=always|never. 
> 
> for tcmalloc, they are no difference. 
> but for jemalloc, the difference is huge (around 25% lower with tp=never). 
> 
> jemmaloc 4.6.0+tp=never vs tcmalloc use 10% more RSS memory 
> 
> jemmaloc 4.0+tp=never almost use same RSS memory than tcmalloc ! 
> 
> 
> I don't have monitored memory usage in recovery, but I think it should help 
> too. 
> 
> 
> 
> 
> tcmalloc 2.1 tp=always 
> --- 
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 
> 
> root 67746 120 1.0 1531220 671152 ? Ssl 01:18 0:43 /usr/bin/ceph-osd 
> --cluster=ceph -i 0 -f 
> root 67764 144 1.0 1570256 711232 ? Ssl 01:18 0:51 /usr/bin/ceph-osd 
> --cluster=ceph -i 1 -f 
> 
> root 68363 220 0.9 1522292 655888 ? Ssl 01:19 0:46 /usr/bin/ceph-osd 
> --cluster=ceph -i 0 -f 
> root 68381 261 1.0 1563396 702500 ? Ssl 01:19 0:55 /usr/bin/ceph-osd 
> --cluster=ceph -i 1 -f 
> 
> root 68963 228 1.0 1519240 666196 ? Ssl 01:20 0:31 /usr/bin/ceph-osd 
> --cluster=ceph -i 0 -f 
> root 68981 268 1.0 1564452 694352 ? Ssl 01:20 0:37 /usr/bin/ceph-osd 
> --cluster=ceph -i 1 -f 
> 
> 
> 
> tcmalloc 2.1 tp=never 
> - 
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 
> 
> root 69560 144 1.0 1544968 677584 ? Ssl 01:21 0:20 /usr/bin/ceph-osd 
> --cluster=ceph -i 0 -f 
> root 69578 167 1.0 1568620 704456 ? Ssl 01:21 0:23 /usr/bin/ceph-osd 
> --cluster=ceph -i 1 -f 
> 
> 
> root 70156 164 0.9 1519680 649776 ? Ssl 01:21 0:16 /usr/bin/ceph-osd 
> --cluster=ceph -i 0 -f 
> root 70174 214 1.0 1559772 692828 ? Ssl 01:21 0:19 /usr/bin/ceph-osd 
> --cluster=ceph -i 1 -f 
> 
> root 70757 202 0.9 1520376 650572 ? Ssl 01:22 0:20 /usr/bin/ceph-osd 
> --cluster=ceph -i 0 -f 
> root 70775 236 1.0 1560644 694088 ? Ssl 01:22 0:23 /usr/bin/ceph-osd 
> --cluster=ceph -i 1 -f 
> 
> 
> 
> jemalloc 3.6 tp = always 
>  
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 
> 
> root 92005 46.1 1.4 2033864 967512 ? Ssl 01:00 0:04 /usr/bin/ceph-osd 
> --cluster=ceph -i 5 -f 
> root 92027 45.5 1.4 2021624 963536 ? Ssl 01:00 0:04 /usr/bin/ceph-osd 
> --cluster=ceph -i 4 -f 
> 
> 
> 
> root 92703 191 1.5 2138724 1002376 ? Ssl 01:02 1:16 /usr/bin/ceph-osd 
> --cluster=ceph -i 5 -f 
> root 92721 183 1.5 2126228 986448 ? Ssl 01:02 1:13 /usr/bin/ceph-osd 
> --cluster=ceph -i 4 -f 
> 
> 
> root 93366 258 1.4 2139052 984132 ? Ssl 01:03 1:09 /usr/bin/ceph-osd 
> --cluster=ceph -i 5 -f 
> root 93384 250 1.5 2126244 990348 ? Ssl 01:03 1:07 /usr/bin/ceph-osd 
> --cluster=ceph -i 4 -f 
> 
> 
> 
> jemalloc 3.6 tp = never 
> --- 
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 
> 
> root 93990 238 1.1 2105812 762628 ? Ssl 01:04 1:16 /usr/bin/ceph-osd 
> --cluster=ceph -i 4 -f 
> root 94033 263 1.1 2118288 781768 ? Ssl 01:04 1:18 /usr/bin/ceph-osd 
> --cluster=ceph -i 5 -f 
> 
> 
> root 94656 266 1.1 2139096 781392 ? Ssl 01:05 0:58 /usr/bin/ceph-osd 
> --cluster=ceph -i 5 -f 
> root 94674 257 1.1 2126316 760632 ? Ssl 01:05 0:56 /usr/bin/ceph-osd 
> --cluster=ceph -i 4 -f 
> 
> root 95317 297 1.1 2135044 780532 ? Ssl 01:06 0:35 /usr/bin/ceph-osd 
> --cluster=ceph -i 5 -f 
> root 95335 284 1.1 2112016 760972 ? Ssl 01:06 0:34 /usr/bin/ceph-osd 
> --cluster=ceph -i 4 -f 
> 
> 
> 
> jemalloc 4.0 tp = always 
>  
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 
> 
> root 100275 198 1.3 1784520 880288 ? Ssl 01:14 0:45 /usr/bin/ceph-osd 
> --cluster=ceph -i 4 -f 
> root 100320 239 1.1 1793184 760824 ? Ssl 01:14 0:47 /usr/bin/ceph-osd 
> --cluster=ceph -i 5 -f 
> 
> 
> root 100897 200 1.3 1765780 891256 ? Ssl 01:15 0:50 /usr/bin/ceph-osd 
> --cluster=ceph -i 4 -f 
> root 100942 245 1.1 1817436 746956 ? Ssl 01:15 0:53 /usr/bin/ceph-osd 
> --cluster=ceph -i 5 -f 
> 
> root 101517 196 1.3 1769904 877132 ? Ssl 01:16 0:33 /usr/bin/ceph-osd 
> --cluster=ceph -i 4 -f 
> root 101562 258 1.1 1805172 746532 ? Ssl 01:16 0:36 /usr/bin/ceph-osd 
> --cluster=ceph -i 5 -f 
> 
> 
> jemalloc 4.0 tp = never 
> --- 
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 
> 
> root 98362 87.8 1.0 1841748 678848 ? Ssl 01:10 0:53 /usr/bin/ceph-osd 
> --cluster=ceph -i 4 -f 
> root 98405 97.0 1.0 1846328

Re: [ceph-users] jemalloc and transparent hugepage

2015-09-08 Thread Sage Weil

On Wed, 9 Sep 2015, Alexandre DERUMIER wrote:
> >>Have you noticed any performance difference with tp=never?
> 
> No difference.
> 
> I think hugepage could speedup big memory sets like 100-200GB, but for 
> 1-2GB they are no noticable difference.

Is this something we can set with mallctl[1] at startup?

sage

[1] http://www.canonware.com/download/jemalloc/jemalloc-latest/doc/jemalloc.html

> 
> 
> 
> 
> 
> 
> - Mail original -
> De: "Mark Nelson" 
> À: "aderumier" , "ceph-devel" 
> , "ceph-users" 
> Cc: "Somnath Roy" 
> Envoyé: Mercredi 9 Septembre 2015 01:49:35
> Objet: Re: [ceph-users] jemalloc and transparent hugepage
> 
> Excellent investigation Alexandre! Have you noticed any performance 
> difference with tp=never? 
> 
> Mark 
> 
> On 09/08/2015 06:33 PM, Alexandre DERUMIER wrote: 
> > I have done small benchmark with tcmalloc and jemalloc, transparent 
> > hugepage=always|never. 
> > 
> > for tcmalloc, they are no difference. 
> > but for jemalloc, the difference is huge (around 25% lower with tp=never). 
> > 
> > jemmaloc 4.6.0+tp=never vs tcmalloc use 10% more RSS memory 
> > 
> > jemmaloc 4.0+tp=never almost use same RSS memory than tcmalloc ! 
> > 
> > 
> > I don't have monitored memory usage in recovery, but I think it should help 
> > too. 
> > 
> > 
> > 
> > 
> > tcmalloc 2.1 tp=always 
> > --- 
> > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 
> > 
> > root 67746 120 1.0 1531220 671152 ? Ssl 01:18 0:43 /usr/bin/ceph-osd 
> > --cluster=ceph -i 0 -f 
> > root 67764 144 1.0 1570256 711232 ? Ssl 01:18 0:51 /usr/bin/ceph-osd 
> > --cluster=ceph -i 1 -f 
> > 
> > root 68363 220 0.9 1522292 655888 ? Ssl 01:19 0:46 /usr/bin/ceph-osd 
> > --cluster=ceph -i 0 -f 
> > root 68381 261 1.0 1563396 702500 ? Ssl 01:19 0:55 /usr/bin/ceph-osd 
> > --cluster=ceph -i 1 -f 
> > 
> > root 68963 228 1.0 1519240 666196 ? Ssl 01:20 0:31 /usr/bin/ceph-osd 
> > --cluster=ceph -i 0 -f 
> > root 68981 268 1.0 1564452 694352 ? Ssl 01:20 0:37 /usr/bin/ceph-osd 
> > --cluster=ceph -i 1 -f 
> > 
> > 
> > 
> > tcmalloc 2.1 tp=never 
> > - 
> > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 
> > 
> > root 69560 144 1.0 1544968 677584 ? Ssl 01:21 0:20 /usr/bin/ceph-osd 
> > --cluster=ceph -i 0 -f 
> > root 69578 167 1.0 1568620 704456 ? Ssl 01:21 0:23 /usr/bin/ceph-osd 
> > --cluster=ceph -i 1 -f 
> > 
> > 
> > root 70156 164 0.9 1519680 649776 ? Ssl 01:21 0:16 /usr/bin/ceph-osd 
> > --cluster=ceph -i 0 -f 
> > root 70174 214 1.0 1559772 692828 ? Ssl 01:21 0:19 /usr/bin/ceph-osd 
> > --cluster=ceph -i 1 -f 
> > 
> > root 70757 202 0.9 1520376 650572 ? Ssl 01:22 0:20 /usr/bin/ceph-osd 
> > --cluster=ceph -i 0 -f 
> > root 70775 236 1.0 1560644 694088 ? Ssl 01:22 0:23 /usr/bin/ceph-osd 
> > --cluster=ceph -i 1 -f 
> > 
> > 
> > 
> > jemalloc 3.6 tp = always 
> >  
> > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 
> > 
> > root 92005 46.1 1.4 2033864 967512 ? Ssl 01:00 0:04 /usr/bin/ceph-osd 
> > --cluster=ceph -i 5 -f 
> > root 92027 45.5 1.4 2021624 963536 ? Ssl 01:00 0:04 /usr/bin/ceph-osd 
> > --cluster=ceph -i 4 -f 
> > 
> > 
> > 
> > root 92703 191 1.5 2138724 1002376 ? Ssl 01:02 1:16 /usr/bin/ceph-osd 
> > --cluster=ceph -i 5 -f 
> > root 92721 183 1.5 2126228 986448 ? Ssl 01:02 1:13 /usr/bin/ceph-osd 
> > --cluster=ceph -i 4 -f 
> > 
> > 
> > root 93366 258 1.4 2139052 984132 ? Ssl 01:03 1:09 /usr/bin/ceph-osd 
> > --cluster=ceph -i 5 -f 
> > root 93384 250 1.5 2126244 990348 ? Ssl 01:03 1:07 /usr/bin/ceph-osd 
> > --cluster=ceph -i 4 -f 
> > 
> > 
> > 
> > jemalloc 3.6 tp = never 
> > --- 
> > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 
> > 
> > root 93990 238 1.1 2105812 762628 ? Ssl 01:04 1:16 /usr/bin/ceph-osd 
> > --cluster=ceph -i 4 -f 
> > root 94033 263 1.1 2118288 781768 ? Ssl 01:04 1:18 /usr/bin/ceph-osd 
> > --cluster=ceph -i 5 -f 
> > 
> > 
> > root 94656 266 1.1 2139096 781392 ? Ssl 01:05 0:58 /usr/bin/ceph-osd 
> > --cluster=ceph -i 5 -f 
> > root 94674 257 1.1 2126316 760632 ? Ssl 01:05 0:56 /usr/bin/ceph-osd 
> > --cluster=ceph -i 4 -f 
> > 
> > root 95317 297 1.1 2135044 780532 ? Ssl 01:06 0:35 /usr/bin/ceph-osd 
> > --cluster=ceph -i 5 -f 
> > root 95335 284 1.1 2112016 760972 ? Ssl 01:06 0:34 /usr/bin/ceph-osd 
> > --cluster=ceph -i 4 -f 
> > 
> > 
> > 
> > jemalloc 4.0 tp = always 
> >  
> > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 
> > 
> > root 100275 198 1.3 1784520 880288 ? Ssl 01:14 0:45 /usr/bin/ceph-osd 
> > --cluster=ceph -i 4 -f 
> > root 100320 239 1.1 1793184 760824 ? Ssl 01:14 0:47 /usr/bin/ceph-osd 
> > --cluster=ceph -i 5 -f 
> > 
> > 
> > root 100897 200 1.3 1765780 891256 ? Ssl 01:15 0:50 /usr/bin/ceph-osd 
> > --cluster=ceph -i 4 -f 
> > root 100942 245 1.1 1817436 746956 ? Ssl 01:15 0:53

Re: [ceph-users] Still have orphaned rgw shadow files, ceph 0.94.3

2015-09-08 Thread Ben Hines

FYI, over the past week I have deleted over 50 TB of data from my
cluster of these objects. Almost all were from buckets that no longer
exist, and the fix tool did not find them. Fortunately i don't need
the data from these old buckets so deleting all objects by prefix
worked great.

Anyone managing a large RGW cluster should periodically make sure that
the pool use matches expected values. (replication factor * sum of
size_kb_actual for each rgw bucket)

-Ben

On Mon, Aug 31, 2015 at 3:53 PM, Yehuda Sadeh-Weinraub
 wrote:
> The bucket index objects are most likely in the .rgw.buckets.index pool.
>
> Yehuda
>
> On Mon, Aug 31, 2015 at 3:27 PM, Ben Hines  wrote:
>> Good call, thanks!
>>
>> Is there any risk of also deleting parts of the bucket index? I'm not
>> sure what the objects for the index itself look like, or if they are
>> in the .rgw.buckets pool.
>>
>>
>> On Mon, Aug 31, 2015 at 3:23 PM, Yehuda Sadeh-Weinraub
>>  wrote:
>>> Make sure you use the underscore also, e.g., "default.8873277.32_".
>>> Otherwise you could potentially erase objects you did't intend to,
>>> like ones who start with "default.8873277.320" and such.
>>>
>>> On Mon, Aug 31, 2015 at 3:20 PM, Ben Hines  wrote:
 Ok. I'm not too familiar with the inner workings of RGW, but i would
 assume that for a bucket with these parameters:

"id": "default.8873277.32",
"marker": "default.8873277.32",

 Tha it would be the only bucket using the files that start with
 "default.8873277.32"

 default.8873277.32__shadow_.OkYjjANx6-qJOrjvdqdaHev-LHSvPhZ_15
 default.8873277.32__shadow_.a2qU3qodRf_E5b9pFTsKHHuX2RUC12g_2



 On Mon, Aug 31, 2015 at 2:51 PM, Yehuda Sadeh-Weinraub
  wrote:
> As long as you're 100% sure that the prefix is only being used for the
> specific bucket that was previously removed, then it is safe to remove
> these objects. But please do double check and make sure that there's
> no other bucket that matches this prefix somehow.
>
> Yehuda
>
> On Mon, Aug 31, 2015 at 2:42 PM, Ben Hines  wrote:
>> No input, eh? (or maybe TL,DR for everyone)
>>
>> Short version: Presuming the bucket index shows blank/empty, which it
>> does and is fine, would me manually deleting the rados objects with
>> the prefix matching the former bucket's ID cause any problems?
>>
>> thanks,
>>
>> -Ben
>>
>> On Fri, Aug 28, 2015 at 4:22 PM, Ben Hines  wrote:
>>> Ceph 0.93->94.2->94.3
>>>
>>> I noticed my pool used data amount is about twice the bucket used data 
>>> count.
>>>
>>> This bucket was emptied long ago. It has zero objects:
>>> "globalcache01",
>>> {
>>> "bucket": "globalcache01",
>>> "pool": ".rgw.buckets",
>>> "index_pool": ".rgw.buckets.index",
>>> "id": "default.8873277.32",
>>> "marker": "default.8873277.32",
>>> "owner": "...",
>>> "ver": "0#12348839",
>>> "master_ver": "0#0",
>>> "mtime": "2015-03-08 11:44:11.00",
>>> "max_marker": "0#",
>>> "usage": {
>>> "rgw.none": {
>>> "size_kb": 0,
>>> "size_kb_actual": 0,
>>> "num_objects": 0
>>> },
>>> "rgw.main": {
>>> "size_kb": 0,
>>> "size_kb_actual": 0,
>>> "num_objects": 0
>>> }
>>> },
>>> "bucket_quota": {
>>> "enabled": false,
>>> "max_size_kb": -1,
>>> "max_objects": -1
>>> }
>>> },
>>>
>>>
>>>
>>> bucket check shows nothing:
>>>
>>> 16:07:09 root@sm-cephrgw4 ~ $ radosgw-admin bucket check
>>> --bucket=globalcache01 --fix
>>> []
>>> 16:07:27 root@sm-cephrgw4 ~ $ radosgw-admin bucket check
>>> --check-head-obj-locator --bucket=globalcache01 --fix
>>> {
>>> "bucket": "globalcache01",
>>> "check_objects": [
>>> ]
>>> }
>>>
>>>
>>> However, i see a lot of data for it on an OSD (all shadow files with
>>> escaped underscores)
>>>
>>> [root@sm-cld-mtl-008 current]# find . -name default.8873277.32* -print
>>> ./12.161_head/DIR_1/DIR_6/DIR_9/DIR_E/default.8873277.32\u\ushadow\u.Tos2Ms8w2BiEG7YJAZeE6zrrc\uwcHPN\u1__head_D886E961__c
>>> ./12.161_head/DIR_1/DIR_6/DIR_9/DIR_E/DIR_1/default.8873277.32\u\ushadow\u.Aa86mlEMvpMhRaTDQKHZmcxAReFEo2J\u1__head_4A71E961__c
>>> ./12.161_head/DIR_1/DIR_6/DIR_9/DIR_E/DIR_5/default.8873277.32\u\ushadow\u.KCiWEa4YPVaYw2FPjqvpd9dKTRBu8BR\u17__head_00B5E961__c
>>>

Re: [ceph-users] How to observed civetweb.

2015-09-08 Thread Vickie ch

Thanks a lot!!
One more question.  I can understand use haproxy is a better way for
loadbalance.
And github say civetweb already support https.
But I found some documents mention that civetweb need haproxy for https.
Which one is true?



Best wishes,
Mika


2015-09-09 2:21 GMT+08:00 Kobi Laredo :

> Vickie,
>
> You can add:
> *access_log_file=/var/log/civetweb/access.log
> error_log_file=/var/log/civetweb/error.log*
>
> to *rgw frontends* in ceph.conf though these logs are thin on info
> (Source IP, date, and request)
>
> Check out
> https://github.com/civetweb/civetweb/blob/master/docs/UserManual.md for
> more civetweb configs you can inject through  *rgw frontends* config
> attribute in ceph.conf
>
> We are currently testing tuning civetweb's num_threads
> and request_timeout_ms to improve radosgw performance
>
> *Kobi Laredo*
> *Cloud Systems Engineer* | (*408) 409-KOBI*
>
> On Tue, Sep 8, 2015 at 8:20 AM, Yehuda Sadeh-Weinraub 
> wrote:
>
>> You can increase the civetweb logs by adding 'debug civetweb = 10' in
>> your ceph.conf. The output will go into the rgw logs.
>>
>> Yehuda
>>
>> On Tue, Sep 8, 2015 at 2:24 AM, Vickie ch  wrote:
>> > Dear cephers,
>> >Just upgrade radosgw from apache to civetweb.
>> > It's really simple to installed and used. But I can't find any
>> parameters or
>> > logs to adjust(or observe) civetweb. (Like apache log).  I'm really
>> confuse.
>> > Any ideas?
>> >
>> >
>> > Best wishes,
>> > Mika
>> >
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] maximum object size

2015-09-08 Thread Ilya Dryomov

On Tue, Sep 8, 2015 at 6:54 PM, HEWLETT, Paul (Paul)
 wrote:
> Hi All
>
> We have recently encountered a problem on Hammer (0.94.2) whereby we
> cannot write objects > 2GB in size to the rados backend.
> (NB not RadosGW, CephFS or RBD)
>
> I found the following issue
> https://wiki.ceph.com/Planning/Blueprints/Firefly/Object_striping_in_librad
> os which seems to address this but no progress reported.
>
> What are the implications of writing such large objects to RADOS? What
> impact is expected on the XFS backend particularly regarding the size and
> location of the journal?

Huge RADOS objects are a bad idea.  Think about the distribution of
objects across PGs (i.e. sets of OSDs) and how different OSDs will be
utilized - there is a bunch of reasons why OSDs reject objects larger
than 90M by default.

AFAIK libradosstriper was merged quite a while ago, before hammer
I think, so it should be usable - people at CERN are building other
things on top of it.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] maximum object size

2015-09-08 Thread Ilya Dryomov

On Tue, Sep 8, 2015 at 7:30 PM, HEWLETT, Paul (Paul)
 wrote:
> Hi Ilya
>
> Thanks for that - libradosstriper is what we need - any notes available on
> usage?

No, I'm afraid not.  include/radosstriper/libradosstriper.h and
libradosstriper.hpp should be enough to get you started - there is
a fair amount of detail in the comments.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] maximum object size

2015-09-08 Thread HEWLETT, Paul (Paul)

I found the description in the source code. Apparently one sets attributes
on the object to force striping.

Regards
Paul

On 08/09/2015 17:39, "Ilya Dryomov"  wrote:

>On Tue, Sep 8, 2015 at 7:30 PM, HEWLETT, Paul (Paul)
> wrote:
>> Hi Ilya
>>
>> Thanks for that - libradosstriper is what we need - any notes available
>>on
>> usage?
>
>No, I'm afraid not.  include/radosstriper/libradosstriper.h and
>libradosstriper.hpp should be enough to get you started - there is
>a fair amount of detail in the comments.
>
>Thanks,
>
>Ilya

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked

2015-09-08 Thread Lincoln Bryant

For whatever it’s worth, my problem has returned and is very similar to yours. 
Still trying to figure out what’s going on over here.

Performance is nice for a few seconds, then goes to 0. This is a similar setup 
to yours (12 OSDs per box, Scientific Linux 6, Ceph 0.94.3, etc)

  384  16 29520 29504   307.287  1188 0.0492006  0.208259
  385  16 29813 29797   309.532  1172 0.0469708  0.206731
  386  16 30105 30089   311.756  1168 0.0375764  0.205189
  387  16 30401 30385   314.009  1184  0.036142  0.203791
  388  16 30695 30679   316.231  1176 0.0372316  0.202355
  389  16 30987 30971318.42  1168 0.0660476  0.200962
  390  16 31282 31266   320.628  1180 0.0358611  0.199548
  391  16 31568 31552   322.734  1144 0.0405166  0.198132
  392  16 31857 31841   324.859  1156 0.0360826  0.196679
  393  16 32090 32074   326.404   932 0.0416869   0.19549
  394  16 32205 32189   326.743   460 0.0251877  0.194896
  395  16 32302 32286   326.897   388 0.0280574  0.194395
  396  16 32348 32332   326.537   184 0.0256821  0.194157
  397  16 32385 32369   326.087   148 0.0254342  0.193965
  398  16 32424 32408   325.659   156 0.0263006  0.193763
  399  16 32445 32429   325.05484 0.0233839  0.193655
2015-09-08 11:22:31.940164 min lat: 0.0165045 max lat: 67.6184 avg lat: 0.193655
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
  400  16 32445 32429   324.241 0 -  0.193655
  401  16 32445 32429   323.433 0 -  0.193655
  402  16 32445 32429   322.628 0 -  0.193655
  403  16 32445 32429   321.828 0 -  0.193655
  404  16 32445 32429   321.031 0 -  0.193655
  405  16 32445 32429   320.238 0 -  0.193655
  406  16 32445 32429319.45 0 -  0.193655
  407  16 32445 32429   318.665 0 -  0.193655

needless to say, very strange.

—Lincoln


> On Sep 7, 2015, at 3:35 PM, Vickey Singh  wrote:
> 
> Adding ceph-users.
> 
> On Mon, Sep 7, 2015 at 11:31 PM, Vickey Singh  
> wrote:
> 
> 
> On Mon, Sep 7, 2015 at 10:04 PM, Udo Lembke  wrote:
> Hi Vickey,
> Thanks for your time in replying to my problem.
>  
> I had the same rados bench output after changing the motherboard of the 
> monitor node with the lowest IP...
> Due to the new mainboard, I assume the hw-clock was wrong during startup. 
> Ceph health show no errors, but all VMs aren't able to do IO (very high load 
> on the VMs - but no traffic).
> I stopped the mon, but this don't changed anything. I had to restart all 
> other mons to get IO again. After that I started the first mon also (with the 
> right time now) and all worked fine again...
> 
> Thanks i will try to restart all OSD / MONS and report back , if it solves my 
> problem 
> 
> Another posibility:
> Do you use journal on SSDs? Perhaps the SSDs can't write to garbage 
> collection?
> 
> No i don't have journals on SSD , they are on the same OSD disk. 
> 
> 
> 
> Udo
> 
> 
> On 07.09.2015 16:36, Vickey Singh wrote:
>> Dear Experts
>> 
>> Can someone please help me , why my cluster is not able write data.
>> 
>> See the below output  cur MB/S  is 0  and Avg MB/s is decreasing.
>> 
>> 
>> Ceph Hammer  0.94.2
>> CentOS 6 (3.10.69-1)
>> 
>> The Ceph status says OPS are blocked , i have tried checking , what all i 
>> know 
>> 
>> - System resources ( CPU , net, disk , memory )-- All normal 
>> - 10G network for public and cluster network  -- no saturation 
>> - Add disks are physically healthy 
>> - No messages in /var/log/messages OR dmesg
>> - Tried restarting OSD which are blocking operation , but no luck
>> - Tried writing through RBD  and Rados bench , both are giving same problemm
>> 
>> Please help me to fix this problem.
>> 
>> #  rados bench -p rbd 60 write
>>  Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds or 0 
>> objects
>>  Object prefix: benchmark_data_stor1_1791844
>>sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>>  0   0 0 0 0 0 - 0
>>  1  16   125   109   435.873   436  0.022076 0.0697864
>>  2  16   139   123   245.94856  0.246578 0.0674407
>>  3  16   139   123   163.969 0 - 0.0674407
>>  4  16   139   123   122.978 0 - 0.0674407
>>  5  16   139   12398.383 0 - 0.0674407
>>  6  16   139   123   81.9865 0 - 0.0674407
>>  7  16

Re: [ceph-users] rebalancing taking very long time

2015-09-08 Thread Alphe Salas

I can say exactly the same I am using ceph sin 0.38 and I never get osd 
so laggy than with 0.94. rebalancing /rebuild algorithm is crap in 0.94 
serriously I have 2 osd serving 2 discs of 2TB and 4 GB of RAM osd takes 
1.6GB each !!! serriously ! that makes avanche snow.


Let me be straight and explain what changed.

in 0.38 you ALWAYS could stop the ceph cluster and then start it up it 
would evaluate if everyone is back if there is enough replicas then 
start rebuilding /rebalancing what needed of course like 10 minutes was 
necesary to bring up ceph cluster but then the rebuilding /rebalancing 
process was smooth.
With 0.94 first you have 2 osd too full at 95 % and 4 osd at 63% over 20 
osd. then you get a disc crash. so ceph starts automatically to rebuild 
and rebalance stuff. and there osd start to lag then to crash
you stop ceph cluster you change the drive restart the ceph cluster 
stops all rebuild process setting no-backfill, norecovey noscrub 
nodeep-scrub you rm the old osd create a new one wait for all osd
to be in and up and then starts rebuilding lag/rebalancing since it is 
automated not much a choice there.


And again all osd are stuck in enless lag/down/recovery intent cycle...

It is a pain serriously. 5 days after changing the faulty disc it is 
still locked in the lag/down/recovery cycle.


Sur it can be argued that my machines are really ressource limited and 
that I should buy 3 thousand dollar worth server at least. But intil 
0.72 that rebalancing /rebuilding process was working smoothly on the 
same hardware.


It seems to me that the rebalancing/rebuilding algorithm is more strict 
now than it was in the past. in the past only what really really needed 
to be rebuild or rebalance was rebalanced or rebuild.


I can still delete all and go back to 0.72... like I should buy a cray 
T-90 to not have anymore problems and have ceph run smoothly. But this 
will not help making ceph a better product.


for me ceph 0.94 is like windows vista...

Alphe Salas
I.T ingeneer

On 09/08/2015 10:20 AM, Gregory Farnum wrote:

On Wed, Sep 2, 2015 at 9:34 PM, Bob Ababurko  wrote:

When I lose a disk OR replace a OSD in my POC ceph cluster, it takes a very
long time to rebalance.  I should note that my cluster is slightly unique in
that I am using cephfs(shouldn't matter?) and it currently contains about
310 million objects.

The last time I replaced a disk/OSD was 2.5 days ago and it is still
rebalancing.  This is on a cluster with no client load.

The configurations is 5 hosts with 6 x 1TB 7200rpm SATA OSD's & 1 850 Pro
SSD which contains the journals for said OSD's.  Thats means 30 OSD's in
total.  System disk is on its own disk.  I'm also using a backend network
with single Gb NIC.  THe rebalancing rate(objects/s) seems to be very slow
when it is close to finishingsay <1% objects misplaced.

It doesn't seem right that it would take 2+ days to rebalance a 1TB disk
with no load on the cluster.  Are my expectations off?


Possibly...Ceph basically needs to treat each object as a single IO.
If you're recovering from a failed disk then you've got to replicate
roughly 310 million * 3 / 30 = 31 million objects. If it's perfectly
balanced across 30 disks that get 80 IOPS that's 12916 seconds (~3.5
hours) worth of work just to read each file — and in reality it's
likely to take more than one IO to read the file, and then you have to
spend a bunch to write it as well.



I'm not sure if my pg_num/pgp_num needs to be changed OR the rebalance time
is dependent on the number of objects in the pool.  These are thoughts i've
had but am not certain are relevant here.


Rebalance time is dependent on the number of objects in the pool. You
*might* see an improvement by increasing "osd max push objects" from
its default of 10...or you might not. That many small files isn't
something I've explored.
-Greg



$ sudo ceph -v
ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)

$ sudo ceph -s
[sudo] password for bababurko:
 cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79
  health HEALTH_WARN
 5 pgs backfilling
 5 pgs stuck unclean
 recovery 3046506/676638611 objects misplaced (0.450%)
  monmap e1: 3 mons at
{cephmon01=10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0}
 election epoch 20, quorum 0,1,2 cephmon01,cephmon02,cephmon03
  mdsmap e6070: 1/1/1 up {0=cephmds01=up:active}, 1 up:standby
  osdmap e4395: 30 osds: 30 up, 30 in; 5 remapped pgs
   pgmap v3100039: 2112 pgs, 3 pools, 6454 GB data, 321 Mobjects
 18319 GB used, 9612 GB / 27931 GB avail
 3046506/676638611 objects misplaced (0.450%)
 2095 active+clean
   12 active+clean+scrubbing+deep
5 active+remapped+backfilling
recovery io 2294 kB/s, 147 objects/s

$ sudo rados df
pool name KB  objects   clones degraded
unfound

Re: [ceph-users] How to observed civetweb.

2015-09-08 Thread Yehuda Sadeh-Weinraub

You can increase the civetweb logs by adding 'debug civetweb = 10' in
your ceph.conf. The output will go into the rgw logs.

Yehuda

On Tue, Sep 8, 2015 at 2:24 AM, Vickie ch  wrote:
> Dear cephers,
>Just upgrade radosgw from apache to civetweb.
> It's really simple to installed and used. But I can't find any parameters or
> logs to adjust(or observe) civetweb. (Like apache log).  I'm really confuse.
> Any ideas?
>
>
> Best wishes,
> Mika
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-08 Thread Mark Nelson

On 09/07/2015 11:34 AM, Quentin Hartman wrote:

fwiw, I am not confused about the various types of SSDs that Samsung
offers. I knew exactly what I was getting when I ordered them. Based on
their specs and my WAG on how much writing I would be doing they should
have lasted about 6 years. Turns out my estimates were wrong, but even
adjusting for actual use, I should have gotten about 18 months out of
these drives, but I have them dying now at 9 months, with about half of
their theoretical life left.

A list of hardware that is known to work well would be incredibly
valuable to people getting started. It doesn't have to be exhaustive,
nor does it have to provide all the guidance someone could want. A
simple "these things have worked for others" would be sufficient. If
nothing else, it will help people justify more expensive gear when their
approval people say "X seems just as good and is cheaper, why can't we
get that?".

So I have my opinions on different drives, but I think we do need to be
really careful not to appear to endorse or pick on specific vendors.
The more we can stick to high-level statements like:

- Drives should have high write endurance
- Drives should perform well with O_DSYNC writes
- Drives should support power loss protection for data in motion

The better I think. Once those are established, I think it's reasonable
to point out that certain drives meet (or do not meet) those criteria
and get feedback from the community as to whether or not vendor's
marketing actually reflects reality. It'd also be really nice to see
more information available like the actual hardware (capacitors, flash
cells, etc) used in the drives. I've had to show photos of the innards
of specific drives to vendors to get them to give me accurate
information regarding certain drive capabilities. Having a database of
such things available to the community would be really helpful.

To that point, I think perhaps though something more important than a
list of known "good" hardware would be a list of known "bad" hardware,

I'm rather hesitant to do this unless it's been specifically confirmed
by the vendor. It's too easy to point fingers (see the recent kernel
trim bug situation).

and perhaps some more experience about what kind of write volume people
should reasonably expect. Setting aside for a moment the early death
problem the recent Samsung drives clearly have (I wonder if it's a
side-effect of the "3D-NAND" tech?) I wouldn't have gotten them had my
estimates told me I'd only get 18 months out of them. That would have
also provided me the information I needed to justify the DC-class drives
that cost four times as much to those that approve purchases. Without
that critical piece of information, I'm left trying to justify thousands
of extra dollars with only "because they're better".

Also, I talked to a Samsung rep last week and he told me the DC 845 line
has been discontinued. The DC-class drives from Samsung are now model
pm863. They are theoretically on the market, but I've not been able to
find them in stock anywhere.

On Mon, Sep 7, 2015 at 4:22 AM, Jan Schermer > wrote:

It is not just a question of which SSD.
It's the combination of distribution (kernel version), disk
controller and firmware, SSD revision and firmware.

There are several ways to select hardware
1) the most traditional way where you build your BoM on a single
vendor - so you buy servers including SSDs and HBAs as a single unit
and then scream at the vendor when it doesn't work. I had a good
experience with vendors in this scenario.
2) based on Hardware Compatibility Lists - usually means you can't
use tha latest hardware. For example LSI doesn't list most SSDs as
compatible, or they only list really old firmware versions.
Unusable, nobody will really help you.
3) You get a sample and test it, and you hope you will get the same
hardware when you order in bulk later. We went this route and got
nothing but trouble when Kingston changed their SSDs completely
without changing their PN.

Would we recommend s3700/3710 for Ceph? Absolutely. But there are
still people who have trouble with them in combination with LSI
controllers.
Can we recommend Samsung 845 DC PRO then? I can say it worked nicely
with my hardware. But surely some people had trouble with it.

I "vote" against creating such a list because of all those reasons,
it could get someone in trouble.

Jan

On 07 Sep 2015, at 11:14, Andrija Panic > wrote:

There is

http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

On the other hand, I'm not sure if SSD vendors would be happy to
see their device listed performing total crap (for Journaling)
...but yes, I vote for having some oficial

Re: [ceph-users] CephFS/Fuse : detect package upgrade to remount

On Tue, Sep 8, 2015 at 2:33 PM, Florent B  wrote:
>
>
> On 09/08/2015 03:26 PM, Gregory Farnum wrote:
>> On Fri, Sep 4, 2015 at 9:15 AM, Florent B  wrote:
>>> Hi everyone,
>>>
>>> I would like to know if there is a way on Debian to detect an upgrade of
>>> ceph-fuse package, that "needs" remouting CephFS.
>>>
>>> When I upgrade my systems, I do a "aptitude update && aptitude
>>> safe-upgrade".
>>>
>>> When ceph-fuse package is upgraded, it would be nice to remount all
>>> CephFS points,  I suppose.
>>>
>>> Does someone did this ?
>> I'm not sure how this could work. It'd be nice to smoothly upgrade for
>> users, but
>> 1) We don't automatically restart the OSD or monitor daemons on
>> upgrade, because users want to control how many of their processes are
>> down at once (and the load spike of a rebooting OSD),
>> 2) I'm not sure how you could safely/cleanly restart a process that's
>> serving a filesystem. It's not like we can force users to stop using
>> the cephfs mountpoint and then reopen all their files after we reboot.
>> -Greg
>
> Hi Greg,
>
> I understand.
> It could be something like this : a command (or temp file) containing
> *running* version of CephFS (per mount point, or per system).
> Of course we can then get *installed* version of CephFS.
> And if different, umount point.

I guess I don't see how that helps compared to just remembering when
you upgraded the package and comparing that to the running time of the
ceph-fuse process.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Inconsistent PGs that ceph pg repair does not fix

2015-09-08 Thread Andras Pataki

Hi Sam,

I saw that ceph 0.94.3 is out and it contains a resolution to the issue below 
(http://tracker.ceph.com/issues/12577).  I installed it on our cluster, but 
unfortunately it didn't resolve the issue.  Same as before, I have a couple of 
inconsistent pg's, and run ceph pg repair on them - the OSD says:

2015-09-08 11:21:53.930324 7f49c17ea700  0 log_channel(cluster) log [INF] : 
2.439 repair starts
2015-09-08 11:27:57.708394 7f49c17ea700 -1 log_channel(cluster) log [ERR] : 
repair 2.439 b029e439/1022d93.0f0c/head//2 on disk data digest 
0xb3d78a6e != 0xa3944ad0
2015-09-08 11:28:32.359938 7f49c17ea700 -1 log_channel(cluster) log [ERR] : 
2.439 repair 1 errors, 0 fixed
2015-09-08 11:28:32.364506 7f49c17ea700  0 log_channel(cluster) log [INF] : 
2.439 deep-scrub starts
2015-09-08 11:29:18.650876 7f49c17ea700 -1 log_channel(cluster) log [ERR] : 
deep-scrub 2.439 b029e439/1022d93.0f0c/head//2 on disk data digest 
0xb3d78a6e != 0xa3944ad0
2015-09-08 11:29:23.136109 7f49c17ea700 -1 log_channel(cluster) log [ERR] : 
2.439 deep-scrub 1 errors

$ ceph tell osd.* version | grep version | sort | uniq -c
 94 "version": "ceph version 0.94.3 
(95cefea9fd9ab740263bf8bb4796fd864d9afe2b)"

Could you have another look?

Thanks,

Andras



From: Andras Pataki
Sent: Monday, August 3, 2015 4:09 PM
To: Samuel Just
Cc: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org
Subject: Re: [ceph-users] Inconsistent PGs that ceph pg repair does not fix

Done: http://tracker.ceph.com/issues/12577
BTW, I¹m using the latest release 0.94.2 on all machines.

Andras


On 8/3/15, 3:38 PM, "Samuel Just"  wrote:

>Hrm, that's certainly supposed to work.  Can you make a bug?  Be sure
>to note what version you are running (output of ceph-osd -v).
>-Sam
>
>On Mon, Aug 3, 2015 at 12:34 PM, Andras Pataki
> wrote:
>> Summary: I am having problems with inconsistent PG's that the 'ceph pg
>> repair' command does not fix.  Below are the details.  Any help would be
>> appreciated.
>>
>> # Find the inconsistent PG's
>> ~# ceph pg dump | grep inconsistent
>> dumped all in format plain
>> 2.439 42080 00 017279507143 31033103 active+clean+inconsistent2015-08-03
>> 14:49:17.29288477323'2250145 77480:890566 [78,54]78 [78,54]78
>> 77323'22501452015-08-03 14:49:17.29253877323'2250145 2015-08-03
>> 14:49:17.292538
>> 2.8b9 40830 00 016669590823 30513051 active+clean+inconsistent2015-08-03
>> 14:46:05.14006377323'2249886 77473:897325 [7,72]7 [7,72]7
>> 77323'22498862015-08-03 14:22:47.83406377323'2249886 2015-08-03
>> 14:22:47.834063
>>
>> # Look at the first one:
>> ~# ceph pg deep-scrub 2.439
>> instructing pg 2.439 on osd.78 to deep-scrub
>>
>> # The logs of osd.78 show:
>> 2015-08-03 15:16:34.409738 7f09ec04a700  0 log_channel(cluster) log
>>[INF] :
>> 2.439 deep-scrub starts
>> 2015-08-03 15:16:51.364229 7f09ec04a700 -1 log_channel(cluster) log
>>[ERR] :
>> deep-scrub 2.439 b029e439/1022d93.0f0c/head//2 on disk data
>>digest
>> 0xb3d78a6e != 0xa3944ad0
>> 2015-08-03 15:16:52.763977 7f09ec04a700 -1 log_channel(cluster) log
>>[ERR] :
>> 2.439 deep-scrub 1 errors
>>
>> # Finding the object in question:
>> ~# find ~ceph/osd/ceph-78/current/2.439_head -name
>>1022d93.0f0c* -ls
>> 21510412310 4100 -rw-r--r--   1 root root  4194304 Jun 30 17:09
>>
>>/var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
>>0022d93.0f0c__head_B029E439__2
>> ~# md5sum
>>
>>/var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
>>0022d93.0f0c__head_B029E439__2
>> 4e4523244deec051cfe53dd48489a5db
>>
>>/var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
>>0022d93.0f0c__head_B029E439__2
>>
>> # The object on the backup osd:
>> ~# find ~ceph/osd/ceph-54/current/2.439_head -name
>>1022d93.0f0c* -ls
>> 6442614367 4100 -rw-r--r--   1 root root  4194304 Jun 30 17:09
>>
>>/var/lib/ceph/osd/ceph-54/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
>>0022d93.0f0c__head_B029E439__2
>> ~# md5sum
>>
>>/var/lib/ceph/osd/ceph-54/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
>>0022d93.0f0c__head_B029E439__2
>> 4e4523244deec051cfe53dd48489a5db
>>
>>/var/lib/ceph/osd/ceph-54/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
>>0022d93.0f0c__head_B029E439__2
>>
>> # They don't seem to be different.
>> # When I try repair:
>> ~# ceph pg repair 2.439
>> instructing pg 2.439 on osd.78 to repair
>>
>> # The osd.78 logs show:
>> 2015-08-03 15:19:21.775933 7f09ec04a700  0 log_channel(cluster) log
>>[INF] :
>> 2.439 repair starts
>> 2015-08-03 15:19:38.088673 7f09ec04a700 -1 log_channel(cluster) log
>>[ERR] :
>> repair 2.439 b029e439/1022d93.0f0c/head//2 on disk data digest
>> 0xb3d78a6e != 0xa3944ad0
>> 2015-08-03 15:19:39.958019 7f09ec04a700 -1 log_channel(cluster) log
>>[ERR] :
>> 2.439 repair 1 errors, 0 fixed
>> 2015-08-03 15:19:39.962406 7f09ec04a700  0

Re: [ceph-users] maximum object size

2015-09-08 Thread Somnath Roy

I think the limit is 90 MB from OSD side, isn't it ?
If so, how are you able to write object till 1.99 GB ?
Am I missing anything ?

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
HEWLETT, Paul (Paul)
Sent: Tuesday, September 08, 2015 8:55 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] maximum object size

Hi All

We have recently encountered a problem on Hammer (0.94.2) whereby we cannot 
write objects > 2GB in size to the rados backend.
(NB not RadosGW, CephFS or RBD)

I found the following issue
https://wiki.ceph.com/Planning/Blueprints/Firefly/Object_striping_in_librad
os which seems to address this but no progress reported.

What are the implications of writing such large objects to RADOS? What impact 
is expected on the XFS backend particularly regarding the size and location of 
the journal?

Any prospect of progressing the issue reported in the enclosed link?

Interestingly I could not find anywhere in the ceph documentation that 
describes the 2GB limitation. The implication of most of the website docs is 
that there is no limit on objects stored in Ceph. The only hint is that 
osd_max_write_size is a 32 bit signed integer.

If we use erasure coding will this reduce the impact? I.e. 4+1 EC will only 
write 500MB to each OSD and then this value will be tested against the chunk 
size instead of the total file size?

The relevant code in Ceph is:

src/FileJournal.cc:

  needed_space = ((int64_t)g_conf->osd_max_write_size) << 20;
  needed_space += (2 * sizeof(entry_header_t)) + get_top();
  if (header.max_size - header.start < needed_space) {
derr << "FileJournal::create: OSD journal is not large enough to hold "
<< "osd_max_write_size bytes!" << dendl;
ret = -ENOSPC;
goto free_buf;
  }

src/osd/OSD.cc:

// too big?
if (cct->_conf->osd_max_write_size &&
m->get_data_len() > cct->_conf->osd_max_write_size << 20) {
// journal can't hold commit!
 derr << "handle_op msg data len " << m->get_data_len()
 << " > osd_max_write_size " << (cct->_conf->osd_max_write_size << 20)
 << " on " << *m << dendl;
service.reply_op_error(op, -OSD_WRITETOOBIG);
return;
  }

Interestingly the code in OSD.cc looks like a bug - the max_write value should 
be cast to an int64_t before shifting left 20 bits (which is done correctly in 
FileJournal.cc). Otherwise overflow may occur and negative values generated.


Any comments welcome - any help appreciated.

Regards
Paul


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Inconsistent PGs that ceph pg repair does not fix

2015-09-08 Thread Sage Weil

On Tue, 8 Sep 2015, Andras Pataki wrote:
> Hi Sam,
> 
> I saw that ceph 0.94.3 is out and it contains a resolution to the issue below 
> (http://tracker.ceph.com/issues/12577).  I installed it on our cluster, but 
> unfortunately it didn't resolve the issue.  Same as before, I have a couple 
> of inconsistent pg's, and run ceph pg repair on them - the OSD says:
> 
> 2015-09-08 11:21:53.930324 7f49c17ea700  0 log_channel(cluster) log [INF] : 
> 2.439 repair starts
> 2015-09-08 11:27:57.708394 7f49c17ea700 -1 log_channel(cluster) log [ERR] : 
> repair 2.439 b029e439/1022d93.0f0c/head//2 on disk data digest 
> 0xb3d78a6e != 0xa3944ad0
> 2015-09-08 11:28:32.359938 7f49c17ea700 -1 log_channel(cluster) log [ERR] : 
> 2.439 repair 1 errors, 0 fixed
> 2015-09-08 11:28:32.364506 7f49c17ea700  0 log_channel(cluster) log [INF] : 
> 2.439 deep-scrub starts
> 2015-09-08 11:29:18.650876 7f49c17ea700 -1 log_channel(cluster) log [ERR] : 
> deep-scrub 2.439 b029e439/1022d93.0f0c/head//2 on disk data digest 
> 0xb3d78a6e != 0xa3944ad0
> 2015-09-08 11:29:23.136109 7f49c17ea700 -1 log_channel(cluster) log [ERR] : 
> 2.439 deep-scrub 1 errors
> 
> $ ceph tell osd.* version | grep version | sort | uniq -c
>  94 "version": "ceph version 0.94.3 
> (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)"
> 
> Could you have another look?

The fix was merged into master in 
6a949e10198a1787f2008b6c537b7060d191d236, after v0.94.3 was released.  It 
will be in v0.94.4.

Note that we had a bunch of similar errors on our internal lab cluster and 
this resolved them.  We installed the test build from gitbuilder, 
available at 
http://gitbuilder.ceph.com/ceph-rpm-centos7-x86_64-basic/ref/hammer/ (or 
similar, adjust URL for your distro).

sage


> 
> Thanks,
> 
> Andras
> 
> 
> 
> From: Andras Pataki
> Sent: Monday, August 3, 2015 4:09 PM
> To: Samuel Just
> Cc: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org
> Subject: Re: [ceph-users] Inconsistent PGs that ceph pg repair does not fix
> 
> Done: http://tracker.ceph.com/issues/12577
> BTW, I¹m using the latest release 0.94.2 on all machines.
> 
> Andras
> 
> 
> On 8/3/15, 3:38 PM, "Samuel Just"  wrote:
> 
> >Hrm, that's certainly supposed to work.  Can you make a bug?  Be sure
> >to note what version you are running (output of ceph-osd -v).
> >-Sam
> >
> >On Mon, Aug 3, 2015 at 12:34 PM, Andras Pataki
> > wrote:
> >> Summary: I am having problems with inconsistent PG's that the 'ceph pg
> >> repair' command does not fix.  Below are the details.  Any help would be
> >> appreciated.
> >>
> >> # Find the inconsistent PG's
> >> ~# ceph pg dump | grep inconsistent
> >> dumped all in format plain
> >> 2.439 42080 00 017279507143 31033103 active+clean+inconsistent2015-08-03
> >> 14:49:17.29288477323'2250145 77480:890566 [78,54]78 [78,54]78
> >> 77323'22501452015-08-03 14:49:17.29253877323'2250145 2015-08-03
> >> 14:49:17.292538
> >> 2.8b9 40830 00 016669590823 30513051 active+clean+inconsistent2015-08-03
> >> 14:46:05.14006377323'2249886 77473:897325 [7,72]7 [7,72]7
> >> 77323'22498862015-08-03 14:22:47.83406377323'2249886 2015-08-03
> >> 14:22:47.834063
> >>
> >> # Look at the first one:
> >> ~# ceph pg deep-scrub 2.439
> >> instructing pg 2.439 on osd.78 to deep-scrub
> >>
> >> # The logs of osd.78 show:
> >> 2015-08-03 15:16:34.409738 7f09ec04a700  0 log_channel(cluster) log
> >>[INF] :
> >> 2.439 deep-scrub starts
> >> 2015-08-03 15:16:51.364229 7f09ec04a700 -1 log_channel(cluster) log
> >>[ERR] :
> >> deep-scrub 2.439 b029e439/1022d93.0f0c/head//2 on disk data
> >>digest
> >> 0xb3d78a6e != 0xa3944ad0
> >> 2015-08-03 15:16:52.763977 7f09ec04a700 -1 log_channel(cluster) log
> >>[ERR] :
> >> 2.439 deep-scrub 1 errors
> >>
> >> # Finding the object in question:
> >> ~# find ~ceph/osd/ceph-78/current/2.439_head -name
> >>1022d93.0f0c* -ls
> >> 21510412310 4100 -rw-r--r--   1 root root  4194304 Jun 30 17:09
> >>
> >>/var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
> >>0022d93.0f0c__head_B029E439__2
> >> ~# md5sum
> >>
> >>/var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
> >>0022d93.0f0c__head_B029E439__2
> >> 4e4523244deec051cfe53dd48489a5db
> >>
> >>/var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
> >>0022d93.0f0c__head_B029E439__2
> >>
> >> # The object on the backup osd:
> >> ~# find ~ceph/osd/ceph-54/current/2.439_head -name
> >>1022d93.0f0c* -ls
> >> 6442614367 4100 -rw-r--r--   1 root root  4194304 Jun 30 17:09
> >>
> >>/var/lib/ceph/osd/ceph-54/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
> >>0022d93.0f0c__head_B029E439__2
> >> ~# md5sum
> >>
> >>/var/lib/ceph/osd/ceph-54/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
> >>0022d93.0f0c__head_B029E439__2
> >> 4e4523244deec051cfe53dd48489a5db
> >>
>

Re: [ceph-users] Inconsistent PGs that ceph pg repair does not fix

2015-09-08 Thread Andras Pataki

Cool, thanks!

Andras


From: Sage Weil 
Sent: Tuesday, September 8, 2015 2:07 PM
To: Andras Pataki
Cc: Samuel Just; ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org
Subject: Re: [ceph-users] Inconsistent PGs that ceph pg repair does not fix

On Tue, 8 Sep 2015, Andras Pataki wrote:
> Hi Sam,
>
> I saw that ceph 0.94.3 is out and it contains a resolution to the issue below 
> (http://tracker.ceph.com/issues/12577).  I installed it on our cluster, but 
> unfortunately it didn't resolve the issue.  Same as before, I have a couple 
> of inconsistent pg's, and run ceph pg repair on them - the OSD says:
>
> 2015-09-08 11:21:53.930324 7f49c17ea700  0 log_channel(cluster) log [INF] : 
> 2.439 repair starts
> 2015-09-08 11:27:57.708394 7f49c17ea700 -1 log_channel(cluster) log [ERR] : 
> repair 2.439 b029e439/1022d93.0f0c/head//2 on disk data digest 
> 0xb3d78a6e != 0xa3944ad0
> 2015-09-08 11:28:32.359938 7f49c17ea700 -1 log_channel(cluster) log [ERR] : 
> 2.439 repair 1 errors, 0 fixed
> 2015-09-08 11:28:32.364506 7f49c17ea700  0 log_channel(cluster) log [INF] : 
> 2.439 deep-scrub starts
> 2015-09-08 11:29:18.650876 7f49c17ea700 -1 log_channel(cluster) log [ERR] : 
> deep-scrub 2.439 b029e439/1022d93.0f0c/head//2 on disk data digest 
> 0xb3d78a6e != 0xa3944ad0
> 2015-09-08 11:29:23.136109 7f49c17ea700 -1 log_channel(cluster) log [ERR] : 
> 2.439 deep-scrub 1 errors
>
> $ ceph tell osd.* version | grep version | sort | uniq -c
>  94 "version": "ceph version 0.94.3 
> (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)"
>
> Could you have another look?

The fix was merged into master in
6a949e10198a1787f2008b6c537b7060d191d236, after v0.94.3 was released.  It
will be in v0.94.4.

Note that we had a bunch of similar errors on our internal lab cluster and
this resolved them.  We installed the test build from gitbuilder,
available at
http://gitbuilder.ceph.com/ceph-rpm-centos7-x86_64-basic/ref/hammer/ (or
similar, adjust URL for your distro).

sage


>
> Thanks,
>
> Andras
>
>
> 
> From: Andras Pataki
> Sent: Monday, August 3, 2015 4:09 PM
> To: Samuel Just
> Cc: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org
> Subject: Re: [ceph-users] Inconsistent PGs that ceph pg repair does not fix
>
> Done: http://tracker.ceph.com/issues/12577
> BTW, I¹m using the latest release 0.94.2 on all machines.
>
> Andras
>
>
> On 8/3/15, 3:38 PM, "Samuel Just"  wrote:
>
> >Hrm, that's certainly supposed to work.  Can you make a bug?  Be sure
> >to note what version you are running (output of ceph-osd -v).
> >-Sam
> >
> >On Mon, Aug 3, 2015 at 12:34 PM, Andras Pataki
> > wrote:
> >> Summary: I am having problems with inconsistent PG's that the 'ceph pg
> >> repair' command does not fix.  Below are the details.  Any help would be
> >> appreciated.
> >>
> >> # Find the inconsistent PG's
> >> ~# ceph pg dump | grep inconsistent
> >> dumped all in format plain
> >> 2.439 42080 00 017279507143 31033103 active+clean+inconsistent2015-08-03
> >> 14:49:17.29288477323'2250145 77480:890566 [78,54]78 [78,54]78
> >> 77323'22501452015-08-03 14:49:17.29253877323'2250145 2015-08-03
> >> 14:49:17.292538
> >> 2.8b9 40830 00 016669590823 30513051 active+clean+inconsistent2015-08-03
> >> 14:46:05.14006377323'2249886 77473:897325 [7,72]7 [7,72]7
> >> 77323'22498862015-08-03 14:22:47.83406377323'2249886 2015-08-03
> >> 14:22:47.834063
> >>
> >> # Look at the first one:
> >> ~# ceph pg deep-scrub 2.439
> >> instructing pg 2.439 on osd.78 to deep-scrub
> >>
> >> # The logs of osd.78 show:
> >> 2015-08-03 15:16:34.409738 7f09ec04a700  0 log_channel(cluster) log
> >>[INF] :
> >> 2.439 deep-scrub starts
> >> 2015-08-03 15:16:51.364229 7f09ec04a700 -1 log_channel(cluster) log
> >>[ERR] :
> >> deep-scrub 2.439 b029e439/1022d93.0f0c/head//2 on disk data
> >>digest
> >> 0xb3d78a6e != 0xa3944ad0
> >> 2015-08-03 15:16:52.763977 7f09ec04a700 -1 log_channel(cluster) log
> >>[ERR] :
> >> 2.439 deep-scrub 1 errors
> >>
> >> # Finding the object in question:
> >> ~# find ~ceph/osd/ceph-78/current/2.439_head -name
> >>1022d93.0f0c* -ls
> >> 21510412310 4100 -rw-r--r--   1 root root  4194304 Jun 30 17:09
> >>
> >>/var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
> >>0022d93.0f0c__head_B029E439__2
> >> ~# md5sum
> >>
> >>/var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
> >>0022d93.0f0c__head_B029E439__2
> >> 4e4523244deec051cfe53dd48489a5db
> >>
> >>/var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
> >>0022d93.0f0c__head_B029E439__2
> >>
> >> # The object on the backup osd:
> >> ~# find ~ceph/osd/ceph-54/current/2.439_head -name
> >>1022d93.0f0c* -ls
> >> 6442614367 4100 -rw-r--r--   1 root root  4194304 Jun 30 17:09
> >>
>

Re: [ceph-users] How to observed civetweb.

2015-09-08 Thread Kobi Laredo

Vickie,

You can add:
*access_log_file=/var/log/civetweb/access.log
error_log_file=/var/log/civetweb/error.log*

to *rgw frontends* in ceph.conf though these logs are thin on info (Source
IP, date, and request)

Check out
https://github.com/civetweb/civetweb/blob/master/docs/UserManual.md for
more civetweb configs you can inject through  *rgw frontends* config
attribute in ceph.conf

We are currently testing tuning civetweb's num_threads
and request_timeout_ms to improve radosgw performance

*Kobi Laredo*
*Cloud Systems Engineer* | (*408) 409-KOBI*

On Tue, Sep 8, 2015 at 8:20 AM, Yehuda Sadeh-Weinraub 
wrote:

> You can increase the civetweb logs by adding 'debug civetweb = 10' in
> your ceph.conf. The output will go into the rgw logs.
>
> Yehuda
>
> On Tue, Sep 8, 2015 at 2:24 AM, Vickie ch  wrote:
> > Dear cephers,
> >Just upgrade radosgw from apache to civetweb.
> > It's really simple to installed and used. But I can't find any
> parameters or
> > logs to adjust(or observe) civetweb. (Like apache log).  I'm really
> confuse.
> > Any ideas?
> >
> >
> > Best wishes,
> > Mika
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] jemalloc and transparent hugepage

>>Is this something we can set with mallctl[1] at startup? 

I don't think it's possible.

TP hugepage are managed by kernel, not jemalloc.

(but a simple "echo never > /sys/kernel/mm/transparent_hugepage/enabled" in 
init script is enough)

- Mail original -
De: "Sage Weil" 
À: "aderumier" 
Cc: "Mark Nelson" , "ceph-devel" 
, "ceph-users" , 
"Somnath Roy" 
Envoyé: Mercredi 9 Septembre 2015 04:07:59
Objet: Re: [ceph-users] jemalloc and transparent hugepage

On Wed, 9 Sep 2015, Alexandre DERUMIER wrote: 
> >>Have you noticed any performance difference with tp=never? 
> 
> No difference. 
> 
> I think hugepage could speedup big memory sets like 100-200GB, but for 
> 1-2GB they are no noticable difference. 

Is this something we can set with mallctl[1] at startup? 

sage 

[1] 
http://www.canonware.com/download/jemalloc/jemalloc-latest/doc/jemalloc.html 

> 
> 
> 
> 
> 
> 
> - Mail original - 
> De: "Mark Nelson"  
> À: "aderumier" , "ceph-devel" 
> , "ceph-users"  
> Cc: "Somnath Roy"  
> Envoyé: Mercredi 9 Septembre 2015 01:49:35 
> Objet: Re: [ceph-users] jemalloc and transparent hugepage 
> 
> Excellent investigation Alexandre! Have you noticed any performance 
> difference with tp=never? 
> 
> Mark 
> 
> On 09/08/2015 06:33 PM, Alexandre DERUMIER wrote: 
> > I have done small benchmark with tcmalloc and jemalloc, transparent 
> > hugepage=always|never. 
> > 
> > for tcmalloc, they are no difference. 
> > but for jemalloc, the difference is huge (around 25% lower with tp=never). 
> > 
> > jemmaloc 4.6.0+tp=never vs tcmalloc use 10% more RSS memory 
> > 
> > jemmaloc 4.0+tp=never almost use same RSS memory than tcmalloc ! 
> > 
> > 
> > I don't have monitored memory usage in recovery, but I think it should help 
> > too. 
> > 
> > 
> > 
> > 
> > tcmalloc 2.1 tp=always 
> > --- 
> > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 
> > 
> > root 67746 120 1.0 1531220 671152 ? Ssl 01:18 0:43 /usr/bin/ceph-osd 
> > --cluster=ceph -i 0 -f 
> > root 67764 144 1.0 1570256 711232 ? Ssl 01:18 0:51 /usr/bin/ceph-osd 
> > --cluster=ceph -i 1 -f 
> > 
> > root 68363 220 0.9 1522292 655888 ? Ssl 01:19 0:46 /usr/bin/ceph-osd 
> > --cluster=ceph -i 0 -f 
> > root 68381 261 1.0 1563396 702500 ? Ssl 01:19 0:55 /usr/bin/ceph-osd 
> > --cluster=ceph -i 1 -f 
> > 
> > root 68963 228 1.0 1519240 666196 ? Ssl 01:20 0:31 /usr/bin/ceph-osd 
> > --cluster=ceph -i 0 -f 
> > root 68981 268 1.0 1564452 694352 ? Ssl 01:20 0:37 /usr/bin/ceph-osd 
> > --cluster=ceph -i 1 -f 
> > 
> > 
> > 
> > tcmalloc 2.1 tp=never 
> > - 
> > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 
> > 
> > root 69560 144 1.0 1544968 677584 ? Ssl 01:21 0:20 /usr/bin/ceph-osd 
> > --cluster=ceph -i 0 -f 
> > root 69578 167 1.0 1568620 704456 ? Ssl 01:21 0:23 /usr/bin/ceph-osd 
> > --cluster=ceph -i 1 -f 
> > 
> > 
> > root 70156 164 0.9 1519680 649776 ? Ssl 01:21 0:16 /usr/bin/ceph-osd 
> > --cluster=ceph -i 0 -f 
> > root 70174 214 1.0 1559772 692828 ? Ssl 01:21 0:19 /usr/bin/ceph-osd 
> > --cluster=ceph -i 1 -f 
> > 
> > root 70757 202 0.9 1520376 650572 ? Ssl 01:22 0:20 /usr/bin/ceph-osd 
> > --cluster=ceph -i 0 -f 
> > root 70775 236 1.0 1560644 694088 ? Ssl 01:22 0:23 /usr/bin/ceph-osd 
> > --cluster=ceph -i 1 -f 
> > 
> > 
> > 
> > jemalloc 3.6 tp = always 
> >  
> > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 
> > 
> > root 92005 46.1 1.4 2033864 967512 ? Ssl 01:00 0:04 /usr/bin/ceph-osd 
> > --cluster=ceph -i 5 -f 
> > root 92027 45.5 1.4 2021624 963536 ? Ssl 01:00 0:04 /usr/bin/ceph-osd 
> > --cluster=ceph -i 4 -f 
> > 
> > 
> > 
> > root 92703 191 1.5 2138724 1002376 ? Ssl 01:02 1:16 /usr/bin/ceph-osd 
> > --cluster=ceph -i 5 -f 
> > root 92721 183 1.5 2126228 986448 ? Ssl 01:02 1:13 /usr/bin/ceph-osd 
> > --cluster=ceph -i 4 -f 
> > 
> > 
> > root 93366 258 1.4 2139052 984132 ? Ssl 01:03 1:09 /usr/bin/ceph-osd 
> > --cluster=ceph -i 5 -f 
> > root 93384 250 1.5 2126244 990348 ? Ssl 01:03 1:07 /usr/bin/ceph-osd 
> > --cluster=ceph -i 4 -f 
> > 
> > 
> > 
> > jemalloc 3.6 tp = never 
> > --- 
> > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 
> > 
> > root 93990 238 1.1 2105812 762628 ? Ssl 01:04 1:16 /usr/bin/ceph-osd 
> > --cluster=ceph -i 4 -f 
> > root 94033 263 1.1 2118288 781768 ? Ssl 01:04 1:18 /usr/bin/ceph-osd 
> > --cluster=ceph -i 5 -f 
> > 
> > 
> > root 94656 266 1.1 2139096 781392 ? Ssl 01:05 0:58 /usr/bin/ceph-osd 
> > --cluster=ceph -i 5 -f 
> > root 94674 257 1.1 2126316 760632 ? Ssl 01:05 0:56 /usr/bin/ceph-osd 
> > --cluster=ceph -i 4 -f 
> > 
> > root 95317 297 1.1 2135044 780532 ? Ssl 01:06 0:35 /usr/bin/ceph-osd 
> > --cluster=ceph -i 5 -f 
> > root

Re: [ceph-users] jemalloc and transparent hugepage