Re: [ceph-users] can cache-mode be set to readproxy for tiercachewith ceph 0.94.9 ?

2016-12-13 Thread JiaJia Zhong
-- Original --
From:  "Shinobu Kinjo";
Date:  Wed, Dec 14, 2016 10:56 AM
To:  "JiaJia Zhong"; 
Cc:  "CEPH list"; "ukernel"; 
Subject:  Re: [ceph-users] can cache-mode be set to readproxy for tiercachewith 
ceph 0.94.9 ?

 
> ps: When we first met this issue, restarting the mds could cure that. (but 
> that was ceph 0.94.1).

Is this still working?
I think It's hard to reproduce the issue, the cluster woks well now.
Since you're using 0.94.9, bug(#12551) you mentioned was fixed.

Can you do the followings to see object appear to you as ZERO size is
actually there:
 # rados -p ${cache pool} ls
 # rados -p ${cache pool} get ${object} /tmp/file
 # ls -l /tmp/file
I did these with the ZERO file, It's the original OK file via rados get, It's 
only not normal in cephfs


-- Original --
From:  "Shinobu Kinjo";
Date:  Tue, Dec 13, 2016 06:21 PM
To:  "JiaJia Zhong";
Cc:  "CEPH list"; "ukernel";
Subject:  Re: [ceph-users] can cache-mode be set to readproxy for tier
cachewith ceph 0.94.9 ?



On Tue, Dec 13, 2016 at 4:38 PM, JiaJia Zhong  wrote:
>
> hi cephers:
> we are using ceph hammer 0.94.9,  yes, It's not the latest ( jewel),
> with some ssd osds for tiering,  cache-mode is set to readproxy, 
> everything seems to be as expected,
> but when reading some small files from cephfs, we got 0 bytes.


Would you be able to share:

 #1 How small is actual data?
 #2 Is the symptom reproduceable with same size of different data?
 #3 can you share your ceph.conf(ceph --show-config)?

>
>
> I did some search and got the below link,
> 
> http://ceph-users.ceph.narkive.com/g4wcB8ED/cephfs-with-cache-tiering-reading-files-are-filled-with-0s
> that's almost the same as what we are suffering from except  the 
> cache-mode in the link is writeback, ours is readproxy.
>
> that bug shall have been FIXED in 0.94.9 
> (http://tracker.ceph.com/issues/12551)
> but we still can encounter with that occasionally :(
>
>enviroment:
>  - ceph: 0.94.9
>  - kernel client: 4.2.0-36-generic ( ubuntu 14.04 )
>  - any others needed ?
>
>Question:
>1.  does readproxy mode work on ceph0.94.9 ? since there are only 
> writeback and readonly in  the document for hammer.
>2.  any one with (Jewel or Hammer) met the same issue ?
>
>
> loop Yan, Zheng
>Quote from the link for convince.
>  """
> Hi,
>
> I am experiencing an issue with CephFS with cache tiering where the kernel
> clients are reading files filled entirely with 0s.
>
> The setup:
> ceph 0.94.3
> create cephfs_metadata replicated pool
> create cephfs_data replicated pool
> cephfs was created on the above two pools, populated with files, then:
> create cephfs_ssd_cache replicated pool,
> then adding the tiers:
> ceph osd tier add cephfs_data cephfs_ssd_cache
> ceph osd tier cache-mode cephfs_ssd_cache writeback
> ceph osd tier set-overlay cephfs_data cephfs_ssd_cache
>
> While the cephfs_ssd_cache pool is empty, multiple kernel clients on
> different hosts open the same file (the size of the file is small, <10k) at
> approximately the same time. A number of the clients from the OS level see
> the entire file being empty. I can do a rados -p {cache pool} ls for the
> list of files cached, and do a rados -p {cache pool} get {object} /tmp/file
> and see the complete contents of the file.
> I can repeat this by setting cache-mode to forward, rados -p {cache pool}
> cache-flush-evict-all, checking no more objects in cache with rados -p
> {cache pool} ls, resetting cache-mode to writeback with an empty pool, and
> doing the multiple same file opens.
>
> Has anyone seen this issue? It seems like what may be a race condition
> where the object is not yet completely loaded into the cache pool so the
> cache pool serves out an incomplete object.
> If anyone can shed some light or any suggestions to help debug this issue,
> that would be very helpful.
>
> Thanks,
> Arthur"""
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] can cache-mode be set to readproxy for tier cachewith ceph 0.94.9 ?

2016-12-13 Thread Shinobu Kinjo
> ps: When we first met this issue, restarting the mds could cure that. (but 
> that was ceph 0.94.1).

Is this still working?

Since you're using 0.94.9, bug(#12551) you mentioned was fixed.

Can you do the followings to see object appear to you as ZERO size is
actually there:
 # rados -p ${cache pool} ls
 # rados -p ${cache pool} get ${object} /tmp/file
 # ls -l /tmp/file

-- Original --
From:  "Shinobu Kinjo";
Date:  Tue, Dec 13, 2016 06:21 PM
To:  "JiaJia Zhong";
Cc:  "CEPH list"; "ukernel";
Subject:  Re: [ceph-users] can cache-mode be set to readproxy for tier
cachewith ceph 0.94.9 ?



On Tue, Dec 13, 2016 at 4:38 PM, JiaJia Zhong  wrote:
>
> hi cephers:
> we are using ceph hammer 0.94.9,  yes, It's not the latest ( jewel),
> with some ssd osds for tiering,  cache-mode is set to readproxy, 
> everything seems to be as expected,
> but when reading some small files from cephfs, we got 0 bytes.


Would you be able to share:

 #1 How small is actual data?
 #2 Is the symptom reproduceable with same size of different data?
 #3 can you share your ceph.conf(ceph --show-config)?

>
>
> I did some search and got the below link,
> 
> http://ceph-users.ceph.narkive.com/g4wcB8ED/cephfs-with-cache-tiering-reading-files-are-filled-with-0s
> that's almost the same as what we are suffering from except  the 
> cache-mode in the link is writeback, ours is readproxy.
>
> that bug shall have been FIXED in 0.94.9 
> (http://tracker.ceph.com/issues/12551)
> but we still can encounter with that occasionally :(
>
>enviroment:
>  - ceph: 0.94.9
>  - kernel client: 4.2.0-36-generic ( ubuntu 14.04 )
>  - any others needed ?
>
>Question:
>1.  does readproxy mode work on ceph0.94.9 ? since there are only 
> writeback and readonly in  the document for hammer.
>2.  any one with (Jewel or Hammer) met the same issue ?
>
>
> loop Yan, Zheng
>Quote from the link for convince.
>  """
> Hi,
>
> I am experiencing an issue with CephFS with cache tiering where the kernel
> clients are reading files filled entirely with 0s.
>
> The setup:
> ceph 0.94.3
> create cephfs_metadata replicated pool
> create cephfs_data replicated pool
> cephfs was created on the above two pools, populated with files, then:
> create cephfs_ssd_cache replicated pool,
> then adding the tiers:
> ceph osd tier add cephfs_data cephfs_ssd_cache
> ceph osd tier cache-mode cephfs_ssd_cache writeback
> ceph osd tier set-overlay cephfs_data cephfs_ssd_cache
>
> While the cephfs_ssd_cache pool is empty, multiple kernel clients on
> different hosts open the same file (the size of the file is small, <10k) at
> approximately the same time. A number of the clients from the OS level see
> the entire file being empty. I can do a rados -p {cache pool} ls for the
> list of files cached, and do a rados -p {cache pool} get {object} /tmp/file
> and see the complete contents of the file.
> I can repeat this by setting cache-mode to forward, rados -p {cache pool}
> cache-flush-evict-all, checking no more objects in cache with rados -p
> {cache pool} ls, resetting cache-mode to writeback with an empty pool, and
> doing the multiple same file opens.
>
> Has anyone seen this issue? It seems like what may be a race condition
> where the object is not yet completely loaded into the cache pool so the
> cache pool serves out an incomplete object.
> If anyone can shed some light or any suggestions to help debug this issue,
> that would be very helpful.
>
> Thanks,
> Arthur"""
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to release Hammer osd RAM when compiled with jemalloc

2016-12-13 Thread Sage Weil
On Wed, 14 Dec 2016, Dong Wu wrote:
> Thanks for your response.
> 
> 2016-12-13 20:40 GMT+08:00 Sage Weil :
> > On Tue, 13 Dec 2016, Dong Wu wrote:
> >> Hi, all
> >>I have a cluster with nearly 1000 osds, and each osd already
> >> occupied 2.5G physical memory on average, which cause each host 90%
> >> memory useage. when use tcmalloc, we can use "ceph tell osd.* release"
> >> to release unused memory, but in my cluster, ceph is build with
> >> jemalloc, so can't use "ceph tell osd.* release", is there any methods
> >> to release some memory?
> >
> > We explicitly call into tcmalloc to release memory with that command, but
> > unless you've patched something in yourself there is no integration with
> > jemalloc's release API.
> 
> Are there any methods to know detail memory usage of OSD, if we have a
> memory allocator recording detail memory usage, will this be helpful?
> Is it on the schedule?

Kraken has a new mempool infrastructure and some of the OSD pieces have 
been moved into it, but only some.  There's quite a bit of opportunity to 
further categorize allocations to get better visibility here.

Barring that, your best bet is to use either tcmalloc's heap profiling or 
valgrind massif.  Both slow down execution by a lot (5-10x).  Massif has 
better detail, but is somewhat slower.

sage


> 
> >
> >> another question:
> >> can I decrease the following config value which used for cached osdmap
> >>  to lower osd's memory?
> >>
> >> "mon_min_osdmap_epochs": "500"
> >> "osd_map_max_advance": "200",
> >> "osd_map_cache_size": "500",
> >> "osd_map_message_max": "100",
> >> "osd_map_share_max_epochs": "100"
> >
> > Yeah.  You should be fine with 500, 50, 100, 50, 50.
> 
> >
> > sage
> 
> Thanks.
> Regards.
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What happens if all replica OSDs journals are broken?

2016-12-13 Thread Christian Balzer

Hello,

On Wed, 14 Dec 2016 00:06:14 +0100 Kevin Olbrich wrote:

> Ok, thanks for your explanation!
> I read those warnings about size 2 + min_size 1 (we are using ZFS as RAID6,
> called zraid2) as OSDs.
>
This is similar to my RAID6 or RAID10 backed OSDs with regards to having
very resilient, extremely unlikely to fail OSDs.
As such a Ceph replication of 2 with min_size is a calculated risk,
acceptable for me on others in certain use cases.
This is also with very few (2-3) journals per SSD.

If:

1. Your journal SSDs are well trusted and monitored (Intel DC S36xx, 37xx)
2. Your failure domain represented by a journal SSD is small enough
(meaning that replicating the lost OSDs can be done quickly)

it may be an acceptable risk for you as well.

> Time to raise replication!
>
If you can afford that (money, space, latency), definitely go for it.
 
Christian
> Kevin
> 
> 2016-12-13 0:00 GMT+01:00 Christian Balzer :
> 
> > On Mon, 12 Dec 2016 22:41:41 +0100 Kevin Olbrich wrote:
> >
> > > Hi,
> > >
> > > just in case: What happens when all replica journal SSDs are broken at
> > once?
> > >
> > That would be bad, as in BAD.
> >
> > In theory you just "lost" all the associated OSDs and their data.
> >
> > In practice everything but in the in-flight data at the time is still on
> > the actual OSDs (HDDs), but it's inconsistent and inaccessible as far as
> > Ceph is concerned.
> >
> > So with some trickery and an experienced data-recovery Ceph consultant you
> > _may_ get things running with limited data loss/corruption, but that's
> > speculation and may be wishful thinking on my part.
> >
> > Another data point to deploy only well known/monitored/trusted SSDs and
> > have a 3x replication.
> >
> > > The PGs most likely will be stuck inactive but as I read, the journals
> > just
> > > need to be replaced (http://ceph.com/planet/ceph-recover-osds-after-ssd-
> > > journal-failure/).
> > >
> > > Does this also work in this case?
> > >
> > Not really, no.
> >
> > The above works by having still a valid state and operational OSDs from
> > which the "broken" one can recover.
> >
> > Christian
> > --
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> >


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v11.1.0 kraken candidate released

2016-12-13 Thread Darrell Enns
OK, thanks for the update Greg!

- Darrell
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v11.1.0 kraken candidate released

2016-12-13 Thread Gregory Farnum
On Tue, Dec 13, 2016 at 4:34 PM, Darrell Enns  wrote:
> Are CephFS snapshots still considered unstable/experimental in Kraken?

Sadly, yes. I had a solution but they didn't account for hard links.
When we decided we wanted to support those instead of ignoring them or
trying to set up boundary regions which disallow one or the other, I
had to go back to the drawing board. Working on a design but it's
taking a back burner to other things like stabilizing multi-MDS
features. :)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v11.1.0 kraken candidate released

2016-12-13 Thread Darrell Enns
Are CephFS snapshots still considered unstable/experimental in Kraken?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Revisiting: Many clients (X) failing to respond to cache pressure

2016-12-13 Thread Goncalo Borges

Hi John.

Comments in line.



Hi Ceph(FS)ers...

I am currently running in production the following environment:

- ceph/cephfs in 10.2.2.
- All infrastructure is in the same version (rados cluster, mons, mds and
cephfs clients).
- We mount cephfs using ceph-fuse.

Since yesterday that we have our cluster in warning state with the message
"mds0: Many clients (X) failing to respond to cache pressure". X has been
changing with time, from ~130 to ~70. I am able to correlate the appearance
of this message with burst of jobs in our cluster.

This subject has been discussed in the mailing list a lot of times, and
normally, the recipe is to look for something wrong in the clients. So, I
have tried to look to clients first:

1) I've started to loop through all my clients, and run 'ceph --admin-daemon
/var/run/ceph/ceph-client.mount_user.asok status' to get the inodes_count
reported in each client.

$ cat all.txt | grep inode_count | awk '{print $2}' | sed 's/,//g' | awk
'{s+=$1} END {print s}'
2407659

2) I've then compared with the number of inodes the mds had in its cache
(obtained by a perf dump)
  inode_max": 200 and "inodes": 2413826

3) I've tried to understand how many clients had a number of inodes higher
than 16384 (the default) and got

$ for i in `cat all.txt | grep inode_count | awk '{print $2}' | sed 's/,//g'
`; do if [ $i -ge 16384 ]; then echo $i; fi; done | wc -l
27

4) My conclusion is that the core of inodes is held by a couple of machines.
However, while the majority is running user jobs, others are not doing
anything at all. For example, an idle machine (which had no users logged in,
no jobs running, updatedb does not search for cephfs filesystem) reported
more than > 30 inodes). To regain those inodes, I had to umount and
remount cephfs in that machine.

5) Based on my previous observations I suspect that there are still some
problems in the ceph-fuse client regarding recovering these inodes (or it
happens at a very slow rate).

Seems that way.  Can you come up with a reproducer for us, and/or
gather some client+mds debug logs where a client is failing to respond
to cache pressure?


I think I've nailed this down to a specific user workload. Everytime 
this user runs, it lefts the client with a huge number of inodes, 
normally more than 10. The workload consists in the generations of a 
big number of analysis files spread over multiple directories. I am 
going to try to inject some debug parameters and see what do we come up 
with. Will reply on this thread later on.




Also, what kernel is in use on the clients?  It's possible that the
issue is in FUSE itself (or the way that it responses to ceph-fuse's
attempts to ask for some inodes to be released).


All our clusters run SL6 because CERN experiments software is only 
certified to that OS flavour. Because of the SL6 restriction, to enable 
pos infernalis ceph clients in those machines, we have to recompile them 
as well as some of the dependencies it needs and which are not available 
in SL6. In summary, we recompile ceph-fuse 10.2.2 with gcc 4.8.4 against 
boost-1.53.0-25 and fuse-2.9.7. The kernel version in the clients is 
2.6.32-642.6.2.el6.x86_64


Thanks for the explanations about the mds memory usage. I am glad there 
is something on its way to trigger a more effective memory usage


Cheers
Goncalo



However, I also do not completely understand what is happening on the server
side:

6) The current memory usage of my mds is the following:

   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
17831 ceph  20   0 13.667g 0.012t  10048 S  37.5 40.2   1068:47 ceph-mds

The mds cache size is set to 200. Running 'ceph daemon mds. perf
dump', I get  "inode_max": 200 and "inodes": 2413826. Assuming 4k per
each inode one gets ~10G. So why it is taking much more than that?


7) I have been running cephfs for more than an year, and looking to ganglia,
the mds memory never decreases but always increases (even in cases when we
umount almost all the clients). Why does that happen?

Coincidentally someone posted about this on ceph-devel just yesterday.
The answer is that the MDS uses memory pools for allocation, and it
doesn't (currently) ever bother releasing memory back to the operating
system because it's doing its own cache size enforcement.  However,
when the cache size limits aren't being enforced (for example because
of clients failing to release caps) this becomes a problem.  There's a
patch for master (https://github.com/ceph/ceph/pull/12443)


8) I am running 2 mds, in active / standby-replay mode. The memory of the
standby-replay is much lower

   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
   716 ceph  20   0 6149424 5.115g   8524 S   1.2 43.6  53:19.74 ceph-mds

If I trigger a restart on my active mds, the standby replay will start
acting as active, but will continue with the same amount of memory. Why the
second mds can become active, and do 

Re: [ceph-users] What happens if all replica OSDs journals are broken?

2016-12-13 Thread Kevin Olbrich
Ok, thanks for your explanation!
I read those warnings about size 2 + min_size 1 (we are using ZFS as RAID6,
called zraid2) as OSDs.
Time to raise replication!

Kevin

2016-12-13 0:00 GMT+01:00 Christian Balzer :

> On Mon, 12 Dec 2016 22:41:41 +0100 Kevin Olbrich wrote:
>
> > Hi,
> >
> > just in case: What happens when all replica journal SSDs are broken at
> once?
> >
> That would be bad, as in BAD.
>
> In theory you just "lost" all the associated OSDs and their data.
>
> In practice everything but in the in-flight data at the time is still on
> the actual OSDs (HDDs), but it's inconsistent and inaccessible as far as
> Ceph is concerned.
>
> So with some trickery and an experienced data-recovery Ceph consultant you
> _may_ get things running with limited data loss/corruption, but that's
> speculation and may be wishful thinking on my part.
>
> Another data point to deploy only well known/monitored/trusted SSDs and
> have a 3x replication.
>
> > The PGs most likely will be stuck inactive but as I read, the journals
> just
> > need to be replaced (http://ceph.com/planet/ceph-recover-osds-after-ssd-
> > journal-failure/).
> >
> > Does this also work in this case?
> >
> Not really, no.
>
> The above works by having still a valid state and operational OSDs from
> which the "broken" one can recover.
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Erasure Code question - state of LRC plugin?

2016-12-13 Thread McFarland, Bruce
I’m looking at performance and storage impacts of EC vs. Replication.
After an initial EC investigation LRC is an interesting option. Can anyone
tell me the state of the LRC plugin? Is it considered production ready in
the same sense that EC is production ready in Jewel? Or is the LRC plugin
considered an “experimental” plugin?

Thanks,
Bruce

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs quotas reporting

2016-12-13 Thread Goncalo Borges
Hi Greg
Thanks for following it up.
We are aiming to upgrade to 10.2.5 in early January. Will let you know once 
that is done, and what do we get as outputs.
Cheers
Goncalo


From: Gregory Farnum [gfar...@redhat.com]
Sent: 14 December 2016 06:59
To: Goncalo Borges
Cc: John Spray; ceph-us...@ceph.com
Subject: Re: [ceph-users] cephfs quotas reporting

On Mon, Dec 5, 2016 at 5:24 PM, Goncalo Borges
 wrote:
> Hi Greg, Jonh...
>
> To Jonh: Nothing is done in tge background between two consecutive df 
> commands,
>
> I have opened the following tracker issue: 
> http://tracker.ceph.com/issues/18151
>
> (sorry, all the issue headers are empty apart from the title. I've hit enter  
> before actually filling all the appropriate headers, and I can not edit all 
> those headers once the issue is created. I am sure you guys can do it)

Can you try this with 10.2.4 or 10.2.5? I dug up what I think the
problem is and went to reproduce and deal with it, but discovered that
the problem area of code changed between 10.2.2 and those releases. If
it's still an issue let me know and I'll dig into it a little more.
-Greg

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs quotas reporting

2016-12-13 Thread Gregory Farnum
On Mon, Dec 5, 2016 at 5:24 PM, Goncalo Borges
 wrote:
> Hi Greg, Jonh...
>
> To Jonh: Nothing is done in tge background between two consecutive df 
> commands,
>
> I have opened the following tracker issue: 
> http://tracker.ceph.com/issues/18151
>
> (sorry, all the issue headers are empty apart from the title. I've hit enter  
> before actually filling all the appropriate headers, and I can not edit all 
> those headers once the issue is created. I am sure you guys can do it)

Can you try this with 10.2.4 or 10.2.5? I dug up what I think the
problem is and went to reproduce and deal with it, but discovered that
the problem area of code changed between 10.2.2 and those releases. If
it's still an issue let me know and I'll dig into it a little more.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance measurements CephFS vs. RBD

2016-12-13 Thread Ilya Dryomov
On Fri, Dec 9, 2016 at 9:42 PM, Gregory Farnum  wrote:
> On Fri, Dec 9, 2016 at 6:58 AM, plataleas  wrote:
>> Hi all
>>
>> We enabled CephFS on our Ceph Cluster consisting of:
>> - 3 Monitor servers
>> - 2 Metadata servers
>> - 24 OSD  (3 OSD / Server)
>> - Spinning disks, OSD Journal is on SSD
>> - Public and Cluster Network separated, all 1GB
>> - Release: Jewel 10.2.3
>>
>> With CephFS we reach roughly 1/3 of the write performance of RBD. There are
>> some other discussions about RBD outperforming CephFS on the mailing list.
>> However it would be interesting to have more figures about that topic.
>>
>> Writes on CephFS:
>>
>> # dd if=/dev/zero of=/data_cephfs/testfile.dd bs=50M count=1 oflag=direct
>> 1+0 records in
>> 1+0 records out
>> 52428800 bytes (52 MB) copied, 1.40136 s, 37.4 MB/s
>>
>> #dd if=/dev/zero of=/data_cephfs/testfile.dd bs=500M count=1 oflag=direct
>> 1+0 records in
>> 1+0 records out
>> 524288000 bytes (524 MB) copied, 13.9494 s, 37.6 MB/s
>>
>> # dd if=/dev/zero of=/data_cephfs/testfile.dd bs=1000M count=1 oflag=direct
>> 1+0 records in
>> 1+0 records out
>> 1048576000 bytes (1.0 GB) copied, 27.7233 s, 37.8 MB/s
>>
>> Writes on RBD
>>
>> # dd if=/dev/zero of=/data_rbd/testfile.dd bs=50M count=1 oflag=direct
>> 1+0 records in
>> 1+0 records out
>> 52428800 bytes (52 MB) copied, 0.558617 s, 93.9 MB/s
>>
>> # dd if=/dev/zero of=/data_rbd/testfile.dd bs=500M count=1 oflag=direct
>> 1+0 records in
>> 1+0 records out
>> 524288000 bytes (524 MB) copied, 3.70657 s, 141 MB/s
>>
>> # dd if=/dev/zero of=/data_rbd/testfile.dd bs=1000M count=1 oflag=direct
>> 1+0 records in
>> 1+0 records out
>> 1048576000 bytes (1.0 GB) copied, 7.75926 s, 135 MB/s
>>
>> Are these measurements reproducible by others ? Thanks for sharing your
>> experience!
>
> IIRC, the interfaces in use mean these are doing very different things
> despite the flag similarity. Direct IO on rbd is still making use of
> the RBD cache, but in CephFS it is going straight to the OSD (if
> you're using the kernel client; if you're on ceph-fuse the flags might
> get dropped on the kernel/FUSE barrier).

A small clarification: if you using the rbd kernel client, all I/O
(including direct I/O) goes straight to the OSDs.  krbd block devices
are advertised as "write through".

Only librbd makes use of the rbd cache.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Server crashes on high mount volume

2016-12-13 Thread Ken Dreyer
On Tue, Dec 13, 2016 at 6:45 AM, Diego Castro
 wrote:
> Thank you for the tip.
> Just found out the repo is empty, am i doing something wrong?
>
> http://mirror.centos.org/centos/7/cr/x86_64/Packages/
>

Sorry for the confusion. CentOS 7.3 shipped a few hours ago.

- Ken
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Server crashes on high mount volume

2016-12-13 Thread Ilya Dryomov
On Tue, Dec 13, 2016 at 2:45 PM, Diego Castro
 wrote:
> Thank you for the tip.
> Just found out the repo is empty, am i doing something wrong?
>
> http://mirror.centos.org/centos/7/cr/x86_64/Packages/

The kernel in the OS repo seems new enough:

http://mirror.centos.org/centos/7/os/x86_64/Packages/kernel-3.10.0-514.el7.x86_64.rpm

The CR repo is expected to be empty most of the time, I think.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Server crashes on high mount volume

2016-12-13 Thread Diego Castro
Thank you for the tip.
Just found out the repo is empty, am i doing something wrong?

http://mirror.centos.org/centos/7/cr/x86_64/Packages/




---
Diego Castro / The CloudFather
GetupCloud.com - Eliminamos a Gravidade

2016-12-12 17:31 GMT-03:00 Ilya Dryomov :

> On Mon, Dec 12, 2016 at 9:16 PM, Diego Castro
>  wrote:
> > I didn't have a try, i'll let you know how did it goes..
>
> This should be fixed by commit [1] upstream and it was indeed
> backported to 7.3.
>
> [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/
> linux.git/commit/?id=811c6688774613a78bfa020f64b570b73f6974c8
>
> Thanks,
>
> Ilya
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v11.1.0 kraken candidate released

2016-12-13 Thread John Spray
On Tue, Dec 13, 2016 at 12:18 PM, Dietmar Rieder
 wrote:
> Hi John,
>
> Thanks for your answer.
> The mentioned modification of the pool validation would than allow for
> CephFS having the data pools on EC while keeping the metadata on a
> replicated pool, right?

I would expect so.

John

>
> Dietmar
>
> On 12/13/2016 12:35 PM, John Spray wrote:
>> On Tue, Dec 13, 2016 at 7:35 AM, Dietmar Rieder
>>  wrote:
>>> Hi,
>>>
>>> this is good news! Thanks.
>>>
>>> As far as I see the RBD supports (experimentally) now EC data pools. Is
>>> this true also for CephFS? It is not stated in the announce, so I wonder
>>> if and when EC pools are planned to be supported by CephFS.
>>
>> Nobody has worked on this so far.  For EC data pools, it should mainly
>> be a case of modifying the pool validation in MDSMonitor that
>> currently prevents assigning an EC pool.  I strongly suspect we'll get
>> around to this before Luminous.
>>
>> John
>>
>>> ~regards
>>>   Dietmar
>>>
>>> On 12/13/2016 03:28 AM, Abhishek L wrote:
 Hi everyone,

 This is the first release candidate for Kraken, the next stable
 release series. There have been major changes from jewel with many
 features being added. Please note the upgrade process from jewel,
 before upgrading.

 Major Changes from Jewel
 

 - *RADOS*:

   * The new *BlueStore* backend now has a stable disk format and is
 passing our failure and stress testing. Although the backend is
 still flagged as experimental, we encourage users to try it out
 for non-production clusters and non-critical data sets.
   * RADOS now has experimental support for *overwrites on
 erasure-coded* pools. Because the disk format and implementation
 are not yet finalized, there is a special pool option that must be
 enabled to test the new feature.  Enabling this option on a cluster
 will permanently bar that cluster from being upgraded to future
 versions.
   * We now default to the AsyncMessenger (``ms type = async``) instead
 of the legacy SimpleMessenger.  The most noticeable difference is
 that we now use a fixed sized thread pool for network connections
 (instead of two threads per socket with SimpleMessenger).
   * Some OSD failures are now detected almost immediately, whereas
 previously the heartbeat timeout (which defaults to 20 seconds)
 had to expire.  This prevents IO from blocking for an extended
 period for failures where the host remains up but the ceph-osd
 process is no longer running.
   * There is a new ``ceph-mgr`` daemon.  It is currently collocated with
 the monitors by default, and is not yet used for much, but the basic
 infrastructure is now in place.
   * The size of encoded OSDMaps has been reduced.
   * The OSDs now quiesce scrubbing when recovery or rebalancing is in 
 progress.

 - *RGW*:

   * RGW now supports a new zone type that can be used for metadata indexing
 via Elasticseasrch.
   * RGW now supports the S3 multipart object copy-part API.
   * It is possible now to reshard an existing bucket. Note that bucket
 resharding currently requires that all IO (especially writes) to
 the specific bucket is quiesced.
   * RGW now supports data compression for objects.
   * Civetweb version has been upgraded to 1.8
   * The Swift static website API is now supported (S3 support has been 
 added
 previously).
   * S3 bucket lifecycle API has been added. Note that currently it only 
 supports
 object expiration.
   * Support for custom search filters has been added to the LDAP auth
 implementation.
   * Support for NFS version 3 has been added to the RGW NFS gateway.
   * A Python binding has been created for librgw.

 - *RBD*:

   * RBD now supports images stored in an *erasure-coded* RADOS pool
 using the new (experimental) overwrite support. Images must be
 created using the new rbd CLI "--data-pool " option to
 specify the EC pool where the backing data objects are
 stored. Attempting to create an image directly on an EC pool will
 not be successful since the image's backing metadata is only
 supported on a replicated pool.
   * The rbd-mirror daemon now supports replicating dynamic image
 feature updates and image metadata key/value pairs from the
 primary image to the non-primary image.
   * The number of image snapshots can be optionally restricted to a
 configurable maximum.
   * The rbd Python API now supports asynchronous IO operations.

 - *CephFS*:

   * libcephfs function definitions have been changed to enable proper
 uid/gid control.  The library 

Re: [ceph-users] v11.1.0 kraken candidate released

2016-12-13 Thread Dietmar Rieder
Hi John,

Thanks for your answer.
The mentioned modification of the pool validation would than allow for
CephFS having the data pools on EC while keeping the metadata on a
replicated pool, right?

Dietmar

On 12/13/2016 12:35 PM, John Spray wrote:
> On Tue, Dec 13, 2016 at 7:35 AM, Dietmar Rieder
>  wrote:
>> Hi,
>>
>> this is good news! Thanks.
>>
>> As far as I see the RBD supports (experimentally) now EC data pools. Is
>> this true also for CephFS? It is not stated in the announce, so I wonder
>> if and when EC pools are planned to be supported by CephFS.
> 
> Nobody has worked on this so far.  For EC data pools, it should mainly
> be a case of modifying the pool validation in MDSMonitor that
> currently prevents assigning an EC pool.  I strongly suspect we'll get
> around to this before Luminous.
> 
> John
> 
>> ~regards
>>   Dietmar
>>
>> On 12/13/2016 03:28 AM, Abhishek L wrote:
>>> Hi everyone,
>>>
>>> This is the first release candidate for Kraken, the next stable
>>> release series. There have been major changes from jewel with many
>>> features being added. Please note the upgrade process from jewel,
>>> before upgrading.
>>>
>>> Major Changes from Jewel
>>> 
>>>
>>> - *RADOS*:
>>>
>>>   * The new *BlueStore* backend now has a stable disk format and is
>>> passing our failure and stress testing. Although the backend is
>>> still flagged as experimental, we encourage users to try it out
>>> for non-production clusters and non-critical data sets.
>>>   * RADOS now has experimental support for *overwrites on
>>> erasure-coded* pools. Because the disk format and implementation
>>> are not yet finalized, there is a special pool option that must be
>>> enabled to test the new feature.  Enabling this option on a cluster
>>> will permanently bar that cluster from being upgraded to future
>>> versions.
>>>   * We now default to the AsyncMessenger (``ms type = async``) instead
>>> of the legacy SimpleMessenger.  The most noticeable difference is
>>> that we now use a fixed sized thread pool for network connections
>>> (instead of two threads per socket with SimpleMessenger).
>>>   * Some OSD failures are now detected almost immediately, whereas
>>> previously the heartbeat timeout (which defaults to 20 seconds)
>>> had to expire.  This prevents IO from blocking for an extended
>>> period for failures where the host remains up but the ceph-osd
>>> process is no longer running.
>>>   * There is a new ``ceph-mgr`` daemon.  It is currently collocated with
>>> the monitors by default, and is not yet used for much, but the basic
>>> infrastructure is now in place.
>>>   * The size of encoded OSDMaps has been reduced.
>>>   * The OSDs now quiesce scrubbing when recovery or rebalancing is in 
>>> progress.
>>>
>>> - *RGW*:
>>>
>>>   * RGW now supports a new zone type that can be used for metadata indexing
>>> via Elasticseasrch.
>>>   * RGW now supports the S3 multipart object copy-part API.
>>>   * It is possible now to reshard an existing bucket. Note that bucket
>>> resharding currently requires that all IO (especially writes) to
>>> the specific bucket is quiesced.
>>>   * RGW now supports data compression for objects.
>>>   * Civetweb version has been upgraded to 1.8
>>>   * The Swift static website API is now supported (S3 support has been added
>>> previously).
>>>   * S3 bucket lifecycle API has been added. Note that currently it only 
>>> supports
>>> object expiration.
>>>   * Support for custom search filters has been added to the LDAP auth
>>> implementation.
>>>   * Support for NFS version 3 has been added to the RGW NFS gateway.
>>>   * A Python binding has been created for librgw.
>>>
>>> - *RBD*:
>>>
>>>   * RBD now supports images stored in an *erasure-coded* RADOS pool
>>> using the new (experimental) overwrite support. Images must be
>>> created using the new rbd CLI "--data-pool " option to
>>> specify the EC pool where the backing data objects are
>>> stored. Attempting to create an image directly on an EC pool will
>>> not be successful since the image's backing metadata is only
>>> supported on a replicated pool.
>>>   * The rbd-mirror daemon now supports replicating dynamic image
>>> feature updates and image metadata key/value pairs from the
>>> primary image to the non-primary image.
>>>   * The number of image snapshots can be optionally restricted to a
>>> configurable maximum.
>>>   * The rbd Python API now supports asynchronous IO operations.
>>>
>>> - *CephFS*:
>>>
>>>   * libcephfs function definitions have been changed to enable proper
>>> uid/gid control.  The library version has been increased to reflect the
>>> interface change.
>>>   * Standby replay MDS daemons now consume less memory on workloads
>>> doing deletions.
>>>   * Scrub now repairs backtrace, and populates `damage ls` with

Re: [ceph-users] How to release Hammer osd RAM when compiled with jemalloc

2016-12-13 Thread Sage Weil
On Tue, 13 Dec 2016, Dong Wu wrote:
> Hi, all
>I have a cluster with nearly 1000 osds, and each osd already
> occupied 2.5G physical memory on average, which cause each host 90%
> memory useage. when use tcmalloc, we can use "ceph tell osd.* release"
> to release unused memory, but in my cluster, ceph is build with
> jemalloc, so can't use "ceph tell osd.* release", is there any methods
> to release some memory?

We explicitly call into tcmalloc to release memory with that command, but 
unless you've patched something in yourself there is no integration with 
jemalloc's release API.

> another question:
> can I decrease the following config value which used for cached osdmap
>  to lower osd's memory?
> 
> "mon_min_osdmap_epochs": "500"
> "osd_map_max_advance": "200",
> "osd_map_cache_size": "500",
> "osd_map_message_max": "100",
> "osd_map_share_max_epochs": "100"

Yeah.  You should be fine with 500, 50, 100, 50, 50.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unwanted automatic restart of daemons during an upgrade since 10.2.5 (on Trusty)

2016-12-13 Thread Francois Lafont
On 12/13/2016 12:42 PM, Francois Lafont wrote:

> But, _by_ _principle_, in the specific case of ceph (I know it's not the
> usual case of packages which provide daemons), I think it would be more
> safe and practical that the ceph packages don't manage the restart of
> daemons.

And I say (even if I think it was relatively clear in my first post) that
*it was the case* before the 10.2.5 version, so I was surprised by this
change.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Revisiting: Many clients (X) failing to respond to cache pressure

2016-12-13 Thread John Spray
On Tue, Dec 13, 2016 at 6:03 AM, Goncalo Borges
 wrote:
> Hi Ceph(FS)ers...
>
> I am currently running in production the following environment:
>
> - ceph/cephfs in 10.2.2.
> - All infrastructure is in the same version (rados cluster, mons, mds and
> cephfs clients).
> - We mount cephfs using ceph-fuse.
>
> Since yesterday that we have our cluster in warning state with the message
> "mds0: Many clients (X) failing to respond to cache pressure". X has been
> changing with time, from ~130 to ~70. I am able to correlate the appearance
> of this message with burst of jobs in our cluster.
>
> This subject has been discussed in the mailing list a lot of times, and
> normally, the recipe is to look for something wrong in the clients. So, I
> have tried to look to clients first:
>
> 1) I've started to loop through all my clients, and run 'ceph --admin-daemon
> /var/run/ceph/ceph-client.mount_user.asok status' to get the inodes_count
> reported in each client.
>
> $ cat all.txt | grep inode_count | awk '{print $2}' | sed 's/,//g' | awk
> '{s+=$1} END {print s}'
> 2407659
>
> 2) I've then compared with the number of inodes the mds had in its cache
> (obtained by a perf dump)
>  inode_max": 200 and "inodes": 2413826
>
> 3) I've tried to understand how many clients had a number of inodes higher
> than 16384 (the default) and got
>
> $ for i in `cat all.txt | grep inode_count | awk '{print $2}' | sed 's/,//g'
> `; do if [ $i -ge 16384 ]; then echo $i; fi; done | wc -l
> 27
>
> 4) My conclusion is that the core of inodes is held by a couple of machines.
> However, while the majority is running user jobs, others are not doing
> anything at all. For example, an idle machine (which had no users logged in,
> no jobs running, updatedb does not search for cephfs filesystem) reported
> more than > 30 inodes). To regain those inodes, I had to umount and
> remount cephfs in that machine.
>
> 5) Based on my previous observations I suspect that there are still some
> problems in the ceph-fuse client regarding recovering these inodes (or it
> happens at a very slow rate).

Seems that way.  Can you come up with a reproducer for us, and/or
gather some client+mds debug logs where a client is failing to respond
to cache pressure?

Also, what kernel is in use on the clients?  It's possible that the
issue is in FUSE itself (or the way that it responses to ceph-fuse's
attempts to ask for some inodes to be released).

> However, I also do not completely understand what is happening on the server
> side:
>
> 6) The current memory usage of my mds is the following:
>
>   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
> 17831 ceph  20   0 13.667g 0.012t  10048 S  37.5 40.2   1068:47 ceph-mds
>
> The mds cache size is set to 200. Running 'ceph daemon mds. perf
> dump', I get  "inode_max": 200 and "inodes": 2413826. Assuming 4k per
> each inode one gets ~10G. So why it is taking much more than that?
>
>
> 7) I have been running cephfs for more than an year, and looking to ganglia,
> the mds memory never decreases but always increases (even in cases when we
> umount almost all the clients). Why does that happen?

Coincidentally someone posted about this on ceph-devel just yesterday.
The answer is that the MDS uses memory pools for allocation, and it
doesn't (currently) ever bother releasing memory back to the operating
system because it's doing its own cache size enforcement.  However,
when the cache size limits aren't being enforced (for example because
of clients failing to release caps) this becomes a problem.  There's a
patch for master (https://github.com/ceph/ceph/pull/12443)

>
> 8) I am running 2 mds, in active / standby-replay mode. The memory of the
> standby-replay is much lower
>
>   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
>   716 ceph  20   0 6149424 5.115g   8524 S   1.2 43.6  53:19.74 ceph-mds
>
> If I trigger a restart on my active mds, the standby replay will start
> acting as active, but will continue with the same amount of memory. Why the
> second mds can become active, and do the same job but using much more
> memory?

Presumably this also makes sense once you know about the allocator in use.

> 9) Finally, I am sending an extract of 'ceph daemon mds. perf dump' from
> my active and standby mdses. What is exactly the meaning of inodes_pin_tail,
> inodes_expired and inodes_with_caps? Is the standby mds suppose to show the
> same numbers? They don't...

It's not really possible to explain these counters without a
substantial explanation of MDS internals, sorry.  I will say though
that there is absolutely no guarantee of performance counters on the
standby replay daemon matching those on the active daemon.

John

> Thanks in advance for your answers /  suggestions
>
> Cheers
>
> Goncalo
>
>
>
> active:
>
> "mds": {
> "request": 93941296,
> "reply": 93940671,
> "reply_latency": {

[ceph-users] Unwanted automatic restart of daemons during an upgrade since 10.2.5 (on Trusty)

2016-12-13 Thread Francois Lafont
Hi @all,

I have a little remark concerning at least the Trusty ceph packages (maybe
it concerns another distributions, I don't know).

I'm pretty sure that before the 10.2.5 version, the restart of the daemons
wasn't managed during the packages upgrade and with the 10.2.5 version it's
the case. I explain below.

Personally, during a "ceph" upgrade, I prefer to manage the "ceph" daemons
_myself_. For instance, during a "ceph" upgrade of a Ubuntu Trusty OSD server,
I'm used to make something like that:


# I stop all the OSD daemons (here, it's an upstart command but it's
# an implementation detail, the idea is just "I stop all OSD"):
sudo stop ceph-osd-all

# And after that, I launch the "ceph" upgrade with something like that:
sudo apt-get update && sudo apt-get upgrade

# (*) Before the 10.2.5 version, the daemons weren't automatically
# restarted by the upgrade and personally, it was a _good_ thing
# for me. Now, with the 10.2.5 version, the daemons seems to be
# automatically restarted.

# Personally, after a "ceph" upgrade, I always prefer launch a _reboot_
# of the server.
sudo reboot


So, now with 10.2.5 version, in my process, OSD daemons are stopped,
then automatically restarted by the upgrade and then stopped again
by the reboot. This is not an optimal process of course. ;)

I perfectly know workarounds to avoid an automatic restart of the
daemons during the "ceph" upgrades (for instance, in the case of
Trusty, I could temporarily removed the files
/var/lib/ceph/osd/ceph-$id/upstart).

But, _by_ _principle_, in the specific case of ceph (I know it's not the
usual case of packages which provide daemons), I think it would be more
safe and practical that the ceph packages don't manage the restart of
daemons.

What you do think about that ? Maybe I'm wrong... ;)

François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v11.1.0 kraken candidate released

2016-12-13 Thread John Spray
On Tue, Dec 13, 2016 at 7:35 AM, Dietmar Rieder
 wrote:
> Hi,
>
> this is good news! Thanks.
>
> As far as I see the RBD supports (experimentally) now EC data pools. Is
> this true also for CephFS? It is not stated in the announce, so I wonder
> if and when EC pools are planned to be supported by CephFS.

Nobody has worked on this so far.  For EC data pools, it should mainly
be a case of modifying the pool validation in MDSMonitor that
currently prevents assigning an EC pool.  I strongly suspect we'll get
around to this before Luminous.

John

> ~regards
>   Dietmar
>
> On 12/13/2016 03:28 AM, Abhishek L wrote:
>> Hi everyone,
>>
>> This is the first release candidate for Kraken, the next stable
>> release series. There have been major changes from jewel with many
>> features being added. Please note the upgrade process from jewel,
>> before upgrading.
>>
>> Major Changes from Jewel
>> 
>>
>> - *RADOS*:
>>
>>   * The new *BlueStore* backend now has a stable disk format and is
>> passing our failure and stress testing. Although the backend is
>> still flagged as experimental, we encourage users to try it out
>> for non-production clusters and non-critical data sets.
>>   * RADOS now has experimental support for *overwrites on
>> erasure-coded* pools. Because the disk format and implementation
>> are not yet finalized, there is a special pool option that must be
>> enabled to test the new feature.  Enabling this option on a cluster
>> will permanently bar that cluster from being upgraded to future
>> versions.
>>   * We now default to the AsyncMessenger (``ms type = async``) instead
>> of the legacy SimpleMessenger.  The most noticeable difference is
>> that we now use a fixed sized thread pool for network connections
>> (instead of two threads per socket with SimpleMessenger).
>>   * Some OSD failures are now detected almost immediately, whereas
>> previously the heartbeat timeout (which defaults to 20 seconds)
>> had to expire.  This prevents IO from blocking for an extended
>> period for failures where the host remains up but the ceph-osd
>> process is no longer running.
>>   * There is a new ``ceph-mgr`` daemon.  It is currently collocated with
>> the monitors by default, and is not yet used for much, but the basic
>> infrastructure is now in place.
>>   * The size of encoded OSDMaps has been reduced.
>>   * The OSDs now quiesce scrubbing when recovery or rebalancing is in 
>> progress.
>>
>> - *RGW*:
>>
>>   * RGW now supports a new zone type that can be used for metadata indexing
>> via Elasticseasrch.
>>   * RGW now supports the S3 multipart object copy-part API.
>>   * It is possible now to reshard an existing bucket. Note that bucket
>> resharding currently requires that all IO (especially writes) to
>> the specific bucket is quiesced.
>>   * RGW now supports data compression for objects.
>>   * Civetweb version has been upgraded to 1.8
>>   * The Swift static website API is now supported (S3 support has been added
>> previously).
>>   * S3 bucket lifecycle API has been added. Note that currently it only 
>> supports
>> object expiration.
>>   * Support for custom search filters has been added to the LDAP auth
>> implementation.
>>   * Support for NFS version 3 has been added to the RGW NFS gateway.
>>   * A Python binding has been created for librgw.
>>
>> - *RBD*:
>>
>>   * RBD now supports images stored in an *erasure-coded* RADOS pool
>> using the new (experimental) overwrite support. Images must be
>> created using the new rbd CLI "--data-pool " option to
>> specify the EC pool where the backing data objects are
>> stored. Attempting to create an image directly on an EC pool will
>> not be successful since the image's backing metadata is only
>> supported on a replicated pool.
>>   * The rbd-mirror daemon now supports replicating dynamic image
>> feature updates and image metadata key/value pairs from the
>> primary image to the non-primary image.
>>   * The number of image snapshots can be optionally restricted to a
>> configurable maximum.
>>   * The rbd Python API now supports asynchronous IO operations.
>>
>> - *CephFS*:
>>
>>   * libcephfs function definitions have been changed to enable proper
>> uid/gid control.  The library version has been increased to reflect the
>> interface change.
>>   * Standby replay MDS daemons now consume less memory on workloads
>> doing deletions.
>>   * Scrub now repairs backtrace, and populates `damage ls` with
>> discovered errors.
>>   * A new `pg_files` subcommand to `cephfs-data-scan` can identify
>> files affected by a damaged or lost RADOS PG.
>>   * The false-positive "failing to respond to cache pressure" warnings have
>> been fixed.
>>
>>
>> Upgrading from Jewel
>> 
>>
>> * All clusters must first be upgraded to Jewel 10.2.z 

Re: [ceph-users] can cache-mode be set to readproxy for tier cachewith ceph 0.94.9 ?

2016-12-13 Thread JiaJia Zhong
shinjo, thanks for your help,


#1 How small is actual data?

23K, 24K, 165K , I didn't record all of them.




#2 Is the symptom reproduceable with same size of different data?

no, we have some processes to create files, the  0-byte-files became normal 
after they were covered by those processes.
It's hard to reproduce, 
the newer kernel has many patches about cephfs, so I even mount the cephfs with 
another client (kernel 4.9) to wait for the issue, hoping the newer kernel 
client could read the file correctly. is this possible ?
ps: When we first met this issue, restarting the mds could cure that. (but that 
was ceph 0.94.1).




#3 can you share your ceph.conf(ceph --show-config)?

some cache configs first.
$ ceph osd pool get data_cache hit_set_count
hit_set_count: 1
ceph osd pool get data_cache min_read_recency_for_promote
min_read_recency_for_promote: 0
$ceph osd pool get data_cache target_max_bytes
target_max_bytes: 5000
$ ceph osd pool get data_cache target_max_objects
target_max_objects: 100



the whole ceph configs below.
please, ignore the net configs :)


#ceph --show-config
name = client.admin
cluster = ceph
debug_none = 0/5
debug_lockdep = 0/1
debug_context = 0/1
debug_crush = 1/1
debug_mds = 1/5
debug_mds_balancer = 1/5
debug_mds_locker = 1/5
debug_mds_log = 1/5
debug_mds_log_expire = 1/5
debug_mds_migrator = 1/5
debug_buffer = 0/1
debug_timer = 0/1
debug_filer = 0/1
debug_striper = 0/1
debug_objecter = 0/1
debug_rados = 0/5
debug_rbd = 0/5
debug_rbd_replay = 0/5
debug_journaler = 0/5
debug_objectcacher = 0/5
debug_client = 0/5
debug_osd = 0/5
debug_optracker = 0/5
debug_objclass = 0/5
debug_filestore = 1/3
debug_keyvaluestore = 1/3
debug_journal = 1/3
debug_ms = 0/5
debug_mon = 1/5
debug_monc = 0/10
debug_paxos = 1/5
debug_tp = 0/5
debug_auth = 1/5
debug_crypto = 1/5
debug_finisher = 1/1
debug_heartbeatmap = 1/5
debug_perfcounter = 1/5
debug_rgw = 1/5
debug_civetweb = 1/10
debug_javaclient = 1/5
debug_asok = 1/5
debug_throttle = 1/1
debug_refs = 0/0
debug_xio = 1/5
host = localhost
fsid = 477c0de9-fa96-4000-a87b-2f4ba4a15472
public_addr = :/0
cluster_addr = :/0
cluster_network = X
num_client = 1
monmap = 
mon_host =  XXX
lockdep = false
lockdep_force_backtrace = false
run_dir = /var/run/ceph
admin_socket = 
daemonize = false
pid_file = 
chdir = /
max_open_files = 0
restapi_log_level = 
restapi_base_url = 
fatal_signal_handlers = true
log_file = 
log_max_new = 1000
log_max_recent = 500
log_to_stderr = true
err_to_stderr = true
log_to_syslog = false
err_to_syslog = false
log_flush_on_exit = true
log_stop_at_utilization = 0.97
clog_to_monitors = default=true
clog_to_syslog = false
clog_to_syslog_level = info
clog_to_syslog_facility = default=daemon audit=local0
mon_cluster_log_to_syslog = default=false
mon_cluster_log_to_syslog_level = info
mon_cluster_log_to_syslog_facility = daemon
mon_cluster_log_file = default=/var/log/ceph/ceph.$channel.log 
cluster=/var/log/ceph/ceph.log
mon_cluster_log_file_level = info
enable_experimental_unrecoverable_data_corrupting_features = 
xio_trace_mempool = false
xio_trace_msgcnt = false
xio_trace_xcon = false
xio_queue_depth = 512
xio_mp_min = 128
xio_mp_max_64 = 65536
xio_mp_max_256 = 8192
xio_mp_max_1k = 8192
xio_mp_max_page = 4096
xio_mp_max_hint = 4096
xio_portal_threads = 2
key = 
keyfile = 
keyring = 
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin
heartbeat_interval = 5
heartbeat_file = 
heartbeat_inject_failure = 0
perf = true
ms_type = simple
ms_tcp_nodelay = true
ms_tcp_rcvbuf = 0
ms_tcp_prefetch_max_size = 4096
ms_initial_backoff = 0.2
ms_max_backoff = 15
ms_crc_data = true
ms_crc_header = true
ms_die_on_bad_msg = false
ms_die_on_unhandled_msg = false
ms_die_on_old_message = false
ms_die_on_skipped_message = false
ms_dispatch_throttle_bytes = 104857600
ms_bind_ipv6 = false
ms_bind_port_min = 6800
ms_bind_port_max = 7300
ms_bind_retry_count = 3
ms_bind_retry_delay = 5
ms_rwthread_stack_bytes = 1048576
ms_tcp_read_timeout = 900
ms_pq_max_tokens_per_priority = 16777216
ms_pq_min_cost = 65536
ms_inject_socket_failures = 0
ms_inject_delay_type = 
ms_inject_delay_msg_type = 
ms_inject_delay_max = 1
ms_inject_delay_probability = 0
ms_inject_internal_delays = 0
ms_dump_on_send = false
ms_dump_corrupt_message_level = 1
ms_async_op_threads = 2
ms_async_set_affinity = true
ms_async_affinity_cores = 
inject_early_sigterm = false
mon_data = /var/lib/ceph/mon/ceph-admin
mon_initial_members = cephn1
mon_sync_fs_threshold = 5
mon_compact_on_start = false
mon_compact_on_bootstrap = false
mon_compact_on_trim = true
mon_osd_cache_size = 10
mon_tick_interval = 5
mon_subscribe_interval = 300
mon_delta_reset_interval = 10
mon_osd_laggy_halflife = 3600
mon_osd_laggy_weight = 0.3
mon_osd_adjust_heartbeat_grace = true
mon_osd_adjust_down_out_interval = true
mon_osd_auto_mark_in = false
mon_osd_auto_mark_auto_out_in = true
mon_osd_auto_mark_new_in = true
mon_osd_down_out_interval = 300

Re: [ceph-users] can cache-mode be set to readproxy for tier cache with ceph 0.94.9 ?

2016-12-13 Thread Shinobu Kinjo
On Tue, Dec 13, 2016 at 4:38 PM, JiaJia Zhong 
wrote:

> hi cephers:
> we are using ceph hammer 0.94.9,  yes, It's not the latest ( jewel),
> with some ssd osds for tiering,  cache-mode is set to readproxy,
> everything seems to be as expected,
> but when reading some small files from cephfs, we got 0 bytes.
>

Would you be able to share:

 #1 How small is actual data?
 #2 Is the symptom reproduceable with same size of different data?
 #3 can you share your ceph.conf(ceph --show-config)?


>
> I did some search and got the below link,
> http://ceph-users.ceph.narkive.com/g4wcB8ED/cephfs-
> with-cache-tiering-reading-files-are-filled-with-0s
> that's almost the same as what we are suffering from except  the
> cache-mode in the link is writeback, ours is readproxy.
>
> that bug shall have been FIXED in 0.94.9 (http://tracker.ceph.com/
> issues/12551)
> but we still can encounter with that occasionally :(
>
>enviroment:
>  - ceph: 0.94.9
>  - kernel client: 4.2.0-36-generic ( ubuntu 14.04 )
>  - any others needed ?
>
>Question:
>1.  does readproxy mode work on ceph0.94.9 ? since there are only
> writeback and readonly in  the document for hammer.
>2.  any one with (Jewel or Hammer) met the same issue ?
>
>
> loop Yan, Zheng
>Quote from the link for convince.
>  """
> Hi, 
>
> I am experiencing an issue with CephFS with cache tiering where the kernel
> clients are reading files filled entirely with 0s.
>
> The setup:
> ceph 0.94.3
> create cephfs_metadata replicated pool
> create cephfs_data replicated pool
> cephfs was created on the above two pools, populated with files, then:
> create cephfs_ssd_cache replicated pool,
> then adding the tiers:
> ceph osd tier add cephfs_data cephfs_ssd_cache
> ceph osd tier cache-mode cephfs_ssd_cache writeback
> ceph osd tier set-overlay cephfs_data cephfs_ssd_cache
>
> While the cephfs_ssd_cache pool is empty, multiple kernel clients on
> different hosts open the same file (the size of the file is small, <10k) at
> approximately the same time. A number of the clients from the OS level see
> the entire file being empty. I can do a rados -p {cache pool} ls for the
> list of files cached, and do a rados -p {cache pool} get {object} /tmp/file
> and see the complete contents of the file.
> I can repeat this by setting cache-mode to forward, rados -p {cache pool}
> cache-flush-evict-all, checking no more objects in cache with rados -p
> {cache pool} ls, resetting cache-mode to writeback with an empty pool, and
> doing the multiple same file opens.
>
> Has anyone seen this issue? It seems like what may be a race condition
> where the object is not yet completely loaded into the cache pool so the
> cache pool serves out an incomplete object.
> If anyone can shed some light or any suggestions to help debug this issue,
> that would be very helpful.
>
> Thanks,
> Arthur"""
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What happens if all replica OSDs journals are broken?

2016-12-13 Thread Wojciech Kobryń
Hi,

Recently I lost 5 out of 12 journal OSDs (2xSDD failure at one time).
size=2, min_size=1. I know, should rather be 3/2, I have plans to switch to
it asap.

CEPH started to throw many failures, then I removed these two SSDs, and
recreated journal OSD from scratch. In my case, all data on main OSD was
still there, but Ceph tried to do the best  it could to disable write to
OSDs and keep the data consistency.
After re-creating all 5 journal OSD on another HDD, recovery+backfill
started to work. After couple of hours it discovered 7 "unfound" objects (6
in data OSD and 1 hitset in cache tier). I found out what files were
affected, and hoped to not loose important data. Then after trying to
revert these 6 unfound object to the previous version, but if was
unsuccessfull, so I just deleted them. Most important problem we found was
that single hitset file that we couldn't just delete, and instead we took
some another hitset file and copied it onto missing one. Then cache tier
recognized this hitset and invalidated it, which allowed all the
backfill+recovery to finish, and finally entire Ceph cluster went back to
HEALTH_OK. Finally I run fsck wherever these 6 unfound files could affect,
and fortunately, these lost blocks were not important and contained empty
data, so fsck recovery was successfull in all cases. That was very
stressfull time :)

-- 
Wojtek

wt., 13.12.2016 o 00:01 użytkownik Christian Balzer  napisał:

> On Mon, 12 Dec 2016 22:41:41 +0100 Kevin Olbrich wrote:
>
> > Hi,
> >
> > just in case: What happens when all replica journal SSDs are broken at
> once?
> >
> That would be bad, as in BAD.
>
> In theory you just "lost" all the associated OSDs and their data.
>
> In practice everything but in the in-flight data at the time is still on
> the actual OSDs (HDDs), but it's inconsistent and inaccessible as far as
> Ceph is concerned.
>
> So with some trickery and an experienced data-recovery Ceph consultant you
> _may_ get things running with limited data loss/corruption, but that's
> speculation and may be wishful thinking on my part.
>
> Another data point to deploy only well known/monitored/trusted SSDs and
> have a 3x replication.
>
> > The PGs most likely will be stuck inactive but as I read, the journals
> just
> > need to be replaced (http://ceph.com/planet/ceph-recover-osds-after-ssd-
> > journal-failure/).
> >
> > Does this also work in this case?
> >
> Not really, no.
>
> The above works by having still a valid state and operational OSDs from
> which the "broken" one can recover.
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrading from Hammer

2016-12-13 Thread Wido den Hollander

> Op 13 december 2016 om 9:05 schreef Kees Meijs :
> 
> 
> Hi guys,
> 
> In the past few months, I've read some posts about upgrading from
> Hammer. Maybe I've missed something, but I didn't really read something
> on QEMU/KVM behaviour in this context.
> 
> At the moment, we're using:
> 
> > $ qemu-system-x86_64 --version
> > QEMU emulator version 2.3.0 (Debian 1:2.3+dfsg-5ubuntu9.4~cloud2),
> > Copyright (c) 2003-2008 Fabrice Bellard
> The Ubuntu package (originating from Canonical's cloud archive) is
> utilising:
> 
>   * librados2 - 0.94.8-0ubuntu0.15.10.1~cloud0
>   * librbd1 - 0.94.8-0ubuntu0.15.10.1~cloud0
> 
> I'm very curious if there's someone out there using a similar version
> with a Ceph cluster on Jewel. Anything to take in account?
> 

Why? The Ubuntu Cloud Archive is there to provide you a newer Qemu on a older 
Ubuntu system.

If you run Qemu under Ubuntu 16.04 and use the DEB packages directly from Ceph 
you should be fine.

Recent Qemu and recent Ceph :)

Wido

> Thanks in advance!
> 
> Best regards,
> Kees
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Upgrading from Hammer

2016-12-13 Thread Kees Meijs
Hi guys,

In the past few months, I've read some posts about upgrading from
Hammer. Maybe I've missed something, but I didn't really read something
on QEMU/KVM behaviour in this context.

At the moment, we're using:

> $ qemu-system-x86_64 --version
> QEMU emulator version 2.3.0 (Debian 1:2.3+dfsg-5ubuntu9.4~cloud2),
> Copyright (c) 2003-2008 Fabrice Bellard
The Ubuntu package (originating from Canonical's cloud archive) is
utilising:

  * librados2 - 0.94.8-0ubuntu0.15.10.1~cloud0
  * librbd1 - 0.94.8-0ubuntu0.15.10.1~cloud0

I'm very curious if there's someone out there using a similar version
with a Ceph cluster on Jewel. Anything to take in account?

Thanks in advance!

Best regards,
Kees

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com