[ceph-users] Delete unused RBD volume takes to long.

2017-07-15 Thread Alvaro Soto
Hi,
does anyone have experienced or know why the delete process takes longer
that a creation of a RBD volume.

My test was this:

   - Create a 1PB volume -> less than a minute
   - Delete the volume created -> like 2 days

The result was unexpected by me and till now, don't know the reason, the
process of deletion was initiated exactly at the end of the creation, so
the volume was never used.

About the environment:

   - Ubuntu trusty 64bits
   - 5 Servers cluster
   - 24 Intel SSD per server * 800GB each disk
   - Intel NVMe PCIe for journal
   - CEPH Hammer.
   - Replica 3

Hope that someone can tell me why this deletion takes so long.
Best.

-- 

ATTE. Alvaro Soto Escobar

--
Great people talk about ideas,
average people talk about things,
small people talk ... about other people.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] iSCSI production ready?

2017-07-15 Thread Alvaro Soto
Hi guys,
does anyone know any news about in what release iSCSI interface is going to
be production ready, if not yet?

I mean without the use of a gateway, like a different endpoint connector to
a CEPH cluster.

Thanks in advance.
Best.

-- 

ATTE. Alvaro Soto Escobar

--
Great people talk about ideas,
average people talk about things,
small people talk ... about other people.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] some OSDs stuck down after 10.2.7 -> 10.2.9 update

2017-07-15 Thread Lincoln Bryant

Hi all,

After updating to 10.2.9, some of our SSD-based OSDs get put into "down" 
state and die as in [1].


After bringing these OSDs back up, they sit at 100% CPU utilization and 
never become up/in. From the log I see (from [2]):
heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1cfad0d700' had 
timed out after 1

before they ultimately crash.

Stracing them, I see them chewing on omaps for a while and then they 
seem to do nothing, but CPU utilization is still quite high.


I downgraded (inadvisable, I know) these OSDs to 10.2.7 and they come 
back happily.  I tried setting debug_osd = 20, debug_filestore = 20, 
debug_ms = 20, debug_auth = 20, debug_leveldb = 20 but it didn't seem 
like there was any additional information in the logs.


Does anyone have any clues how to debug this further? I'm a bit worried 
about running a mix of 10.2.7 and 10.2.9 OSDs in my pool.


For what it's worth, the SSD OSDs in this CRUSH root are serving CephFS 
metadata. Other OSDs (spinners in EC and replicated pools) are 
completely OK as far as I can tell. All hosts are EL7.


Thanks,
Lincoln

[1]
-8> 2017-07-15 13:21:51.959502 7f9d23a2a700  1 -- 
192.170.226.253:0/2474101 <== osd.456 192.170.226.250:6807/3547149 1293 
 osd_ping(ping_reply e818277 stamp 2017-07-15 13:21:51.958432) v2 
 47+0+0 (584190599 0 0) 0x7f9dd6a93000

 con 0x7f9dcf4d2300
-7> 2017-07-15 13:21:51.959578 7f9d2b26b700  1 -- 
192.170.226.253:0/2474101 <== osd.461 192.170.226.255:6814/4575940 1295 
 osd_ping(ping_reply e818277 stamp 2017-07-15 13:21:51.958432) v2 
 47+0+0 (584190599 0 0) 0x7f9d9a1c9200

 con 0x7f9dc38fff80
-6> 2017-07-15 13:21:51.959597 7f9d2b46d700  1 -- 
192.170.226.253:0/2474101 <== osd.460 192.170.226.254:6851/2545858 1290 
 osd_ping(ping_reply e818277 stamp 2017-07-15 13:21:51.958432) v2 
 47+0+0 (584190599 0 0) 0x7f9d9a1c7600

 con 0x7f9dc3900a00
-5> 2017-07-15 13:21:51.959612 7f9d1e14f700  1 -- 
192.170.226.253:0/2474101 <== osd.434 192.170.226.242:6803/3058582 1293 
 osd_ping(ping_reply e818277 stamp 2017-07-15 13:21:51.958432) v2 
 47+0+0 (584190599 0 0) 0x7f9dc78c0800

 con 0x7f9d7aebae80
-4> 2017-07-15 13:21:51.959650 7f9d19792700  1 -- 
192.170.226.253:0/2474101 <== osd.437 192.170.226.245:6818/2299326 1277 
 osd_ping(ping_reply e818277 stamp 2017-07-15 13:21:51.958432) v2 
 47+0+0 (584190599 0 0) 0x7f9dc78c0200

 con 0x7f9dd0c0ba80
-3> 2017-07-15 13:21:51.959666 7f9d5d940700  1 -- 
192.170.226.253:0/2474101 <== osd.460 192.170.226.254:6849/2545858 1290 
 osd_ping(ping_reply e818277 stamp 2017-07-15 13:21:51.958432) v2 
 47+0+0 (584190599 0 0) 0x7f9d9a1c8200

 con 0x7f9dc38ff500
-2> 2017-07-15 13:21:52.085120 7f9d659a2700  1 heartbeat_map 
is_healthy 'OSD::osd_op_tp thread 0x7f9ce0504700' had timed out after 15
-1> 2017-07-15 13:21:52.085130 7f9d659a2700  1 heartbeat_map 
is_healthy 'OSD::osd_op_tp thread 0x7f9ce0504700' had suicide timed out 
after 150
 0> 2017-07-15 13:21:52.108248 7f9d659a2700 -1 
common/HeartbeatMap.cc: In function 'bool 
ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, 
time_t)' thread 7f9d659a2700 time 2017-07-15 13:21:52.085137

common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")

 ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x85) [0x7f9d6bb0f4a5]
 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char 
const*, long)+0x2e1) [0x7f9d6ba4e541]

 3: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7f9d6ba4ed9e]
 4: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x7f9d6ba4f57c]
 5: (CephContextServiceThread::entry()+0x15b) [0x7f9d6bb2724b]
 6: (()+0x7dc5) [0x7f9d69a26dc5]
 7: (clone()+0x6d) [0x7f9d680b173d]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.





[2]

2017-07-15 14:35:23.730434 7f1d98bde800  0 ceph version 10.2.9 
(2ee413f77150c0f375ff6f10edd6c8f9c7d060d0), process ceph-osd, pid 2559209
2017-07-15 14:35:23.731923 7f1d98bde800  0 pidfile_write: ignore empty 
--pid-file
2017-07-15 14:35:23.772858 7f1d98bde800  0 
filestore(/var/lib/ceph/osd/ceph-459) backend xfs (magic 0x58465342)
2017-07-15 14:35:23.773367 7f1d98bde800  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-459) detect_features: 
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2017-07-15 14:35:23.773374 7f1d98bde800  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-459) detect_features: 
SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2017-07-15 14:35:23.773393 7f1d98bde800  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-459) detect_features: 
splice is supported
2017-07-15 14:35:24.148987 7f1d98bde800  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-459) detect_features: 
syncfs(2) syscall fully supported (by glibc and kernel)
2017-07-15 14:35:24.149090 7f1d98bde800  0 
xfsfilestorebackend(/var/lib/ceph/osd/ceph-459) 

Re: [ceph-users] RBD cache being filled up in small increases instead of 4MB

2017-07-15 Thread Ruben Rodriguez


On 14/07/17 18:43, Ruben Rodriguez wrote:
> How to reproduce...

I'll provide more concise details on how to test this behavior:

Ceph config:

[client]
rbd readahead max bytes = 0 # we don't want forced readahead to fool us
rbd cache = true

Start a qemu vm, with a rbd image attached with virtio-scsi:
   
  
  

  
  



  
  
  
  
  


Block device parameters, inside the vm:
NAME ALIGN  MIN-IO  OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE   RA WSAME
sdb  0 4194304 4194304 512 5121 noop  128 40962G

Collect performance statistics from librbd, using command:

$ ceph --admin-daemon /var/run/ceph/ceph-client.[...].asok perf dump

Note the values for:
- rd: number of read operations done by qemu
- rd_bytes: length of read requests done by qemu
- cache_ops_hit: read operations hitting the cache
- cache_ops_miss: read ops missing the cache
- data_read: data read from the cache
- op_r: number of objects sent by the OSD

Perform one small read, not at the beginning of the image (because udev
may have read it already), at a 4MB boundary line:

dd if=/dev/sda ibs=512 count=1 skip=41943040 iflag=skip_bytes

Do it again advancing 5000 bytes (to not overlap with the previous read)
Run the perf dump command again

dd if=/dev/sda ibs=512 count=1 skip=41948040 iflag=skip_bytes
Run the perf dump command again

If you compare the op_r values at each step, you should see a cache miss
each time, and a object read each time. Same object fetched twice.

IMPACT:

Let's take a look at how the op_r value increases by doing some common
operations:

- Booting a vm: This operation needs (in my case) ~70MB to be read,
which include the kernel, initrd and all files read by systemd and
daemons, until a command prompt appears. Values read
"rd": 2524,
"rd_bytes": 69685248,
"cache_ops_hit": 228,
"cache_ops_miss": 2268,
"cache_bytes_hit": 90353664,
"cache_bytes_miss": 63902720,
"data_read": 69186560,
"op": 2295,
"op_r": 2279,
That is 2299 objects being fetched from the OSD to read 69MB.

- Greping inside the linux source code (833MB), takes almost 3 minutes.
  Values get increased to:
"rd": 65127,
"rd_bytes": 1081487360,
"cache_ops_hit": 228,
"cache_ops_miss": 64885,
"cache_bytes_hit": 90353664,
"cache_bytes_miss": 1075672064,
"data_read": 1080988672,
"op_r": 64896,
That is over 60.000 objects fetched to read <1GB, and *0* cache hits.
Optimized, this should take 10 seconds, and fetch ~700 objects.

Is my Qemu implementation completely broken? Or is this expected? Please
help!

-- 
Ruben Rodriguez | Senior Systems Administrator, Free Software Foundation
GPG Key: 05EF 1D2F FE61 747D 1FC8 27C3 7FAC 7D26 472F 4409
https://fsf.org | https://gnu.org
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD cache being filled up in small increases instead of 4MB

2017-07-15 Thread Ruben Rodriguez


On 15/07/17 09:43, Nick Fisk wrote:
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Gregory Farnum
>> Sent: 15 July 2017 00:09
>> To: Ruben Rodriguez 
>> Cc: ceph-users 
>> Subject: Re: [ceph-users] RBD cache being filled up in small increases 
>> instead
>> of 4MB
>>
>> On Fri, Jul 14, 2017 at 3:43 PM, Ruben Rodriguez  wrote:
>>>
>>> I'm having an issue with small sequential reads (such as searching
>>> through source code files, etc), and I found that multiple small reads
>>> withing a 4MB boundary would fetch the same object from the OSD
>>> multiple times, as it gets inserted into the RBD cache partially.
>>>
>>> How to reproduce: rbd image accessed from a Qemu vm using virtio-scsi,
>>> writethrough cache on. Monitor with perf dump on the rbd client. The
>>> image is filled up with zeroes in advance. Rbd readahead is off.
>>>
>>> 1 - Small read from a previously unread section of the disk:
>>> dd if=/dev/sdb ibs=512 count=1 skip=41943040 iflag=skip_bytes
>>> Notes: dd cannot read less than 512 bytes. The skip is arbitrary to
>>> avoid the beginning of the disk, which would have been read at boot.
>>>
>>> Expected outcomes: perf dump should show a +1 increase on values rd,
>>> cache_ops_miss and op_r. This happens correctly.
>>> It should show a 4194304 increase in data_read as a whole object is
>>> put into the cache. Instead it increases by 4096. (not sure why 4096, btw).
>>>
>>> 2 - Small read from less than 4MB distance (in the example, +5000b).
>>> dd if=/dev/sdb ibs=512 count=1 skip=41948040 iflag=skip_bytes Expected
>>> outcomes: perf dump should show a +1 increase on cache_ops_hit.
>>> Instead cache_ops_miss increases.
>>> It should show a 4194304 increase in data_read as a whole object is
>>> put into the cache. Instead it increases by 4096.
>>> op_r should not increase. Instead it increases by one, indicating that
>>> the object was fetched again.
>>>
>>> My tests show that this could be causing a 6 to 20-fold performance
>>> loss in small sequential reads.
>>>
>>> Is it by design that the RBD cache only inserts the portion requested
>>> by the client instead of the whole last object fetched? Could it be a
>>> tunable in any of my layers (fs, block device, qemu, rbd...) that is
>>> preventing this?
>>
>> I don't know the exact readahead default values in that stack, but there's no
>> general reason to think RBD (or any Ceph component) will read a whole
>> object at a time. In this case, you're asking for 512 bytes and it appears to
>> have turned that into a 4KB read (probably the virtual block size in use?),
>> which seems pretty reasonable — if you were asking for 512 bytes out of
>> every 4MB and it was reading 4MB each time, you'd probably be wondering
>> why you were only getting 1/8192 the expected bandwidth. ;) -Greg
> 
> I think the general readahead logic might be a bit more advanced in the Linux 
> kernel vs using readahead from the librbd client.

Yes, the problems I'm having should be corrected by the vm kernel
issuing larger reads, but I'm failing to get that to happen.

> The kernel will watch how successful each readahead is and scale as 
> necessary. You might want to try uping the read_ahead_kb for the block device 
> in the VM. Something between 4MB to 32MB works well for RBD's, but make sure 
> you have a 4.x kernel as some fixes to readahead max size were introduced and 
> not sure if they ever got backported.

I'm using kernel 4.4 and 4.8. I have readahead, min_io_size,
optimum_io_size and max_sectors_kb set to 4MB. It helps in some use
cases, like fio or dd tests, but not with real world tests like cp,
grep, tar on a large pool of small files.

From all I can tell, optimal read performance would happen when the vm
kernel reads in 4MB increases _every_ _time_. I can force that with an
ugly hack (putting the files inside a formatted big file, mounted as
loop) and gives a 20-fold performance gain. But that is just silly...

I documented that experiment on this thread:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018924.html

> Unless you tell the rbd client to not disable readahead after reading the 1st 
> x number of bytes (rbd readahead disable after bytes=0), it will stop reading 
> ahead and will only cache exactly what is requested by the client.

I realized that, so as a proof of concept I made some changes to the
readahead mechanism. I force it on, make it trigger every time, and made
the max and min readahead size be 4MB. This way I ensure whole objects
get into the cache, and I get a 6-fold performance gain reading small files.

This is just a proof of concept, I don't advocate for this behavior to
be implemented by the readahead function. Ideally it should be up to the
client to issue the correct read sizes. But what if the client is
faulty? I think it could be useful to have the option to tell librbd to
cache whole objects.

-- 

Re: [ceph-users] RBD cache being filled up in small increases instead of 4MB

2017-07-15 Thread Ruben Rodriguez


On 15/07/17 15:33, Jason Dillaman wrote:
> On Sat, Jul 15, 2017 at 9:43 AM, Nick Fisk  wrote:
>> Unless you tell the rbd client to not disable readahead after reading the 
>> 1st x number of bytes (rbd readahead disable after bytes=0), it will stop 
>> reading ahead and will only cache exactly what is requested by the client.
> 
> The default is to disable librbd readahead caching after reading 50MB
> -- since we expect the OS to take over and do a much better job.

I understand having the expectation that the client would do the right
thing, but from all I can tell it is not the case. I've run out of ways
to try to make virtio-scsi (or any other driver) *always* read in 4MB
increments. "minimum_io_size" seems to be ignored.
BTW I just sent this patch to Qemu (and I'm open to any suggestions on
that side!): https://bugs.launchpad.net/qemu/+bug/1600563

But this expectation you mention still has a problem: if you would only
put in the RBD cache what the OS specifically requested, the chances of
that data being requested twice would be pretty low, since the OS page
cache would take care of it better than the RBD cache anyway. So why
bother having a read cache if it doesn't fetch anything extra?

Incidentally, if the RBD cache were to include the whole object instead
of just the requested portion, RBD readahead would be unnecessary.

-- 
Ruben Rodriguez | Senior Systems Administrator, Free Software Foundation
GPG Key: 05EF 1D2F FE61 747D 1FC8 27C3 7FAC 7D26 472F 4409
https://fsf.org | https://gnu.org
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous

2017-07-15 Thread Udo Lembke
Hi,

On 15.07.2017 16:01, Phil Schwarz wrote:
> Hi,
> ...
>
> While investigating, i wondered about my config :
> Question relative to /etc/hosts file :
> Should i use private_replication_LAN Ip or public ones ?
private_replication_LAN!! And the pve-cluster should use another network
(nics) if possible.

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD cache being filled up in small increases instead of 4MB

2017-07-15 Thread Jason Dillaman
On Sat, Jul 15, 2017 at 9:43 AM, Nick Fisk  wrote:
> Unless you tell the rbd client to not disable readahead after reading the 1st 
> x number of bytes (rbd readahead disable after bytes=0), it will stop reading 
> ahead and will only cache exactly what is requested by the client.

The default is to disable librbd readahead caching after reading 50MB
-- since we expect the OS to take over and do a much better job.

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 答复: 答复: 答复: No "snapset" attribute for clone object

2017-07-15 Thread 许雪寒
I debugged a little, and find that this might have something to do with the 
"cache evict" and "list_snaps" operations.

I debugged the "core" file of the process with gdb, and confirmed that the 
object that caused the segmentation fault is 
rbd_data.d18d71b948ac7.062e, just as the following logs indicates:

(gdb) f 4
#4  calc_snap_set_diff (cct=, snap_set=..., start=, end=, diff=0x7ffed23a4640, end_size=, 
end_exists=0x7ffed23a461f)
at librados/snap_set_diff.cc:41
41a = r->snaps[0];
(gdb) p r
$1 = {cloneid = 22, snaps = std::vector of length 0, capacity 0, overlap = 
std::vector of length 2, capacity 2 = {{first = 0, second = 786432}, {first = 
1523712, second = 2670592}}, 
  size = 4194304}
(gdb) f 5
#5  0x7fa87a4359c4 in compute_diffs (diffs=0x7ffed23a4630, 
this=0x7fa88f196820) at librbd/DiffIterate.cc:130
130_exists);
(gdb) p m_oid
$2 = "rbd_data.d18d71b948ac7.", '0' , "62e"

Then we checked the cache tier osd's log:

2017-07-14 18:27:11.122472 7f91a365f700 10 osd.58.objecter ms_dispatch 
0x7f91e2a9c140 osd_op_reply(2877166 rbd_data.d18d71b948ac7.062e 
[copy-get max 8388608] v0'0 uv47138 ondisk = 0) v7
2017-07-14 18:27:11.122514 7f91b395d700 10 osd.58 pg_epoch: 2160 pg[6.38b( v 
2160'164121 (2133'161077,2160'164121] local-les=1977 n=81 ec=279 les/c/f 
1977/1977/0 1975/1976/789) [58,46,35] r=0 lpr=1976 crt=2160'164119 lcod 
2160'164120 mlcod 2160'164120 active+clean] process_copy_chunk 
6:d1d35c73:::rbd_data.d18d71b948ac7.062e:16 tid 2877166 (0) Success
2017-07-14 18:27:11.129590 7f91b395d700 10 osd.58.objecter _op_submit oid 
rbd_data.d18d71b948ac7.062e '@8' '@8' [assert-version 
v47138,copy-get max 8388608] tid 2877168 osd.0
2017-07-14 18:27:11.129602 7f91b395d700  1 -- 10.142.121.179:0/24945 --> 
10.142.121.142:6824/6246 -- osd_op(osd.58.789:2877168 8.ce3acb8b 
rbd_data.d18d71b948ac7.062e [assert-version v47138,copy-get max 
8388608] snapc 0=[] 
ack+read+rwordered+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
 e2160) v7 -- ?+0 0x7f91ee305180 con 0x7f921046e880
2017-07-14 18:27:11.133206 7f91a365f700  1 -- 10.142.121.179:0/24945 <== osd.0 
10.142.121.142:6824/6246 149  osd_op_reply(2877168 
rbd_data.d18d71b948ac7.062e [assert-version v47138,copy-get max 
8388608] v0'0 uv47138 ondisk = 0) v7  201+0+119 (2793013310 0 570526743) 
0x7f91fb306680 con 0x7f921046e880
2017-07-14 18:27:11.133220 7f91a365f700 10 osd.58.objecter ms_dispatch 
0x7f91e2a9c140 osd_op_reply(2877168 rbd_data.d18d71b948ac7.062e 
[assert-version v47138,copy-get max 8388608] v0'0 uv47138 ondisk = 0) v7
2017-07-14 18:27:11.133264 7f91b395d700 10 osd.58 pg_epoch: 2160 pg[6.38b( v 
2160'164121 (2133'161077,2160'164121] local-les=1977 n=81 ec=279 les/c/f 
1977/1977/0 1975/1976/789) [58,46,35] r=0 lpr=1976 crt=2160'164120 lcod 
2160'164120 mlcod 2160'164120 active+clean] process_copy_chunk 
6:d1d35c73:::rbd_data.d18d71b948ac7.062e:16 tid 2877168 (0) Success
2017-07-14 18:27:11.133475 7f91b395d700 10 osd.58 pg_epoch: 2160 pg[6.38b( v 
2160'164121 (2133'161077,2160'164121] local-les=1977 n=81 ec=279 les/c/f 
1977/1977/0 1975/1976/789) [58,46,35] r=0 lpr=1976 crt=2160'164120 lcod 
2160'164120 mlcod 2160'164120 active+clean] finish_promote 
6:d1d35c73:::rbd_data.d18d71b948ac7.062e:16 r=0 uv47138
2017-07-14 18:27:11.133495 7f91b395d700 20 osd.58 pg_epoch: 2160 pg[6.38b( v 
2160'164121 (2133'161077,2160'164121] local-les=1977 n=81 ec=279 les/c/f 
1977/1977/0 1975/1976/789) [58,46,35] r=0 lpr=1976 crt=2160'164120 lcod 
2160'164120 mlcod 2160'164120 active+clean] simple_opc_create 
6:d1d35c73:::rbd_data.d18d71b948ac7.062e:16
2017-07-14 18:27:11.133529 7f91b395d700 20 osd.58 pg_epoch: 2160 pg[6.38b( v 
2160'164121 (2133'161077,2160'164121] local-les=1977 n=81 ec=279 les/c/f 
1977/1977/0 1975/1976/789) [58,46,35] r=0 lpr=1976 crt=2160'164120 lcod 
2160'164120 mlcod 2160'164120 active+clean] finish_ctx 
6:d1d35c73:::rbd_data.d18d71b948ac7.062e:16 0x7f91eb158000 op 
promote
2017-07-14 18:27:11.133612 7f91b395d700  7 osd.58 pg_epoch: 2160 pg[6.38b( v 
2160'164121 (2133'161077,2160'164121] local-les=1977 n=82 ec=279 les/c/f 
1977/1977/0 1975/1976/789) [58,46,35] r=0 lpr=1976 crt=2160'164120 lcod 
2160'164120 mlcod 2160'164120 active+clean] issue_repop rep_tid 29670336 o 
6:d1d35c73:::rbd_data.d18d71b948ac7.062e:16
2017-07-14 18:27:11.133722 7f91b395d700  1 -- 10.143.208.51:6802/3024945 --> 
10.143.208.16:6819/4176877 -- osd_repop(osd.58.0:29670336 6.38b 
6:d1d35c73:::rbd_data.d18d71b948ac7.062e:16 v 2160'164122) v1 -- 
?+676 0x7f91e84e6600 con 0x7f91f58c2580
2017-07-14 18:27:11.133770 7f91b395d700  1 -- 10.143.208.51:6802/3024945 --> 
10.143.208.50:6800/2039335 -- osd_repop(osd.58.0:29670336 6.38b 
6:d1d35c73:::rbd_data.d18d71b948ac7.062e:16 v 2160'164122) v1 -- 
?+676 0x7f91e84e8a00 con 0x7f91ebf70280
2017-07-14 

[ceph-users] When are bugs available in the rpm repository

2017-07-15 Thread Marc Roos
 
When are bugs like these http://tracker.ceph.com/issues/20563 available 
in the rpm repository 
(https://download.ceph.com/rpm-luminous/el7/x86_64/)?

I sort of don’t get it from this page 
http://docs.ceph.com/docs/master/releases/. Maybe something here could 
specifically mentioned about the availability of rpm updates.
Or maybe a date can be added in the to the release notes pages 
(http://docs.ceph.com/docs/master/release-notes/#v12-0-3-luminous-dev)?






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous

2017-07-15 Thread Phil Schwarz

Hi,

short version :
I broke my cluster !

Long version , with context:
With a 4 nodes Proxmox Cluster
The nodes are all Pproxmox 5.05+Ceph luminous with filestore
-3 mon+OSD
-1 LXC+OSD

Was working fine
Added a fifth node (proxmox+ceph) today a broke everything..

Though every node can ping each other, the web GUI is full of red 
crossed nodes. No LXC is seen though there up and alive.

However, every other proxmox is manageable through the web GUI

In logs, i've tons of same message on 2 over 3 mons :

" failed to decode message of type 80 v6: buffer::malformed_input: void 
pg_history_t::decode(ceph::buffer::list::iterator&) unknown encoding 
version > 7"


Thanks for your answers.
Best regards

While investigating, i wondered about my config :
Question relative to /etc/hosts file :
Should i use private_replication_LAN Ip or public ones ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD cache being filled up in small increases instead of 4MB

2017-07-15 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Gregory Farnum
> Sent: 15 July 2017 00:09
> To: Ruben Rodriguez 
> Cc: ceph-users 
> Subject: Re: [ceph-users] RBD cache being filled up in small increases instead
> of 4MB
> 
> On Fri, Jul 14, 2017 at 3:43 PM, Ruben Rodriguez  wrote:
> >
> > I'm having an issue with small sequential reads (such as searching
> > through source code files, etc), and I found that multiple small reads
> > withing a 4MB boundary would fetch the same object from the OSD
> > multiple times, as it gets inserted into the RBD cache partially.
> >
> > How to reproduce: rbd image accessed from a Qemu vm using virtio-scsi,
> > writethrough cache on. Monitor with perf dump on the rbd client. The
> > image is filled up with zeroes in advance. Rbd readahead is off.
> >
> > 1 - Small read from a previously unread section of the disk:
> > dd if=/dev/sdb ibs=512 count=1 skip=41943040 iflag=skip_bytes
> > Notes: dd cannot read less than 512 bytes. The skip is arbitrary to
> > avoid the beginning of the disk, which would have been read at boot.
> >
> > Expected outcomes: perf dump should show a +1 increase on values rd,
> > cache_ops_miss and op_r. This happens correctly.
> > It should show a 4194304 increase in data_read as a whole object is
> > put into the cache. Instead it increases by 4096. (not sure why 4096, btw).
> >
> > 2 - Small read from less than 4MB distance (in the example, +5000b).
> > dd if=/dev/sdb ibs=512 count=1 skip=41948040 iflag=skip_bytes Expected
> > outcomes: perf dump should show a +1 increase on cache_ops_hit.
> > Instead cache_ops_miss increases.
> > It should show a 4194304 increase in data_read as a whole object is
> > put into the cache. Instead it increases by 4096.
> > op_r should not increase. Instead it increases by one, indicating that
> > the object was fetched again.
> >
> > My tests show that this could be causing a 6 to 20-fold performance
> > loss in small sequential reads.
> >
> > Is it by design that the RBD cache only inserts the portion requested
> > by the client instead of the whole last object fetched? Could it be a
> > tunable in any of my layers (fs, block device, qemu, rbd...) that is
> > preventing this?
> 
> I don't know the exact readahead default values in that stack, but there's no
> general reason to think RBD (or any Ceph component) will read a whole
> object at a time. In this case, you're asking for 512 bytes and it appears to
> have turned that into a 4KB read (probably the virtual block size in use?),
> which seems pretty reasonable — if you were asking for 512 bytes out of
> every 4MB and it was reading 4MB each time, you'd probably be wondering
> why you were only getting 1/8192 the expected bandwidth. ;) -Greg

I think the general readahead logic might be a bit more advanced in the Linux 
kernel vs using readahead from the librbd client. The kernel will watch how 
successful each readahead is and scale as necessary. You might want to try 
uping the read_ahead_kb for the block device in the VM. Something between 4MB 
to 32MB works well for RBD's, but make sure you have a 4.x kernel as some fixes 
to readahead max size were introduced and not sure if they ever got backported.

Unless you tell the rbd client to not disable readahead after reading the 1st x 
number of bytes (rbd readahead disable after bytes=0), it will stop reading 
ahead and will only cache exactly what is requested by the client.

> 
> >
> > Regards,
> > --
> > Ruben Rodriguez | Senior Systems Administrator, Free Software
> > Foundation GPG Key: 05EF 1D2F FE61 747D 1FC8  27C3 7FAC 7D26 472F
> 4409
> > https://fsf.org | https://gnu.org
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com