Re: [ceph-users] Unexpected disk write activity with btrfs OSDs

2015-06-23 Thread Erik Logtenberg
Thanks!

Just so I understand correctly, the btrfs snapshots are mainly useful if
the journals are on the same disk as the osd, right? Is it indeed safe
to turn them off if the journals are on a separate ssd?

Kind regards,

Erik.


On 22-06-15 20:18, Krzysztof Nowicki wrote:
 pon., 22.06.2015 o 20:09 użytkownik Lionel Bouton
 lionel-subscript...@bouton.name
 mailto:lionel-subscript...@bouton.name napisał:
 
 On 06/22/15 17:21, Erik Logtenberg wrote:
  I have the journals on a separate disk too. How do you disable the
  snapshotting on the OSD?
 http://ceph.com/docs/master/rados/configuration/filestore-config-ref/ :
 
 filestore btrfs snap = false
 
 Once this is done and verified working (after a restart of the OSD) make
 sure to remove the now unnecessary snapshots (snap_xxx) from the osd
 filesystem as failing to do so will cause an increase of occupied space
 over time (old and unneeded versions of objects will remain stored).
 This can be done by running 'sudo btrfs subvolume delete
 /var/lib/ceph/osd/ceph-xx/snap_yy'. To verify that the option change is
 effective you can observe the 'snap_xxx' directories - after disabling
 snapshotting their revision number should not increase any more).
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What do internal_safe_to_start_threads and leveldb_compression do?

2015-06-02 Thread Erik Logtenberg
 What does this do?

 - leveldb_compression: false (default: true)
 - leveldb_block/cache/write_buffer_size (all bigger than default)
 
 I take it you're running these commands on a monitor (from I think the
 Dumpling timeframe, or maybe even Firefly)? These are hitting specific
 settings in LevelDB which we tune differently for the monitor and OSD,
 but which were shared config options in older releases. They have
 their own settings in newer code.
 -Greg
 

You are correct. I started out with Firefly and gradually upgraded the
cluster as new releases came out. I am on Hammer (0.94.1) now.

The current settings are different from the default. Does this mean
that the settings are still Firefly-like and should be changed to the
new default; or does this mean that the defaults are still Firefly-like
but the settings are actually Hammer-style ;) and thus right.

Thanks,

Erik.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] What do internal_safe_to_start_threads and leveldb_compression do?

2015-06-01 Thread Erik Logtenberg
Hi,

I ran a config diff, like this:

ceph --admin-daemon (...).asok config diff

There are the obvious things like the fsid and IP-ranges, but two
settings stand out:

- internal_safe_to_start_threads: true (default: false)

What does this do?

- leveldb_compression: false (default: true)
- leveldb_block/cache/write_buffer_size (all bigger than default)

Any idea what these settings do and if it is a good idea to have them
differ from the default?

Thanks,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Mount options nodcache and nofsc

2015-05-21 Thread Erik Logtenberg
Hi,

Can anyone explain what the mount options nodcache and nofsc are for,
and especially why you would want to turn these options on/off (what are
the pros and cons either way?)

Thanks,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Interesting re-shuffling of pg's after adding new osd

2015-05-17 Thread Erik Logtenberg
Hi,

Two days ago I added a new osd to one of my ceph machines, because one
of the existing osd's got rather full. There was quite a difference in
disk space usage between osd's, but I understand this is kind of just
how ceph works. It spreads data over osd's but not perfectly even.

Now check out the graph of free disk space. You can clearly see the new
4TB osd added and how it starts to fill up. It's also quite visible that
some existing osd's profit more than others.
And not only is data put onto the new osd, but also data is exchanged
between existing osd's. This is also why it takes so incredibly long to
fill the new osd up, because ceph is spending most its time shuffling
data around instead of moving it to the new osd.

Anyway, what is especially troubling, is that the osd that was already
lowest on disk space, is actually filling up even more during this
process (!)
What's causing that and how can I get ceph to do the reasonable thing?

All crush weights are identical.

Thanks,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS unexplained writes

2015-05-07 Thread Erik Logtenberg
It does sound contradictory: why would read operations in cephfs result
in writes to disk? But they do. I upgraded to Hammer last week and I am
still seeing this.

The setup is as follows:

EC-pool on hdd's for data
replicated pool on ssd's for data-cache
replicated pool on ssd's for meta-data

Now whenever I start doing heavy reads on cephfs, I see intense bursts
of write operations on the hdd's. The reads I'm doing are things like
reading a large file (streaming a video), or running a big rsync job
with --dry-run (so it just checks meta-data). No clue why that would
have any effect on the hdd's, but it does.

Now, to further figure out what's going on, I tried using lsof, atop,
iotop, but those tools don't provide the necessary information. In lsof
I just see a whole bunch of files opened at any time, but it doesn't
change much during these tests.
In atop and iotop I can clearly see that the hdd's are doing a lot of
writes when I'm reading in cephfs, but those tools can't tell me what
those writes are.

So I tried strace, which can trace file operations and attach to running
processes.
# strace -f -e trace=file -p 5076
This gave me an idea of what was going on. 5076 is the process id of the
osd for one of the hdd's. I saw mostly stat's and open's, but those are
all reads, not writes. Of course btrfs can cause writes when doing reads
(atime), but I have the osd mounted with noatime.
The only write operations that I saw a lot of are these:

[pid  5350]
getxattr(/var/lib/ceph/osd/ceph-10/current/4.1es1_head/DIR_E/DIR_1/DIR_D/DIR_3,
user.cephos.phash.contents, \1Q\0\0\0\0\0\0\0\0\0\0\0\4\0\0, 1024) = 17
[pid  5350]
setxattr(/var/lib/ceph/osd/ceph-10/current/4.1es1_head/DIR_E/DIR_1/DIR_D/DIR_3,
user.cephos.phash.contents, \1R\0\0\0\0\0\0\0\0\0\0\0\4\0\0, 17, 0) = 0
[pid  5350]
removexattr(/var/lib/ceph/osd/ceph-10/current/4.1es1_head/DIR_E/DIR_1/DIR_D/DIR_3,
user.cephos.phash.contents@1) = -1 ENODATA (No data available)

So it appears that the osd's aren't writing actual data to disk, but
metadata in the form of xattr's. Can anyone explain what this setting
and removing of xattr's could be for?

Kind regards,

Erik.


On 03/16/2015 10:44 PM, Gregory Farnum wrote:
 The information you're giving sounds a little contradictory, but my
 guess is that you're seeing the impacts of object promotion and
 flushing. You can sample the operations the OSDs are doing at any
 given time by running ops_in_progress (or similar, I forget exact
 phrasing) command on the OSD admin socket. I'm not sure if rados df
 is going to report cache movement activity or not.
 
 That though would mostly be written to the SSDs, not the hard drives —
 although the hard drives could still get metadata updates written when
 objects are flushed. What data exactly are you seeing that's leading
 you to believe writes are happening against these drives? What is the
 exact CephFS and cache pool configuration?
 -Greg
 
 On Mon, Mar 16, 2015 at 2:36 PM, Erik Logtenberg e...@logtenberg.eu wrote:
 Hi,

 I forgot to mention: while I am seeing these writes in iotop and
 /proc/diskstats for the hdd's, I am -not- seeing any writes in rados
 df for the pool residing on these disks. There is only one pool active
 on the hdd's and according to rados df it is getting zero writes when
 I'm just reading big files from cephfs.

 So apparently the osd's are doing some non-trivial amount of writing on
 their own behalf. What could it be?

 Thanks,

 Erik.


 On 03/16/2015 10:26 PM, Erik Logtenberg wrote:
 Hi,

 I am getting relatively bad performance from cephfs. I use a replicated
 cache pool on ssd in front of an erasure coded pool on rotating media.

 When reading big files (streaming video), I see a lot of disk i/o,
 especially writes. I have no clue what could cause these writes. The
 writes are going to the hdd's and they stop when I stop reading.

 I mounted everything with noatime and nodiratime so it shouldn't be
 that. On a related note, the Cephfs metadata is stored on ssd too, so
 metadata-related changes shouldn't hit the hdd's anyway I think.

 Any thoughts? How can I get more information about what ceph is doing?
 Using iotop I only see that the osd processes are busy but it doesn't
 give many hints as to what they are doing.

 Thanks,

 Erik.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS unexplained writes

2015-03-16 Thread Erik Logtenberg
Hi,

I am getting relatively bad performance from cephfs. I use a replicated
cache pool on ssd in front of an erasure coded pool on rotating media.

When reading big files (streaming video), I see a lot of disk i/o,
especially writes. I have no clue what could cause these writes. The
writes are going to the hdd's and they stop when I stop reading.

I mounted everything with noatime and nodiratime so it shouldn't be
that. On a related note, the Cephfs metadata is stored on ssd too, so
metadata-related changes shouldn't hit the hdd's anyway I think.

Any thoughts? How can I get more information about what ceph is doing?
Using iotop I only see that the osd processes are busy but it doesn't
give many hints as to what they are doing.

Thanks,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS unexplained writes

2015-03-16 Thread Erik Logtenberg
Hi,

I forgot to mention: while I am seeing these writes in iotop and
/proc/diskstats for the hdd's, I am -not- seeing any writes in rados
df for the pool residing on these disks. There is only one pool active
on the hdd's and according to rados df it is getting zero writes when
I'm just reading big files from cephfs.

So apparently the osd's are doing some non-trivial amount of writing on
their own behalf. What could it be?

Thanks,

Erik.


On 03/16/2015 10:26 PM, Erik Logtenberg wrote:
 Hi,
 
 I am getting relatively bad performance from cephfs. I use a replicated
 cache pool on ssd in front of an erasure coded pool on rotating media.
 
 When reading big files (streaming video), I see a lot of disk i/o,
 especially writes. I have no clue what could cause these writes. The
 writes are going to the hdd's and they stop when I stop reading.
 
 I mounted everything with noatime and nodiratime so it shouldn't be
 that. On a related note, the Cephfs metadata is stored on ssd too, so
 metadata-related changes shouldn't hit the hdd's anyway I think.
 
 Any thoughts? How can I get more information about what ceph is doing?
 Using iotop I only see that the osd processes are busy but it doesn't
 give many hints as to what they are doing.
 
 Thanks,
 
 Erik.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crush Map and SSD Pools

2014-12-30 Thread Erik Logtenberg
Hi Lindsay,

Actually you just setup two entries for each host in your crush map. One
for hdd's and one for ssd's. My osd's look like this:

# idweight  type name   up/down reweight
-6  1.8 root ssd
-7  0.45host ceph-01-ssd
0   0.45osd.0   up  1
-8  0.45host ceph-02-ssd
3   0.45osd.3   up  1
-9  0.45host ceph-03-ssd
8   0.45osd.8   up  1
-10 0.45host ceph-04-ssd
11  0.45osd.11  up  1
-1  29.12   root default
-2  7.28host ceph-01
1   3.64osd.1   up  1
2   3.64osd.2   up  1
-3  7.28host ceph-02
5   3.64osd.5   up  1
4   3.64osd.4   up  1
-4  7.28host ceph-03
6   3.64osd.6   up  1
7   3.64osd.7   up  1
-5  7.28host ceph-04
10  3.64osd.10  up  1
9   3.64osd.9   up  1

As you can see, I have four hosts: ceph-01 ... ceph-04, but eight host
entries. This works great.

Regards,

Erik.


On 30-12-14 15:13, Lindsay Mathieson wrote:
 I looked at the section for setting up different pools with different
 OSD's (e.g SSD Pool):
 
  
 
 http://ceph.com/docs/master/rados/operations/crush-map/#placing-different-pools-on-different-osds
 
  
 
 And it seems to make the assumption that the ssd's and platters all live
 on separate hosts.
 
  
 
 Not the case at all for my setup and I imagine for most people I have
 ssd's mixed with platters on the same hosts.
 
  
 
 In that case should one have the root buckets referencing buckets not
 based on hosts, e.g, something like this:
 
  
 
  
 
 # devices
 
 # Platters
 
 device 0 osd.0
 
 device 1 osd.1
 
  
 
 # SSD
 
 device 2 osd.2
 
 device 3 osd.3
 
  
 
 host vnb {
 
 id -2 # do not change unnecessarily
 
 # weight 1.000
 
 alg straw
 
 hash 0 # rjenkins1
 
 item osd.0 weight 1.000
 
 item osd.2 weight 1.000
 
 }
 
 host vng {
 
 id -3 # do not change unnecessarily
 
 # weight 1.000
 
 alg straw
 
 hash 0 # rjenkins1
 
 item osd.1 weight 1.000
 
 item osd.3 weight 1.000
 
 }
 
  
 
 row disk-platter {
 
 alg straw
 
 hash 0 # rjenkins1
 
 item osd.0 weight 1.000
 
 item osd.1 weight 1.000
 
 }
 
  
 
 row disk-ssd {
 
 alg straw
 
 hash 0 # rjenkins1
 
 item osd.2 weight 1.000
 
 item osd.3 weight 1.000
 
 }
 
  
 
  
 
 root default {
 
 id -1 # do not change unnecessarily
 
 # weight 2.000
 
 alg straw
 
 hash 0 # rjenkins1
 
 item disk-platter weight 2.000
 
 }
 
  
 
 root ssd {
 
 id -4
 
 alg straw
 
 hash 0
 
 item disk-ssd weight 2.000
 
 }
 
  
 
 # rules
 
 rule replicated_ruleset {
 
 ruleset 0
 
 type replicated
 
 min_size 1
 
 max_size 10
 
 step take default
 
 step chooseleaf firstn 0 type host
 
 step emit
 
 }
 
  
 
 rule ssd {
 
 ruleset 1
 
 type replicated
 
 min_size 0
 
 max_size 4
 
 step take ssd
 
 step chooseleaf firstn 0 type host
 
 step emit
 
 }
 
  
 
  
 
 -- 
 
 Lindsay
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache tiers flushing logic

2014-12-30 Thread Erik Logtenberg
 
 Hi Erik,
 
 I have tiering working on a couple test clusters.  It seems to be
 working with Ceph v0.90 when I set:
 
 ceph osd pool set POOL  hit_set_type bloom
 ceph osd pool set POOL  hit_set_count 1
 ceph osd pool set POOL  hit_set_period 3600
 ceph osd pool set POOL  cache_target_dirty_ratio .5
 ceph osd pool set POOL  cache_target_full_ratio .9
 
 Eric
 

Hi Eric,

You say it seems to be working. My setup also seems to be working, in
the sense that the pools can be written to and read from. However the
cache flushing doesn't work as expected.
Do you mean that all objects in your cache are flushed during idle time?

Thanks,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crush Map and SSD Pools

2014-12-30 Thread Erik Logtenberg
No, bucket names in crush map are completely arbitrary. In fact, crush
doesn't really know what a host is. It is just a bucket, like rack
or datacenter. But they could be called cat and mouse just as well.

The only reason to use host names is for human readability.

You can then use crush rules to make sure that for instance two copies
of some object are not on the same host or in the same rack or not
in the same whatever bucket you like. This way you can define your
failure domains in correspondence with your physical layout.

Kind regards,

Erik.




On 12/30/2014 10:18 PM, Lindsay Mathieson wrote:
 On Tue, 30 Dec 2014 04:18:07 PM Erik Logtenberg wrote:
 As you can see, I have four hosts: ceph-01 ... ceph-04, but eight
 host entries. This works great.
 
 
 you have - host ceph-01 - host ceph-01-ssd
 
 Don't the host names have to match the real host names?
 
 
 
 ___ ceph-users mailing
 list ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crush Map and SSD Pools

2014-12-30 Thread Erik Logtenberg
If you want to be able to start your osd's with /etc/init.d/ceph init
script, then you better make sure that /etc/ceph/ceph.conf does link
the osd's to the actual hostname :)

Check out this snippet from my ceph.conf:


[osd.0]
host = ceph-01
osd crush location = host=ceph-01-ssd root=ssd

[osd.1]
host = ceph-01

[osd.2]
host = ceph-01


You see all osd's are linked to the right hostname. But the ssd osd is
then explicitly set to go into the right crush location too.

Kind regards,

Erik.



On 12/30/2014 11:11 PM, Lindsay Mathieson wrote:
 On Tue, 30 Dec 2014 10:38:14 PM Erik Logtenberg wrote:
 No, bucket names in crush map are completely arbitrary. In fact,
 crush doesn't really know what a host is. It is just a bucket,
 like rack or datacenter. But they could be called cat and
 mouse just as well.
 
 Hmmm, I tried that earlier and ran into problems with
 starting/stopping the osd - but maybe I screwed something else up.
 Will give it another go.
 
 
 
 ___ ceph-users mailing
 list ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Online converting of pool type

2014-12-23 Thread Erik Logtenberg
Hi,

Every now and then someone asks if it's possible to convert a pool to a
different type (replicated vs erasure / change the amount of pg's /
etc), but this is not supported. The advised approach is usually to just
create a new pool and somehow copy all data manually to this new pool,
removing the old pool afterwards. This is both unpractical and very time
consuming.

Recently I saw someone on this list suggest that the cache tiering
feature may actually be used to achieve some form of online converting
of pool types. Today I ran some tests and I would like to share my results.

I started out with a pool test-A, created an rbd image in the pool,
mapped it, created a filesystem in the rbd image, mounted the fs and
placed some test files in the fs. Just to have some objects in the
test-A pool.

I then added a test-B pool and transferred the data using cache tiering
as follows:

Step 0: We have a test-A pool and it contains data, some of which is in use.
# rados -p test-A df
test-A  -   9941   110
  0   0  324 2404   57 4717

Step 1: Create new pool test-B
# ceph osd pool create test-B 32
pool 'test-B' created

Step 2: Make pool test-A a cache pool for test-B.
# ceph osd tier add test-B test-A --force-nonempty
# ceph osd tier cache-mode test-A forward

Step 3: Move data from test-A to test-B (this potentially takes long)
# rados -p test-A cache-flush-evict-all
This step will move all data except the objects that are in active use,
so we are left with some remaining data on test-A pool.

Step 4: Move also the remaining data. This is the only step that doesn't
work online.
Step 4a: Disconnect all clients
# rbd unmap /dev/rbd/test-A/test-rbd   (in my case)
Stab 4b: Move remaining objects
# rados -p test-A cache-flush-evict-all
# rados -p test-A ls  (should now be empty)

Step 5: Remove test-A as cache pool
# ceph osd tier remove test-B test-A

Step 6: Clients are allowed to connect with test-B pool (we are back in
online mode)
# rbd map test-B/test-rbd  (in my case)

Step 7: Remove the now empty pool test-A
# ceph osd pool delete test-A test-A --yes-i-really-really-mean-it


This worked smoothly. In my first try I actually used more steps, by creatig
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Online converting of pool type

2014-12-23 Thread Erik Logtenberg
Whoops, I accidently sent my mail before it was finished. Anyway I have
some more testing to do, especially with converting between
erasure/replicated pools. But it looks promising.

Thanks,

Erik.


On 23-12-14 16:57, Erik Logtenberg wrote:
 Hi,
 
 Every now and then someone asks if it's possible to convert a pool to a
 different type (replicated vs erasure / change the amount of pg's /
 etc), but this is not supported. The advised approach is usually to just
 create a new pool and somehow copy all data manually to this new pool,
 removing the old pool afterwards. This is both unpractical and very time
 consuming.
 
 Recently I saw someone on this list suggest that the cache tiering
 feature may actually be used to achieve some form of online converting
 of pool types. Today I ran some tests and I would like to share my results.
 
 I started out with a pool test-A, created an rbd image in the pool,
 mapped it, created a filesystem in the rbd image, mounted the fs and
 placed some test files in the fs. Just to have some objects in the
 test-A pool.
 
 I then added a test-B pool and transferred the data using cache tiering
 as follows:
 
 Step 0: We have a test-A pool and it contains data, some of which is in use.
 # rados -p test-A df
 test-A  -   9941   110
   0   0  324 2404   57 4717
 
 Step 1: Create new pool test-B
 # ceph osd pool create test-B 32
 pool 'test-B' created
 
 Step 2: Make pool test-A a cache pool for test-B.
 # ceph osd tier add test-B test-A --force-nonempty
 # ceph osd tier cache-mode test-A forward
 
 Step 3: Move data from test-A to test-B (this potentially takes long)
 # rados -p test-A cache-flush-evict-all
 This step will move all data except the objects that are in active use,
 so we are left with some remaining data on test-A pool.
 
 Step 4: Move also the remaining data. This is the only step that doesn't
 work online.
 Step 4a: Disconnect all clients
 # rbd unmap /dev/rbd/test-A/test-rbd   (in my case)
 Stab 4b: Move remaining objects
 # rados -p test-A cache-flush-evict-all
 # rados -p test-A ls  (should now be empty)
 
 Step 5: Remove test-A as cache pool
 # ceph osd tier remove test-B test-A
 
 Step 6: Clients are allowed to connect with test-B pool (we are back in
 online mode)
 # rbd map test-B/test-rbd  (in my case)
 
 Step 7: Remove the now empty pool test-A
 # ceph osd pool delete test-A test-A --yes-i-really-really-mean-it
 
 
 This worked smoothly. In my first try I actually used more steps, by creatig
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Tip of the week: don't use Intel 530 SSD's for journals

2014-11-25 Thread Erik Logtenberg
If you are like me, you have the journals for your OSD's with rotating
media stored separately on an SSD. If you are even more like me, you
happen to use Intel 530 SSD's in some of your hosts. If so, please do
check your S.M.A.R.T. statistics regularly, because these SSD's really
can't cope with Ceph.

Check out the media-wear graphs for the two Intel 530's in my cluster.
As soon as those declining lines get down to 30% or so, they need to be
replaced. That means less than half a year between purchase and
end-of-life :(

Tip of the week, keep an eye on those statistics, don't let a failing
SSD surprise you.

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to mount cephfs from fstab

2014-11-24 Thread Erik Logtenberg
Hi,

I would like to mount a cephfs share from fstab, but it doesn't
completely work.

First of all, I followed the documentation [1], which resulted in the
following line in fstab:

ceph-01:6789:/ /mnt/cephfs/ ceph
name=testhost,secretfile=/root/testhost.key,noacl 0 2

Yes, this works when I manually try mount /mnt/cephfs, but it does
give me the following error/warning:

mount: error writing /etc/mtab: Invalid argument

Now, even though this error doesn't influence the mounting itself, it
does prohibit my machine from booting right. Apparently Fedora/systemd
doesn't like this error when going through fstab, so booting is not
possible.

The mtab issue can easily be worked around, by calling mount manually
and using the -n (--no-mtab) argument, like this:

mount -t ceph -n ceph-01:6789:/ /mnt/cephfs/ -o
name=testhost,secretfile=/root/testhost.key,noacl

However, I can't find a way to put that -n option in /etc/fstab itself
(since it's not a -o option. Currently, I have the noauto setting in
fstab, so it doesn't get mounted on boot at all. Then I have to manually
log in and say mount /mnt/cephfs to explicitly mount the share. Far
from ideal.

So, how do my fellow cephfs-users do this?

Thanks,

Erik.

[1] http://ceph.com/docs/giant/cephfs/fstab/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to add/remove/move an MDS?

2014-11-19 Thread Erik Logtenberg
Hi,

I noticed that the docs [1] on adding and removing an MDS are not yet
written...

[1] https://ceph.com/docs/master/rados/deployment/ceph-deploy-mds/

I would like to do exactly that, however. I have an MDS on one machine,
but I'd like a faster machine to take over instead. In fact, It would be
great to make an active/standby configuration (at least as long as
multimaster is not supported). How to do this?

By the way, this is a manually deployed cluster, no ceph-deploy used.

Thanks,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Troubleshooting an erasure coded pool with a cache tier

2014-11-17 Thread Erik Logtenberg
I think I might be running into the same issue. I'm using Giant though.
A lot of slow writes. My thoughts went to: the OSD's get too much work
to do (commodity hardware), so I'll have to do some performance tuning
to limit parallellism a bit. And indeed, limiting the amount of threads
for different tasks reduced some of the load, but I keep getting slow
writes very often, especially if the load is coming from CephFS (which
is the only thing I use a cache tier for).

To answer your question: no, it's not yet production, and it's not
suited for production currently either.

In my case the slow writes keep stacking up, until OSD's commit suicide,
and then the recovery process adds even further to the load of the
remaining OSD's, causing a chain reaction in which other OSD's also kill
themselves.

Non-optimal performance could in my case be acceptable for
semi-production, but stability is essential. So I hope these issues can
be fixed.

Kind regards,

Erik.


On 17-11-14 17:45, Laurent GUERBY wrote:
 Hi,
 
 Just a follow-up on this issue, we're probably hitting:
 
 http://tracker.ceph.com/issues/9285
 
 We had the issue a few weeks ago with replicated SSD pool in front of
 rotational pool and turned off cache tiering. 
 
 Yesterday we made a new test and activating cache tiering on a single
 erasure pool threw the whole ceph cluster performance to the floor
 (including non cached non erasure coded pools) with frequent slow
 write in the logs. Removing cache tiering was enough to go back to
 normal performance.
 
 I assume no one use cache tiering on 0.80.7 in production clusters?
 
 Sincerely,
 
 Laurent
 
 Le Sunday 09 November 2014 à 00:24 +0100, Loic Dachary a écrit :

 On 09/11/2014 00:03, Gregory Farnum wrote:
 It's all about the disk accesses. What's the slow part when you dump 
 historic and in-progress ops?

 This is what I see on g1 (6% iowait)

 root@g1:~# ceph daemon osd.0 dump_ops_in_flight
 { num_ops: 0,
   ops: []}

 root@g1:~# ceph daemon osd.0 dump_ops_in_flight
 { num_ops: 1,
   ops: [
 { description: osd_op(client.4407100.0:11030174 
 rb.0.410809.238e1f29.1038 [set-alloc-hint object_size 4194304 
 write_size 4194304,write 4095488~4096] 58.3aabb66d ack+ondisk+write e15613),
   received_at: 2014-11-09 00:14:17.385256,
   age: 0.538802,
   duration: 0.011955,
   type_data: [
 waiting for sub ops,
 { client: client.4407100,
   tid: 11030174},
 [
 { time: 2014-11-09 00:14:17.385393,
   event: waiting_for_osdmap},
 { time: 2014-11-09 00:14:17.385563,
   event: reached_pg},
 { time: 2014-11-09 00:14:17.385793,
   event: started},
 { time: 2014-11-09 00:14:17.385807,
   event: started},
 { time: 2014-11-09 00:14:17.385875,
   event: waiting for subops from 1,10},
 { time: 2014-11-09 00:14:17.386201,
   event: commit_queued_for_journal_write},
 { time: 2014-11-09 00:14:17.386336,
   event: write_thread_in_journal_buffer},
 { time: 2014-11-09 00:14:17.396293,
   event: journaled_completion_queued},
 { time: 2014-11-09 00:14:17.396332,
   event: op_commit},
 { time: 2014-11-09 00:14:17.396678,
   event: op_applied},
 { time: 2014-11-09 00:14:17.397211,
   event: sub_op_commit_rec}]]}]}

 and it looks ok. When I go to n7 which has 20% iowait, I see a much larger 
 output http://pastebin.com/DPxsaf6z which includes a number of event: 
 waiting_for_osdmap.

 I'm not sure what to make of this and it would certainly be better if n7 had 
 a lower iowait. Also when I ceph -w I see a new pgmap is created every 
 second which is also not a good sign.

 2014-11-09 00:22:47.090795 mon.0 [INF] pgmap v4389613: 460 pgs: 460 
 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 3889 
 B/s rd, 2125 kB/s wr, 237 op/s
 2014-11-09 00:22:48.143412 mon.0 [INF] pgmap v4389614: 460 pgs: 460 
 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 1586 
 kB/s wr, 204 op/s
 2014-11-09 00:22:49.172794 mon.0 [INF] pgmap v4389615: 460 pgs: 460 
 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 343 
 kB/s wr, 88 op/s
 2014-11-09 00:22:50.222958 mon.0 [INF] pgmap v4389616: 460 pgs: 460 
 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 412 
 kB/s wr, 130 op/s
 2014-11-09 00:22:51.281294 mon.0 [INF] pgmap v4389617: 460 pgs: 460 
 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 1195 
 kB/s wr, 167 op/s
 2014-11-09 00:22:52.318895 mon.0 [INF] pgmap v4389618: 460 pgs: 460 
 active+clean; 2580 GB data, 

Re: [ceph-users] Cache tiering and cephfs

2014-11-16 Thread Erik Logtenberg
I know that it is possible to run CephFS with a cache tier on the data
pool in Giant, because that's what I do. However when I configured it, I
was on the previous release. When I upgraded to Giant, everything just
kept working.

By the way when I set it up, I used the following commmands:

ceph osd pool create cephfs-data 192 192 erasure
ceph osd pool create cephfs-metadata 192 192 replicated ssd
ceph osd pool create cephfs-data-cache 192 192 replicated ssd
ceph osd pool set cephfs-data-cache crush_ruleset 1
ceph osd pool set cephfs-metadata crush_ruleset 1
ceph osd tier add cephfs-data cephfs-data-cache
ceph osd tier cache-mode cephfs-data-cache writeback
ceph osd tier set-overlay cephfs-data cephfs-data-cache
ceph osd dump
ceph mds newfs 5 6 --yes-i-really-mean-it

So actually I didn't add a cache tier to an existing CephFS, but first
made the pools and added CephFS directly after. In my case, the ssd
pool is ssd-backed (obviously), while the default pool is on rotating
media; the crush_ruleset 1 is meant to place both the cache pool and the
metadata pool on the ssd's.

Erik.


On 11/16/2014 08:01 PM, Scott Laird wrote:
 Is it possible to add a cache tier to cephfs's data pool in giant?
 
 I'm getting a error:
 
 $ ceph osd tier set-overlay data data-cache
 
 Error EBUSY: pool 'data' is in use by CephFS via its tier
 
 
 From what I can see in the code, that comes from
 OSDMonitor::_check_remove_tier; I don't understand why set-overlay needs
 to call _check_remove_tier.  A quick look makes it look like set-overlay
 will always fail once MDS has been set up.  Is this a bug, or am I doing
 something wrong?
 
 
 Scott
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD commits suicide

2014-11-15 Thread Erik Logtenberg
Hi,

Thanks for the tip, I applied these configuration settings and it does
lower the load during rebuilding a bit. Are there settings like these
that also tune Ceph down a bit during regular operations? The slow
requests, timeouts and OSD suicides are killing me.

If I allow the cluster to regain consciousness and stay idle a bit, it
all seems to settle down nicely, but as soon as I apply some load it
immediately starts to overstress and complain like crazy.

I'm also seeing this behaviour: http://tracker.ceph.com/issues/9844
This was reported by Dmitry Smirnov 26 days ago, but the report has no
response yet. Any ideas?

In my experience, OSD's are quite unstable in Giant and very easily
stressed, causing chain effects, further worsening the issues. It would
be nice to know if this is also noticed by other users?

Thanks,

Erik.


On 11/10/2014 08:40 PM, Craig Lewis wrote:
 Have you tuned any of the recovery or backfill parameters?  My ceph.conf
 has:
 [osd]
   osd max backfills = 1
   osd recovery max active = 1
   osd recovery op priority = 1
 
 Still, if it's running for a few hours, then failing, it sounds like
 there might be something else at play.  OSDs use a lot of RAM during
 recovery.  How much RAM and how many OSDs do you have in these nodes? 
 What does memory usage look like after a fresh restart, and what does it
 look like when the problems start?  Even better if you know what it
 looks like 5 minutes before the problems start.
 
 Is there anything interesting in the kernel logs?  OOM killers, or
 memory deadlocks?
 
 
 
 On Sat, Nov 8, 2014 at 11:19 AM, Erik Logtenberg e...@logtenberg.eu
 mailto:e...@logtenberg.eu wrote:
 
 Hi,
 
 I have some OSD's that keep committing suicide. My cluster has ~1.3M
 misplaced objects, and it can't really recover, because OSD's keep
 failing before recovering finishes. The load on the hosts is quite high,
 but the cluster currently has no other tasks than just the
 backfilling/recovering.
 
 I attached the logfile from a failed OSD. It shows the suicide, the
 recent events and also me starting the OSD again after some time.
 
 It'll keep running for a couple of hours and then fail again, for the
 same reason.
 
 I noticed a lot of timeouts. Apparently ceph stresses the hosts to the
 limit with the recovery tasks, so much that they timeout and can't
 finish that task. I don't understand why. Can I somehow throttle ceph a
 bit so that it doesn't keep overrunning itself? I kinda feel like it
 should chill out a bit and simply recover one step at a time instead of
 full force and then fail.
 
 Thanks,
 
 Erik.
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD always tries to remove non-existent btrfs snapshot

2014-11-15 Thread Erik Logtenberg
Hi,

Every time I start any OSD, it always logs that it tried to remove two
btrfs snapshots but failed:

2014-11-15 22:31:08.251600 7f1730f71700 -1
filestore(/var/lib/ceph/osd/ceph-5) unable to destroy snap
'snap_3020746' got (2) No such file or directory

2014-11-15 22:31:09.661161 7f1730f71700 -1
filestore(/var/lib/ceph/osd/ceph-5) unable to destroy snap
'snap_3020758' got (2) No such file or directory

These three snapshots do exist:
drwxr-xr-x. 1 root root 3126 15 nov 22:16 snap_3020728
drwxr-xr-x. 1 root root 3126 15 nov 22:16 snap_3020736
drwxr-xr-x. 1 root root 3126 15 nov 22:16 snap_3020746

So the first one it tries to remove does exist but somehow removing it
fails.

The second one it wants to remove does not exist, so of course that
fails too.

And the two remaining snapshots that do exist are not mentioned in the
log file.

The OSD always continues to boot successfully, so it doesn't appear to
hurt. Nontheless, it doesn't instill much confidence in its internal
administration either.. ;)

Anyone else seeing this?

Thanks,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jbod + SMART : how to identify failing disks ?

2014-11-12 Thread Erik Logtenberg
I have no experience with the DELL SAS controller, but usually the
advantage of using a simple controller (instead of a RAID card) is that
you can use full SMART directly.

$ sudo smartctl -a /dev/sda

=== START OF INFORMATION SECTION ===
Device Model: INTEL SSDSA2BW300G3H
Serial Number:PEPR2381003E300EGN

Personally, I make sure that I know which serial number drive is in
which bay, so I can easily tell which drive I'm talking about.

So you can use SMART both to notice (pre)failing disks -and- to
physically identify them.

The same smartctl command also returns the health status like so:

233 Media_Wearout_Indicator 0x0032   099   099   000Old_age   Always
  -   0

This specific SSD has 99% media lifetime left, so it's in the green. But
it will continue to gradually degrade, and at some time It'll hit a
percentage where I like to replace it. To keep an eye on the speed of
decay, I'm graphing those SMART values in Cacti. That way I can somewhat
predict how long a disk will last, especially SSD's which die very
gradually.

Erik.


On 12-11-14 14:43, JF Le Fillâtre wrote:
 
 Hi,
 
 May or may not work depending on your JBOD and the way it's identified
 and set up by the LSI card and the kernel:
 
 cat /sys/block/sdX/../../../../sas_device/end_device-*/bay_identifier
 
 The weird path and the wildcards are due to the way the sysfs is set up.
 
 That works with a Dell R520, 6GB HBA SAS cards and Dell MD1200s, running
 CentOS release 6.5.
 
 Note that you can make your life easier by writing an udev script that
 will create a symlink with a sane identifier for each of your external
 disks. If you match along the lines of
 
 KERNEL==sd*[a-z], KERNELS==end_device-*:*:*
 
 then you'll just have to cat /sys/class/sas_device/${1}/bay_identifier
 in a script (with $1 being the $id of udev after that match, so the
 string end_device-X:Y:Z) to obtain the bay ID.
 
 Thanks,
 JF
 
 
 
 On 12/11/14 14:05, SCHAER Frederic wrote:
 Hi,

  

 I’m used to RAID software giving me the failing disks  slots, and most
 often blinking the disks on the disk bays.

 I recently installed a  DELL “6GB HBA SAS” JBOD card, said to be an LSI
 2008 one, and I now have to identify 3 pre-failed disks (so says
 S.M.A.R.T) .

  

 Since this is an LSI, I thought I’d use MegaCli to identify the disks
 slot, but MegaCli does not see the HBA card.

 Then I found the LSI “sas2ircu” utility, but again, this one fails at
 giving me the disk slots (it finds the disks, serials and others, but
 slot is always 0)

 Because of this, I’m going to head over to the disk bay and unplug the
 disk which I think corresponds to the alphabetical order in linux, and
 see if it’s the correct one…. But even if this is correct this time, it
 might not be next time.

  

 But this makes me wonder : how do you guys, Ceph users, manage your
 disks if you really have JBOD servers ?

 I can’t imagine having to guess slots that each time, and I can’t
 imagine neither creating serial number stickers for every single disk I
 could have to manage …

 Is there any specific advice reguarding JBOD cards people should (not)
 use in their systems ?

 Any magical way to “blink” a drive in linux ?

  

 Thanks  regards



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] E-Mail netiquette

2014-11-09 Thread Erik Logtenberg
Oops, my apologies if the 3MB logfile that I sent to this list
yesterday was annoying to anybody. I didn't realize that the
combination low bandwith / high mobile tariffs and email client
that automatically downloads all attachments was still a thing.
Apparently it is.
Next time I'll upload a large-ish attachment somewhere else and put a
link in the mail.

Thanks,

Erik.


On 11/09/2014 09:04 PM, Manfred Hollstein wrote:
 Hi there,
 
 as we've recently seen some pretty bad examples, is it possible to agree
 on some topics for posting on a list reaching quite a few subscribers?
 
 Logs should be uploaded to some server; if you still believe they must
 be sent to the list, compress them first. Some people are on mobile
 tariffs where the monthly quota is quite pressing, and uncompressed logs
 are very unlikely to help with that...
 
 While we're at it, I know HTML e-mails to be mostly preferred by
 managers, in fact they don't really add any real benefit on a mostly
 technical mailing list, hence text-only format should work.
 
 The level of detail and information is really great on this list, so I'd
 really appreciate if some people here hold on and think about something
 before hitting the Send button ;)
 
 I hope you don't feel offended, that clearly wasn't my intention!
 
 Keep up the good work.
 
 Cheers.
 
 l8er
 manfred
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS slow, logging rdlock failures

2014-11-07 Thread Erik Logtenberg
Hi,

My MDS is very slow, and it logs stuff like this:

2014-11-07 23:38:41.154939 7f8180a31700  0 log_channel(default) log
[WRN] : 2 slow requests, 1 included below; oldest blocked for 
187.777061 secs
2014-11-07 23:38:41.154956 7f8180a31700  0 log_channel(default) log
[WRN] : slow request 121.322570 seconds old, received at 2014-11-07
23:36:39.832336: client_request(client.7071:115 getattr pAsLsXsFs
#102bdbe 2014-11-07 23:36:39.00) currently failed to rdlock, waiting

Any idea what that rdlock is and what might cause it to fail?

This is on Giant by the way.

Thanks,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bug in Fedora package ceph-0.87-1

2014-11-05 Thread Erik Logtenberg
Hi,

There is a small bug in the Fedora package for ceph-0.87. Two days ago,
Boris Ranto built the first 0.87 package, for Fedora 22 (rawhide) [1].

[1] http://koji.fedoraproject.org/koji/buildinfo?buildID=589731

This build was a succes, so I took that package and built it for Fedora
20 (which is the current version), however that failed. The thing is,
apparently in Fedora the rbd-replay-prep binary is not built, but there
is a specific check for Fedora 20, which nontheless tries to include it
in the file list:

%if (0%{?fedora} == 20 || 0%{?rhel} == 6)
%{_mandir}/man8/rbd-replay-prep.8*
%{_bindir}/rbd-replay-prep
%endif

I really have no idea what this is supposed to accomplish. That binary
is not built, and that manpage is even explicitly deleted for that reason:

# do not package man page for binary that is not built
rm -f $RPM_BUILD_ROOT%{_mandir}/man8/rbd-replay-prep.8*

So it is quite obvious that in Fedora those two files do not exist. In
all other versions of Fedora than 20 this is no problem, but
specifically in Fedora 20 they are explicitly included in the file list,
which causes packaging to fail (obviously).

The easiest way to fix this is simply remove those four lines from %if
to %endif, so that this package behaves the same for Fedora 20 as for
all other versions of Fedora.

I haven't testes a build for rhel version 6, but that man page gets
deleted anyway on all rpm platforms, so this must fail for rhel 6 too.

Thanks,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Negative amount of objects degraded

2014-10-30 Thread Erik Logtenberg
Hi,

Yesterday I removed two OSD's, to replace them with new disks. Ceph was
not able to completely reach all active+clean state, but some degraded
objects remain. However, the amount of degraded objects is negative
(-82), see below:

2014-10-30 13:31:32.862083 mon.0 [INF] pgmap v209175: 768 pgs: 761
active+clean, 7 active+remapped; 1644 GB data, 2524 GB used, 17210 GB /
19755 GB avail; 2799 B/s wr, 1 op/s; -82/1439391 objects degraded (-0.006%)

According to rados df, the -82 degraded objects are part of the
cephfs-data-cache pool, which is an SSD-backed replicated pool, that
functions as a cache pool for an HDD-backed erasure coded pool for cephfs.

The cache should be empty, because I isseud rados
cache-flush-evict-all-command, and rados -p cephfs-data-cache ls
indeed shows zero objects in this pool.

rados df however does show 192 objects for this pool, with just 35KB
used and -82 degraded:

pool name   category KB  objects   clones
  degraded  unfound   rdrd KB   wrwr KB
cephfs-data-cache - 35  1920
 -82   0 1119   348800  1198371   1703673493

Please advice...

Thanks,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Negative amount of objects degraded

2014-10-30 Thread Erik Logtenberg
 Yesterday I removed two OSD's, to replace them with new disks. Ceph was
 not able to completely reach all active+clean state, but some degraded
 objects remain. However, the amount of degraded objects is negative
 (-82), see below:

 
 So why didn't it reach that state?

Well, I dunno, I was hoping this list would know why? I simply sat there
waiting for the process to complete and it didn't.

 Could you query those PGs and see why they are remapped?
 
 $ ceph pg pg id query

I queried one of the PG's, see below for the output. Can you tell why
they are remapped?

# ceph pg 5.af query
{ state: active+remapped,
  epoch: 1105,
  up: [
14,
11],
  acting: [
14,
11,
12],
  actingbackfill: [
11,
12,
14],
  info: { pgid: 5.af,
  last_update: 533'1976,
  last_complete: 533'1976,
  log_tail: 0'0,
  last_user_version: 1976,
  last_backfill: MAX,
  purged_snaps: [],
  history: { epoch_created: 197,
  last_epoch_started: 772,
  last_epoch_clean: 772,
  last_epoch_split: 0,
  same_up_since: 721,
  same_interval_since: 771,
  same_primary_since: 721,
  last_scrub: 533'1976,
  last_scrub_stamp: 2014-10-29 00:09:56.095703,
  last_deep_scrub: 533'1976,
  last_deep_scrub_stamp: 2014-10-27 00:09:48.770622,
  last_clean_scrub_stamp: 2014-10-29 00:09:56.095703},
  stats: { version: 533'1976,
  reported_seq: 2846,
  reported_epoch: 1105,
  state: active+remapped,
  last_fresh: 2014-10-30 01:55:27.177249,
  last_change: 2014-10-29 23:07:40.579020,
  last_active: 2014-10-30 01:55:27.177249,
  last_clean: 2014-10-26 21:49:13.064622,
  last_became_active: 0.00,
  last_unstale: 2014-10-30 01:55:27.177249,
  mapping_epoch: 766,
  log_start: 0'0,
  ondisk_log_start: 0'0,
  created: 197,
  last_epoch_clean: 772,
  parent: 0.0,
  parent_split_bits: 0,
  last_scrub: 533'1976,
  last_scrub_stamp: 2014-10-29 00:09:56.095703,
  last_deep_scrub: 533'1976,
  last_deep_scrub_stamp: 2014-10-27 00:09:48.770622,
  last_clean_scrub_stamp: 2014-10-29 00:09:56.095703,
  log_size: 1976,
  ondisk_log_size: 1976,
  stats_invalid: 0,
  stat_sum: { num_bytes: 4194304,
  num_objects: 13,
  num_object_clones: 0,
  num_object_copies: 39,
  num_objects_missing_on_primary: 0,
  num_objects_degraded: 0,
  num_objects_unfound: 0,
  num_objects_dirty: 13,
  num_whiteouts: 0,
  num_read: 36,
  num_read_kb: 40,
  num_write: 2047,
  num_write_kb: 17259,
  num_scrub_errors: 0,
  num_shallow_scrub_errors: 0,
  num_deep_scrub_errors: 0,
  num_objects_recovered: 26,
  num_bytes_recovered: 8388608,
  num_keys_recovered: 222,
  num_objects_omap: 12,
  num_objects_hit_set_archive: 0},
  stat_cat_sum: {},
  up: [
14,
11],
  acting: [
14,
11,
12],
  up_primary: 14,
  acting_primary: 14},
  empty: 0,
  dne: 0,
  incomplete: 0,
  last_epoch_started: 772,
  hit_set_history: { current_last_update: 0'0,
  current_last_stamp: 0.00,
  current_info: { begin: 0.00,
  end: 0.00,
  version: 0'0},
  history: []}},
  peer_info: [
{ peer: 11,
  pgid: 5.af,
  last_update: 533'1976,
  last_complete: 533'1976,
  log_tail: 0'0,
  last_user_version: 1976,
  last_backfill: MAX,
  purged_snaps: [],
  history: { epoch_created: 197,
  last_epoch_started: 772,
  last_epoch_clean: 772,
  last_epoch_split: 0,
  same_up_since: 721,
  same_interval_since: 771,
  same_primary_since: 721,
  last_scrub: 533'1976,
  last_scrub_stamp: 2014-10-29 00:09:56.095703,
  last_deep_scrub: 533'1976,
  last_deep_scrub_stamp: 2014-10-27 00:09:48.770622,
  last_clean_scrub_stamp: 2014-10-29 00:09:56.095703},
  stats: { version: 533'1976,
  reported_seq: 2430,
  reported_epoch: 723,
  state: remapped+peering,
  last_fresh: 2014-10-29 23:03:18.847590,
  last_change: 2014-10-29 23:03:17.673820,
  last_active: 2014-10-29 22:41:29.551558,
  last_clean: 2014-10-26 21:49:13.064622,
  last_became_active: 0.00,
  last_unstale: 2014-10-29 23:03:18.847590,
  mapping_epoch: 766,

Re: [ceph-users] Negative amount of objects degraded

2014-10-30 Thread Erik Logtenberg
Thanks for pointing that out. Unfortunately, those tickets contain only
a description of the problem, but no solution or workaround. One was
opened 8 months ago and the other more than a year ago. No love since.

Is there any way I can get my cluster back in a healthy state?

Thanks,

Erik.


On 10/30/2014 05:13 PM, John Spray wrote:
 There are a couple of open tickets about bogus (negative) stats on PGs:
 http://tracker.ceph.com/issues/5884
 http://tracker.ceph.com/issues/7737
 
 Cheers,
 John
 
 On Thu, Oct 30, 2014 at 12:38 PM, Erik Logtenberg e...@logtenberg.eu wrote:
 Hi,

 Yesterday I removed two OSD's, to replace them with new disks. Ceph was
 not able to completely reach all active+clean state, but some degraded
 objects remain. However, the amount of degraded objects is negative
 (-82), see below:

 2014-10-30 13:31:32.862083 mon.0 [INF] pgmap v209175: 768 pgs: 761
 active+clean, 7 active+remapped; 1644 GB data, 2524 GB used, 17210 GB /
 19755 GB avail; 2799 B/s wr, 1 op/s; -82/1439391 objects degraded (-0.006%)

 According to rados df, the -82 degraded objects are part of the
 cephfs-data-cache pool, which is an SSD-backed replicated pool, that
 functions as a cache pool for an HDD-backed erasure coded pool for cephfs.

 The cache should be empty, because I isseud rados
 cache-flush-evict-all-command, and rados -p cephfs-data-cache ls
 indeed shows zero objects in this pool.

 rados df however does show 192 objects for this pool, with just 35KB
 used and -82 degraded:

 pool name   category KB  objects   clones
   degraded  unfound   rdrd KB   wrwr KB
 cephfs-data-cache - 35  1920
  -82   0 1119   348800  1198371   1703673493

 Please advice...

 Thanks,

 Erik.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph RPM spec removes /etc/ceph

2014-10-22 Thread Erik Logtenberg
I would like to add that removing log files (/var/log/ceph is also
removed on uninstall) is also a bad thing.

My suggestion would be to simply drop the whole %postun trigger, since
it does only these two very questionable things.

Thanks,

Erik.


On 10/22/2014 09:16 PM, Dmitry Borodaenko wrote:
 Current version of RPM spec for Ceph removes the whole /etc/ceph
 directory on uninstall:
 https://github.com/ceph/ceph/blob/master/ceph.spec.in#L557-L562
 
 I don't think contents of /etc/ceph is disposable and should be
 silently discarded like that.
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reweight a host

2014-10-20 Thread Erik Logtenberg
I don't think so, check this out:

# idweight  type name   up/down reweight
-6  3.05root ssd
-7  0.04999 host ceph-01-ssd
11  0.04999 osd.11  up  1
-8  1   host ceph-02-ssd
12  0.04999 osd.12  up  1
-9  1   host ceph-03-ssd
13  0.03999 osd.13  up  1
-10 1   host ceph-04-ssd
14  0.03999 osd.14  up  1

As you can see, only host ceph-01-ssd has the same weight as its osd,
the other three hosts have weight 1 which is different from their
associated osd.

If the weight of the host -should- be the sum of all osd weights on this
hosts, then my question becomes: how do I make that so for the three
hosts where this is currently not the case?

Thanks,

Erik.


On 20-10-14 03:55, Lei Dong wrote:
 According to my understanding, the weight of a host is the sum of all osd
 weights on this host. So you just reweight any osd on this host, the
 weight of this host is reweighed.
 
 Thanks
 LeiDong
 
 On 10/20/14, 7:11 AM, Erik Logtenberg e...@logtenberg.eu wrote:
 
 Hi,

 Simple question: how do I reweight a host in crushmap?

 I can use ceph osd crush reweight to reweight an osd, but I would like
 to change the weight of a host instead.

 I tried exporting the crushmap, but I noticed that the weights of all
 hosts are commented out, like so:

# weight 5.460

 And they are not the same values as seen in ceph osd tree.

 So how do I keep everything as it currently it, but simply change one
 single weight of one single host?

 Thanks,

 Erik.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Reweight a host

2014-10-19 Thread Erik Logtenberg
Hi,

Simple question: how do I reweight a host in crushmap?

I can use ceph osd crush reweight to reweight an osd, but I would like
to change the weight of a host instead.

I tried exporting the crushmap, but I noticed that the weights of all
hosts are commented out, like so:

# weight 5.460

And they are not the same values as seen in ceph osd tree.

So how do I keep everything as it currently it, but simply change one
single weight of one single host?

Thanks,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Best practice K/M-parameters EC pool

2014-08-15 Thread Erik Logtenberg
Hi,

With EC pools in Ceph you are free to choose any K and M parameters you
like. The documentation explains what K and M do, so far so good.

Now, there are certain combinations of K and M that appear to have more
or less the same result. Do any of these combinations have pro's and
con's that I should consider and/or are there best practices for
choosing the right K/M-parameters?

For instance, if I choose K = 3 and M = 2, then pg's in this pool will
use 5 OSD's and sustain the loss of 2 OSD's. There is 40% overhead in
this configuration.

Now, if I were to choose K = 6 and M = 4, I would end up with pg's that
use 10 OSD's and sustain the loss of 4 OSD's, which is statistically not
so much different from the first configuration. Also there is the same
40% overhead.

One rather obvious difference between the two configurations is that the
latter requires a cluster with at least 10 OSD's to make sense. But
let's say we have such a cluster, which of the two configurations would
be recommended, and why?

Thanks,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-15 Thread Erik Logtenberg
 Now, there are certain combinations of K and M that appear to have more
 or less the same result. Do any of these combinations have pro's and
 con's that I should consider and/or are there best practices for
 choosing the right K/M-parameters?


 Loic might have a better anwser, but I think that the more segments (K)
 you have, the heavier recovery. You have to contact more OSDs to
 reconstruct the whole object so that involves more disks doing seeks.

 I heard sombody from Fujitsu say that he thought 8/3 was best for most
 situations. That wasn't with Ceph though, but with a different system
 which implemented Erasure Coding.
 
 Performance is definitely lower with more segments in Ceph.  I kind of
 gravitate toward 4/2 or 6/2, though that's just my own preference.

This is indeed the kind of pro's and con's I was thinking about.
Performance-wise, I would expect differences, but I can think of both
positive and negative effects of bigger values for K.

For instance, yes recovery takes more OSD's with bigger values of K, but
it seems to me that there are also less or smaller items to recover.
Also read-performance generally appears to benefit from having a bigger
cluster (more parallellism), so I can imagine that bigger values of K
also provide an increase in read-performance.

Mark says more segments hurts performance though, are you referring just
to rebuild-performance or also basic operational performance (read/write)?

 For instance, if I choose K = 3 and M = 2, then pg's in this pool will
 use 5 OSD's and sustain the loss of 2 OSD's. There is 40% overhead in
 this configuration.

 Now, if I were to choose K = 6 and M = 4, I would end up with pg's that
 use 10 OSD's and sustain the loss of 4 OSD's, which is statistically
 not
 so much different from the first configuration. Also there is the same
 40% overhead.

 Although I don't have numbers in mind, I think the odds of loosing two
 OSD simultaneously are a lot smaller than the odds of loosing four OSD
 simultaneously. Or am I misunderstanding you when you write
 statistically not so much different from the first configuration ?


 Loosing two smaller then loosing four? Is that correct or did you mean
 it the other way around?

 I'd say that loosing four OSDs simultaneously is less likely to happen
 then two simultaneously.
 
 This is true, though the more disks you spread your objects across, the
 higher likelihood that any given object will be affected by a lost OSD.
  The extreme case being that every object is spread across every OSD and
 losing any given OSD affects all objects.  I suppose the severity
 depends on the relative fraction of your erasure coding parameters
 relative to the total number of OSDs.  I think this is perhaps what Erik
 was getting at.

I haven't done the actual calculations, but given some % chance of disk
failure, I would assume that losing x out of y disks has roughly the
same chance as losing 2*x out of 2*y disks over the same period.

That's also why you generally want to limit RAID5 arrays to maybe 6
disks or so and move to RAID6 for bigger arrays. For arrays bigger than
20 disks you would usually split those into separate arrays, just to
keep the (parity disks / total disks) fraction high enough.

With regard to data safety I would guess that 3+2 and 6+4 are roughly
equal, although the behaviour of 6+4 is probably easier to predict
because bigger numbers makes your calculations less dependent on
individual deviations in reliability.

Do you guys feel this argument is valid?

Erik.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-15 Thread Erik Logtenberg

 I haven't done the actual calculations, but given some % chance of disk
 failure, I would assume that losing x out of y disks has roughly the
 same chance as losing 2*x out of 2*y disks over the same period.

 That's also why you generally want to limit RAID5 arrays to maybe 6
 disks or so and move to RAID6 for bigger arrays. For arrays bigger than
 20 disks you would usually split those into separate arrays, just to
 keep the (parity disks / total disks) fraction high enough.

 With regard to data safety I would guess that 3+2 and 6+4 are roughly
 equal, although the behaviour of 6+4 is probably easier to predict
 because bigger numbers makes your calculations less dependent on
 individual deviations in reliability.

 Do you guys feel this argument is valid?
 
 Here is how I reason about it, roughly:
 
 If the probability of loosing a disk is 0.1%, the probability of loosing two 
 disks simultaneously (i.e. before the failure can be recovered) would be 
 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks 
 becomes 0.0001% 
 
 Accurately calculating the reliability of the system as a whole is a lot more 
 complex (see 
 https://wiki.ceph.com/Development/Add_erasure_coding_to_the_durability_model/ 
 for more information).
 
 Cheers

Okay, I see that in your calculation, you leave the total amount of
disks completely out of the equation. The link you provided is very
useful indeed and does some actual calculations. Interestingly, the
example in the details page [1] use k=32 and m=32 for a total of 64 blocks.
Those are very much bigger values than Mark Nelson mentioned earlier. Is
that example merely meant to demonstrate the theoretical advantages, or
would you actually recommend using those numbers in practice.
Let's assume that we have at least 64 OSD's available, would you
recommend k=32 and m=32?

[1]
https://wiki.ceph.com/Development/Add_erasure_coding_to_the_durability_model/Technical_details_on_the_model

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fw: external monitoring tools for processes

2014-08-11 Thread Erik Logtenberg
Hi,

Be sure to check this out:

http://ceph.com/community/ceph-calamari-goes-open-source/

Erik.


On 11-08-14 08:50, Irek Fasikhov wrote:
 Hi.
 
 I use ZABBIX with the following script:
 [ceph@ceph08 ~]$ cat /etc/zabbix/external/ceph 
 #!/usr/bin/python
 
 import sys
 import os
 import commands
 import json
 import datetime
 import time
 
 #Chech arguments. If count arguments equally 1, then false.
 if len(sys.argv) == 1:
 print You will need arguments!;
 exit;
 
 def generate(data,type):
 JSON={\data\:[
 for js in range(len(splits)):
 JSON+={\{#+type+}\:\+splits[js]+\},;
 return JSON[:-1]+]}
 
 if sys.argv[1] == osd:
 if len(sys.argv)==2:
 splits=commands.getoutput('df | grep osd | awk {\'print
 $6\'}| sed \'s/[^0-9]//g\'| sed \':a;N;$!ba;s/\\n/,/g\'').split(,)
 print generate(splits,OSD)
 else:
 ID=sys.argv[2]
 LEVEL=sys.argv[3]
 PERF=sys.argv[4]
 CACHEFILE=/tmp/zabbix.ceph.osd+ID+.cache
 CACHETTL=5

 TIME=int(round(float(datetime.datetime.now().strftime(%s
 
 ##CACHE FOR OPTIMIZATION PERFORMANCE#
 if os.path.isfile(CACHEFILE):
 CACHETIME=int(round(os.stat(CACHEFILE).st_mtime))
 else:
 CACHETIME=0
 if TIME-CACHETIMECACHETTL: 
 if os.system('sudo ceph --admin-daemon
 /var/run/ceph/ceph-osd.'+ID+'.asok perfcounters_dump '+CACHEFILE)0: exit
 
 json_data=open(CACHEFILE)
 data = json.load(json_data)
 json_data.close()
 ## PARSING 
 if LEVEL in data:
 if PERF in data[LEVEL]:
 try:
 key=data[LEVEL][PERF].has_key(sum)
 print
 (data[LEVEL][PERF][sum])/(data[LEVEL][PERF][avgcount])
 except AttributeError: 
 print data[LEVEL][PERF]
 
 and zabbix templates:
 https://dl.dropboxusercontent.com/u/575018/zbx_export_templates.xml
 
 
 
 2014-08-11 7:42 GMT+04:00 pragya jain prag_2...@yahoo.co.in
 mailto:prag_2...@yahoo.co.in:
 
 please somebody reply my question.
 
 
 On Saturday, 9 August 2014 3:34 PM, pragya jain
 prag_2...@yahoo.co.in mailto:prag_2...@yahoo.co.in wrote:
 
 
 
 hi all,
 
 can somebody suggest me some external monitoring tools which can
 monitor whether the processes in ceph, such as, heartbeating,
 data scrubbing, authentication, backfilling, recovering etc. are
 working properly or not.
 
 Regards
 Pragya Jain
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 
 -- 
 С уважением, Фасихов Ирек Нургаязович
 Моб.: +79229045757
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph.com centos7 repository ?

2014-07-10 Thread Erik Logtenberg
Hi,

RHEL7 repository works just as well. CentOS 7 is effectively a copy of
RHEL7 anyway. Packages for CentOS 7 wouldn't actually be any different.

Erik.

On 07/10/2014 06:14 AM, Alexandre DERUMIER wrote:
 Hi,
 
 I would like to known if a centos7 respository will be available soon ?
 
 Or can I use current rhel7 for the moment ?
 
 http://ceph.com/rpm-firefly/rhel7/x86_64/
 
 
 
 Cheers,
 
 Alexandre
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Temporary degradation when adding OSD's

2014-07-10 Thread Erik Logtenberg

 Yeah, Ceph will never voluntarily reduce the redundancy. I believe
 splitting the degraded state into separate wrongly placed and
 degraded (reduced redundancy) states is currently on the menu for
 the Giant release, but it's not been done yet.

That would greatly improve the accuracy of ceph's status reports.

Does ceph currently know about the difference of these states well
enough to be smart with prioritizing? Specifically, if I add an OSD and
ceph starts moving data around, but during that time an other OSD fails;
is ceph smart enough to quickly prioritize reduplicating the lost copies
before continuing to move data around (that was still perfectly duplicated)?

Thanks,

Erik.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Temporary degradation when adding OSD's

2014-07-07 Thread Erik Logtenberg
Hi,

If you add an OSD to an existing cluster, ceph will move some existing
data around so the new OSD gets its respective share of usage right away.

Now I noticed that during this moving around, ceph reports the relevant
PG's as degraded. I can more or less understand the logic here: if a
piece of data is supposed to be in a certain place (the new OSD), but it
is not yet there, it's degraded.

However I would hope that the movement of data is executed in such a way
that first a new copy is made on the new OSD and only after successfully
doing that, one of the existing copies is removed. If so, there is never
actually any degradation of that PG.

More to the point, if I have a PG replicated over three OSD's: 1, 2 and
3; now I add an OSD 4, and ceph decides to move the copy of OSD 3 to the
new OSD 4; if it turns out that ceph can't read the copies on OSD 1 and
2 due to some disk error, I would assume that ceph would still use the
copy that exists on OSD 3 to populate the copy on OSD 4. Is that indeed
the case?


I have a very similar question about removing an OSD. You can tell ceph
to mark an OSD as out before physically removing it. The OSD is still
up but ceph will no longer assign PG's to it, and will make new copies
of the PG's that are on this OSD to other OSD's.
Now again ceph will report degradation, even though the out OSD is
still up, so the existing copies are not actually lost. Does ceph use
the OSD that is marked out as a source for making the new copies on
other OSD's?

Thanks,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question about placing different pools on different osds

2014-07-06 Thread Erik Logtenberg
Hi,

I have some osd's on hdd's and some on ssd's, just like the example in
these docs:

http://ceph.com/docs/firefly/rados/operations/crush-map/

Now I'd like to place an erasure encoded pool on the hdd's and a
replicated (cache) pool on the ssd's. In order to do that, I have to
split the crush maps into two roots, according to the docs.

The way the docs describe it, I have to create two roots, and then
separate server-nodes for ssd's and hdd's, then the osd's under the
right server node.
My hdd's and ssd's are mixed within hosts though: I have four physical
hosts each with a couple of hdd's and an ssd. So there is not a
hdd-server and a ssd-server, like in the docs. Or do I have to create
two server-nodes per host? It appears to me that not all rules will
still work the same way in that case.

Regards,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Permissions spontaneously changing in cephfs

2014-07-01 Thread Erik Logtenberg
Hi Zheng,

Yes, it was mounted implicitly with acl's enabled. I disabled it by
adding noacl to the mount command, and now the behaviour is correct!
No more changing permissions.

So it appears to be related to acl's indeed, even though I didn't
actually set any acl's. Simply mounting with acl's enabled was enough to
cause the issue apparently.

So, do you have enough information to possibly fix it, or is there any
way that I can provide additional information?

Thanks,

Erik.


On 06/30/2014 05:13 AM, Yan, Zheng wrote:
 On Mon, Jun 30, 2014 at 4:25 AM, Erik Logtenberg e...@logtenberg.eu wrote:
 Hi Zheng,

 Okay, so on host1 I did:

 # echo module ceph +p  /sys/kernel/debug/dynamic_debug/control
 # mkdir hoi
 mkdir: kan map 'hoi' niet aanmaken: Bestand bestaat al
 # mkdir hoi2
 # ls -al
 drwxr-xr-x  1 root root0 29 jun 22:12 hoi
 drwxr-xr-x  1 root root0 29 jun 22:16 hoi2
 # dmesg  /host1.log
 
 Did you have Posix ACL enabled? A bug in Posix ACL support code can
 cause this issue.
 
 Regards
 Yan, Zheng
 
 

 On host2 I did:

 # echo module ceph +p  /sys/kernel/debug/dynamic_debug/control
 # ls -al
 drwxrwxrwx  1 root root0 29 jun 22:12 hoi
 drwxrwxrwx  1 root root0 29 jun 22:16 hoi2
 # dmesg  /host2.log

 Please find attached both host1.log and host2.log

 Thanks,

 Erik.


 On 06/20/2014 08:04 AM, Yan, Zheng wrote:
 On Fri, Jun 20, 2014 at 6:13 AM, Erik Logtenberg e...@logtenberg.eu wrote:
 Hi Zheng,

 Additionally, I notice that as long as I don't do anything with that
 directory, the permissions stay wrong.

 Previously I noticed that the permissions eventually got right by
 themselves, but I don't know what triggered it.

 Also, the permission problem is not just with the first ever created
 directory, it happens to files too:

 [host1 bla]# touch hoi
 [host1 bla]# ls -al
 -rw-r--r-- 1 root root 0 20 jun 00:05 hoi

 [host2 bla]# ls -al
 -rw-rw-rw- 1 root root 0 20 jun 00:05 hoi

 Notice the additional group and world writable flags. It works the other
 way round too:

 [host2 bla]# touch hoi2
 [host2 bla]# ls -al
 -rw-r--r-- 1 root root 0 20 jun 00:09 hoi2

 [host1 bla]# ls -al
 -rw-rw-rw- 1 root root 0 20 jun 00:09 hoi2

 However now after a couple of seconds I re-check on host2, and the
 permissons have changed there as well:

 [host2 bla]# ls -al
 -rw-rw-rw- 1 root root 0 20 jun 00:09 hoi2

 So now it's group and world writable on both hosts.

 I can't reproduce this locally. Please enable dynamic debugging for
 ceph (echo module ceph +p  /sys/kernel/debug/dynamic_debug/control)
 and send kernel log to me.

 Regards
 Yan, Zheng


 Kind regards,

 Erik.


 On 06/19/2014 11:37 PM, Erik Logtenberg wrote:
 I am using the kernel client.

 kernel: 3.14.4-100.fc19.x86_64
 ceph: ceph-0.80.1-0.fc19.x86_64

 Actually, I seem to be able to reproduce it quite reliably. I just reset
 my cephfs (fiddling with erasure coded pools which was no success), so
 just for kicks tried again with creating a directory. Exactly the same
 results.

 Kind regards,

 Erik.



 On 06/16/2014 02:32 PM, Yan, Zheng wrote:
 were you using ceph-fuse or kernel client? ceph version and kernel
 version? how reliably you can reproduce this problem?

 Regards
 Yan, Zheng

 On Sun, Jun 15, 2014 at 4:42 AM, Erik Logtenberg e...@logtenberg.eu 
 wrote:
 Hi,

 So... I wrote some files into that directory to test performance, and
 now I notice that both hosts see the permissions the right way, like
 they were when I first created the directory.

 What is going on here? ..

 Erik.


 On 06/14/2014 10:32 PM, Erik Logtenberg wrote:
 Hi,

 I ran into a weird issue with cephfs today. I create a directory like 
 this:

 # mkdir bla
 # ls -al
 drwxr-xr-x  1 root root0 14 jun 22:22 bla

 Now on another host, with the same cephfs mounted, I see different
 permissions:

 # ls -al
 drwxrwxrwx 1 root root 0 14 jun 22:22 bla

 Weird, huh?

 Back to host #1, I unmount cephfs and mount it again. Now it sees the
 same (changed) permissions as I saw on the second host:

 # ls -al
 drwxrwxrwx  1 root root0 14 jun 22:22 bla

 So... what happened to the original permissions and why did they 
 change?

 Thanks,

 Erik.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.82 released

2014-06-30 Thread Erik Logtenberg
Ah, okay I missed that.

So, what distributions/versions are supported then? I see that the FC20
part of the ceph repository (http://ceph.com/rpm/fc20/x86_64) doesn't
contain ceph itself, so I am assuming you'd have to use the ceph package
from FC20 itself, however they are still at 0.80.1:

ceph-0.80.1-2.fc20.x86_64.rpm

There are also el6 packages at http://ceph.com/rpm/el6/ but they are at
0.80.1 as well. The same seems to be true for the RHEL7 packages, so I
am a bit at a loss...

Thanks,

Erik.




On 06/30/2014 03:02 PM, Alfredo Deza wrote:
 Erik, I don't think we are building for FC19 anymore.
 
 There are some dependencies that could not be met for Ceph in FC19 so we
 decided to stop trying to get builds out for that.
 
 On Sun, Jun 29, 2014 at 2:52 PM, Erik Logtenberg e...@logtenberg.eu wrote:
 Nice work! When will the new rpm's be released on
 http://ceph.com/rpm/fc19/x86_64/ ?

 Thanks,

 Erik.


 On 06/27/2014 10:55 PM, Sage Weil wrote:
 This is the second post-firefly development release.  It includes a range
 of bug fixes and some usability improvements.  There are some MDS
 debugging and diagnostic tools, an improved 'ceph df', and some OSD
 backend refactoring and cleanup.

 Notable Changes
 ---

 * ceph-brag: add tox tests (Alfredo Deza)
 * common: perfcounters now use atomics and go faster (Sage Weil)
 * doc: CRUSH updates (John Wilkins)
 * doc: osd primary affinity (John Wilkins)
 * doc: pool quotas (John Wilkins)
 * doc: pre-flight doc improvements (Kevin Dalley)
 * doc: switch to an unencumbered font (Ross Turk)
 * doc: update openstack docs (Josh Durgin)
 * fix hppa arch build (Dmitry Smirnov)
 * init-ceph: continue starting other daemons on crush or mount failure
   (#8343, Sage Weil)
 * keyvaluestore: fix hint crash (#8381, Haomai Wang)
 * libcephfs-java: build against older JNI headers (Greg Farnum)
 * librados: fix rados_pool_list bounds checks (Sage Weil)
 * mds: cephfs-journal-tool (John Spray)
 * mds: improve Journaler on-disk format (John Spray)
 * mds, libcephfs: use client timestamp for mtime/ctime (Sage Weil)
 * mds: misc encoding improvements (John Spray)
 * mds: misc fixes for multi-mds (Yan, Zheng)
 * mds: OPTracker integration, dump_ops_in_flight (Greg Farnum)
 * misc cleanup (Christophe Courtaut)
 * mon: fix default replication pool ruleset choice (#8373, John Spray)
 * mon: fix set cache_target_full_ratio (#8440, Geoffrey Hartz)
 * mon: include per-pool 'max avail' in df output (Sage Weil)
 * mon: prevent EC pools from being used with cephfs (Joao Eduardo Luis)
 * mon: restore original weight when auto-marked out OSDs restart (Sage
   Weil)
 * mon: use msg header tid for MMonGetVersionReply (Ilya Dryomov)
 * osd: fix bogus assert during OSD shutdown (Sage Weil)
 * osd: fix clone deletion case (#8334, Sam Just)
 * osd: fix filestore removal corner case (#8332, Sam Just)
 * osd: fix hang waiting for osdmap (#8338, Greg Farnum)
 * osd: fix interval check corner case during peering (#8104, Sam Just)
 * osd: fix journal-less operation (Sage Weil)
 * osd: include backend information in metadata reported to mon (Sage Weil)
 * rest-api: fix help (Ailing Zhang)
 * rgw: check entity permission for put_metadata (#8428, Yehuda Sadeh)

 Getting Ceph
 

 * Git at git://github.com/ceph/ceph.git
 * Tarball at http://ceph.com/download/ceph-0.82.tar.gz
 * For packages, see http://ceph.com/docs/master/install/get-packages
 * For ceph-deploy, see 
 http://ceph.com/docs/master/install/install-ceph-deploy

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.82 released

2014-06-29 Thread Erik Logtenberg
Nice work! When will the new rpm's be released on
http://ceph.com/rpm/fc19/x86_64/ ?

Thanks,

Erik.


On 06/27/2014 10:55 PM, Sage Weil wrote:
 This is the second post-firefly development release.  It includes a range 
 of bug fixes and some usability improvements.  There are some MDS 
 debugging and diagnostic tools, an improved 'ceph df', and some OSD 
 backend refactoring and cleanup.
 
 Notable Changes
 ---
 
 * ceph-brag: add tox tests (Alfredo Deza)
 * common: perfcounters now use atomics and go faster (Sage Weil)
 * doc: CRUSH updates (John Wilkins)
 * doc: osd primary affinity (John Wilkins)
 * doc: pool quotas (John Wilkins)
 * doc: pre-flight doc improvements (Kevin Dalley)
 * doc: switch to an unencumbered font (Ross Turk)
 * doc: update openstack docs (Josh Durgin)
 * fix hppa arch build (Dmitry Smirnov)
 * init-ceph: continue starting other daemons on crush or mount failure 
   (#8343, Sage Weil)
 * keyvaluestore: fix hint crash (#8381, Haomai Wang)
 * libcephfs-java: build against older JNI headers (Greg Farnum)
 * librados: fix rados_pool_list bounds checks (Sage Weil)
 * mds: cephfs-journal-tool (John Spray)
 * mds: improve Journaler on-disk format (John Spray)
 * mds, libcephfs: use client timestamp for mtime/ctime (Sage Weil)
 * mds: misc encoding improvements (John Spray)
 * mds: misc fixes for multi-mds (Yan, Zheng)
 * mds: OPTracker integration, dump_ops_in_flight (Greg Farnum)
 * misc cleanup (Christophe Courtaut)
 * mon: fix default replication pool ruleset choice (#8373, John Spray)
 * mon: fix set cache_target_full_ratio (#8440, Geoffrey Hartz)
 * mon: include per-pool 'max avail' in df output (Sage Weil)
 * mon: prevent EC pools from being used with cephfs (Joao Eduardo Luis)
 * mon: restore original weight when auto-marked out OSDs restart (Sage 
   Weil)
 * mon: use msg header tid for MMonGetVersionReply (Ilya Dryomov)
 * osd: fix bogus assert during OSD shutdown (Sage Weil)
 * osd: fix clone deletion case (#8334, Sam Just)
 * osd: fix filestore removal corner case (#8332, Sam Just)
 * osd: fix hang waiting for osdmap (#8338, Greg Farnum)
 * osd: fix interval check corner case during peering (#8104, Sam Just)
 * osd: fix journal-less operation (Sage Weil)
 * osd: include backend information in metadata reported to mon (Sage Weil)
 * rest-api: fix help (Ailing Zhang)
 * rgw: check entity permission for put_metadata (#8428, Yehuda Sadeh)
 
 Getting Ceph
 
 
 * Git at git://github.com/ceph/ceph.git
 * Tarball at http://ceph.com/download/ceph-0.82.tar.gz
 * For packages, see http://ceph.com/docs/master/install/get-packages
 * For ceph-deploy, see http://ceph.com/docs/master/install/install-ceph-deploy
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Erasure coded pool suitable for MDS?

2014-06-19 Thread Erik Logtenberg
Hi,

Are erasure coded pools suitable for use with MDS?

I tried to give it a go by creating two new pools like so:

# ceph osd pool create ecdata 128 128 erasure
# ceph osd pool create ecmetadata 128 128 erasure

Then looked up their id's:

# ceph osd lspools
..., 6 ecdata,7 ecmetadata

# ceph mds newfs 7 6 --yes-i-really-mean-it

But then when I start MDS, it crashes horribly. I did notice that MDS
created a couple of objects in the ecmetadata pool:

# rados ls -p ecmetadata
mds0_sessionmap
mds0_inotable
1..inode
200.
mds_anchortable
mds_snaptable
100..inode

However it crashes immediately after. I started mds manually to try and
see what's up:

# ceph-mds -i 0 -d

This spews out so much information that I saved it in a logfile, added
as an attachment.

Kind regards,

Erik.
2014-06-19 22:07:34.492328 7f3572f6e7c0  0 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-mds, pid 2943
starting mds.0 at :/0
2014-06-19 22:07:35.793309 7f356dd88700  1 mds.-1.0 handle_mds_map standby
2014-06-19 22:07:35.876689 7f356dd88700  1 mds.0.15 handle_mds_map i am now mds.0.15
2014-06-19 22:07:35.876695 7f356dd88700  1 mds.0.15 handle_mds_map state change up:standby -- up:creating
2014-06-19 22:07:35.876931 7f356dd88700  0 mds.0.cache creating system inode with ino:1
2014-06-19 22:07:35.877204 7f356dd88700  0 mds.0.cache creating system inode with ino:100
2014-06-19 22:07:35.877209 7f356dd88700  0 mds.0.cache creating system inode with ino:600
2014-06-19 22:07:35.877369 7f356dd88700  0 mds.0.cache creating system inode with ino:601
2014-06-19 22:07:35.877455 7f356dd88700  0 mds.0.cache creating system inode with ino:602
2014-06-19 22:07:35.877519 7f356dd88700  0 mds.0.cache creating system inode with ino:603
2014-06-19 22:07:35.877566 7f356dd88700  0 mds.0.cache creating system inode with ino:604
2014-06-19 22:07:35.877606 7f356dd88700  0 mds.0.cache creating system inode with ino:605
2014-06-19 22:07:35.877683 7f356dd88700  0 mds.0.cache creating system inode with ino:606
2014-06-19 22:07:35.877723 7f356dd88700  0 mds.0.cache creating system inode with ino:607
2014-06-19 22:07:35.877780 7f356dd88700  0 mds.0.cache creating system inode with ino:608
2014-06-19 22:07:35.877819 7f356dd88700  0 mds.0.cache creating system inode with ino:609
2014-06-19 22:07:35.877858 7f356dd88700  0 mds.0.cache creating system inode with ino:200
mds/CDir.cc: In function 'virtual void C_Dir_Committed::finish(int)' thread 7f356dd88700 time 2014-06-19 22:07:35.881337
mds/CDir.cc: 1809: FAILED assert(r == 0)
 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: ceph-mds() [0x75c6f1]
 2: (Context::complete(int)+0x9) [0x56cff9]
 3: (C_Gather::sub_finish(Context*, int)+0x1f7) [0x56e9a7]
 4: (C_Gather::C_GatherSub::finish(int)+0x12) [0x56eab2]
 5: (Context::complete(int)+0x9) [0x56cff9]
 6: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf4e) [0x7d26ee]
 7: (MDS::handle_core_message(Message*)+0xb1f) [0x58e5ef]
 8: (MDS::_dispatch(Message*)+0x32) [0x58e7f2]
 9: (MDS::ms_dispatch(Message*)+0xa3) [0x5901d3]
 10: (DispatchQueue::entry()+0x57a) [0x99d9da]
 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x8be63d]
 12: (()+0x7c53) [0x7f3572366c53]
 13: (clone()+0x6d) [0x7f3571257dbd]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this.
2014-06-19 22:07:35.883239 7f356dd88700 -1 mds/CDir.cc: In function 'virtual void C_Dir_Committed::finish(int)' thread 7f356dd88700 time 2014-06-19 22:07:35.881337
mds/CDir.cc: 1809: FAILED assert(r == 0)

 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: ceph-mds() [0x75c6f1]
 2: (Context::complete(int)+0x9) [0x56cff9]
 3: (C_Gather::sub_finish(Context*, int)+0x1f7) [0x56e9a7]
 4: (C_Gather::C_GatherSub::finish(int)+0x12) [0x56eab2]
 5: (Context::complete(int)+0x9) [0x56cff9]
 6: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf4e) [0x7d26ee]
 7: (MDS::handle_core_message(Message*)+0xb1f) [0x58e5ef]
 8: (MDS::_dispatch(Message*)+0x32) [0x58e7f2]
 9: (MDS::ms_dispatch(Message*)+0xa3) [0x5901d3]
 10: (DispatchQueue::entry()+0x57a) [0x99d9da]
 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x8be63d]
 12: (()+0x7c53) [0x7f3572366c53]
 13: (clone()+0x6d) [0x7f3571257dbd]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this.

--- begin dump of recent events ---
  -144 2014-06-19 22:07:34.489920 7f3572f6e7c0  5 asok(0x1e0) register_command perfcounters_dump hook 0x1dc8010
  -143 2014-06-19 22:07:34.489992 7f3572f6e7c0  5 asok(0x1e0) register_command 1 hook 0x1dc8010
  -142 2014-06-19 22:07:34.490003 7f3572f6e7c0  5 asok(0x1e0) register_command perf dump hook 0x1dc8010
  -141 2014-06-19 22:07:34.490015 7f3572f6e7c0  5 asok(0x1e0) register_command perfcounters_schema hook 0x1dc8010
  -140 2014-06-19 22:07:34.490027 7f3572f6e7c0  5 asok(0x1e0) register_command 2 hook 0x1dc8010
  -139 2014-06-19 22:07:34.490035 7f3572f6e7c0  5 asok(0x1e0) 

Re: [ceph-users] Errors setting up erasure coded pool

2014-06-15 Thread Erik Logtenberg
Hi Loic,

Yes I upgraded the cluster, it is definately not new. I ran into
several other issues as well, for instance legacy tunables that had to
be changed.

These are the directory contents you asked for:

# ls -al /usr/lib64/ceph/erasure-code
totaal 776
drwxr-xr-x. 2 root root   4096 12 jun 21:36 .
drwxr-xr-x. 3 root root   4096 12 jun 21:36 ..
lrwxrwxrwx. 1 root root 22 12 jun 21:36 libec_example.so -
libec_example.so.0.0.0
lrwxrwxrwx. 1 root root 22 12 jun 21:36 libec_example.so.0 -
libec_example.so.0.0.0
-rwxr-xr-x. 1 root root 381240 14 mei 17:44 libec_example.so.0.0.0
lrwxrwxrwx. 1 root root 33 12 jun 21:36
libec_fail_to_initialize.so - libec_fail_to_initialize.so.0.0.0
lrwxrwxrwx. 1 root root 33 12 jun 21:36
libec_fail_to_initialize.so.0 - libec_fail_to_initialize.so.0.0.0
-rwxr-xr-x. 1 root root   6032 14 mei 17:44
libec_fail_to_initialize.so.0.0.0
lrwxrwxrwx. 1 root root 31 12 jun 21:36 libec_fail_to_register.so
- libec_fail_to_register.so.0.0.0
lrwxrwxrwx. 1 root root 31 12 jun 21:36
libec_fail_to_register.so.0 - libec_fail_to_register.so.0.0.0
-rwxr-xr-x. 1 root root   6032 14 mei 17:44
libec_fail_to_register.so.0.0.0
lrwxrwxrwx. 1 root root 20 12 jun 21:36 libec_hangs.so -
libec_hangs.so.0.0.0
lrwxrwxrwx. 1 root root 20 12 jun 21:36 libec_hangs.so.0 -
libec_hangs.so.0.0.0
-rwxr-xr-x. 1 root root   6024 14 mei 17:44 libec_hangs.so.0.0.0
lrwxrwxrwx. 1 root root 23 12 jun 21:36 libec_jerasure.so -
libec_jerasure.so.2.0.0
lrwxrwxrwx. 1 root root 23 12 jun 21:36 libec_jerasure.so.2 -
libec_jerasure.so.2.0.0
-rwxr-xr-x. 1 root root 364696 14 mei 17:44 libec_jerasure.so.2.0.0
lrwxrwxrwx. 1 root root 34 12 jun 21:36
libec_missing_entry_point.so - libec_missing_entry_point.so.0.0.0
lrwxrwxrwx. 1 root root 34 12 jun 21:36
libec_missing_entry_point.so.0 - libec_missing_entry_point.so.0.0.0
-rwxr-xr-x. 1 root root   5944 14 mei 17:44
libec_missing_entry_point.so.0.0.0

Kind regards,

Erik.


On 06/15/2014 09:47 AM, Loic Dachary wrote:
 Hi Erik,
 
 Did you upgrade the cluster or is it a new cluster ? Could you
 please ls -l /usr/lib64/ceph/erasure-code ? If you're connected on
 irc.oftc.net#ceph today feel free to ping me ( loicd ).
 
 Cheers
 
 On 14/06/2014 23:25, Erik Logtenberg wrote:
 Hi,
 
 I'm trying to set up an erasure coded pool, as described in the
 Ceph docs:
 
 http://ceph.com/docs/firefly/dev/erasure-coded-pool/
 
 Unfortunately, creating a pool like that gives me the following
 error:
 
 # ceph osd pool create ecpool 12 12 erasure Error EINVAL: cannot
 determine the erasure code plugin because there is no 'plugin'
 entry in the erasure_code_profile {}failed to load plugin using
 profile default
 
 I noticed that the default profile appears to be empty:
 
 # ceph osd erasure-code-profile get default #
 
 I tried setting the default profile right, but I'm not allowed:
 
 # ceph osd erasure-code-profile set default plugin=jerasure 
 technique=reed_sol_van k=2 m=1 Error EPERM: will not override
 erasure code profile default
 
 So I set a new profile with the same configuration:
 
 # ceph osd erasure-code-profile set k2m1 plugin=jerasure 
 technique=reed_sol_van k=2 m=1 # ceph osd erasure-code-profile
 get k2m1 directory=/usr/lib64/ceph/erasure-code k=2 m=1 
 plugin=jerasure technique=reed_sol_van
 
 So, that worked :)
 
 However, I still can't use it to create a pool:
 
 # ceph osd pool create test123 100 100 erasure k2m1 Error EIO:
 failed to load plugin using profile k2m1
 
 I double checked that the directory is correct, and it is:
 
 # ls -al /usr/lib64/ceph/erasure-code/libec_jerasure.so 
 lrwxrwxrwx. 1 root root 23 12 jun 21:36 
 /usr/lib64/ceph/erasure-code/libec_jerasure.so -
 libec_jerasure.so.2.0.0
 
 It does contain the jerasure library, so now I'm at a loss. What
 am I doing wrong? By the way, this is ceph-0.80.1.
 
 Thanks,
 
 Erik. ___ ceph-users
 mailing list ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Errors setting up erasure coded pool

2014-06-15 Thread Erik Logtenberg
Hi Loic,

I hope you can sort of read the log snippet below.

2014-06-15 13:20:22.593085 7effee3f0700  0 mon.0@0(leader) e1
handle_command mon_command({pool_type: erasure, prefix: osd
pool create, pg_num: 100, erasure_code_profile: k2m1,
pgp_num: 100, pool: test123} v 0) v1
2014-06-15 13:20:22.593192 7effee3f0700 20 is_capable service=osd
command=osd pool create read write on cap allow *
2014-06-15 13:20:22.593197 7effee3f0700 20  allow so far , doing grant
allow *
2014-06-15 13:20:22.593208 7effee3f0700 20  allow all
2014-06-15 13:20:22.593209 7effee3f0700 10 mon.0@0(leader) e1
_allowed_command capable
2014-06-15 13:20:22.593212 7effee3f0700  1 mon.0@0(leader).paxos(paxos
active c 17561..18177) is_readable now=2014-06-15 13:20:22.593212
lease_expire=0.00 has v0 lc 18177
2014-06-15 13:20:22.593216 7effee3f0700 10 mon.0@0(leader).osd e281
preprocess_query mon_command({pool_type: erasure, prefix: osd
pool create, pg_num: 100, erasure_code_profile: k2m1,
pgp_num: 100, pool: test123} v 0) v1 from client.5815
192.168.1.15:0/1002158
2014-06-15 13:20:22.593544 7effee3f0700  7 mon.0@0(leader).osd e281
prepare_update mon_command({pool_type: erasure, prefix: osd
pool create, pg_num: 100, erasure_code_profile: k2m1,
pgp_num: 100, pool: test123} v 0) v1 from client.5815
192.168.1.15:0/1002158
2014-06-15 13:20:22.593664 7effee3f0700  1 mon.0@0(leader).osd e281
implicitly use ruleset named after the pool: test123
2014-06-15 13:20:22.593801 7effee3f0700 -1
ErasureCodePluginSelectJerasure: load
dlopen(/usr/lib64/ceph/erasure-code/libec_jerasure_sse3.so):
/usr/lib64/ceph/erasure-code/libec_jerasure_sse3.so: cannot open
shared object file: No such file or directory
2014-06-15 13:20:22.601143 7effee3f0700 10 mon.0@0(leader) e1
ms_handle_reset 0x1d2ca40 192.168.1.15:0/1002158

It's quite obvious what goes wrong here, ceph can't find the file:
 /usr/lib64/ceph/erasure-code/libec_jerasure_sse3.so

And that's correct. I only have:
 libec_jerasure.so.2.0.0

and two symlinks to this library, called:
 libec_jerasure.so - libec_jerasure.so.2.0.0
 libec_jerasure.so.2 - libec_jerasure.so.2.0.0

I don't know why it's looking for an sse3 optimized version (?) of
this library and/or why that library is missing. My cpu does have the
sse, sse2 and ssse3 flags by the way.

Kind regards,

Erik.

P.S. How do I reset the verbosity of the logging? ;)


On 06/15/2014 10:12 AM, Loic Dachary wrote:
 Hi,
 
 It would also help if you could add
 
 ceph tell mon.'*' injectargs '--debug-mon 20'
 
 check that it is set (replace mon.a with mon.)
 
 ceph daemon mon.a config get debug_mon { debug_mon: 20\/20}
 
 and check what the log have after you get
 
 ceph osd pool create test123 100 100 erasure k2m1 Error EIO: failed
 to load plugin using profile k2m1
 
 I think you are facing two different problems.
 
 A) the empty default profile is probably combination of upgrading
 an existing cluster (in which case maybe the default profile is not
 created) and the implicit creation of an empty profile (which
 should not happen and has been fixed in
 http://tracker.ceph.com/issues/8599 but not yet released).
 
 B) the jerasure plugin fails to load for some reason and the mon
 logs should tell us why
 
 Cheers
 
 On 15/06/2014 09:47, Loic Dachary wrote:
 Hi Erik,
 
 Did you upgrade the cluster or is it a new cluster ? Could you
 please ls -l /usr/lib64/ceph/erasure-code ? If you're connected
 on irc.oftc.net#ceph today feel free to ping me ( loicd ).
 
 Cheers
 
 On 14/06/2014 23:25, Erik Logtenberg wrote:
 Hi,
 
 I'm trying to set up an erasure coded pool, as described in the
 Ceph docs:
 
 http://ceph.com/docs/firefly/dev/erasure-coded-pool/
 
 Unfortunately, creating a pool like that gives me the following
 error:
 
 # ceph osd pool create ecpool 12 12 erasure Error EINVAL:
 cannot determine the erasure code plugin because there is no
 'plugin' entry in the erasure_code_profile {}failed to load
 plugin using profile default
 
 I noticed that the default profile appears to be empty:
 
 # ceph osd erasure-code-profile get default #
 
 I tried setting the default profile right, but I'm not
 allowed:
 
 # ceph osd erasure-code-profile set default plugin=jerasure 
 technique=reed_sol_van k=2 m=1 Error EPERM: will not override
 erasure code profile default
 
 So I set a new profile with the same configuration:
 
 # ceph osd erasure-code-profile set k2m1 plugin=jerasure 
 technique=reed_sol_van k=2 m=1 # ceph osd erasure-code-profile
 get k2m1 directory=/usr/lib64/ceph/erasure-code k=2 m=1 
 plugin=jerasure technique=reed_sol_van
 
 So, that worked :)
 
 However, I still can't use it to create a pool:
 
 # ceph osd pool create test123 100 100 erasure k2m1 Error EIO:
 failed to load plugin using profile k2m1
 
 I double checked that the directory is correct, and it is:
 
 # ls -al /usr/lib64/ceph/erasure-code/libec_jerasure.so 
 lrwxrwxrwx. 1 root root 23 12 jun 21:36 
 /usr/lib64/ceph/erasure-code/libec_jerasure.so -
 libec_jerasure.so.2.0.0
 
 It does

[ceph-users] Permissions spontaneously changing in cephfs

2014-06-14 Thread Erik Logtenberg
Hi,

I ran into a weird issue with cephfs today. I create a directory like this:

# mkdir bla
# ls -al
drwxr-xr-x  1 root root0 14 jun 22:22 bla

Now on another host, with the same cephfs mounted, I see different
permissions:

# ls -al
drwxrwxrwx 1 root root 0 14 jun 22:22 bla

Weird, huh?

Back to host #1, I unmount cephfs and mount it again. Now it sees the
same (changed) permissions as I saw on the second host:

# ls -al
drwxrwxrwx  1 root root0 14 jun 22:22 bla

So... what happened to the original permissions and why did they change?

Thanks,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Permissions spontaneously changing in cephfs

2014-06-14 Thread Erik Logtenberg
Hi,

So... I wrote some files into that directory to test performance, and
now I notice that both hosts see the permissions the right way, like
they were when I first created the directory.

What is going on here? ..

Erik.


On 06/14/2014 10:32 PM, Erik Logtenberg wrote:
 Hi,
 
 I ran into a weird issue with cephfs today. I create a directory like this:
 
 # mkdir bla
 # ls -al
 drwxr-xr-x  1 root root0 14 jun 22:22 bla
 
 Now on another host, with the same cephfs mounted, I see different
 permissions:
 
 # ls -al
 drwxrwxrwx 1 root root 0 14 jun 22:22 bla
 
 Weird, huh?
 
 Back to host #1, I unmount cephfs and mount it again. Now it sees the
 same (changed) permissions as I saw on the second host:
 
 # ls -al
 drwxrwxrwx  1 root root0 14 jun 22:22 bla
 
 So... what happened to the original permissions and why did they change?
 
 Thanks,
 
 Erik.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Errors setting up erasure coded pool

2014-06-14 Thread Erik Logtenberg
Hi,

I'm trying to set up an erasure coded pool, as described in the Ceph docs:

http://ceph.com/docs/firefly/dev/erasure-coded-pool/

Unfortunately, creating a pool like that gives me the following error:

# ceph osd pool create ecpool 12 12 erasure
Error EINVAL: cannot determine the erasure code plugin because there is
no 'plugin' entry in the erasure_code_profile {}failed to load plugin
using profile default

I noticed that the default profile appears to be empty:

# ceph osd erasure-code-profile get default
#

I tried setting the default profile right, but I'm not allowed:

# ceph osd erasure-code-profile set default plugin=jerasure
technique=reed_sol_van k=2 m=1
Error EPERM: will not override erasure code profile default

So I set a new profile with the same configuration:

# ceph osd erasure-code-profile set k2m1 plugin=jerasure
technique=reed_sol_van k=2 m=1
# ceph osd erasure-code-profile get k2m1
directory=/usr/lib64/ceph/erasure-code
k=2
m=1
plugin=jerasure
technique=reed_sol_van

So, that worked :)

However, I still can't use it to create a pool:

# ceph osd pool create test123 100 100 erasure k2m1
Error EIO: failed to load plugin using profile k2m1

I double checked that the directory is correct, and it is:

# ls -al /usr/lib64/ceph/erasure-code/libec_jerasure.so
lrwxrwxrwx. 1 root root 23 12 jun 21:36
/usr/lib64/ceph/erasure-code/libec_jerasure.so - libec_jerasure.so.2.0.0

It does contain the jerasure library, so now I'm at a loss. What am I
doing wrong? By the way, this is ceph-0.80.1.

Thanks,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Recommended way to use Ceph as storage for file server

2014-06-02 Thread Erik Logtenberg
Hi,

In march 2013 Greg wrote an excellent blog posting regarding the (then)
current status of MDS/CephFS and the plans for going forward with
development.

http://ceph.com/dev-notes/cephfs-mds-status-discussion/

Since then, I understand progress has been slow, and Greg confirmed that
he didn't want to commit to any release date yet, when I asked him for
an update earlier this year.
CephFS appears to be a more or less working product, does receive
stability fixes every now and then, but I don't think Inktank would call
it production ready.

So my question is: I would like to use Ceph as a storage for files, as a
fileserver or at least as a backend to my fileserver. What is the
recommended way to do this?

A more or less obvious alternative for CephFS would be to simply create
a huge RBD and have a separate file server (running NFS / Samba /
whatever) use that block device as backend. Just put a regular FS on top
of the RBD and use it that way.
Clients wouldn't really have any of the real performance and resilience
benefits that Ceph could offer though, because the (single machine?)
file server is now the bottleneck.

Any advice / best practice would be greatly appreciated. Any real-world
experience with current CephFS as well.

Kind regards,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Idle OSD's keep using a lot of CPU

2013-08-01 Thread Erik Logtenberg
Hi,

I think the high CPU usage was due to the system time not being right. I
activated ntp and it had to do quite big adjustment, and after that the
high CPU usage was gone.

Anyway, I immediately ran into another issue. I ran a simple benchmark:
# rados bench --pool benchmark 300 write --no-cleanup

During the benchmark, one of my osd's went down. I checked the logs and
apparently there was no hardware failure (the disk is still nicely
mounted and the osd is still running, but the logfile fills up rapidly
with these messages:

2013-08-02 00:03:40.014982 7fe7336fd700  0 -- 192.168.1.15:6801/1229 
192.168.1.16:6801/3001 pipe(0x39e9680 sd=28 :36884 s=2 pgs=86874
cs=173547 l=0).fault, initiating reconnect
2013-08-02 00:03:40.016682 7fe7336fd700  0 -- 192.168.1.15:6801/1229 
192.168.1.16:6801/3001 pipe(0x39e9680 sd=28 :36885 s=2 pgs=86875
cs=173549 l=0).fault, initiating reconnect
2013-08-02 00:03:40.019241 7fe7336fd700  0 -- 192.168.1.15:6801/1229 
192.168.1.16:6801/3001 pipe(0x39e9680 sd=28 :36886 s=2 pgs=86876
cs=173551 l=0).fault, initiating reconnect

What could be wrong here?

King regards,

Erik.



On 08/01/2013 08:00 AM, Dan Mick wrote:
 Logging might well help.
 
 http://ceph.com/docs/master/rados/troubleshooting/log-and-debug/
 
 
 
 On 07/31/2013 03:51 PM, Erik Logtenberg wrote:
 Hi,

 I just added a second node to my ceph test platform. The first node has
 a mon and three osd's, the second node only has three osd's. Adding the
 osd's was pretty painless, and ceph distributed the data from the first
 node evenly over both nodes so everything seems to be fine. The monitor
 also thinks everything is fine:

 2013-08-01 00:41:12.719640 mon.0 [INF] pgmap v1283: 292 pgs: 292
 active+clean; 9264 MB data, 24826 MB used, 5541 GB / 5578 GB avail

 Unfortunately, the three osd's on the second node keep eating a lot of
 cpu, while there is no activity whatsoever:

PID USER  VIRTRESSHR S  %CPU %MEM TIME+ COMMAND
 21272 root  441440  34632   7848 S  61.8  0.9   4:08.62 ceph-osd
 21145 root  439852  29316   8360 S  60.4  0.7   4:04.31 ceph-osd
 21036 root  443828  31324   8336 S  60.1  0.8   4:07.55 ceph-osd

 Any idea why that is and how I can even ask an osd what it's doing?
 There is no corresponding hdd activity, it's only cpu and hardly any
 memory usage.

 Also the monitor on the first node is doing the same thing:

PID USERVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
 12825 root186900  23492   5540 S 141.1 0.590   9:47.64 ceph-mon

 I tried stopping the three osd's: that makes the monitor calm down, but
 after restarting the osd's, the monitor resumes its cpu usage. I also
 tried stopping the monitor, which makes the three osd's calm down, but
 once again they will start eating cpu again as soon as the monitor is
 back online.

 In the mean time, the first three osd's, the ones on the same machine as
 the monitor, don't behave like this at all. Currently as there is no
 activity, they are just idling on low cpu usage, as expected.

 Kind regards,

 Erik.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Idle OSD's keep using a lot of CPU

2013-07-31 Thread Erik Logtenberg
Hi,

I just added a second node to my ceph test platform. The first node has
a mon and three osd's, the second node only has three osd's. Adding the
osd's was pretty painless, and ceph distributed the data from the first
node evenly over both nodes so everything seems to be fine. The monitor
also thinks everything is fine:

2013-08-01 00:41:12.719640 mon.0 [INF] pgmap v1283: 292 pgs: 292
active+clean; 9264 MB data, 24826 MB used, 5541 GB / 5578 GB avail

Unfortunately, the three osd's on the second node keep eating a lot of
cpu, while there is no activity whatsoever:

  PID USER  VIRTRESSHR S  %CPU %MEM TIME+ COMMAND
21272 root  441440  34632   7848 S  61.8  0.9   4:08.62 ceph-osd
21145 root  439852  29316   8360 S  60.4  0.7   4:04.31 ceph-osd
21036 root  443828  31324   8336 S  60.1  0.8   4:07.55 ceph-osd

Any idea why that is and how I can even ask an osd what it's doing?
There is no corresponding hdd activity, it's only cpu and hardly any
memory usage.

Also the monitor on the first node is doing the same thing:

  PID USERVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
12825 root186900  23492   5540 S 141.1 0.590   9:47.64 ceph-mon

I tried stopping the three osd's: that makes the monitor calm down, but
after restarting the osd's, the monitor resumes its cpu usage. I also
tried stopping the monitor, which makes the three osd's calm down, but
once again they will start eating cpu again as soon as the monitor is
back online.

In the mean time, the first three osd's, the ones on the same machine as
the monitor, don't behave like this at all. Currently as there is no
activity, they are just idling on low cpu usage, as expected.

Kind regards,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Small fix for ceph.spec

2013-07-30 Thread Erik Logtenberg
Hi,

Fedora, in this case Fedora 19, x86_64.

Kind regards,

Erik.


On 07/30/2013 09:29 AM, Danny Al-Gaaf wrote:
 Hi,
 
 I think this is a bug in packaging of the leveldb package in this case
 since the spec-file already sets dependencies on on leveldb-devel.
 
 leveldb depends on snappy, therefore the leveldb package should set a
 dependency on snappy-devel for leveldb-devel (check the SUSE spec file
 for leveldb:
 https://build.opensuse.org/package/view_file/home:dalgaaf:ceph:extra/leveldb/leveldb.spec?expand=1).
 This way the RPM build process will pick up the correct packages needed
 to build ceph.
 
 Which distro do you use?
 
 Danny
 
 Am 30.07.2013 01:33, schrieb Patrick McGarry:
 -- Forwarded message --
 From: Erik Logtenberg e...@logtenberg.eu
 Date: Mon, Jul 29, 2013 at 7:07 PM
 Subject: [ceph-users] Small fix for ceph.spec
 To: ceph-users@lists.ceph.com


 Hi,

 The spec file used for building rpm's misses a build time dependency on
 snappy-devel. Please see attached patch to fix.

 Kind regards,

 Erik.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Small fix for ceph.spec

2013-07-30 Thread Erik Logtenberg
Hi,

I will report the issue there as well. Please note that Ceph seems to
support Fedora 17, even though that release is considered end-of-life by
Fedora. This issue with the leveldb package cannot be fixed for Fedora
17, only for 18 and 19.
So if Ceph wants to continue supporting Fedora 17, adding this
workaround seems to be the only way to get this (rather minor) bug fixed.

Kind regards,

Erik.


On 07/30/2013 09:56 AM, Danny Al-Gaaf wrote:
 Hi,
 
 then the Fedora package is broken. If you check the spec file of:
 
 http://dl.fedoraproject.org/pub/fedora/linux/updates/19/SRPMS/leveldb-1.12.0-3.fc19.src.rpm
 
 
 You can see the spec-file sets a:
 
 BuildRequires:  snappy-devel
 
 But not the corresponding Requires: snappy-devel for the devel package.
 
 You should report this issue to your distribution, it needs to be fixed
 there instead of adding a workaround to the ceph spec.
 
 Regards,
 
 Danny
 
 Am 30.07.2013 09:42, schrieb Erik Logtenberg:
 Hi,

 Fedora, in this case Fedora 19, x86_64.

 Kind regards,

 Erik.


 On 07/30/2013 09:29 AM, Danny Al-Gaaf wrote:
 Hi,

 I think this is a bug in packaging of the leveldb package in this case
 since the spec-file already sets dependencies on on leveldb-devel.

 leveldb depends on snappy, therefore the leveldb package should set a
 dependency on snappy-devel for leveldb-devel (check the SUSE spec file
 for leveldb:
 https://build.opensuse.org/package/view_file/home:dalgaaf:ceph:extra/leveldb/leveldb.spec?expand=1).
 This way the RPM build process will pick up the correct packages needed
 to build ceph.

 Which distro do you use?

 Danny

 Am 30.07.2013 01:33, schrieb Patrick McGarry:
 -- Forwarded message --
 From: Erik Logtenberg e...@logtenberg.eu
 Date: Mon, Jul 29, 2013 at 7:07 PM
 Subject: [ceph-users] Small fix for ceph.spec
 To: ceph-users@lists.ceph.com


 Hi,

 The spec file used for building rpm's misses a build time dependency on
 snappy-devel. Please see attached patch to fix.

 Kind regards,

 Erik.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.66 released

2013-07-19 Thread Erik Logtenberg
  * osd: pg log (re)writes are not vastly more efficient (faster peering) 
(Sam Just)

Do you really mean are not? I'd think are now would make sense (?)

-  Erik.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com