Re: [ceph-users] CEPH I/O Performance with OpenStack

2015-01-27 Thread Robert van Leeuwen
 I have two ceph nodes with the following specifications 
 2x CEPH - OSD - 2 Replication factor 
 Model : SuperMicro X8DT3 
 CPU : Dual intel E5620 
 RAM : 32G 
 HDD : 2x 480GB SSD RAID-1 ( OS and Journal ) 
  22x 4TB SATA RAID-10 ( OSD )

 3x Controllers - CEPH Monitor
 Model : ProLiant DL180 G6
 CPU : Dual intel E5620
 RAM : 24G


 If it's a hardware issue please help finding out an answer for the following 
 5 questions.

4 TB spinners do not give a lot of IOPS, about 100 random IOPS per disk.
In total it would just be 1100 IOPS: 44 disk times 100 IOPS divide by 2 for 
RAID and divide by 2 for replication factor. 
There might be a bit of caching on the RAID controller and SSD journal but 
worst case you will get just 1100 IOPS.

 I need around 20TB storage, SuperMicro SC846TQ can get 24 hardisk. 
 I may attach 24x 960G SSD - NO Raid - with 3x SuperMicro servers - 
 replication factor 3.

Or it's better to scale-out and put smaller disks on many servers such ( HP 
DL380pG8/2x Intel Xeon E5-2650 ) which can hold 12 hardisk
 And Attach 12x 960G SSD - NO Raid - 6x OSD nodes - replication factor 3.

An OSD for a SSD can easily eat a whole CPU core so 24 SSDs would be to much.
More smaller nodes also have the upside off smaller impact when a node breaks.
You could also look at the Supermicro  2u twin chassis with 2 servers with 12 
disks in 2u.
Note that you will not get near to theoretical native performance of those 
combined SSDs (10+ IOPS) but performance will be good none the less.
There have been a few threads about that here before so look back in the mail 
threads to find out more.

 2. I'm using Mirantis/Fuel 5 for provisioning and deployment of nodes 
 When i attach the new ceph osd nodes to the environment, Will the data be 
 replicated automatically 
 from my current old SuperMicro OSD nodes to the new servers after the 
 deployment complete ?
Don't know the specifics of Fuel and how it manages the crush map.
Some of the data will end up there but not a copy of all data unless you 
specify the new servers as a new failure domain in the crush map.

 3. I will use 2x 960G SSD RAID 1 for OS 
 Is it recommended put the SSD journal disk as a separate partition on the 
 same disk of OS ?
If you run with SSDs only I would put the journals together with the data SSDs.
It makes a lot of sense to have them on seperate SSDs when your data disks are 
spinners.
(because of the speed difference and bad random IOPS performance of spinners.)

 4. Is it safe to remove the OLD ceph nodes while i'm currently using 2 
 replication factors after adding the new hardware nodes ?
It is probably not safe to just turn them off (as mentioned above it depend on 
the crush map failure domain layout)
The safe way would be to follow the documentation on how to remove an OSD: 
http://ceph.com/docs/master/rados/operations/add-or-rm-osds/
This will make sure the data is re-located before the OSD is removed.

 5. Do i need RAID 1 for the journal hardisk ? and if not, What will happen if 
 one of the journal HDD's failed ?
No, it is not required. Both have trade-offs.
Disks that are behind the journal will become unavailable when it happens.
RAID1 will be a bit easier to replace in case of a single SSD failure but is 
useless if the 2 SSDs fail at the same time (e.g. due to wear).
JBOD will reduce the write load and wear plus it has less impact when it does 
fail.

 6. Should i use RAID Level for the drivers on OSD nodes ? or it's better to 
 go without RAID ?
Without RAID usually makes for better performance. Benchmark your specific 
workload to be sure.
In general I would go for 3 replica's and no RAID.

Cheers,
Robert van Leeuwen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Total number PGs using multiple pools

2015-01-27 Thread Luis Periquito
Although the documentation is not great, and open to interpretation, there
is a pg calculator here http://ceph.com/pgcalc/.
With it you should be able to simulate your use case, and generate number
based on your scenario.

On Mon, Jan 26, 2015 at 8:00 PM, Italo Santos okd...@gmail.com wrote:

  Thanks for your answer.

 But what I’d like to understand is if this numbers are per pool bases or
 per cluster bases? If this number were per cluster bases I’ll plan on
 cluster deploy how much pools I’d like to have on that cluster and their
 replicas

 Regards.

 *Italo Santos*
 http://italosantos.com.br/

 On Saturday, January 17, 2015 at 07:04, lidc...@redhat.com wrote:

  Here are a few values commonly used:

- Less than 5 OSDs set pg_num to 128
- Between 5 and 10 OSDs set pg_num to 512
- Between 10 and 50 OSDs set pg_num to 4096
- If you have more than 50 OSDs, you need to understand the tradeoffs
and how to calculate the pg_num value by yourself

 But i think 10 OSD is to small for rados cluster.


 *From:* Italo Santos okd...@gmail.com
 *Date:* 2015-01-17 05:00
 *To:* ceph-users ceph-users@lists.ceph.com
 *Subject:* [ceph-users] Total number PGs using multiple pools
 Hello,

 Into placement groups documentation
 http://ceph.com/docs/giant/rados/operations/placement-groups/ we have
 the message bellow:

 “*When using multiple data pools for storing objects, you need to ensure
 that you balance the number of placement groups per pool with the number of
 placement groups per OSD so that you arrive at a reasonable total number of
 placement groups that provides reasonably low variance per OSD without
 taxing system resources or making the peering process too slow.*”

 This means that, if I have a cluster with 10 OSD and 3 pools with size = 3
 each pool can have only ~111 PGs?

 Ex.: (100 * 10 OSDs) / 3 replicas = 333 PGs / 3 pools = 111 PGS per pool

 I don't know if reasoning is right… I’ll glad for any help.

 Regards.

 *Italo Santos*
 http://italosantos.com.br/



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph File System Question

2015-01-27 Thread John Spray
Raj,

The note is still valid, but the filesystem is getting more stable all the
time.  Some people are using it, especially in an active/passive
configuration with a single active MDS.  If you do choose to do some
testing, use the most recent stable release of Ceph and the most recent
linux kernel you can.

Thanks,
John


On Mon, Jan 26, 2015 at 11:25 PM, Jeripotula, Shashiraj 
shashiraj.jeripot...@verizon.com wrote:

 Hi All,



 We are planning to use Ceph File System in our data center.



 I was reading the ceph documentation and they do not recommend this for
 production



 Is this still valid ???



 Please advise.



 Thanks



 Raj



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Testing

2015-01-27 Thread Jeripotula, Shashiraj
Hi  All,

Is there a good documentation on Ceph Testing.

I have the following setup done, but not able to find a good document to start 
doing the tests.




[cid:image001.png@01D03A1C.7B7513B0]
Please advise.

Thanks

Raj

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs modification time

2015-01-27 Thread Christopher Armstrong
Hey folks,

Any update on this fix getting merged? We suspect other crashes based on
this bug.

Thanks,

Chris

On Tue, Jan 13, 2015 at 7:09 AM, Gregory Farnum g...@gregs42.com wrote:

 Awesome, thanks for the bug report and the fix, guys. :)
 -Greg

 On Mon, Jan 12, 2015 at 11:18 PM, 严正 z...@redhat.com wrote:
  I tracked down the bug. Please try the attached patch
 
  Regards
  Yan, Zheng
 
 
 
 
  在 2015年1月13日,07:40,Gregory Farnum g...@gregs42.com 写道:
 
  Zheng, this looks like a kernel client issue to me, or else something
  funny is going on with the cap flushing and the timestamps (note how
  the reading client's ctime is set to an even second, while the mtime
  is ~.63 seconds later and matches what the writing client sees). Any
  ideas?
  -Greg
 
  On Mon, Jan 12, 2015 at 12:19 PM, Lorieri lori...@gmail.com wrote:
  Hi Gregory,
 
 
  $ uname -a
  Linux coreos2 3.17.7+ #2 SMP Tue Jan 6 08:22:04 UTC 2015 x86_64
  Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz GenuineIntel GNU/Linux
 
 
  Kernel Client, using  `mount -t ceph ...`
 
 
  core@coreos2 /var/run/systemd/system $ modinfo ceph
  filename:   /lib/modules/3.17.7+/kernel/fs/ceph/ceph.ko
  license:GPL
  description:Ceph filesystem for Linux
  author: Patience Warnick patie...@newdream.net
  author: Yehuda Sadeh yeh...@hq.newdream.net
  author: Sage Weil s...@newdream.net
  alias:  fs-ceph
  depends:libceph
  intree: Y
  vermagic:   3.17.7+ SMP mod_unload
  signer: Magrathea: Glacier signing key
  sig_key:
 D4:BB:DE:E9:C6:D8:FC:90:9F:23:59:B2:19:1B:B8:FA:57:A1:AF:D2
  sig_hashalgo:   sha256
 
  core@coreos2 /var/run/systemd/system $ modinfo libceph
  filename:   /lib/modules/3.17.7+/kernel/net/ceph/libceph.ko
  license:GPL
  description:Ceph filesystem for Linux
  author: Patience Warnick patie...@newdream.net
  author: Yehuda Sadeh yeh...@hq.newdream.net
  author: Sage Weil s...@newdream.net
  depends:libcrc32c
  intree: Y
  vermagic:   3.17.7+ SMP mod_unload
  signer: Magrathea: Glacier signing key
  sig_key:
 D4:BB:DE:E9:C6:D8:FC:90:9F:23:59:B2:19:1B:B8:FA:57:A1:AF:D2
  sig_hashalgo:   sha256
 
 
 
  ceph is installed on a ubuntu containers (same kernel):
 
  $ dpkg -l |grep ceph
 
  ii  ceph 0.87-1trusty
  amd64distributed storage and file system
  ii  ceph-common  0.87-1trusty
  amd64common utilities to mount and interact with a ceph
  storage cluster
  ii  ceph-fs-common   0.87-1trusty
  amd64common utilities to mount and interact with a ceph file
  system
  ii  ceph-fuse0.87-1trusty
  amd64FUSE-based client for the Ceph distributed file system
  ii  ceph-mds 0.87-1trusty
  amd64metadata server for the ceph distributed file system
  ii  libcephfs1   0.87-1trusty
  amd64Ceph distributed file system client library
  ii  python-ceph  0.87-1trusty
  amd64Python libraries for the Ceph distributed filesystem
 
 
 
  Reproducing the error:
 
  at machine 1:
  core@coreos1 /var/lib/deis/store/logs $  test.log
  core@coreos1 /var/lib/deis/store/logs $ echo 1  test.log
  core@coreos1 /var/lib/deis/store/logs $ stat test.log
   File: 'test.log'
   Size: 2 Blocks: 1  IO Block: 4194304 regular file
  Device: 0h/0d Inode: 1099511629882  Links: 1
  Access: (0644/-rw-r--r--)  Uid: (  500/core)   Gid: (  500/
 core)
  Access: 2015-01-12 20:05:03.0 +
  Modify: 2015-01-12 20:06:09.637234229 +
  Change: 2015-01-12 20:06:09.637234229 +
  Birth: -
 
  at machine 2:
  core@coreos2 /var/lib/deis/store/logs $ stat test.log
   File: 'test.log'
   Size: 2 Blocks: 1  IO Block: 4194304 regular file
  Device: 0h/0d Inode: 1099511629882  Links: 1
  Access: (0644/-rw-r--r--)  Uid: (  500/core)   Gid: (  500/
 core)
  Access: 2015-01-12 20:05:03.0 +
  Modify: 2015-01-12 20:06:09.637234229 +
  Change: 2015-01-12 20:06:09.0 +
  Birth: -
 
 
  Change time is not updated making some tail libs to not show new
  content until you force the change time be updated, like running a
  touch in the file.
  Some tools freeze and trigger other issues in the system.
 
 
  Tests, all in the machine #2:
 
  FAILED - https://github.com/ActiveState/tail
  FAILED - /usr/bin/tail of a Google docker image running debian wheezy
  PASSED - /usr/bin/tail of a ubuntu 14.04 docker image
  PASSED - /usr/bin/tail of the coreos release 494.5.0
 
 
  Tests in machine #1 (same machine that is writing the file) all tests
 pass.
 
 
 
  On Mon, Jan 12, 2015 at 5:14 PM, Gregory Farnum g...@gregs42.com
 wrote:
  What versions of all the Ceph pieces are you using? (Kernel
  client/ceph-fuse, MDS, etc)
 
  Can you provide more details on exactly what the program is doing on
  which 

Re: [ceph-users] Appending to a rados object with feedback

2015-01-27 Thread Kim Vandry

Hi Greg,

Thanks for your feedback.

On 2015-01-27 15:38, Gregory Farnum wrote:

On Mon, Jan 26, 2015 at 6:47 PM, Kim Vandry van...@tzone.org wrote:

By the way, I have a question about the class. Following the example in
cle_hello.cc method record_hello, our method calls cls_cxx_stat() and yet is
declared CLS_METHOD_WR, not CLS_METHOD_RD|CLS_METHOD_WR. Is stating an
object not considered reading it? How come the method does not need the
CLS_METHOD_RD flag? I tried including that flag to see what would happen but
then my method was unable to create new objects, which we want to support
with the same meaning as appending to a 0-size object. It seems that in that
case Ceph asserts that the objects exists before calling the method.


Mmmm, this actually might be an issue. Write ops don't always force an
object into a readable state before being processed, so you could read
out-of-date status in some cases. :/


I see. I'll change this, then.


I don't have the exact API calls to hand, but librados exposes
versions on op completion and you can assert the version when
submitting ops, too. Did you check that out?


That sounds like exactly what we should be using. I see 
rados_get_last_version() for reading the version, which I missed before. 
Unfortunately, I can't find how to assert the version when submitting an 
op. I'm looking at src/include/rados/librados.h in git. Maybe you or 
someone else can help me find it once you have the API docs at hand.


-kv
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Appending to a rados object with feedback

2015-01-27 Thread Kim Vandry

On 2015-01-27 17:06, Kim Vandry wrote:

Unfortunately, I can't find how to assert the version when submitting an
op. I'm looking at src/include/rados/librados.h in git. Maybe you or
someone else can help me find it once you have the API docs at hand.


Ah, never mind, I see that assert_version is found in the C++ version. I 
will see if I can change our client to use C++ instead of C. I'll worry 
about the Python client (which also can't call into C++) later.


-kv
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph as a primary storage for owncloud

2015-01-27 Thread Simone Spinelli
Dear all,

we would like to use ceph as a a primary (object) storage for owncloud.
Did anyone already do this? I mean: is that actually possible or am I
wrong?
As I understood I have to use radosGW in swift flavor, but what about
s3 flavor?
I cannot find anything official so hence my question.
Do you have any advice or can you indicate me some kind of
documentation/how-to?

I know that maybe this is not the right place for this questions but I
also asked owncloud's community... in the meantime...

Every answer is appreciated!

Thanks

Simone

-- 
Simone Spinelli simone.spine...@unipi.it
Università di Pisa
Direzione ICT - Servizi di Rete
PGP KEY http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0xDBDA383DEA2F1F96


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph and btrfs - disable copy-on-write?

2015-01-27 Thread Christopher Armstrong
When starting an OSD in a Docker container (so the volume is btrfs), we see
the following output:

2015-01-24 16:48:30.511813 7f9f3d066900  0 ceph version 0.87
(c51c8f9d80fa4e0168aa52685b8de40e42758578), process ceph-osd, pid 1
2015-01-24 16:48:30.522509 7f9f3d066900  0
filestore(/var/lib/ceph/osd/ceph-0) backend btrfs (magic 0x9123683e)
2015-01-24 16:48:30.535455 7f9f3d066900  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features:
FIEMAP ioctl is supported and appears to work
2015-01-24 16:48:30.535519 7f9f3d066900  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features:
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-01-24 16:48:30.628612 7f9f3d066900  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)
2015-01-24 16:48:30.628960 7f9f3d066900  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature:
CLONE_RANGE ioctl is supported
2015-01-24 16:48:30.629211 7f9f3d066900  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: failed
to create simple subvolume test_subvol: (17) File exists
2015-01-24 16:48:30.629509 7f9f3d066900  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature:
SNAP_CREATE is supported
2015-01-24 16:48:30.630487 7f9f3d066900  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature:
SNAP_DESTROY failed: (1) Operation not permitted
2015-01-24 16:48:30.630763 7f9f3d066900  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: snaps
enabled, but no SNAP_DESTROY ioctl; DISABLING
2015-01-24 16:48:30.631744 7f9f3d066900  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature:
START_SYNC is supported (transid 67)
2015-01-24 16:48:30.639763 7f9f3d066900  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature:
WAIT_SYNC is supported
2015-01-24 16:48:30.639914 7f9f3d066900  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature:
removing old async_snap_test
2015-01-24 16:48:30.640178 7f9f3d066900  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: failed
to remove old async_snap_test: (1) Operation not permitted
2015-01-24 16:48:30.641138 7f9f3d066900  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature:
SNAP_CREATE_V2 is supported
2015-01-24 16:48:30.641387 7f9f3d066900  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature:
SNAP_DESTROY failed: (1) Operation not permitted
2015-01-24 16:48:30.641528 7f9f3d066900  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: failed
to remove test_subvol: (1) Operation not permitted
2015-01-24 16:48:30.651029 7f9f3d066900  0
filestore(/var/lib/ceph/osd/ceph-0) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled
2015-01-24 16:48:30.651282 7f9f3d066900 -1 journal FileJournal::_open:
disabling aio for non-block journal.  Use journal_force_aio to force
use of aio anyway
2015-01-24 16:48:30.652322 7f9f3d066900  1 journal _open
/var/lib/ceph/osd/ceph-0/journal fd 19: 5368709120 bytes, block size
4096 bytes, directio = 1, aio = 0
2015-01-24 16:48:30.652945 7f9f3d066900  1 journal _open
/var/lib/ceph/osd/ceph-0/journal fd 19: 5368709120 bytes, block size
4096 bytes, directio = 1, aio = 0
2015-01-24 16:48:30.654462 7f9f3d066900  1 journal close
/var/lib/ceph/osd/ceph-0/journal


We're considering disabling copy-on-write for the directory to improve
write performance. Are there any recommendations for or against this?


Thanks!


Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Consumer Grade SSD Clusters

2015-01-27 Thread Quenten Grasso
Hi Nick,

Agreed, I see your point of basically once your past the 150TBW or whatever 
that number maybe, your just waiting for failure effectively but aren't we 
anyway?

I guess it depends on your use case at the end of the day. I wonder what the 
likes of Amazon, Rackspace etc are doing in the way of SSD's, either they are 
buying them so cheap per GB due to the volume or they are possibly using 
consumer grade  SSD'.

hmm.. using consumer grade SSD's it may be an interesting option if you have 
descent monitoring and alerting using SMART you should be able to still see how 
much spare flash you have available.
As suggested by Wido using multiple brands would help remove the possible 
cascading failure affect which I guess we all should be doing anyway on our 
spinners.

I guess we have to decide is it worth the extra effort in the long run vs 
running enterprise ssds.

Regards,
Quenten Grasso

From: Nick Fisk [mailto:n...@fisk.me.uk]
Sent: Saturday, 24 January 2015 7:33 PM
To: Quenten Grasso; ceph-users@lists.ceph.com
Subject: RE: Consumer Grade SSD Clusters

Hi Quenten,

There is no real answer to your question. It really depends on how busy your 
storage will be and particularly if it is mainly reads or writes.

I wouldn't pay too much attention to that SSD endurance test, whilst it's great 
to know that they have a lot more headroom than their official spec's, you run 
the risk of having a spectacular multiple disk failure if you intend to run 
them all that high. You can probably guarantee that as 1 SSD starts to fail the 
increase in workload to re-balance the cluster will cause failures on the rest.

I guess it really comes down to how important is the availability of your data. 
Whilst an average pc user might balk at the price of paying 4 times per GB more 
for a S3700 SSD, in the enterprise world they are still comparatively cheap.

The other thing you need to be aware of is that most consumer SSD's don't have 
power loss protection, again if you are mainly doing reads and cost is more 
important than availability, there may be an argument to use them.

Nick

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Quenten Grasso
Sent: 24 January 2015 09:13
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Consumer Grade SSD Clusters

Hi Everyone,

Just wondering if anyone has had any experience in using consumer grade SSD's 
for a Ceph cluster?

I came across this article 
http://techreport.com/review/26523/the-ssd-endurance-experiment-casualties-on-the-way-to-a-petabyte/3http://xo4t.mjt.lu/link/xo4t/gg573yr/1/QRjiN_2beI5qST5ggOanaQ/aHR0cDovL3RlY2hyZXBvcnQuY29tL3Jldmlldy8yNjUyMy90aGUtc3NkLWVuZHVyYW5jZS1leHBlcmltZW50LWNhc3VhbHRpZXMtb24tdGhlLXdheS10by1hLXBldGFieXRlLzM

They have been testing different SSD's write endurance and they have been able 
to write up to 1PB+ to a Samsung 840 Pro 256GB which is only rated at 150TBW 
and of course other SSD's have failed well before 1PBW, So defiantly worth a 
read.

So I've been thinking about using consumer grade SSD's for OSD's and Enterprise 
SSD's for journals.

Reasoning is enterprise SSD's are a lot faster at journaling then consumer 
grade drives plus this would effectively half the overall write requirements on 
the consumer grade disks.

This also could be a cost effective alternative to using enterprise SSD's as 
OSD's however it seems if your happy to use 2 x replication it's a pretty good 
cost saving however 3x replication not so much.

Cheers,
Quenten Grasso



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] chattr +i not working with cephfs

2015-01-27 Thread Eric Eastman
Should chattr +i  work with cephfs?

Using ceph v0.91 and a 3.18 kernel on the CephFS client, I tried this:

# mount | grep ceph
172.16.30.10:/ on /cephfs/test01 type ceph (name=cephfs,key=client.cephfs)
# echo 1  /cephfs/test01/test.1
# ls -l /cephfs/test01/test.1
-rw-r--r-- 1 root root 2 Jan 27 19:09 /cephfs/test01/test.1
# chattr +i /cephfs/test01/test.1
chattr: Inappropriate ioctl for device while reading flags on
/cephfs/test01/test.1

I also tried it using the FUSE interface:

# ceph-fuse -m 172.16.30.10 /cephfs/fuse01/
ceph-fuse[5326]: starting ceph client
2015-01-27 19:54:59.002563 7f6f8fbcb7c0 -1 init, newargv = 0x2ec2be0
newargc=11
ceph-fuse[5326]: starting fuse
# mount | grep ceph
ceph-fuse on /cephfs/fuse01 type fuse.ceph-fuse
(rw,nosuid,nodev,allow_other,default_permissions)
# echo 1  /cephfs/fuse01/test02.dat
# chattr +i   /cephfs/fuse01/test02.dat
chattr: Invalid argument while reading flags on /cephfs/fuse01/test02.dat

Eric
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH I/O Performance with OpenStack

2015-01-27 Thread Ramy Allam
Thanks Robert for your response. I'm considering giving SAS 600G 15K a try
before moving to SSD. It should give ~175 IOPS per disk.

Do you think the performance will be better if i goes with the following
setup ?
4x OSD nodes
2x SSD - RAID 1 for OS and Journal
10x 600G SAS 15K - NO Raid
Two Replication.

According to the IOPS calculation you did for the 4TB. Please clarify is
1100 IOPS will be for the one node and the cluster IOPS =$number_of_nodes x
$IOPS_per_node ?

If this formula is correct, That's being said the cluster on the 4TB - my
current setup should give in total 2200 IOPS and the new SAS setup should
give 3500 IOPS ?

Please correct me if i understand this wrong.

Thanks in advance,

On Tue, Jan 27, 2015 at 3:30 PM, Robert van Leeuwen 
robert.vanleeu...@spilgames.com wrote:

  I have two ceph nodes with the following specifications
  2x CEPH - OSD - 2 Replication factor
  Model : SuperMicro X8DT3
  CPU : Dual intel E5620
  RAM : 32G
  HDD : 2x 480GB SSD RAID-1 ( OS and Journal )
   22x 4TB SATA RAID-10 ( OSD )
 
  3x Controllers - CEPH Monitor
  Model : ProLiant DL180 G6
  CPU : Dual intel E5620
  RAM : 24G
 
 
  If it's a hardware issue please help finding out an answer for the
 following 5 questions.

 4 TB spinners do not give a lot of IOPS, about 100 random IOPS per disk.
 In total it would just be 1100 IOPS: 44 disk times 100 IOPS divide by 2
 for RAID and divide by 2 for replication factor.
 There might be a bit of caching on the RAID controller and SSD journal but
 worst case you will get just 1100 IOPS.

  I need around 20TB storage, SuperMicro SC846TQ can get 24 hardisk.
  I may attach 24x 960G SSD - NO Raid - with 3x SuperMicro servers -
 replication factor 3.
 
 Or it's better to scale-out and put smaller disks on many servers such (
 HP DL380pG8/2x Intel Xeon E5-2650 ) which can hold 12 hardisk
  And Attach 12x 960G SSD - NO Raid - 6x OSD nodes - replication factor 3.

 An OSD for a SSD can easily eat a whole CPU core so 24 SSDs would be to
 much.
 More smaller nodes also have the upside off smaller impact when a node
 breaks.
 You could also look at the Supermicro  2u twin chassis with 2 servers with
 12 disks in 2u.
 Note that you will not get near to theoretical native performance of those
 combined SSDs (10+ IOPS) but performance will be good none the less.
 There have been a few threads about that here before so look back in the
 mail threads to find out more.

  2. I'm using Mirantis/Fuel 5 for provisioning and deployment of nodes
  When i attach the new ceph osd nodes to the environment, Will the data
 be replicated automatically
  from my current old SuperMicro OSD nodes to the new servers after the
 deployment complete ?
 Don't know the specifics of Fuel and how it manages the crush map.
 Some of the data will end up there but not a copy of all data unless you
 specify the new servers as a new failure domain in the crush map.

  3. I will use 2x 960G SSD RAID 1 for OS
  Is it recommended put the SSD journal disk as a separate partition on
 the same disk of OS ?
 If you run with SSDs only I would put the journals together with the data
 SSDs.
 It makes a lot of sense to have them on seperate SSDs when your data disks
 are spinners.
 (because of the speed difference and bad random IOPS performance of
 spinners.)

  4. Is it safe to remove the OLD ceph nodes while i'm currently using 2
 replication factors after adding the new hardware nodes ?
 It is probably not safe to just turn them off (as mentioned above it
 depend on the crush map failure domain layout)
 The safe way would be to follow the documentation on how to remove an OSD:
 http://ceph.com/docs/master/rados/operations/add-or-rm-osds/
 This will make sure the data is re-located before the OSD is removed.

  5. Do i need RAID 1 for the journal hardisk ? and if not, What will
 happen if one of the journal HDD's failed ?
 No, it is not required. Both have trade-offs.
 Disks that are behind the journal will become unavailable when it
 happens.
 RAID1 will be a bit easier to replace in case of a single SSD failure but
 is useless if the 2 SSDs fail at the same time (e.g. due to wear).
 JBOD will reduce the write load and wear plus it has less impact when it
 does fail.

  6. Should i use RAID Level for the drivers on OSD nodes ? or it's better
 to go without RAID ?
 Without RAID usually makes for better performance. Benchmark your specific
 workload to be sure.
 In general I would go for 3 replica's and no RAID.

 Cheers,
 Robert van Leeuwen


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to do maintenance without falling out of service?

2015-01-27 Thread J David
On Wed, Jan 21, 2015 at 5:53 PM, Gregory Farnum g...@gregs42.com wrote:
 Depending on how you configured things it's possible that the min_size
 is also set to 2, which would be bad for your purposes (it should be
 at 1).

This was exactly the problem.  Setting min_size=1 (which I believe
used to be the default, looks like it changed almost exactly when we
set this cluster up) got things back on track for us.

Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH I/O Performance with OpenStack

2015-01-27 Thread Robert van Leeuwen
 Thanks Robert for your response. I'm considering giving SAS 600G 15K a try 
 before moving to SSD. It should give ~175 IOPS per disk.

 Do you think the performance will be better if i goes with the following 
 setup ?
 4x OSD nodes
 2x SSD - RAID 1 for OS and Journal
 10x 600G SAS 15K - NO Raid
 Two Replication.

 According to the IOPS calculation you did for the 4TB. Please clarify is 1100 
 IOPS will be for the one node and the cluster IOPS =$number_of_nodes x 
 $IOPS_per_node ?
 If this formula is correct, That's being said the cluster on the 4TB - my 
 current setup should give in total 2200 IOPS and the new SAS setup should 
 give 3500 IOPS ?

 Please correct me if i understand this wrong.

No the current setup is in total 1100 IOPS!
You have 44 disks each doing 100 IOPS = 4400 IOPS
You have RAID10 which effectively halves the write speed = 2200 IOPS
You have a replication factor of 2 in Ceph which halves it again = 1100 IOPS

I would not be a fan of a replication factor of 2 with NO raid. Chances that 2 
disks in the cluster fail at the same time is significant and you will lose 
data.
Replication of 3 would be the absolute minimum.

For the suggested setup that would be:
40 * 175 = 7000
Rep factor of 3 / devide by three = 2300 IOPS

So you effectively double the amount of writes you can do.
Note that this is the total cluster performance.
You will not get this from a single instance since the data would be needed to 
be written exactly spread across the cluster.
In my experience it is good enough for some low writes instances but not for 
write intensive applications like Mysql.


Cheers,
Robert van Leeuwen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help:mount error

2015-01-27 Thread 于泓海

Hi Wang:


I have created the pool and fs before.




--
 
在 2015-01-28 14:54:33,王亚洲 breb...@163.com 写道:

hi:
if you do ceph fs new command? I encounter the same issue without doing 
ceph fs new. 






At 2015-01-28 14:48:09, 于泓海 foxconn-...@163.com wrote:

Hi:


I have completed the installation of ceph cluster,and the ceph health is ok:


cluster 15ee68b9-eb3c-4a49-8a99-e5de64449910
 health HEALTH_OK
 monmap e1: 1 mons at {ceph01=10.194.203.251:6789/0}, election epoch 1, 
quorum 0 ceph01
 mdsmap e2: 0/0/1 up
 osdmap e16: 2 osds: 2 up, 2 in
  pgmap v729: 92 pgs, 4 pools, 136 MB data, 46 objects
23632 MB used, 31172 MB / 54805 MB avail
  92 active+clean


But when i mount from client,the error is: mount error 5 = Input/output error.
I have tried lots of ways,for ex:disable selinux,update kernel... 
Could anyone help me to resolve it? Thanks!




Jason



--










___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Help:mount error

2015-01-27 Thread 于泓海
Hi:


I have completed the installation of ceph cluster,and the ceph health is ok:


cluster 15ee68b9-eb3c-4a49-8a99-e5de64449910
 health HEALTH_OK
 monmap e1: 1 mons at {ceph01=10.194.203.251:6789/0}, election epoch 1, 
quorum 0 ceph01
 mdsmap e2: 0/0/1 up
 osdmap e16: 2 osds: 2 up, 2 in
  pgmap v729: 92 pgs, 4 pools, 136 MB data, 46 objects
23632 MB used, 31172 MB / 54805 MB avail
  92 active+clean


But when i mount from client,the error is: mount error 5 = Input/output error.
I have tried lots of ways,for ex:disable selinux,update kernel... 
Could anyone help me to resolve it? Thanks!




Jason



--




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help:mount error

2015-01-27 Thread 王亚洲
hi:
if you do ceph fs new command? I encounter the same issue without doing 
ceph fs new. 






At 2015-01-28 14:48:09, 于泓海 foxconn-...@163.com wrote:

Hi:


I have completed the installation of ceph cluster,and the ceph health is ok:


cluster 15ee68b9-eb3c-4a49-8a99-e5de64449910
 health HEALTH_OK
 monmap e1: 1 mons at {ceph01=10.194.203.251:6789/0}, election epoch 1, 
quorum 0 ceph01
 mdsmap e2: 0/0/1 up
 osdmap e16: 2 osds: 2 up, 2 in
  pgmap v729: 92 pgs, 4 pools, 136 MB data, 46 objects
23632 MB used, 31172 MB / 54805 MB avail
  92 active+clean


But when i mount from client,the error is: mount error 5 = Input/output error.
I have tried lots of ways,for ex:disable selinux,update kernel... 
Could anyone help me to resolve it? Thanks!




Jason



--







___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD over cache tier over EC pool: rbd rm doesn't remove objects

2015-01-27 Thread Sage Weil
On Tue, 27 Jan 2015, Irek Fasikhov wrote:
 Hi,All.
 Indeed, there is a problem. Removed 1 TB of data space on a cluster is not
 cleared. This feature of the behavior or a bug? And how long will it be
 cleaned?

Your subject says cache tier but I don't see it in the 'ceph df' output 
below.  The cache tiers will store 'whiteout' objects that cache object 
non-existence that could be delaying some deletion.  You can wrangle the 
cluster into flushing those with

 ceph osd pool set cachepool cache_target_dirty_ratio .05

(though you'll probably want to change it back to the default .4 later).

If there's no cache tier involved, there may be another problem.  What 
version is this?  Firefly?

sage

 
 Sat Sep 20 2014 at 8:19:24 AM, Mika?l Cluseau mclus...@isi.nc:
   Hi all,
 
   I have weird behaviour on my firefly test + convenience
   storage cluster. It consists of 2 nodes with a light imbalance
   in available space:
 
   # id    weight    type name    up/down    reweight
   -1    14.58    root default
   -2    8.19        host store-1
   1    2.73            osd.1    up    1   
   0    2.73            osd.0    up    1   
   5    2.73            osd.5    up    1   
   -3    6.39        host store-2
   2    2.73            osd.2    up    1   
   3    2.73            osd.3    up    1   
   4    0.93            osd.4    up    1   
 
   I used to store ~8TB of rbd volumes, coming to a near-full
   state. There was some annoying stuck misplaced PGs so I began
   to remove 4.5TB of data; the weird thing is: the space hasn't
   been reclaimed on the OSDs, they keeped stuck around 84% usage.
   I tried to move PGs around and it happens that the space is
   correctly reclaimed if I take an OSD out, let him empty it XFS
   volume and then take it in again.
 
   I'm currently applying this to and OSD in turn, but I though it
   could be worth telling about this. The current ceph df output
   is:
 
   GLOBAL:
       SIZE   AVAIL RAW USED %RAW USED
       12103G 5311G 6792G    56.12
   POOLS:
       NAME ID USED   %USED OBJECTS
       data 0  0  0 0  
       metadata 1  0  0 0  
       rbd  2  444G   3.67  117333 
   [...]
       archives-ec  14 3628G  29.98 928902 
       archives 15 37518M 0.30  273167
 
   Before just moving data, AVAIL was around 3TB.
 
   I finished the process with the OSDs on store-1, who show the
   following space usage now:
 
   /dev/sdb1 2.8T  1.4T  1.4T  50%
   /var/lib/ceph/osd/ceph-0
   /dev/sdc1 2.8T  1.3T  1.5T  46%
   /var/lib/ceph/osd/ceph-1
   /dev/sdd1 2.8T  1.3T  1.5T  48%
   /var/lib/ceph/osd/ceph-5
 
   I'm currently fixing OSD 2, 3 will be the last one to be fixed.
   The df on store-2 shows the following:
 
   /dev/sdb1   2.8T  1.9T  855G  70%
   /var/lib/ceph/osd/ceph-2
   /dev/sdc1   2.8T  2.4T  417G  86%
   /var/lib/ceph/osd/ceph-3
   /dev/sdd1   932G  481G  451G  52%
   /var/lib/ceph/osd/ceph-4
 
   OSD 2 was at 84% 3h ago, and OSD 3 was ~75%.
 
   During rbd rm (that took a bit more that 3 days), ceph log was
   showing things like that:
 
   2014-09-03 16:17:38.831640 mon.0 192.168.1.71:6789/0 417194 :
   [INF] pgmap v14953987: 3196 pgs: 2882 active+clean, 314
   active+remapped; 7647 GB data, 11067 GB used, 3828 GB / 14896 GB
   avail; 0 B/s rd, 6778 kB/s wr, 18 op/s; -5/5757286 objects
   degraded (-0.000%)
   [...]
   2014-09-05 03:09:59.895507 mon.0 192.168.1.71:6789/0 513976 :
   [INF] pgmap v15050766: 3196 pgs: 2882 active+clean, 314
   active+remapped; 6010 GB data, 11156 GB used, 3740 GB / 14896 GB
   avail; 0 B/s rd, 0 B/s wr, 8 op/s; -388631/5247320 objects
   degraded (-7.406%)
   [...]
   2014-09-06 03:56:50.008109 mon.0 192.168.1.71:6789/0 580816 :
   [INF] pgmap v15117604: 3196 pgs: 2882 active+clean, 314
   active+remapped; 4865 GB data, 11207 GB used, 3689 GB / 14896 GB
   avail; 0 B/s rd, 6117 kB/s wr, 22 op/s; -706519/3699415 objects
   degraded (-19.098%)
   2014-09-06 03:56:44.476903 osd.0 192.168.1.71:6805/11793 729 :
   [WRN] 1 slow requests, 1 included below; oldest blocked for 
   30.058434 secs
   2014-09-06 03:56:44.476909 osd.0 192.168.1.71:6805/11793 730 :
   [WRN] slow request 30.058434 seconds old, received at 2014-09-06
   03:56:14.418429: osd_op(client.19843278.0:46081
   rb.0.c7fd7f.238e1f29.b3fa [delete] 15.b8fb7551
   ack+ondisk+write e38950) v4 currently waiting for blocked object
   2014-09-06 03:56:49.477785 osd.0 192.168.1.71:6805/11793 

Re: [ceph-users] slow read-performance inside the vm

2015-01-27 Thread Patrik Plank
Hello again,



thank all for the very very helpful advices.



Now i have reinstalled my ceph cluster. 

Three nodes with ceph version 0.80.7 and for every single disk an osd. The 
journal will be saved on a ssd.

 
My ceph.conf

  

[global] 

fsid = bceade34-3c54-4a35-a759-7af631a19df7 

mon_initial_members = ceph01 

mon_host = 10.0.0.20,10.0.0.21,10.0.0.22 

auth_cluster_required = cephx 

auth_service_required = cephx 

auth_client_required = cephx 

filestore_xattr_use_omap = true 

public_network = 10.0.0.0/24 

cluster_network = 10.0.1.0/24 

osd_pool_default_size = 2

osd_pool_default_min_size = 1 

osd_pool_default_pg_num = 4096 

osd_pool_default_pgp_num = 4096 

filestore_max_sync_interval = 30 

 


ceph osd tree

 


 -1 6.76 root default 

 
-2 2.44 host ceph01 



0 0.55 osd.0 up 1 

3 0.27 osd.3 up 1 

4 0.27 osd.4 up 1 

5 0.27 osd.5 up 1 

6 0.27 osd.6 up 1 

7 0.27 osd.7 up 1 

1 0.27 osd.1 up 1 

2 0.27 osd.2 up 1 

 
-3 2.16 host ceph02 



9 0.27 osd.9 up 1 

11 0.27 osd.11 up 1 

12 0.27 osd.12 up 1 

13 0.27 osd.13 up 1 

14 0.27 osd.14 up 1 

15 0.27 osd.15 up 1 

8 0.27 osd.8 up 1 

10 0.27 osd.10 up 1 

 
-4 2.16 host ceph03



17 0.27 osd.17 up 1 

18 0.27 osd.18 up 1 

19 0.27 osd.19 up 1 

20 0.27 osd.20 up 1 

21 0.27 osd.21 up 1 

22 0.27 osd.22 up 1 

23 0.27 osd.23 up 1 

16 0.27 osd.16 up 1

 


rados bench -p kvm 50 write --no-cleanup 

 
Total time run: 50.494855 

Total writes made: 1180 

Write size: 4194304 

Bandwidth (MB/sec): 93.475 



Stddev Bandwidth: 16.3955 

Max bandwidth (MB/sec): 112 

Min bandwidth (MB/sec): 0 

Average Latency: 0.684571 

Stddev Latency: 0.216088 

Max latency: 1.86831 

Min latency: 0.234673 



rados bench -p kvm 50 seq 



Total time run: 15.009855 

Total reads made: 1180 

Read size: 4194304 

Bandwidth (MB/sec): 314.460 

Average Latency: 0.20296 

Max latency: 1.06341 

Min latency: 0.02983 

 


I am really happy, these values above are enough for my little amount of vms. 
Inside the vms I get now for write 80mb/s and read 130mb/s, with write-cache 
enabled.

But there is one little problem. 

Are there some tuning parameters for small files?

For 4kb to 50kb files the cluster is very slow. 



thank you

best regards



 
-Original message-
From: Lindsay Mathieson lindsay.mathie...@gmail.com
Sent: Friday 9th January 2015 0:59
To: ceph-users@lists.ceph.com
Cc: Patrik Plank pat...@plank.me
Subject: Re: [ceph-users] slow read-performance inside the vm


On Thu, 8 Jan 2015 05:36:43 PM Patrik Plank wrote:

Hi Patrick, just a beginner myself, but have been through a similar process 
recently :)

 With these values above, I get a write performance of 90Mb/s and read
 performance of 29Mb/s, inside the VM. (Windows 2008/R2 with virtio driver
 and writeback-cache enabled) Are these values normal with my configuration
 and hardware? - 

They do seem *very* odd. Your write performance is pretty good, your read 
performance is abysmal - with a similar setup, with 3 OSD's slower than yours 
I was getting 200 MB/s reads. 

Maybe your network setup is dodgy? Jumbo frames can be tricky. Have you run 
iperf between the nodes?

What are you using for benchmark testing on the windows guest?

Also, probably more useful to turn writeback caching off for benchmarking, the 
cache will totally obscure the real performance.

How is the VM mounted? rbd driver?

 The read-performance seems slow. Would the
 read-performance better if I run for every single disk a osd?

I think so - in general the more OSD's the better. Also having 8 HD's in RAID0 
is a recipe for disaster, you'll lost the entire OSD is one of those disks 
fails.

I'd be creating an OSD for each HD (8 per node), with a 5-10GB SSD partition 
per OSD for journal. Tedious, but should make a big difference to reads and 
writes.

Might be worth while trying
[global]
  filestore max sync interval = 30

as well.

-- 
Lindsay



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CEPH I/O Performance with OpenStack

2015-01-27 Thread Ramy Allam
Hello,

I'm using ceph-0.80.7 with Mirantis OpenStack IceHouse - RBD for nova
ephemeral disk and glance.

I have two ceph nodes with the following specifications
2x CEPH - OSD - 2 Replication factor
Model : SuperMicro X8DT3
CPU : Dual intel E5620
RAM : 32G
HDD : 2x 480GB SSD RAID-1 ( OS and Journal )
  22x 4TB SATA RAID-10 ( OSD )

3x Controllers - CEPH Monitor
Model : ProLiant DL180 G6
CPU : Dual intel E5620
RAM : 24G

*Network
Public : 1G NIC ( eth0 ) - Juniper 2200-48
Storage,Admin,Management - 10G NIC ( eth1 ) - Arista 7050T-36 (32x 10GE
UTP, 4x 10GE SFP+)

*I'm getting very poor ceph performance and high I/O with write/read
And when light or deep scrub is running the load on VM's went crazy.
ceph.conf tuning didn't help.

[global]
auth_service_required = cephx
filestore_xattr_use_omap = true
auth_client_required = cephx
auth_cluster_required = cephx
mon_host = xx.xx.xx.xx xx.xx.xx.xx xx.xx.xx.xx
mon_initial_members = node-xx node-xx node-xx
fsid =
osd_pool_default_size = 2
osd_pool_default_min_size = 1
osd_pool_default_pg_num = 50
public_network = xx.xx.xx.xx
osd_journal_size = 10
auth_supported = cephx
osd_pool_default_pgp_num = 50
osd_pool_default_flag_hashpspool = true
osd_mkfs_type = xfs
cluster_network = xx.xx.xx.xx
mon_clock_drift_allowed = 2

[osd]

osd_op_threads=16
osd_disk_threads=4
osd_disk_thread_ioprio_priority=7
osd_disk_thread_ioprio_class=idle

filestore op threads=8
filestore_queue_max_ops=10
filestore_queue_committing_max_ops=10
filestore_queue_max_bytes=1073741824
filestore_queue_committing_max_bytes=1073741824
filestore_max_sync_interval=10
filestore_fd_cache_size=20240
filestore_flusher=false
filestore_flush_min=0
filestore_sync_flush=true

journal_dio=true
journal_aio=true
journal_max_write_bytes=1073741824
journal_max_write_entries=5
journal_queue_max_bytes=1073741824
journal_queue_max_ops=10

ms_dispatch_throttle_bytes=1073741824
objecter_infilght_op_bytes=1073741824
objecter_inflight_ops=1638400

osd_recovery_threads = 16
#osd_recovery_max_active = 2
#osd_recovery_max_chunk = 8388608
#osd_recovery_op_priority = 2
#osd_max_backfills = 1

[client]
rbd_cache = true
rbd_cache_writethrough_until_flush = true
rbd_cache_size = 20 GiB
rbd_cache_max_dirty = 16 GiB
rbd_cache_target_dirty = 512 MiB


*Results inside CentOS6 64bit VM :
[root@vm ~]# dd if=/dev/zero of=./largefile bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 17.3417 s, 61.9 MB/s

[root@vm ~]# rm -rf /tmp/test  spew -i 50 -v -d --write -r -b 4096 10M
/tmp/test
Iteration:1Total runtime: 00:00:00
WTR:27753.91 KiB/s   Transfer time: 00:00:00IOPS: 6938.48

Iteration:2Total runtime: 00:00:00
WTR:29649.53 KiB/s   Transfer time: 00:00:00IOPS: 7412.38

Iteration:3Total runtime: 00:00:01
WTR:30897.44 KiB/s   Transfer time: 00:00:00IOPS: 7724.36

Iteration:4Total runtime: 00:00:02
WTR: 7474.93 KiB/s   Transfer time: 00:00:01IOPS: 1868.73

Iteration:5Total runtime: 00:00:02
WTR:24810.11 KiB/s   Transfer time: 00:00:00IOPS: 6202.53

Iteration:6Total runtime: 00:00:03
WTR:28534.01 KiB/s   Transfer time: 00:00:00IOPS: 7133.50

Iteration:7Total runtime: 00:00:03
WTR:27687.95 KiB/s   Transfer time: 00:00:00IOPS: 6921.99

Iteration:8Total runtime: 00:00:03
WTR:29195.91 KiB/s   Transfer time: 00:00:00IOPS: 7298.98

Iteration:9Total runtime: 00:00:04
WTR:28315.53 KiB/s   Transfer time: 00:00:00IOPS: 7078.88

Iteration:   10Total runtime: 00:00:04
WTR:27971.42 KiB/s   Transfer time: 00:00:00IOPS: 6992.85

Iteration:   11Total runtime: 00:00:04
WTR:29873.39 KiB/s   Transfer time: 00:00:00IOPS: 7468.35

Iteration:   12Total runtime: 00:00:05
WTR:32364.30 KiB/s   Transfer time: 00:00:00IOPS: 8091.08

Iteration:   13Total runtime: 00:00:05
WTR:32619.98 KiB/s   Transfer time: 00:00:00IOPS: 8155.00

Iteration:   14Total runtime: 00:00:06
WTR:18714.54 KiB/s   Transfer time: 00:00:00IOPS: 4678.64

Iteration:   15Total runtime: 00:00:06
WTR:17070.37 KiB/s   Transfer time: 00:00:00IOPS: 4267.59

Iteration:   16Total runtime: 00:00:07
WTR:22403.23 KiB/s   Transfer time: 00:00:00IOPS: 5600.81

Iteration:   17Total runtime: 00:00:07
WTR:16076.39 KiB/s   Transfer time: 00:00:00IOPS: 4019.10

Iteration:   18Total runtime: 00:00:08
WTR:26219.77 KiB/s   Transfer time: 00:00:00IOPS: 6554.94

Iteration:   19Total runtime: 00:00:08
WTR:29054.01 KiB/s   Transfer time: 00:00:00IOPS: 7263.50

Iteration:   20Total runtime: 00:00:08
WTR:27210.02 KiB/s   Transfer time: 00:00:00IOPS: 6802.50

Iteration:   21Total runtime: 00:00:09
WTR:28502.72 KiB/s   Transfer 

Re: [ceph-users] Ceph File System Question

2015-01-27 Thread Aaron Ten Clay
On Tue, Jan 27, 2015 at 6:13 AM, John Spray john.sp...@redhat.com wrote:

 Raj,

 The note is still valid, but the filesystem is getting more stable all the
 time.  Some people are using it, especially in an active/passive
 configuration with a single active MDS.  If you do choose to do some
 testing, use the most recent stable release of Ceph and the most recent
 linux kernel you can.

 Thanks,
 John

 For what it's worth, I've had much more stability with the FUSE client
than the kernel client. I know there have been lots of bugfixes in the
kernel client recently, though, so I'd be interested in hearing how that
works for you if you try it.

I run a single MDS, and haven't had any issues at all since switching to
the FUSE client :)

-Aaron
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow read-performance inside the vm

2015-01-27 Thread Udo Lembke
Hi Patrik,

Am 27.01.2015 14:06, schrieb Patrik Plank:
 

 ...
 I am really happy, these values above are enough for my little amount of
 vms. Inside the vms I get now for write 80mb/s and read 130mb/s, with
 write-cache enabled.
 
 But there is one little problem.
 
 Are there some tuning parameters for small files?
 
 For 4kb to 50kb files the cluster is very slow.
 

do you use an higher read-ahead inside the VM?
Like echo 4096  /sys/block/vda/queue/read_ahead_kb

Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] verifying tiered pool functioning

2015-01-27 Thread Chad William Seys
Hi Zhang,
  Thanks for the pointer.  That page looks like the commands to set up the 
cache, not how to verify that it is working.
  I think I have been able to see objects (not PGs I guess) moving from the 
cache pool to the storage pool using 'rados df' .  (I haven't run long enough 
to verify yet.)

Thanks again!
Chad.



On Tuesday, January 27, 2015 03:47:53 you wrote:
 Do you mean cache tiering?
 You can refer to http://ceph.com/docs/master/rados/operations/cache-tiering/
 for detail command line. PGs won't migrate from pool to pool.
 
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Chad William Seys Sent: Thursday, January 22, 2015 5:40 AM
 To: ceph-users@lists.ceph.com
 Subject: [ceph-users] verifying tiered pool functioning
 
 Hello,
   Could anyone provide a howto verify that a tiered pool is working
 correctly? E.g.
   Command to watch as PG migrate from one pool to another?  (Or determine
 which pool a PG is currently in.) Command to see how much data is in each
 pool (global view of number of PGs I guess)?
 
 Thanks!
 Chad.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cache pool and storage pool: possible to remove storage pool?

2015-01-27 Thread Chad William Seys
Hi all,
   Documentation explains how to remove the cache pool:
http://ceph.com/docs/master/rados/operations/cache-tiering/
   Anyone know how to remove the storage pool instead?  (E.g. the storage pool 
has wrong parameters.)
   I was hoping to push all the objects into the cache pool and then replace 
storage pool.

Thanks!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 85% of the cluster won't start, or how I learned why to use disk UUIDs

2015-01-27 Thread Steve Anthony
Story time. Over the past year or so, our datacenter had been undergoing
the first of a series or renovations designed to add more power and
cooling capacity. As part of these renovations, changes to the emergency
power off system (EPO) necessitated that this system be tested. If
you're unfamiliar, the EPO system is tied into the fire system, and
presented as a angry caged red button next to each exit design to
*immediately* cut all power and backup power to the datacenter. The idea
being that if there's a fire, or someone's being electrocuted, or some
other life-threatening electrical shenanigans occur, power can be
completely cut in one swift action. As this system hadn't been tested in
about 10 years and a whole bunch of changes had been made due to the
renovations, the powers that be scheduled downtime for all services one
Saturday at which time we would test the EPO and cut all power to the room.

On the appointed day, I shut down each of the 21 nodes and the 3
monitors in our cluster. A couple hours later, after testing and some
associated work had been completed, I powered the monitors back up and
began turning on the nodes holding the spinning OSDs and associated SSD
journals. After pushing the power buttons, I sat down at the console and
noticed something odd; only about 15% of the OSDs in the cluster had
come back online. Checking the logs, I noticed that the OSDs which had
failed to start were complaining about not being able to find their
associated journal partitions.

Fortunately, two things were true at this point. First and most
importantly, I had split off 8 nodes which had not been added to the
cluster yet and set up a second separate cluster in another site to
which I had exported/imported the critical images (and diffs) from the
primary cluster over the past few weeks. Second, I had happened to
restart a node a month or so prior which had presented the same
symptoms, so I knew why this had happened. When I first provisioned the
cluster I added the journals using the /dev/sd[a-z]+ identifier. On the
first four nodes, which I had provisioned manually, this was fine. On
subsequent nodes, I had used FAI Linux, Saltstack, and a Python script I
wrote to automatically provision the OS, configuration, and add the OSDs
and journals as they were inserted into the nodes. After a reboot on
these nodes, the devices were reordered, and the OSDs subsequently
couldn't find the journals. I had written a script to trickle
remove/re-insert OSDs one by one with journals using /dev/disk/by-id
(which is a persistent identifier), but hadn't yet run it on the
production cluster.

After some thought, I came up with a potential (if somewhat unpleasant)
solution which would let me get the production cluster back into a
running state quickly, without having to blow away the whole thing,
re-provision, and restore the backups. I theorized that if I shutdown a
node, removed all the hot-swap disks (the OSDs and journals), booted the
node, and then added the journals in the same order as I had when the
node was first provisioned, the OS should give them the same
/dev/sd[a-z}+ identifiers they had had pre-EPO. A quick test determined
I was correct, and could restore the cluster to working order by
applying the same operation to each node. Luckily, I had (mostly) added
drives to each node in the same order and where I hadn't at least one
journal was placed in the correct order and that allowed me to determine
the correct order for the other two, ie. if journal 2 as ok, but 1 and 3
weren't when I had added them in order 1,2,3, I knew the correct order
was 3,2,1. After pulling and re-inserting 336 disks, I had a working
cluster once again, except for one node where one journal had originally
been /dev/sda, which was now half of the OS software RAID mirror.
Breaking that, toggling the /sys/block/sdX/device/delete flag on that
disk, rescanning the bus, re-adding it to the RAID set when it came back
as /dev/sds, and symlinking /dev/sda to the appropriate SSD fixed that
last node.

Needless to say, I started pulling that node, and subsequently the other
nodes out of the cluster and readding them with /dev/disk/by-id journal
to prevent this from happening again. So a couple lessons here. First,
remember when adding OSDs with SSD journals to use a device UUID, not
/dev/sd[a-z]+, so you don't end up needing to spend three hours manually
touching each disk in your cluster and even longer slowly shifting a
couple hundred terabytes around while you fix the root cause. Second,
establish standards early and stick to them. As much of a headache as
pulling all the disks and re-inserting them was, it would have been much
worse if they weren't originally inserted in the same order on (almost)
all the nodes. Finally, backups are important. Having that safety net
helped me focus on the solution, rather than the problem since I knew
that if none of my ideas worked, I'd be able to get the most critical
data back.

Hopefully this saves 

Re: [ceph-users] ceph as a primary storage for owncloud

2015-01-27 Thread Steve Anthony
I tried this a while back. In my setup, I exposed an block device with
rbd on the owncloud host and tried sharing an image to the owncloud host
via NFS. If I recall correctly, both worked fine (I didn't try S3). The
problem I had at the time (maybe 6-12 months ago) was that owncloud
didn't support enough automated management of LDAP group permissions for
me to easily deploy and manage it for 1000+ users. It is on my list of
things to revisit however, so I'd be curious to hear how things go for
you. If it doesn't work out, I'd also recommend checking out Pydio. It
didn't make it into production in my environment (I didn't have time to
focus on it), but I liked its user management better than owncloud's at
the time.

-Steve

On 01/27/2015 05:05 AM, Simone Spinelli wrote:
 Dear all,

 we would like to use ceph as a a primary (object) storage for owncloud.
 Did anyone already do this? I mean: is that actually possible or am I
 wrong?
 As I understood I have to use radosGW in swift flavor, but what about
 s3 flavor?
 I cannot find anything official so hence my question.
 Do you have any advice or can you indicate me some kind of
 documentation/how-to?

 I know that maybe this is not the right place for this questions but I
 also asked owncloud's community... in the meantime...

 Every answer is appreciated!

 Thanks

 Simone


-- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma...@lehigh.edu




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph, LIO, VMWARE anyone?

2015-01-27 Thread Mike Christie
Oh yeah, I am not completely sure (have not tested myself), but if you
were doing a setup where you were not using a clustering app like
windows/redhat clustering that uses PRs, did not use vmfs and were
instead accessing the disks exported by LIO/TGT directly in the vm
(either using the guest's iscsi client or as a raw esx device), and were
not using ESX clustering, then you might be safe doing active/passive or
active/active with no modifications needed other than some scripts to
distribute the setup info across LIO/TGT nodes.

Were any of you trying this type of setup when you were describing your
results? If so, were you running oracle or something like that? Just
wondering.


On 01/27/2015 08:58 PM, Mike Christie wrote:
 I do not know about perf, but here is some info on what is safe and
 general info.
 
 - If you are not using VAAI then it will use older style RESERVE/RELEASE
 commands only.
 
 If you are using VAAI ATS, and doing active/active then you need
 something, like the lock/sync talked about in the slides/hammer doc,
 that would coordinate multiple ATS/COMPARE_AND_WRITEs from executing at
 the same time on the same sectors. You probably do not ever see problems
 today, because it seems ESX normally does this command for only one
 sector and I do not think there are multiple commands for the same
 sectors in flight normally.
 
 For active/passive, ATS is simple since you only have the one LIO/TGT
 node executing commands at a time, so the locking is done locally using
 a normal old mutex.
 
 - tgt and LIO both support SCSI-3 persistent reservations. This is not
 really needed for ESX vmfs though since it uses ATS or older
 RESERVE/RELEASE. If you were using a cluster app like windows
 clustering, red hat cluster, etc in ESX or in normal non vm use, then
 you need something extra to support SCSI-3 PRs in both active/active or
 active/passive.
 
 For AA, you need something like described in that doc/video.
 
 For AP, you would need to copy over the PR state from one node to the
 other when failing over/back across nodes. For LIO this is in /var/target.
 
 Depending on how you do AP (what ALUA states you use if you do ALUA),
 you might also need to always distribute the PR info if you are doing
 windows clustering. Windows wants to see a consistent view of the PR
 info from all ports if you do something like ALUA active-optimized and
 standby states for active/passive.
 
 - I do not completely understand the comment about using LIO as a
 backend for tgt. You would either use tgt or LIO to export a rbd device.
 Not both at the same time like using LIO for some sort of tgt backend.
 Maybe people meant using the RBD backend instead of LIO backend
 
 - There are some other setup complications that you can see here
 http://comments.gmane.org/gmane.linux.scsi.target.devel/7044
 if you are using ALUA. I think tgt does not support ALUA, but LIO does.
 
 
 On 01/23/2015 04:25 PM, Zoltan Arnold Nagy wrote:
 Correct me if I'm wrong, but tgt doesn't have full SCSI-3 persistence
 support when _not_ using the LIO
 backend for it, right?

 AFAIK you can either run tgt with it's own iSCSI implementation or you
 can use tgt to manage your LIO targets.

 I assume when you're running tgt with the rbd backend code you're
 skipping all the in-kernel LIO parts (in which case
 the RedHat patches won't help a bit), and you won't have proper
 active-active support, since the initiators
 have no way to synchronize state (and more importantly, no way to 
 synchronize write caching! [I can think
 of some really ugly hacks to get around that, tho...]).

 On 01/23/2015 05:46 PM, Jake Young wrote:
 Thanks for the feedback Nick and Zoltan,

 I have been seeing periodic kernel panics when I used LIO.  It was
 either due to LIO or the kernel rbd mapping.  I have seen this on
 Ubuntu precise with kernel 3.14.14 and again in Ubunty trusty with the
 utopic kernel (currently 3.16.0-28).  Ironically, this is the primary
 reason I started exploring a redundancy solution for my iSCSI proxy
 node.  So, yes, these crashes have nothing to do with running the
 Active/Active setup.

 I am moving my entire setup from LIO to rbd enabled tgt, which I've
 found to be much more stable and gives equivalent performance.

 I've been testing active/active LIO since July of 2014 with VMWare and
 I've never seen any vmfs corruption.  I am now convinced (thanks Nick)
 that it is possible.  The reason I have not seen any corruption may
 have to do with how VMWare happens to be configured.

 Originally, I had made a point to use round robin path selection in
 the VMware hosts; but as I did performance testing, I found that it
 actually didn't help performance.  When the host switches iSCSI
 targets there is a short spin up time for LIO to get to 100% IO
 capability.  Since round robin switches targets every 30 seconds (60
 seconds? I forget), this seemed to be significant.  A secondary goal
 for me was to end up with a config that required minimal 

Re: [ceph-users] Consumer Grade SSD Clusters

2015-01-27 Thread Christian Balzer

Hello,

As others said, it depends on your use case and expected write load.
If you search the ML archives, you will find that there can be SEVERE
write amplification with Ceph, something to very much keep in mind.

You should run tests yourself before deploying things and committing to a
hardware that won't cut the mustard. 

I did this comparison for a project (involving DRBD, not Ceph) 3 months
ago:
---
  Model Size   Endurance Cost
Samsung 845DC EVO  960GB600TBW 700USD TBW/$=1.16
Intel DC S3700 800GB   7300TBW1500USD TBW/$=0.20
---

In that particular case the Samsung made the grade, as the expected writes
per year and SSD are less than 60TB.

Make those calculation for the specific SSDs you have in mind. 
Cheap initial costs may come back to bite you in the behind later.

With Ceph, I'd be _very_ uncomfortable putting data on consumer SSDs.
Aside from the ease of mind that the Intel DC S3700 (or maybe the Samsung
845DC Pro, not tested myself yet) give you, there's the overall better
performance (and consistently so, no going to sleep for garbage
collection). 
On top of the SMART monitoring you'll probably also be forced use fstrim
on these SSDs to keep their performance (such as it is) from degrading. 

Christian 

On Wed, 28 Jan 2015 00:30:04 + Quenten Grasso wrote:

 Hi Nick,
 
 Agreed, I see your point of basically once your past the 150TBW or
 whatever that number maybe, your just waiting for failure effectively
 but aren't we anyway?
 
 I guess it depends on your use case at the end of the day. I wonder what
 the likes of Amazon, Rackspace etc are doing in the way of SSD's, either
 they are buying them so cheap per GB due to the volume or they are
 possibly using consumer grade  SSD'.
 
 hmm.. using consumer grade SSD's it may be an interesting option if you
 have descent monitoring and alerting using SMART you should be able to
 still see how much spare flash you have available. As suggested by Wido
 using multiple brands would help remove the possible cascading failure
 affect which I guess we all should be doing anyway on our spinners.
 
 I guess we have to decide is it worth the extra effort in the long run
 vs running enterprise ssds.
 
 Regards,
 Quenten Grasso
 
 From: Nick Fisk [mailto:n...@fisk.me.uk]
 Sent: Saturday, 24 January 2015 7:33 PM
 To: Quenten Grasso; ceph-users@lists.ceph.com
 Subject: RE: Consumer Grade SSD Clusters
 
 Hi Quenten,
 
 There is no real answer to your question. It really depends on how busy
 your storage will be and particularly if it is mainly reads or writes.
 
 I wouldn't pay too much attention to that SSD endurance test, whilst
 it's great to know that they have a lot more headroom than their
 official spec's, you run the risk of having a spectacular multiple disk
 failure if you intend to run them all that high. You can probably
 guarantee that as 1 SSD starts to fail the increase in workload to
 re-balance the cluster will cause failures on the rest.
 
 I guess it really comes down to how important is the availability of
 your data. Whilst an average pc user might balk at the price of paying 4
 times per GB more for a S3700 SSD, in the enterprise world they are
 still comparatively cheap.
 
 The other thing you need to be aware of is that most consumer SSD's
 don't have power loss protection, again if you are mainly doing reads
 and cost is more important than availability, there may be an argument
 to use them.
 
 Nick
 
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Quenten Grasso Sent: 24 January 2015 09:13
 To: ceph-users@lists.ceph.com
 Subject: [ceph-users] Consumer Grade SSD Clusters
 
 Hi Everyone,
 
 Just wondering if anyone has had any experience in using consumer grade
 SSD's for a Ceph cluster?
 
 I came across this article
 http://techreport.com/review/26523/the-ssd-endurance-experiment-casualties-on-the-way-to-a-petabyte/3http://xo4t.mjt.lu/link/xo4t/gg573yr/1/QRjiN_2beI5qST5ggOanaQ/aHR0cDovL3RlY2hyZXBvcnQuY29tL3Jldmlldy8yNjUyMy90aGUtc3NkLWVuZHVyYW5jZS1leHBlcmltZW50LWNhc3VhbHRpZXMtb24tdGhlLXdheS10by1hLXBldGFieXRlLzM
 
 They have been testing different SSD's write endurance and they have
 been able to write up to 1PB+ to a Samsung 840 Pro 256GB which is only
 rated at 150TBW and of course other SSD's have failed well before
 1PBW, So defiantly worth a read.
 
 So I've been thinking about using consumer grade SSD's for OSD's and
 Enterprise SSD's for journals.
 
 Reasoning is enterprise SSD's are a lot faster at journaling then
 consumer grade drives plus this would effectively half the overall write
 requirements on the consumer grade disks.
 
 This also could be a cost effective alternative to using enterprise
 SSD's as OSD's however it seems if your happy to use 2 x replication
 it's a pretty good cost saving however 3x replication not so much.
 
 Cheers,
 Quenten Grasso
 
 
 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   

Re: [ceph-users] Ceph, LIO, VMWARE anyone?

2015-01-27 Thread Mike Christie
I do not know about perf, but here is some info on what is safe and
general info.

- If you are not using VAAI then it will use older style RESERVE/RELEASE
commands only.

If you are using VAAI ATS, and doing active/active then you need
something, like the lock/sync talked about in the slides/hammer doc,
that would coordinate multiple ATS/COMPARE_AND_WRITEs from executing at
the same time on the same sectors. You probably do not ever see problems
today, because it seems ESX normally does this command for only one
sector and I do not think there are multiple commands for the same
sectors in flight normally.

For active/passive, ATS is simple since you only have the one LIO/TGT
node executing commands at a time, so the locking is done locally using
a normal old mutex.

- tgt and LIO both support SCSI-3 persistent reservations. This is not
really needed for ESX vmfs though since it uses ATS or older
RESERVE/RELEASE. If you were using a cluster app like windows
clustering, red hat cluster, etc in ESX or in normal non vm use, then
you need something extra to support SCSI-3 PRs in both active/active or
active/passive.

For AA, you need something like described in that doc/video.

For AP, you would need to copy over the PR state from one node to the
other when failing over/back across nodes. For LIO this is in /var/target.

Depending on how you do AP (what ALUA states you use if you do ALUA),
you might also need to always distribute the PR info if you are doing
windows clustering. Windows wants to see a consistent view of the PR
info from all ports if you do something like ALUA active-optimized and
standby states for active/passive.

- I do not completely understand the comment about using LIO as a
backend for tgt. You would either use tgt or LIO to export a rbd device.
Not both at the same time like using LIO for some sort of tgt backend.
Maybe people meant using the RBD backend instead of LIO backend

- There are some other setup complications that you can see here
http://comments.gmane.org/gmane.linux.scsi.target.devel/7044
if you are using ALUA. I think tgt does not support ALUA, but LIO does.


On 01/23/2015 04:25 PM, Zoltan Arnold Nagy wrote:
 Correct me if I'm wrong, but tgt doesn't have full SCSI-3 persistence
 support when _not_ using the LIO
 backend for it, right?
 
 AFAIK you can either run tgt with it's own iSCSI implementation or you
 can use tgt to manage your LIO targets.
 
 I assume when you're running tgt with the rbd backend code you're
 skipping all the in-kernel LIO parts (in which case
 the RedHat patches won't help a bit), and you won't have proper
 active-active support, since the initiators
 have no way to synchronize state (and more importantly, no way to 
 synchronize write caching! [I can think
 of some really ugly hacks to get around that, tho...]).
 
 On 01/23/2015 05:46 PM, Jake Young wrote:
 Thanks for the feedback Nick and Zoltan,

 I have been seeing periodic kernel panics when I used LIO.  It was
 either due to LIO or the kernel rbd mapping.  I have seen this on
 Ubuntu precise with kernel 3.14.14 and again in Ubunty trusty with the
 utopic kernel (currently 3.16.0-28).  Ironically, this is the primary
 reason I started exploring a redundancy solution for my iSCSI proxy
 node.  So, yes, these crashes have nothing to do with running the
 Active/Active setup.

 I am moving my entire setup from LIO to rbd enabled tgt, which I've
 found to be much more stable and gives equivalent performance.

 I've been testing active/active LIO since July of 2014 with VMWare and
 I've never seen any vmfs corruption.  I am now convinced (thanks Nick)
 that it is possible.  The reason I have not seen any corruption may
 have to do with how VMWare happens to be configured.

 Originally, I had made a point to use round robin path selection in
 the VMware hosts; but as I did performance testing, I found that it
 actually didn't help performance.  When the host switches iSCSI
 targets there is a short spin up time for LIO to get to 100% IO
 capability.  Since round robin switches targets every 30 seconds (60
 seconds? I forget), this seemed to be significant.  A secondary goal
 for me was to end up with a config that required minimal tuning from
 VMWare and the target software; so the obvious choice is to leave
 VMWare's path selection at the default which is Fixed and picks the
 first target in ASCII-betical order.  That means I am actually
 functioning in Active/Passive mode.

 Jake




 On Fri, Jan 23, 2015 at 8:46 AM, Zoltan Arnold Nagy
 zol...@linux.vnet.ibm.com mailto:zol...@linux.vnet.ibm.com wrote:

 Just to chime in: it will look fine, feel fine, but underneath
 it's quite easy to get VMFS corruption. Happened in our tests.
 Also if you're running LIO, from time to time expect a kernel
 panic (haven't tried with the latest upstream, as I've been using
 Ubuntu 14.04 on my export hosts for the test, so might have
 improved...).

 As of now I would not recommend this