Re: [ceph-users] Cache Tier configuration

2016-07-20 Thread Christian Balzer

Hello,

On Wed, 20 Jul 2016 11:44:15 +0200 Mateusz Skała wrote:

[snip]
> > > > >
> > > > There are a number of other options to control things, especially with
> > Jewel.
> > > > Also setting your cache mode to readforward might be a good idea
> > > > depending on your use case.
> > > >
> > > I'm considering this move, especially we are also using SSD Journal.
> > Journals are for writes, they don't affect reads, which would come from the
> > HDD base backing pool.
> 
> I know that journals are for write, but if I good understand, cache-tier in 
> writeback mode also is used for writes, so each write goes Journal SSD -> 
> Cache Tier ->(after some time) cold storage.

Yes, writeback is caching writes (as well), as the name implies.
And no, journals are part of OSDs not something separate.

So a write to a pool with a cache-tier goes:
cache-pool (journal) and within split seconds to cache-pool (OSD).

It may then eventually go to the  backing-pool (journal, then OSD), but if
your object is hot all the time in may NEVER wind up on the backing pool
(cold storage in your terms).

> 
> > 
> > >Please confirm, can I use cache tire readforward with pool size 1? It
> > >is
> > safe? Then I will have 3 times more space for cache tier.
> > >
> > Definitely not.
> > Even with the best, most trusted SSDs you want a replication size of 2, so 
> > you
> > can survive an OSD or node failure, etc.
> I thought that in readforward mode for cache tier, failure on ssd is not 
> affective on backing storage, and then ceph should re-read objects from this 
> storage. What is workflow for ceph, if fails OSD from cache-tier pool in 
> readforward mode.
> 

I think you're confusing this with readonly, which as the Ceph
documentation points out is a mode you're most likely don't want to use
anyway.

In theory a readonly cache-tier setup could survive the loss of the cache
pool, but in practice I think it would break horribly, at least until you
removed the broken cache pool manually.

The readforward and readproxy modes will cache writes (and thus reads for
objects that have been written to and are still in the cache).
And as such they contain your most valuable data and can't be allowed to
fail, ever.

Most people will want writes to be cached, as more applications are
allergic to slow writes than reads.

Christian
[snap]
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] flatten influences performance of parent VM?

2016-07-20 Thread yhpeng

hi,

for a running VM, we have some snapshots on it.
For some snapshots, we protect it, clone it, and then flatten them.
If we run flatten frequently, does this influence the performance of 
original running VM?



thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph + vmware

2016-07-20 Thread Mike Christie
On 07/20/2016 11:52 AM, Jan Schermer wrote:
> 
>> On 20 Jul 2016, at 18:38, Mike Christie  wrote:
>>
>> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>>
>>> Hi Mike,
>>>
>>> Thanks for the update on the RHCS iSCSI target.
>>>
>>> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is
>>> it too early to say / announce).
>>
>> No HA support for sure. We are looking into non HA support though.
>>
>>>
>>> Knowing that HA iSCSI target was on the roadmap, we chose iSCSI over NFS
>>> so we'll just have to remap RBDs to RHCS targets when it's available.
>>>
>>> So we're currently running :
>>>
>>> - 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI target
>>> has all VAAI primitives enabled and run the same configuration.
>>> - RBD images are mapped on each target using the kernel client (so no
>>> RBD cache).
>>> - 6 ESXi. Each ESXi can access to the same LUNs through both targets,
>>> but in a failover manner so that each ESXi always access the same LUN
>>> through one target at a time.
>>> - LUNs are VMFS datastores and VAAI primitives are enabled client side
>>> (except UNMAP as per default).
>>>
>>> Do you see anthing risky regarding this configuration ?
>>
>> If you use a application that uses scsi persistent reservations then you
>> could run into troubles, because some apps expect the reservation info
>> to be on the failover nodes as well as the active ones.
>>
>> Depending on the how you do failover and the issue that caused the
>> failover, IO could be stuck on the old active node and cause data
>> corruption. If the initial active node looses its network connectivity
>> and you failover, you have to make sure that the initial active node is
>> fenced off and IO stuck on that node will never be executed. So do
>> something like add it to the ceph monitor blacklist and make sure IO on
>> that node is flushed and failed before unblacklisting it.
>>
> 
> With iSCSI you can't really do hot failover unless you only use synchronous 
> IO.
> (With any of opensource target softwares available).

That is what we are working on adding.

Why did you only say iSCSI though?

> Flushing the buffers doesn't really help because you don't know what 
> in-flight IO happened before the outage

To be clear, when I wrote flush I did not mean cache buffers. I only
meant the targets list of commands.

And, for the unblacklist comment it is best to unmap images that are
under a blacklist then remap them. The osd blacklist remove command
would leave some krbd structs in a bad state.

> and which didn't. You could end with only part of the "transaction" written 
> on persistent storage.
> 

Maybe I am not sure what you mean by hot failover.

If you are failing over for the case where one node just goes
unreachable, then if you blacklist it before making another node active
you know IO that had not been sent will be failed and never execute,
partially sent IO will be failed and not execute. IO that was sent to
the OSD and is executing will completed by the OSD before new IO to the
same sectors, so you would not end up with what looks like partial
transactions if you later did a read.

If the OSDs die mid write you could end up with a part of command
written, but that could happen with any SCSI based protocol.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph + vmware

2016-07-20 Thread Jake Young
On Wednesday, July 20, 2016, Jan Schermer  wrote:

>
> > On 20 Jul 2016, at 18:38, Mike Christie  > wrote:
> >
> > On 07/20/2016 03:50 AM, Frédéric Nass wrote:
> >>
> >> Hi Mike,
> >>
> >> Thanks for the update on the RHCS iSCSI target.
> >>
> >> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is
> >> it too early to say / announce).
> >
> > No HA support for sure. We are looking into non HA support though.
> >
> >>
> >> Knowing that HA iSCSI target was on the roadmap, we chose iSCSI over NFS
> >> so we'll just have to remap RBDs to RHCS targets when it's available.
> >>
> >> So we're currently running :
> >>
> >> - 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI target
> >> has all VAAI primitives enabled and run the same configuration.
> >> - RBD images are mapped on each target using the kernel client (so no
> >> RBD cache).
> >> - 6 ESXi. Each ESXi can access to the same LUNs through both targets,
> >> but in a failover manner so that each ESXi always access the same LUN
> >> through one target at a time.
> >> - LUNs are VMFS datastores and VAAI primitives are enabled client side
> >> (except UNMAP as per default).
> >>
> >> Do you see anthing risky regarding this configuration ?
> >
> > If you use a application that uses scsi persistent reservations then you
> > could run into troubles, because some apps expect the reservation info
> > to be on the failover nodes as well as the active ones.
> >
> > Depending on the how you do failover and the issue that caused the
> > failover, IO could be stuck on the old active node and cause data
> > corruption. If the initial active node looses its network connectivity
> > and you failover, you have to make sure that the initial active node is
> > fenced off and IO stuck on that node will never be executed. So do
> > something like add it to the ceph monitor blacklist and make sure IO on
> > that node is flushed and failed before unblacklisting it.
> >
>
> With iSCSI you can't really do hot failover unless you only use
> synchronous IO.


VMware does only use synchronous IO. Since the hypervisor can't tell what
type of data the VMs are writing, all IO is treated as needing to be
synchronous.

(With any of opensource target softwares available).
> Flushing the buffers doesn't really help because you don't know what
> in-flight IO happened before the outage
> and which didn't. You could end with only part of the "transaction"
> written on persistent storage.
>
> If you only use synchronous IO all the way from client to the persistent
> storage shared between
> iSCSI target then all should be fine, otherwise YMMV - some people run it
> like that without realizing
> the dangers and have never had a problem, so it may be strictly
> theoretical, and it all depends on how often you need to do the
> failover and what data you are storing - corrupting a few images on a
> gallery site could be fine but corrupting
> a large database tablespace is no fun at all.


No, it's not. VMFS corruption is pretty bad too and there is no fsck for
VMFS...


>
> Some (non opensource) solutions exist, Solaris supposedly does this in
> some(?) way, maybe some iSCSI guru
> can chime tell us what magic they do, but I don't think it's possible
> without client support
> (you essentialy have to do something like transactions and replay the last
> transaction on failover). Maybe
> something can be enabled in protocol to do the iSCSI IO synchronous or
> make it at least wait for some sort of ACK from the
> server (which would require some sort of cache mirroring between the
> targets) without making it synchronous all the way.


This is why the SAN vendors wrote their own clients and drivers. It is not
possible to dynamically make all OS's do what your iSCSI target expects.

Something like VMware does the right thing pretty much all the time (there
are some iSCSI initiator bugs in earlier ESXi 5.x).  If you have control of
your ESXi hosts then attempting to set up HA iSCSI targets is possible.

If you have a mixed client environment with various versions of Windows
connecting to the target, you may be better off buying some SAN appliances.


> The one time I had to use it I resorted to simply mirroring in via mdraid
> on the client side over two targets sharing the same
> DAS, and this worked fine during testing but never went to production in
> the end.
>
> Jan
>
> >
> >>
> >> Would you recommend LIO or STGT (with rbd bs-type) target for ESXi
> >> clients ?
> >
> > I can't say, because I have not used stgt with rbd bs-type support
> enough.


For starters, STGT doesn't implement VAAI properly and you will need to
disable VAAI in ESXi.

LIO does seem to implement VAAI properly, but performance is not nearly as
good as STGT even with VAAI's benefits. The assumption for the cause is
that LIO currently uses kernel rbd mapping and kernel rbd performance is
not as good as librbd.

I recently did a simple test of 

[ceph-users] performance decrease after continuous run

2016-07-20 Thread Kane Kim
Hello,

I was running cosbench for some time and noticed sharp consistent
performance decrease at some point.

Image is here: http://take.ms/rorPw
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs change metadata pool?

2016-07-20 Thread Di Zhang
update:

After upgrading to Jewel and changing journaling to SSD, I no longer
have the slow/blocked requests warnings during normal data copying.
Thank you all.

Zhang Di

On Wed, Jul 13, 2016 at 11:04 PM, Christian Balzer  wrote:

>
> Hello,
>
> On Wed, 13 Jul 2016 22:47:05 -0500 Di Zhang wrote:
>
> > Hi,
> >   I changed to only use the infiniband network. For the 4KB write,
> the IOPS doesn’t improve much.
>
> That's mostly going to be bound by latencies (as I just wrote in the other
> thread), both network and internal Ceph ones.
>
> The cluster I described in the other thread has 32 OSDs and does about
> 1050 "IOPS" with "rados -p rbd bench 30 write -t 32 -b 4096".
> So about half with your 15 OSDs isn't all that unexpected.
>
> Once again, to get something more realistic use fio.
>
> >I also logged into the OSD nodes and atop showed the disks are not always
> at 100% busy. Please check a snapshot of one node below:
>
> When you do the 4KB bench (for 60 seconds or so), also watch the CPU
> usage, rados bench is a killer there.
>
> Christian
>
> >
> > DSK |  sdc  | busy 72% |  read20/s |  write   86/s |
> KiB/w 13  | MBr/s   0.16 |  MBw/s   1.12 |  avio 6.69 ms |
> > DSK |  sda  | busy 47% |  read 0/s |  write  589/s |
> KiB/w  4  | MBr/s   0.00 |  MBw/s   2.83 |  avio 0.79 ms |
> > DSK |  sdb  | busy 31% |  read14/s |  write   77/s |
> KiB/w 10  | MBr/s   0.11 |  MBw/s   0.76 |  avio 3.42 ms |
> > DSK |  sdd  | busy 19% |  read 4/s |  write   50/s |
> KiB/w 11  | MBr/s   0.03 |  MBw/s   0.55 |  avio 3.40 ms |
> > NET | transport | tcpi   656/s |  tcpo   655/s |  udpi 0/s |
> udpo 0/s  | tcpao0/s |  tcppo0/s |  tcprs0/s |
> > NET | network   | ipi657/s |  ipo655/s |  ipfrw0/s |
> deliv  657/s  |  |  icmpi0/s |  icmpo0/s |
> > NET | p10p1 0%  | pcki 0/s |  pcko 0/s |  si0 Kbps | so
>   1 Kbps  | erri 0/s |  erro 0/s |  drpo 0/s |
> > NET | ib0   | pcki   637/s |  pcko   636/s |  si 8006 Kbps | so
> 5213 Kbps  | erri 0/s |  erro 0/s |  drpo 0/s |
> > NET | lo    | pcki19/s |  pcko19/s |  si   14 Kbps | so
>  14 Kbps  | erri 0/s |  erro 0/s |  drpo 0/s |
> >
> >   /dev/sda is the OS and journaling SSD. The other three are OSDs.
> >
> >   Am I missing anything?
> >
> >   Thanks,
> >
> >
> >
> >
> > Zhang, Di
> > Postdoctoral Associate
> > Baylor College of Medicine
> >
> > > On Jul 13, 2016, at 6:56 PM, Christian Balzer  wrote:
> > >
> > >
> > > Hello,
> > >
> > > On Wed, 13 Jul 2016 12:01:14 -0500 Di Zhang wrote:
> > >
> > >> I also tried 4K write bench. The IOPS is ~420.
> > >
> > > That's what people usually mean (4KB blocks) when talking about IOPS.
> > > This number is pretty low, my guess would be network latency on your
> 1Gbs
> > > network for the most part.
> > >
> > > You should run atop on your storage nodes will running a test like this
> > > and see if the OSDs (HDDs) are also very busy.
> > >
> > > Lastly the rados bench gives you some basic numbers but it is not the
> same
> > > as real client I/O, for that you want to run fio inside a VM or in your
> > > case on a mounted CephFS.
> > >
> > >> I used to have better
> > >> bandwidth when I use the same network for both the cluster and
> clients. Now
> > >> the bandwidth must be limited by the 1G ethernet.
> > > That's the bandwidth you also see in your 4MB block tests below.
> > > For small I/Os the real killer is latency, though.
> > >
> > >> What would you suggest to
> > >> me to do?
> > >>
> > > That depends on your budget mostly (switch ports, client NICs).
> > >
> > > A uniform, single 10Gb/s network would be better in all aspects than
> the
> > > split network you have now.
> > >
> > > Christian
> > >
> > >> Thanks,
> > >>
> > >> On Wed, Jul 13, 2016 at 11:37 AM, Di Zhang 
> wrote:
> > >>
> > >>> Hello,
> > >>>Sorry for the misunderstanding about IOPS. Here are some summary
> stats
> > >>> of my benchmark (Is the 20 - 30 IOPS seems normal to you?):
> > >>>
> > >>> ceph osd pool create test 512 512
> > >>>
> > >>> rados bench -p test 10 write --no-cleanup
> > >>>
> > >>> Total time run: 10.480383
> > >>> Total writes made:  288
> > >>> Write size: 4194304
> > >>> Object size:4194304
> > >>> Bandwidth (MB/sec): 109.92
> > >>> Stddev Bandwidth:   11.9926
> > >>> Max bandwidth (MB/sec): 124
> > >>> Min bandwidth (MB/sec): 80
> > >>> Average IOPS:   27
> > >>> Stddev IOPS:3
> > >>> Max IOPS:   31
> > >>> Min IOPS:   20
> > >>> Average Latency(s): 0.579105
> > >>> Stddev Latency(s):  0.19902
> > >>> Max latency(s): 1.32831
> > >>> Min latency(s): 0.245505
> > >>>
> > >>> rados bench -p bench -p test 10 seq
> > >>> Total time run:   

[ceph-users] Aug CDM

2016-07-20 Thread Patrick McGarry
Hey cephers,

A reminder that our next Ceph Developer Monthly is coming up on 03 Aug
@ 12:30p EDT.  Please visit the wiki and enter your blueprints for
discussion, even if it’s just a placeholder.

All major work being done on Ceph should have a blueprint at CDM, even
if it’s just a brief outline so we have a chance to get an update on
work and status. If you have any questions please feel free to contact
me. Thanks.

http://tracker.ceph.com/projects/ceph/wiki/CDM_03-AUG-2016


-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Tech Talk next week

2016-07-20 Thread Patrick McGarry
Hey cephers,

As of right now I have no one slated for the Ceph Tech Talk next week.
Unless I get someone into that slot by the end of this week or maybe
Monday at the latest, we’ll have to cancel the slot. If you’re
interested in giving a tech talk either this month or in the months to
come, please let me know. Thanks.


-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph + vmware

2016-07-20 Thread Jan Schermer

> On 20 Jul 2016, at 18:38, Mike Christie  wrote:
> 
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>> 
>> Hi Mike,
>> 
>> Thanks for the update on the RHCS iSCSI target.
>> 
>> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is
>> it too early to say / announce).
> 
> No HA support for sure. We are looking into non HA support though.
> 
>> 
>> Knowing that HA iSCSI target was on the roadmap, we chose iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS targets when it's available.
>> 
>> So we're currently running :
>> 
>> - 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI target
>> has all VAAI primitives enabled and run the same configuration.
>> - RBD images are mapped on each target using the kernel client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs through both targets,
>> but in a failover manner so that each ESXi always access the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are enabled client side
>> (except UNMAP as per default).
>> 
>> Do you see anthing risky regarding this configuration ?
> 
> If you use a application that uses scsi persistent reservations then you
> could run into troubles, because some apps expect the reservation info
> to be on the failover nodes as well as the active ones.
> 
> Depending on the how you do failover and the issue that caused the
> failover, IO could be stuck on the old active node and cause data
> corruption. If the initial active node looses its network connectivity
> and you failover, you have to make sure that the initial active node is
> fenced off and IO stuck on that node will never be executed. So do
> something like add it to the ceph monitor blacklist and make sure IO on
> that node is flushed and failed before unblacklisting it.
> 

With iSCSI you can't really do hot failover unless you only use synchronous IO.
(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you don't know what in-flight 
IO happened before the outage
and which didn't. You could end with only part of the "transaction" written on 
persistent storage.

If you only use synchronous IO all the way from client to the persistent 
storage shared between
iSCSI target then all should be fine, otherwise YMMV - some people run it like 
that without realizing
the dangers and have never had a problem, so it may be strictly theoretical, 
and it all depends on how often you need to do the
failover and what data you are storing - corrupting a few images on a gallery 
site could be fine but corrupting
a large database tablespace is no fun at all.

Some (non opensource) solutions exist, Solaris supposedly does this in some(?) 
way, maybe some iSCSI guru
can chime tell us what magic they do, but I don't think it's possible without 
client support
(you essentialy have to do something like transactions and replay the last 
transaction on failover). Maybe
something can be enabled in protocol to do the iSCSI IO synchronous or make it 
at least wait for some sort of ACK from the
server (which would require some sort of cache mirroring between the targets) 
without making it synchronous all the way.

The one time I had to use it I resorted to simply mirroring in via mdraid on 
the client side over two targets sharing the same
DAS, and this worked fine during testing but never went to production in the 
end.

Jan

> 
>> 
>> Would you recommend LIO or STGT (with rbd bs-type) target for ESXi
>> clients ?
> 
> I can't say, because I have not used stgt with rbd bs-type support enough.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph + vmware

2016-07-20 Thread Mike Christie
On 07/20/2016 03:50 AM, Frédéric Nass wrote:
> 
> Hi Mike,
> 
> Thanks for the update on the RHCS iSCSI target.
> 
> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is
> it too early to say / announce).

No HA support for sure. We are looking into non HA support though.

> 
> Knowing that HA iSCSI target was on the roadmap, we chose iSCSI over NFS
> so we'll just have to remap RBDs to RHCS targets when it's available.
> 
> So we're currently running :
> 
> - 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI target
> has all VAAI primitives enabled and run the same configuration.
> - RBD images are mapped on each target using the kernel client (so no
> RBD cache).
> - 6 ESXi. Each ESXi can access to the same LUNs through both targets,
> but in a failover manner so that each ESXi always access the same LUN
> through one target at a time.
> - LUNs are VMFS datastores and VAAI primitives are enabled client side
> (except UNMAP as per default).
> 
> Do you see anthing risky regarding this configuration ?

If you use a application that uses scsi persistent reservations then you
could run into troubles, because some apps expect the reservation info
to be on the failover nodes as well as the active ones.

Depending on the how you do failover and the issue that caused the
failover, IO could be stuck on the old active node and cause data
corruption. If the initial active node looses its network connectivity
and you failover, you have to make sure that the initial active node is
fenced off and IO stuck on that node will never be executed. So do
something like add it to the ceph monitor blacklist and make sure IO on
that node is flushed and failed before unblacklisting it.


> 
> Would you recommend LIO or STGT (with rbd bs-type) target for ESXi
> clients ?

I can't say, because I have not used stgt with rbd bs-type support enough.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd export-dif question

2016-07-20 Thread Norman Uittenbogaart
Hi,

I made a backup script to backup RBD images in the pool by snapshots and
exporting the
first snapshot and afterwards only the diffs.

I notices however that creating a diff from one image to the next is always
a certain size.
And its much bigger then only the changes from one snapshot to the next.

I have shared the script here, https://github.com/normanu/rbd_backup

So you can see what I am doing.
But I wonder if I am doing something wrong or if there is always a overhead
in backing up diffs.

Thanks,

Norman
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph : Generic Query : Raw Format of images

2016-07-20 Thread Gaurav Goyal
Dear Ceph User,

I want to ask a very generic query regarding ceph.

Ceph does use .raw format. But every single company is providing qcow2
images.
It takes a lot of time to convert the images to raw format.

is it something everyone else is dealing with?
Or i am not doing something right.

Specially when Organization know ceph functionality, why dont they create
raw images along with qcow2.


Regards
Gaurav Goyal
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Samba VFS RHEL packages

2016-07-20 Thread Ken Dreyer
The Samba packages in Fedora 22+ do enable the Samba VFS:
https://bugzilla.redhat.com/1174412

>From what Ira said downthread, this is pretty experimental, so you
could run your tests on a Fedora system and see how it goes :)

- Ken

On Tue, Jul 19, 2016 at 11:45 PM, Blair Bethwaite
 wrote:
> Hi all,
>
> We've started a CephFS Samba PoC on RHEL but just noticed the Samba
> Ceph VFS doesn't seem to be included with Samba on RHEL, or we're not
> looking in the right place. Trying to avoid needing to build Samba
> from source if possible. Any pointers appreciated.
>
> --
> Cheers,
> ~Blairo
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph OSD with 95% full

2016-07-20 Thread M Ranga Swami Reddy
Do we have any tool to monitor the OSDs usage with help of UI?

Thanks
Swami

On Tue, Jul 19, 2016 at 6:44 PM, M Ranga Swami Reddy
 wrote:
> +1 .. I agree
>
> Thanks
> Swami
>
> On Tue, Jul 19, 2016 at 4:57 PM, Lionel Bouton  
> wrote:
>> Hi,
>>
>> On 19/07/2016 13:06, Wido den Hollander wrote:
 Op 19 juli 2016 om 12:37 schreef M Ranga Swami Reddy 
 :


 Thanks for the correction...so even one OSD reaches to 95% full, the
 total ceph cluster IO (R/W) will be blocked...Ideally read IO should
 work...
>>> That should be a config option, since reading while writes still block is 
>>> also a danger. Multiple clients could read the same object, perform a 
>>> in-memory change and their write will block.
>>>
>>> Now, which client will 'win' after the full flag has been removed?
>>>
>>> That could lead to data corruption.
>>
>> If it did, the clients would be broken as normal usage (without writes
>> being blocked) doesn't prevent multiple clients from reading the same
>> data and trying to write at the same time. So if multiple writes (I
>> suppose on the same data blocks) are possibly waiting the order in which
>> they are performed *must not* matter in your system. The alternative is
>> to prevent simultaneous write accesses from multiple clients (this is
>> how non-cluster filesystems must be configured on top of Ceph/RBD, they
>> must even be prevented from read-only accessing an already mounted fs).
>>
>>>
>>> Just make sure you have proper monitoring on your Ceph cluster. At nearfull 
>>> it goes into WARN and you should act on that.
>>
>>
>> +1 : monitoring is not an option.
>>
>> Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] thoughts about Cache Tier Levels

2016-07-20 Thread Götz Reinicke - IT Koordinator
Hi,

currently there are two levels I know of: storage- and cachepool. From
our workload I do expect an third "level" of data, which will stay
currently in the storagepool as well.

Has anyone as we been thinking of data which could be moved even deeper
in that tiering, e.g. have SSD cache, fast lots of HDD SAS something
2-4TB size for storage pool and e.g. 8TB so whatever for an "archive" level?

From my pov that would be for data which are good and nice to have in
the cluster, but dont have to be as reliable and fast as the storage
tier data.

Or may be that kind of data could be moved to a LTO tape library ...
just thinking ! :)

Thanks for feedback and suggestion on how to handle data you "never will
use again but you have to have them" :)

Regards . Götz






smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to hide monitoring ip in cephfs mounted clients

2016-07-20 Thread John Spray
On Wed, Jul 20, 2016 at 8:33 AM, gjprabu  wrote:
>
> Hi Team,
>
>We are using chepfs file systems to mount client machines, here
> while mount we should provide monitoring ip address, is there any option to
> hide monitoring ips address in the mounted partition. We are using container
> in all ceph clients and which all able see monitoring ip's, this could be a
> security issue for us. Kindly let us know is there any solution on this.

Hmm, so you have a situation where the containers are prevented from
actually communicating with the monitor IPs, but the cephfs mounts are
exposed to the containers in a way that lets them see them when they
run `mount`?

I don't think we've thought about this case before.  Is it normal that
when you have e.g. a docker container with a volume attached, the
container can see the mount information for the filesystem that the
volume lives on?

John

> Regards
> Prabu GJ
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache Tier configuration

2016-07-20 Thread Mateusz Skała

Thank You for quick response.

> -Original Message-
> From: Christian Balzer [mailto:ch...@gol.com]
> Sent: Tuesday, July 19, 2016 3:39 PM
> To: ceph-users@lists.ceph.com
> Cc: Mateusz Skała 
> Subject: Re: [ceph-users] Cache Tier configuration
> 
> 
> Hello,
> 
> On Tue, 19 Jul 2016 15:15:55 +0200 Mateusz Skała wrote:
> 
> > Hello,
> >
> > > -Original Message-
> > > From: Christian Balzer [mailto:ch...@gol.com]
> > > Sent: Wednesday, July 13, 2016 4:03 AM
> > > To: ceph-users@lists.ceph.com
> > > Cc: Mateusz Skała 
> > > Subject: Re: [ceph-users] Cache Tier configuration
> > >
> > >
> > > Hello,
> > >
> > > On Tue, 12 Jul 2016 11:01:30 +0200 Mateusz Skała wrote:
> > >
> > > > Thank You for replay. Answers below.
> > > >
> > > > > -Original Message-
> > > > > From: Christian Balzer [mailto:ch...@gol.com]
> > > > > Sent: Tuesday, July 12, 2016 3:37 AM
> > > > > To: ceph-users@lists.ceph.com
> > > > > Cc: Mateusz Skała 
> > > > > Subject: Re: [ceph-users] Cache Tier configuration
> > > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > On Mon, 11 Jul 2016 16:19:58 +0200 Mateusz Skała wrote:
> > > > >
> > > > > > Hello Cephers.
> > > > > >
> > > > > > Can someone help me in my cache tier configuration? I have 4
> > > > > > same SSD drives 176GB (184196208K) in SSD pool, how to
> > > > > > determine
> > > > > target_max_bytes?
> > > > >
> > > > > What exact SSD models are these?
> > > > > What version of Ceph?
> > > >
> > > > Intel DC S3610 (SSDSC2BX200G401), ceph version 9.2.1
> > > > (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd)
> > > >
> > >
> > > Good, these are decent SSDs and at 3DWPD probably durable enough,
> too.
> > > You will want to monitor their wear-out level anyway, though.
> > >
> > > Remember, dead cache pool means unaccessible and/or lost data.
> > >
> > > Jewel has improved cache controls and a different, less aggressive
> > > default behavior, you may want to consider upgrading to it,
> > > especially if you don't want to become a cache tiering specialist.
> > > ^o^
> > >
> > > Also Infernalis is no longer receiving updates.
> >
> > We are planning upgrade in first week of August.
> >
> You might want to wait until the next version of Jewel is out, unless you have
> a test/staging cluster to verify your upgrade procedure on.
> 
> Jewel is a better choice than Infernalis, but with still a number of bugs and
> also a LOT of poorly or not at all documented massive changes it doesn't
> make me all that eager to upgrade right here, right now.
> 

We have test cluster, but without cache-tier. We will wait for stable release 
Jewel.

> > > > > > I assume
> > > > > > that should be (4 drives* 188616916992 bytes )/ 3 replica =
> > > > > > 251489222656 bytes *85% (because of full disk warning)
> > > > >
> > > > > In theory correct, but you might want to consider (like with all
> > > > > pools) the impact of loosing a single SSD.
> > > > > In short, backfilling and then the remaining 3 getting full anyway.
> > > > >
> > > >
> > > > OK, so better to make lower max target bates than I have space?
> > > > For
> > > example 170GB? Then I will have 1 osd reserve.
> > > >
> > > Something like this, though failures with these SSDs are very unlikely.
> > >
> > > > > > It will be 213765839257 bytes ~200GB. I make this little bit
> > > > > > lower
> > > > > > (160GB) and after some time whole cluster stops on full disk error.
> > > > > > One of SSD drives are full. I see that use of space at the osd is 
> > > > > > not
> equal:
> > > > > >
> > > > > > 32 0.17099  1.0   175G   127G 49514M 72.47 1.77  95
> > > > > >
> > > > > > 42 0.17099  1.0   175G   120G 56154M 68.78 1.68  90
> > > > > >
> > > > > > 37 0.17099  1.0   175G   136G 39670M 77.95 1.90 102
> > > > > >
> > > > > > 47 0.17099  1.0   175G   130G 46599M 74.09 1.80  97
> > > > > >
> > > > >
> > > > > What's the exact error message?
> > > > >
> > > > > None of these are over 85 or 95%, how are they full?
> > > >
> > > > Osd.37 was full on 96%, after error (heath ERR, 1 full osd).Then I
> > > > set
> > > max_target_bytes on 100GB. Flushing reduced used space, now cluster
> > > is working ok, but I want to clarify my configuration.
> > > >
> > > Don't get flushing (copying dirty objects to the backing pool) and
> > > eviction (deleting, really zero-ing, clean objects).
> > > Eviction is what frees up space, but it needs flushed (clean)
> > > objects to work with.
> > >
> >
> > OK, I understand that evicting frees space?
> >
> Yes, re-read the relevant documentation.
> 
> > > >
> > > > >
> > > > > If the above is a snapshot of when Ceph thinks something is
> > > > > "full", it may be an indication that you've reached
> > > > > target_max_bytes and Ceph simply has no clean (flushed) objects
> ready to evict.
> > > > > Which means a configuration problem (all ratios, not the
> > > > > defaults, for this pool please) or your cache filling up 

Re: [ceph-users] pgs stuck unclean after reweight

2016-07-20 Thread Christian Balzer

Hello,

On Wed, 20 Jul 2016 13:42:20 +1000 Goncalo Borges wrote:

> Hi All...
> 
> Today we had a warning regarding 8 near full osd. Looking to the osds 
> occupation, 3 of them were above 90%. 

One would hope that this would have been picked up earlier, as in before
it even reaches near-full.
Either by monitoring (nagios, etc) disk usage checks and/or graphing the
usage and taking a look at it at least daily.

Since you seem to have at least 60 OSDs going from below 85% to 90% must

> In order to solve the situation, 
> I've decided to reweigh those first using
> 
>  ceph osd crush reweight osd.1 2.67719
> 
>  ceph osd crush reweight osd.26 2.67719
> 
>  ceph osd crush reweight osd.53 2.67719
> 
What I'd do is to find the least utilized OSDs and give them higher
weights, so data will (hopefully) move there instead of potentially
pushing another OSD to near-full as with the approach above.

You might consider doing that aside from what I'm writing below.

> Please note that I've started with a very conservative step since the 
> original weight for all osds was 2.72710.
> 
> After some rebalancing (which has now stopped) I've seen that the 
> cluster is currently in the following state
> 
> # ceph health detail
> HEALTH_WARN 4 pgs backfill_toofull; 4 pgs stuck unclean; recovery
> 20/39433323 objects degraded (0.000%); recovery 77898/39433323
> objects misplaced (0.198%); 8 near full osd(s); crush map has legacy
> tunables (require bobtail, min is firefly)
>
So there are all your woes in one fell swoop.

Unless you changed the defaults, your mon_osd_nearfull_ratio and 
osd_backfill_full_ratio are the same at 0.85.
So any data movement towards those 8 near full OSDs will not go anywhere.

Thus aside from the tip above, consider upping your
osd_backfill_full_ratio for those OSDs to something like .92 for the time
being until things are good again.

Going forward, you will want to: 
a) add more OSDs
b) re-weight things so that your OSDs are within a few % of each other
than the often encountered 20%+ variance.

Christian

> pg 6.e2 is stuck unclean for 9578.920997, current state
> active+remapped+backfill_toofull, last acting [49,38,11]
> pg 6.4 is stuck unclean for 9562.054680, current state
> active+remapped+backfill_toofull, last acting [53,6,26]
> pg 5.24 is stuck unclean for 10292.469037, current state
> active+remapped+backfill_toofull, last acting [32,13,51]
> pg 5.306 is stuck unclean for 10292.448364, current state
> active+remapped+backfill_toofull, last acting [44,7,59]
> pg 5.306 is active+remapped+backfill_toofull, acting [44,7,59]
> pg 5.24 is active+remapped+backfill_toofull, acting [32,13,51]
> pg 6.4 is active+remapped+backfill_toofull, acting [53,6,26]
> pg 6.e2 is active+remapped+backfill_toofull, acting [49,38,11]
> recovery 20/39433323 objects degraded (0.000%)
> recovery 77898/39433323 objects misplaced (0.198%)
> osd.1 is near full at 88%
> osd.14 is near full at 87%
> osd.24 is near full at 86%
> osd.26 is near full at 87%
> osd.37 is near full at 87%
> osd.53 is near full at 88%
> osd.56 is near full at 85%
> osd.62 is near full at 87%
> 
> crush map has legacy tunables (require bobtail, min is firefly); 
> see http://ceph.com/docs/master/rados/operations/crush-map/#tunables
> 
> Not sure if it is worthwhile to mention, but after upgrading to Jewel, 
> our cluster shows the warnings regarding tunables. We still have not 
> migrated to the optimal tunables because the cluster will be very 
> actively used during the 3 next weeks ( due to one of the main 
> conference in our area) and we prefer to do that migration after this 
> peak period,
> 
> 
> I am unsure what happen during the rebalacing but the mapping of these 4 
> stuck pgs seems strange, namely the up and acting osds are different.
> 
> # ceph pg dump_stuck unclean
> ok
> pg_statstateupup_primaryactingacting_primary
> 6.e2active+remapped+backfill_toofull[8,53,38]8
> [49,38,11]49
> 6.4active+remapped+backfill_toofull[53,24,6]53
> [53,6,26]53
> 5.24active+remapped+backfill_toofull[32,13,56]32
> [32,13,51]32
> 5.306active+remapped+backfill_toofull[44,60,26]44
> [44,7,59]44
> 
> # ceph pg map 6.e2
> osdmap e1054 pg 6.e2 (6.e2) -> up [8,53,38] acting [49,38,11]
> 
> # ceph pg map 6.4
> osdmap e1054 pg 6.4 (6.4) -> up [53,24,6] acting [53,6,26]
> 
> # ceph pg map 5.24
> osdmap e1054 pg 5.24 (5.24) -> up [32,13,56] acting [32,13,51]
> 
> # ceph pg map 5.306
> osdmap e1054 pg 5.306 (5.306) -> up [44,60,26] acting [44,7,59]
> 
> 
> To complete this information, I am also sending the output of pg query 
> for one of these problematic pgs (ceph pg  5.306 query) after this email.
> 
> What should be the procedure to try to recover those PGS before 
> 

Re: [ceph-users] how to use cache tiering with proxy in ceph-10.2.2

2016-07-20 Thread Nick Fisk

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> m13913886...@yahoo.com
> Sent: 20 July 2016 02:09
> To: Christian Balzer ; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] how to use cache tiering with proxy in ceph-10.2.2
> 
> But the 0.94 version works fine(In fact, IO was greatly improved).

How are you measuring this, is this just through micro benchmarks or for 
something more realistic running over a number of hours?

> This problem occurs only in version 10.x.
> Like you said that the IO was going to the cold storage mostly .  And IO is 
> going slowly.
> what can I do to improve IO performance of cache tiering in version 10.x ?
> How does cache tiering works in version 10.x ?
> is it a bug? Or configure are very different 0.94 version ?
> Too few  information in Official website about this.

There is a number of differences, but should all have a positive effect on real 
life workloads. It's important to focus more on the word tiering rather than 
caching. You don't want to be continually shifting large amounts of data to and 
from the cache, but only the really hot bits.

The main changes between the two versions would be in the inclusion of proxy 
writes, promotion throttling and recency fixes. All will reduce the amount of 
data that gets promoted.

But please let me know how you are testing.

> 
> 
> 
> On Tuesday, July 19, 2016 9:25 PM, Christian Balzer  
> wrote:
> 
> 
> Hello,
> 
> On Tue, 19 Jul 2016 12:24:01 +0200 Oliver Dzombic wrote:
> 
> > Hi,
> >
> > i have in my ceph.conf under [OSD] Section:
> >
> > osd_tier_promote_max_bytes_sec = 1610612736
> > osd_tier_promote_max_objects_sec = 2
> >
> > #ceph --show-config is showing:
> >
> > osd_tier_promote_max_objects_sec = 5242880
> > osd_tier_promote_max_bytes_sec = 25
> >
> > But in fact its working. Maybe some Bug in showing the correct value.
> >
> > I had Problems too, that the IO was going to the cold storage mostly.
> >
> > After i changed this values ( and restarted >every< node inside the
> > cluster ) the problem was gone.
> >
> > So i assume, that its simply showing the wrong values if you call the
> > show-config. Or there is some other miracle going on.
> >
> > I just checked:
> >
> > #ceph --show-config | grep osd_tier
> >
> > shows:
> >
> > osd_tier_default_cache_hit_set_count = 4
> > osd_tier_default_cache_hit_set_period = 1200
> >
> > while
> >
> > #ceph osd pool get ssd_cache hit_set_count
> > #ceph osd pool get ssd_cache hit_set_period
> >
> > show
> >
> > hit_set_count: 1
> > hit_set_period: 120
> >
> Apples and oranges.
> 
> Your first query is about the config (and thus default, as it says in the
> output) options, the second one is for a specific pool.
> 
> There might be still any sorts of breakage with show-config and having to
> restart OSDs to have changes take effect is inelegant at least, but the
> above is not a bug.
> 
> Christian
> 
> >
> > So you can obviously ignore the ceph --show-config command. Its simply
> > not working correctly.
> >
> >
> 
> 
> --
> Christian BalzerNetwork/Systems Engineer
> mailto:ch...@gol.com  Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> 
> ___
> ceph-users mailing list
> mailto:ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs - inconsistent nfs and samba directory listings

2016-07-20 Thread Dennis Kramer (DBS)
Hi all,

Just wondering if the original issue has been resolved. I have the same
problems with inconsistent nfs and samba directory listings. I'm running
Hammer.

Is it a confirmed seekdir bug in de kernel client?

On 01/14/2016 04:05 AM, Yan, Zheng wrote:
> On Thu, Jan 14, 2016 at 3:37 AM, Mike Carlson  wrote:
>> Hey Greg,
>>
>> The inconsistent view is only over nfs/smb on top of our /ceph mount.
>>
>> When I look directly on the /ceph mount (which is using the cephfs kernel
>> module), everything looks fine
>>
>> It is possible that this issue just went unnoticed, and it only being a
>> infernalis problem is just a red herring. With that, it is oddly
>> coincidental that we just started seeing issues.
> 
> This seems like seekdir bugs in kernel client, could you try 4.0+ kernel.
> 
> Besides, do you enable "mds bal frag" for ceph-mds
> 
> 
> Regards
> Yan, Zheng
> 
> 
> 
>>
>> On Wed, Jan 13, 2016 at 11:30 AM, Gregory Farnum  wrote:
>>>
>>> On Wed, Jan 13, 2016 at 11:24 AM, Mike Carlson  wrote:
 Hello.

 Since we upgraded to Infernalis last, we have noticed a severe problem
 with
 cephfs when we have it shared over Samba and NFS

 Directory listings are showing an inconsistent view of the files:


 $ ls /lts-mon/BD/xmlExport/ | wc -l
  100
 $ sudo umount /lts-mon
 $ sudo mount /lts-mon
 $ ls /lts-mon/BD/xmlExport/ | wc -l
 3507


 The only work around I have found is un-mounting and re-mounting the nfs
 share, that seems to clear it up
 Same with samba, I'd post it here but its thousands of lines. I can add
 additional details on request.

 This happened after our upgrade to infernalis. Is it possible the MDS is
 in
 an inconsistent state?
>>>
>>> So this didn't happen to you until after you upgraded? Are you seeing
>>> missing files when looking at cephfs directly, or only over the
>>> NFS/Samba re-exports? Are you also sharing Samba by re-exporting the
>>> kernel cephfs mount?
>>>
>>> Zheng, any ideas about kernel issues which might cause this or be more
>>> visible under infernalis?
>>> -Greg
>>>

 We have cephfs mounted on a server using the built in cephfs kernel
 module:

 lts-mon:6789:/ /ceph ceph
 name=admin,secretfile=/etc/ceph/admin.secret,noauto,_netdev


 We are running all of our ceph nodes on ubuntu 14.04 LTS. Samba is up to
 date, 4.1.6, and we export nfsv3 to linux and freebsd systems. All seem
 to
 exhibit the same behavior.

 system info:

 # uname -a
 Linux lts-osd1 3.13.0-63-generic #103-Ubuntu SMP Fri Aug 14 21:42:59 UTC
 2015 x86_64 x86_64 x86_64 GNU/Linux
 root@lts-osd1:~# lsb
 lsblklsb_release
 root@lts-osd1:~# lsb_release -a
 No LSB modules are available.
 Distributor ID: Ubuntu
 Description: Ubuntu 14.04.3 LTS
 Release: 14.04
 Codename: trusty


 package info:

  # dpkg -l|grep ceph
 ii  ceph 9.2.0-1trusty
 amd64distributed storage and file system
 ii  ceph-common  9.2.0-1trusty
 amd64common utilities to mount and interact with a ceph storage
 cluster
 ii  ceph-fs-common   9.2.0-1trusty
 amd64common utilities to mount and interact with a ceph file
 system
 ii  ceph-mds 9.2.0-1trusty
 amd64metadata server for the ceph distributed file system
 ii  libcephfs1   9.2.0-1trusty
 amd64Ceph distributed file system client library
 ii  python-ceph  9.2.0-1trusty
 amd64Meta-package for python libraries for the Ceph libraries
 ii  python-cephfs9.2.0-1trusty
 amd64Python libraries for the Ceph libcephfs library


 What is interesting, is a directory or file will not show up in a
 listing,
 however, if we directly access the file, it shows up in that instance:


 # ls -al |grep SCHOOL
 # ls -alnd SCHOOL667055
 drwxrwsr-x  1 21695  21183  2962751438 Jan 13 09:33 SCHOOL667055


 Any tips are appreciated!

 Thanks,
 Mike C


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Kramer M.D.
Infrastructure Engineer


Re: [ceph-users] ceph + vmware

2016-07-20 Thread Frédéric Nass


Hi Mike,

Thanks for the update on the RHCS iSCSI target.

Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is 
it too early to say / announce).


Knowing that HA iSCSI target was on the roadmap, we chose iSCSI over NFS 
so we'll just have to remap RBDs to RHCS targets when it's available.


So we're currently running :

- 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI target 
has all VAAI primitives enabled and run the same configuration.
- RBD images are mapped on each target using the kernel client (so no 
RBD cache).
- 6 ESXi. Each ESXi can access to the same LUNs through both targets, 
but in a failover manner so that each ESXi always access the same LUN 
through one target at a time.
- LUNs are VMFS datastores and VAAI primitives are enabled client side 
(except UNMAP as per default).


Do you see anthing risky regarding this configuration ?

Would you recommend LIO or STGT (with rbd bs-type) target for ESXi clients ?

Best regards,

Frederic.

--

Frédéric Nass

Sous-direction Infrastructures
Direction du Numérique
Université de Lorraine

Tél : +33 3 72 74 11 35



Le 11/07/2016 17:45, Mike Christie a écrit :

On 07/08/2016 02:22 PM, Oliver Dzombic wrote:

Hi,

does anyone have experience how to connect vmware with ceph smart ?

iSCSI multipath does not really worked well.

Are you trying to export rbd images from multiple iscsi targets at the
same time or just one target?

For the HA/multiple target setup, I am working on this for Red Hat. We
plan to release it in RHEL 7.3/RHCS 2.1. SUSE ships something already as
someone mentioned.

We just got a large chunk of code in the upstream kernel (it is in the
block layer maintainer's tree for the next kernel) so it should be
simple to add COMPARE_AND_WRITE support now. We should be posting krbd
exclusive lock support in the next couple weeks.



NFS could be, but i think thats just too much layers in between to have
some useable performance.

Systems like ScaleIO have developed a vmware addon to talk with it.

Is there something similar out there for ceph ?

What are you using ?

Thank you !


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: ceph-objectstore-tool remove-clone-metadata. How to use?

2016-07-20 Thread Voloshanenko Igor
Hi community, 10 months ago, we discovered issue, after removing cache tier
from cluster with cluster HEALTH, and start email thread, as result - new
bug was created on tracker by Samuel Just
http://tracker.ceph.com/issues/12738

Till that time, i'm looking for good moment to upgrade (after fix was
backported to 0.94.7). And yesterday i did upgrade on my production cluster.

>From 28 scrub errors, only 5 remains, so i need to fix them by
ceph-objectstore-tool remove-clone-metadata subcommand.

I try to did it, but without real results... Can you please give me advice,
what i'm doing wrong?

My flow was the next:

1. Identify problem PGs... -  ceph health detail | grep inco | grep -v
HEALTH | cut -d " " -f 2
2. Start repair for them, to collect info about errors into logs - ceph pg
repair 

After this for example, i received next records into logs

2016-07-20 00:32:10.650061 osd.56 10.12.2.5:6800/1985741 25 : cluster [INF]
2.c4 repair starts

2016-07-20 00:33:06.405136 osd.56 10.12.2.5:6800/1985741 26 : cluster [ERR]
repair 2.c4 2/22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir
expected clone 2/22ca30c4/rbd_data.e846e25a70bf7.0307/14d

2016-07-20 00:33:06.405323 osd.56 10.12.2.5:6800/1985741 27 : cluster [ERR]
repair 2.c4 2/22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir
expected clone 2/22ca30c4/rbd_data.e846e25a70bf7.0307/138

2016-07-20 00:33:06.405385 osd.56 10.12.2.5:6800/1985741 28 : cluster [INF]
repair 2.c4 2/22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir 1
missing clone(s)

2016-07-20 00:40:42.457657 osd.56 10.12.2.5:6800/1985741 29 : cluster [ERR]
2.c4 repair 2 errors, 0 fixed

So, i try to fix it with next command:

stop ceph-osd id=56
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-56/ --journal-path
/var/lib/ceph/osd/ceph-56/journal rbd_data.e846e25a70bf7.0307
remove-clone-metadata 138
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-56/ --journal-path
/var/lib/ceph/osd/ceph-56/journal rbd_data.e846e25a70bf7.0307
remove-clone-metadata 14d
start ceph-osd id=56

Strange fact, that after I did this commands - i don;t receive message like
(according to sources... )

cout << "Removal of clone " << cloneid << " complete" << std::endl;
cout << "Use pg repair after OSD restarted to correct stat information" <<
std::endl;

I received silent (no output after command, and command take about 30-35
min to execute... )

Sure, i start pg repair again after this actions... But result - same,
errors still exists...

So, possible i misunderstand input format for ceph-objectstore-tool...
Please help with this.. :)

Thanks you in advance!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-objectstore-tool remove-clone-metadata. How to use?

2016-07-20 Thread Voloshanenko Igor
Hi community, 10 months ago, we discovered issue, after removing cache tier
from cluster with cluster HEALTH, and start email thread, as result - new
bug was created on tracker by Samuel Just
http://tracker.ceph.com/issues/12738

Till that time, i'm looking for good moment to upgrade (after fix was
backported to 0.94.7). And yesterday i did upgrade on my production cluster.

>From 28 scrub errors, only 5 remains, so i need to fix them by
ceph-objectstore-tool remove-clone-metadata subcommand.

I try to did it, but without real results... Can you please give me advice,
what i'm doing wrong?

My flow was the next:

1. Identify problem PGs... -  ceph health detail | grep inco | grep -v
HEALTH | cut -d " " -f 2
2. Start repair for them, to collect info about errors into logs - ceph pg
repair 

After this for example, i received next records into logs

2016-07-20 00:32:10.650061 osd.56 10.12.2.5:6800/1985741 25 : cluster [INF]
2.c4 repair starts

2016-07-20 00:33:06.405136 osd.56 10.12.2.5:6800/1985741 26 : cluster [ERR]
repair 2.c4 2/22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir
expected clone 2/22ca30c4/rbd_data.e846e25a70bf7.0307/14d

2016-07-20 00:33:06.405323 osd.56 10.12.2.5:6800/1985741 27 : cluster [ERR]
repair 2.c4 2/22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir
expected clone 2/22ca30c4/rbd_data.e846e25a70bf7.0307/138

2016-07-20 00:33:06.405385 osd.56 10.12.2.5:6800/1985741 28 : cluster [INF]
repair 2.c4 2/22ca30c4/rbd_data.e846e25a70bf7.0307/snapdir 1
missing clone(s)

2016-07-20 00:40:42.457657 osd.56 10.12.2.5:6800/1985741 29 : cluster [ERR]
2.c4 repair 2 errors, 0 fixed

So, i try to fix it with next command:

stop ceph-osd id=56
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-56/ --journal-path
/var/lib/ceph/osd/ceph-56/journal rbd_data.e846e25a70bf7.0307
remove-clone-metadata 138
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-56/ --journal-path
/var/lib/ceph/osd/ceph-56/journal rbd_data.e846e25a70bf7.0307
remove-clone-metadata 14d
start ceph-osd id=56

Strange fact, that after I did this commands - i don;t receive message like
(according to sources... )

cout << "Removal of clone " << cloneid << " complete" << std::endl;
cout << "Use pg repair after OSD restarted to correct stat information" <<
std::endl;

I received silent (no output after command, and command take about 30-35
min to execute... )

Sure, i start pg repair again after this actions... But result - same,
errors still exists...

So, possible i misunderstand input format for ceph-objectstore-tool...
Please help with this.. :)

Thanks you in advance!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to hide monitoring ip in cephfs mounted clients

2016-07-20 Thread gjprabu


Hi Team,



   We are using chepfs file systems to mount client machines, here 
while mount we should provide monitoring ip address, is there any option to 
hide monitoring ips address in the mounted partition. We are using container in 
all ceph clients and which all able see monitoring ip's, this could be a 
security issue for us. Kindly let us know is there any solution on this.



Regards

Prabu GJ

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs stuck unclean after reweight

2016-07-20 Thread M Ranga Swami Reddy
Ok...try the same with osd.32 and osd.13...one by one (do the osd.32
and wait if any rebalance happens, if no changes, then do it on
osd.13).

thanks
Swami

On Wed, Jul 20, 2016 at 11:59 AM, Goncalo Borges
 wrote:
> Hi Swami.
>
> Did not make any difference.
>
> Cheers
>
> G.
>
>
>
> On 07/20/2016 03:31 PM, M Ranga Swami Reddy wrote:
>
> can you restart osd.32 and check the status?
>
> Thanks
> Swami
>
> On Wed, Jul 20, 2016 at 9:12 AM, Goncalo Borges
>  wrote:
>
> Hi All...
>
> Today we had a warning regarding 8 near full osd. Looking to the osds
> occupation, 3 of them were above 90%. In order to solve the situation, I've
> decided to reweigh those first using
>
> ceph osd crush reweight osd.1 2.67719
>
> ceph osd crush reweight osd.26 2.67719
>
> ceph osd crush reweight osd.53 2.67719
>
> Please note that I've started with a very conservative step since the
> original weight for all osds was 2.72710.
>
> After some rebalancing (which has now stopped) I've seen that the cluster is
> currently in the following state
>
> # ceph health detail
> HEALTH_WARN 4 pgs backfill_toofull; 4 pgs stuck unclean; recovery
> 20/39433323 objects degraded (0.000%); recovery 77898/39433323 objects
> misplaced (0.198%); 8 near full osd(s); crush map has legacy tunables
> (require bobtail, min is firefly)
> pg 6.e2 is stuck unclean for 9578.920997, current state
> active+remapped+backfill_toofull, last acting [49,38,11]
> pg 6.4 is stuck unclean for 9562.054680, current state
> active+remapped+backfill_toofull, last acting [53,6,26]
> pg 5.24 is stuck unclean for 10292.469037, current state
> active+remapped+backfill_toofull, last acting [32,13,51]
> pg 5.306 is stuck unclean for 10292.448364, current state
> active+remapped+backfill_toofull, last acting [44,7,59]
> pg 5.306 is active+remapped+backfill_toofull, acting [44,7,59]
> pg 5.24 is active+remapped+backfill_toofull, acting [32,13,51]
> pg 6.4 is active+remapped+backfill_toofull, acting [53,6,26]
> pg 6.e2 is active+remapped+backfill_toofull, acting [49,38,11]
> recovery 20/39433323 objects degraded (0.000%)
> recovery 77898/39433323 objects misplaced (0.198%)
> osd.1 is near full at 88%
> osd.14 is near full at 87%
> osd.24 is near full at 86%
> osd.26 is near full at 87%
> osd.37 is near full at 87%
> osd.53 is near full at 88%
> osd.56 is near full at 85%
> osd.62 is near full at 87%
>
>crush map has legacy tunables (require bobtail, min is firefly); see
> http://ceph.com/docs/master/rados/operations/crush-map/#tunables
>
> Not sure if it is worthwhile to mention, but after upgrading to Jewel, our
> cluster shows the warnings regarding tunables. We still have not migrated to
> the optimal tunables because the cluster will be very actively used during
> the 3 next weeks ( due to one of the main conference in our area) and we
> prefer to do that migration after this peak period,
>
>
> I am unsure what happen during the rebalacing but the mapping of these 4
> stuck pgs seems strange, namely the up and acting osds are different.
>
> # ceph pg dump_stuck unclean
> ok
> pg_statstateupup_primaryactingacting_primary
> 6.e2active+remapped+backfill_toofull[8,53,38]8[49,38,11]
> 49
> 6.4active+remapped+backfill_toofull[53,24,6]53[53,6,26]
> 53
> 5.24active+remapped+backfill_toofull[32,13,56]32[32,13,51]
> 32
> 5.306active+remapped+backfill_toofull[44,60,26]44[44,7,59]
> 44
>
> # ceph pg map 6.e2
> osdmap e1054 pg 6.e2 (6.e2) -> up [8,53,38] acting [49,38,11]
>
> # ceph pg map 6.4
> osdmap e1054 pg 6.4 (6.4) -> up [53,24,6] acting [53,6,26]
>
> # ceph pg map 5.24
> osdmap e1054 pg 5.24 (5.24) -> up [32,13,56] acting [32,13,51]
>
> # ceph pg map 5.306
> osdmap e1054 pg 5.306 (5.306) -> up [44,60,26] acting [44,7,59]
>
>
> To complete this information, I am also sending the output of pg query for
> one of these problematic pgs (ceph pg  5.306 query) after this email.
>
> What should be the procedure to try to recover those PGS before continuing
> with the reweighing?
>
> Than you in advance
> Goncalo
>
>
> # ceph pg  5.306 query
> {
> "state": "active+remapped+backfill_toofull",
> "snap_trimq": "[]",
> "epoch": 1054,
> "up": [
> 44,
> 60,
> 26
> ],
> "acting": [
> 44,
> 7,
> 59
> ],
> "backfill_targets": [
> "26",
> "60"
> ],
> "actingbackfill": [
> "7",
> "26",
> "44",
> "59",
> "60"
> ],
> "info": {
> "pgid": "5.306",
> "last_update": "1005'55174",
> "last_complete": "1005'55174",
> "log_tail": "1005'52106",
> "last_user_version": 55174,
> "last_backfill": "MAX",
> "last_backfill_bitwise": 0,
> "purged_snaps": "[]",
> "history": {
> "epoch_created": 339,
> 

Re: [ceph-users] pgs stuck unclean after reweight

2016-07-20 Thread Goncalo Borges

Hi Swami.

Did not make any difference.

Cheers

G.



On 07/20/2016 03:31 PM, M Ranga Swami Reddy wrote:

can you restart osd.32 and check the status?

Thanks
Swami

On Wed, Jul 20, 2016 at 9:12 AM, Goncalo Borges
 wrote:

Hi All...

Today we had a warning regarding 8 near full osd. Looking to the osds
occupation, 3 of them were above 90%. In order to solve the situation, I've
decided to reweigh those first using

 ceph osd crush reweight osd.1 2.67719

 ceph osd crush reweight osd.26 2.67719

 ceph osd crush reweight osd.53 2.67719

Please note that I've started with a very conservative step since the
original weight for all osds was 2.72710.

After some rebalancing (which has now stopped) I've seen that the cluster is
currently in the following state

# ceph health detail
HEALTH_WARN 4 pgs backfill_toofull; 4 pgs stuck unclean; recovery
20/39433323 objects degraded (0.000%); recovery 77898/39433323 objects
misplaced (0.198%); 8 near full osd(s); crush map has legacy tunables
(require bobtail, min is firefly)
pg 6.e2 is stuck unclean for 9578.920997, current state
active+remapped+backfill_toofull, last acting [49,38,11]
pg 6.4 is stuck unclean for 9562.054680, current state
active+remapped+backfill_toofull, last acting [53,6,26]
pg 5.24 is stuck unclean for 10292.469037, current state
active+remapped+backfill_toofull, last acting [32,13,51]
pg 5.306 is stuck unclean for 10292.448364, current state
active+remapped+backfill_toofull, last acting [44,7,59]
pg 5.306 is active+remapped+backfill_toofull, acting [44,7,59]
pg 5.24 is active+remapped+backfill_toofull, acting [32,13,51]
pg 6.4 is active+remapped+backfill_toofull, acting [53,6,26]
pg 6.e2 is active+remapped+backfill_toofull, acting [49,38,11]
recovery 20/39433323 objects degraded (0.000%)
recovery 77898/39433323 objects misplaced (0.198%)
osd.1 is near full at 88%
osd.14 is near full at 87%
osd.24 is near full at 86%
osd.26 is near full at 87%
osd.37 is near full at 87%
osd.53 is near full at 88%
osd.56 is near full at 85%
osd.62 is near full at 87%

crush map has legacy tunables (require bobtail, min is firefly); see
http://ceph.com/docs/master/rados/operations/crush-map/#tunables

Not sure if it is worthwhile to mention, but after upgrading to Jewel, our
cluster shows the warnings regarding tunables. We still have not migrated to
the optimal tunables because the cluster will be very actively used during
the 3 next weeks ( due to one of the main conference in our area) and we
prefer to do that migration after this peak period,


I am unsure what happen during the rebalacing but the mapping of these 4
stuck pgs seems strange, namely the up and acting osds are different.

# ceph pg dump_stuck unclean
ok
pg_statstateupup_primaryactingacting_primary
6.e2active+remapped+backfill_toofull[8,53,38]8[49,38,11]
49
6.4active+remapped+backfill_toofull[53,24,6]53[53,6,26]
53
5.24active+remapped+backfill_toofull[32,13,56]32[32,13,51]
32
5.306active+remapped+backfill_toofull[44,60,26]44[44,7,59]
44

# ceph pg map 6.e2
osdmap e1054 pg 6.e2 (6.e2) -> up [8,53,38] acting [49,38,11]

# ceph pg map 6.4
osdmap e1054 pg 6.4 (6.4) -> up [53,24,6] acting [53,6,26]

# ceph pg map 5.24
osdmap e1054 pg 5.24 (5.24) -> up [32,13,56] acting [32,13,51]

# ceph pg map 5.306
osdmap e1054 pg 5.306 (5.306) -> up [44,60,26] acting [44,7,59]


To complete this information, I am also sending the output of pg query for
one of these problematic pgs (ceph pg  5.306 query) after this email.

What should be the procedure to try to recover those PGS before continuing
with the reweighing?

Than you in advance
Goncalo


# ceph pg  5.306 query
{
 "state": "active+remapped+backfill_toofull",
 "snap_trimq": "[]",
 "epoch": 1054,
 "up": [
 44,
 60,
 26
 ],
 "acting": [
 44,
 7,
 59
 ],
 "backfill_targets": [
 "26",
 "60"
 ],
 "actingbackfill": [
 "7",
 "26",
 "44",
 "59",
 "60"
 ],
 "info": {
 "pgid": "5.306",
 "last_update": "1005'55174",
 "last_complete": "1005'55174",
 "log_tail": "1005'52106",
 "last_user_version": 55174,
 "last_backfill": "MAX",
 "last_backfill_bitwise": 0,
 "purged_snaps": "[]",
 "history": {
 "epoch_created": 339,
 "last_epoch_started": 1016,
 "last_epoch_clean": 996,
 "last_epoch_split": 0,
 "last_epoch_marked_full": 0,
 "same_up_since": 1015,
 "same_interval_since": 1015,
 "same_primary_since": 928,
 "last_scrub": "1005'55169",
 "last_scrub_stamp": "2016-07-19 14:31:45.790871",
 "last_deep_scrub": "1005'55169",
 "last_deep_scrub_stamp": "2016-07-19 14:31:45.790871",