Re: [ceph-users] RBD image format v1 EOL ...

2019-02-22 Thread koukou73gr

On 2019-02-20 17:38, Mykola Golub wrote:


Note, if even rbd supported live (without any downtime) migration you
would still need to restart the client after the upgrate to a new
librbd with migration support.


You could probably get away with executing the client with a new librbd 
version by live migrating the VM to an updated hypervisor.


At least, this is what I have been doing so far when updating Ceph 
client libraries having zero downtime.


-K.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG auto repair with BlueStore

2018-11-15 Thread koukou73gr
Are there any means to notify the administrator that an auto-repair has 
taken place?


-K.


On 2018-11-15 20:45, Mark Schouten wrote:

As a user, I’m very surprised that this isn’t a default setting.

Mark Schouten


Op 15 nov. 2018 om 18:40 heeft Wido den Hollander  het volgende 
geschreven:

Hi,

This question is actually still outstanding. Is there any good reason to
keep auto repair for scrub errors disabled with BlueStore?

I couldn't think of a reason when using size=3 and min_size=2, so just
wondering.

Thanks!

Wido


On 8/24/18 8:55 AM, Wido den Hollander wrote:
Hi,

osd_scrub_auto_repair still defaults to false and I was wondering how we
think about enabling this feature by default.

Would we say it's safe to enable this with BlueStore?

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs inconsistent, do I fear data loss?

2017-11-02 Thread koukou73gr
The scenario is actually a bit different, see:

Let's assume size=2, min_size=1
-We are looking at pg "A" acting [1, 2]
-osd 1 goes down
-osd 2 accepts a write for pg "A"
-osd 2 goes down
-osd 1 comes back up, while osd 2 still down
-osd 1 has no way to know osd 2 accepted a write in pg "A"
-osd 1 accepts a new write to pg "A"
-osd 2 comes back up.

bang! osd 1 and 2 now have different views of pg "A" but both claim to
have current data.

-K.

On 2017-11-01 20:27, Denes Dolhay wrote:
> Hello,
> 
> I have a trick question for Mr. Turner's scenario:
> Let's assume size=2, min_size=1
> -We are looking at pg "A" acting [1, 2]
> -osd 1 goes down, OK
> -osd 1 comes back up, backfill of pg "A" commences from osd 2 to osd 1, OK
> -osd 2 goes down (and therefore pg "A" 's backfill to osd 1 is
> incomplete and stopped) not OK, but this is the case...
> --> In this event, why does osd 1 accept IO to pg "A" knowing full well,
> that it's data is outdated and will cause an inconsistent state?
> Wouldn't it be prudent to deny io to pg "A" until either
> -osd 2 comes back (therefore we have a clean osd in the acting group)
> ... backfill would continue to osd 1 of course
> -or data in pg "A" is manually marked as lost, and then continues
> operation from osd 1 's (outdated) copy?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore with SSD-backed DBs; what if the SSD fails?

2017-10-25 Thread koukou73gr
On 2017-10-25 11:21, Wido den Hollander wrote:
> 
>> Op 25 oktober 2017 om 5:58 schreef Christian Sarrasin 
>> :
>>
>> The one thing I'm still wondering about is failure domains.  With
>> Filestore and SSD-backed journals, an SSD failure would kill writes but
>> OSDs were otherwise still whole.  Replacing the failed SSD quickly would
>> get you back on your feet with relatively little data movement.
>>
> 
> Not true. If you loose your OSD's journal with FileStore without a clean 
> shutdown of the OSD you loose the OSD. You'd have to rebalance the complete 
> OSD.

Could you crosscheck please? Because this
http://ceph.com/geen-categorie/ceph-recover-osds-after-ssd-journal-failure/
suggests otherwise.

-K.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hard disk bad manipulation: journal corruption and stale pgs

2017-06-05 Thread koukou73gr

Is your min-size at least 2? Is it just one OSD affected?

If yes and if it is only the journal that is corrupt, but the actual OSD
store is intact although lagging behind now in writes and you do have
healthy copies of its PGs elsewhere (hence the min-size requirement) you
could resolve this situation by:

1) ensure the OSD with the corrupt journal is stopped
2) recreate the journal
3) start the OSD again.

The OSD should peer its PGs and bring them on par with the other copies
and the cluster should return to healthy state again.

See here (
http://www.sebastien-han.fr/blog/2014/11/27/ceph-recover-osds-after-ssd-journal-failure/
) for a more detailed walkthrough. It talks about failed SSD with
journals but the situation is the same with regards to any journal failure.

Now you mentioned having set the weight to 0 in the meantime, I have no
idea how this is going to affect the above procedure, maybe you should
wait for somebody else to comment on this.

Hope this helps a bit,

-K.


On 2017-06-05 15:32, Zigor Ozamiz wrote:
> Hi everyone,
> 
> Due to two beginner's big mistakes handling and recovering a hard disk,
> we have reached to a situation in which the system tells us that the
> journal of an osd is corrupted.
> 
> 2017-05-30 17:59:21.318644 7fa90757a8c0  1 journal _open
> /dev/disk/by-id/ata-INTEL_SSDSC2BA200G4_BTHV5281013C200MGN-part3 fd 20:
> 2048000
>  bytes, block size 4096 bytes, directio = 1, aio = 1
> 2017-05-30 17:59:21.36 7fa90757a8c0 -1 journal Unable to read past
> sequence 3219747309 but header indicates the journal has committed up
> through 3219750285, journal is corrupt
> 2017-05-30 17:59:21.325946 7fa90757a8c0 -1 os/FileJournal.cc: In
> function 'bool FileJournal::read_entry(ceph::bufferlist&, uint64_t&,
> bool*)' thread 7fa90757a8c0 time 2017-05-30 17:59:21.322296
> os/FileJournal.cc: 1853: FAILED assert(0)
> 
> We think that the only way we can reuse the osd, is wiping and starting
> it again. But before doing that, we have lowered the weight by 0 and
> waited until the cluster recover itself. Since that moment, several days
> have passed but somg pgs still have "stale + active + clean" state.
> 
> pg_statstateupup_primaryactingacting_primary
> 1.b5stale+active+clean[0]0[0]0
> 1.22stale+active+clean[0]0[0]0
> 1.53stale+active+clean[0]0[0]0
> 1.198stale+active+clean[0]0[0]0
> 1.199stale+active+clean[0]0[0]0
> 1.4estale+active+clean[0]0[0]0
> 1.4fstale+active+clean[0]0[0]0
> 1.a7stale+active+clean[0]0[0]0
> 1.1efstale+active+clean[0]0[0]0
> 1.160stale+active+clean[0]0[0]0
> 18.4stale+active+clean[0]0[0]0
> 1.15estale+active+clean[0]0[0]0
> 1.a1stale+active+clean[0]0[0]0
> 1.18astale+active+clean[0]0[0]0
> 1.156stale+active+clean[0]0[0]0
> 1.6bstale+active+clean[0]0[0]0
> 1.c6stale+active+clean[0]0[0]0
> 1.1b1stale+active+clean[0]0[0]0
> 1.123stale+active+clean[0]0[0]0
> 1.17astale+active+clean[0]0[0]0
> 1.bcstale+active+clean[0]0[0]0
> 1.179stale+active+clean[0]0[0]0
> 1.177stale+active+clean[0]0[0]0
> 1.b8stale+active+clean[0]0[0]0
> 1.2astale+active+clean[0]0[0]0
> 1.117stale+active+clean[0]0[0]0
> 
> When executing a "ceph pg query PGID" or "ceph pg PGID list_missing", we
> get the error "Error ENOENT: I ​​do not have pgid PGID".
> 
> Given that we are using replication 3, there is no data loss, isn't it?
> How could we proceed to solve the problem?
> 
> - Running: ceph osd lost OSDID; as recommended in some previous
> consultation in this list.
> - Recreating the pgs by hand via: ceph pg force_create PGID
> - Making the wipe directly.
> 
> Thanks in advance,
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD exclusive-lock and lqemu/librbd

2017-06-02 Thread koukou73gr
Coming back to this, with Jason's insight it was quickly revealed that
my problem was in reality a cephx authentication permissions issue.

Specifically, exclusive-lock requires a cephx user with class-write
access to the pool where the image resides. This wasn't clear in the
documentation and the used I was using only has class-read access.

Once a cephx user with proper permissions was used, the device backed by
the exclusive-lock enabled rbd image started to behave.

I'm really sorry for the red herring and thank you all for helping me
expand my understanding of Ceph.

-K.


On 2017-06-02 14:07, Peter Maloney wrote:
> On 06/02/17 12:25, koukou73gr wrote:
>> On 2017-06-02 13:01, Peter Maloney wrote:
>>>> Is it easy for you to reproduce it? I had the same problem, and the same
>>>> solution. But it isn't easy to reproduce... Jason Dillaman asked me for
>>>> a gcore dump of a hung process but I wasn't able to get one. Can you do
>>>> that, and when you reply, CC  Jason Dillaman  ?
>>> I mean a hung qemu process on the vm host (the one that uses librbd).
>>> And I guess that should be TO rather than CC.
>>>
>> Peter,
>>
>> Can it be that my situation is different?
>>
>> In my case the guest/qemu process it self does not hang. The guest root
>> filesystem resides in an rbd image w/o exclusive-lock enabled (the
>> pre-existing kind I described).
> Of course it could be different, but it seems the same so far... same
> solution, and same warnings in the guest, just it takes some time before
> the guest totally hangs.
> 
> Sometimes the OS seems ok but has those warnings...
> then worse is you can see the disk looks busy in iostat like 100% but
> has low activity like 1 w/s...
> and worst is that you can't even get anything to run or any screen
> output or keyboard input at all, and kill on the qemu process won't even
> work at that point, except with -9.
> 
> And sometimes you can get the exact same symptoms with a curable
> problem... like if you stop too many osds and min_size is not reached
> for just one pg that the image uses, then it looks like it works, until
> it hits that bad pg, then the above symptoms happen. And then most of
> the time the VM recovers when the osds are up again, but sometimes not.
> But since you mentioned exclusive lock, I still think it seems the same
> or highly related.
> 
>>
>> The problem surfaced when additional storage was attached to the guest,
>> through a new rbd image created with exclusive-lock as it is the default
>> on Jewel.
>>
>> Problem being when parted/fdisk is run on that device, they hang as
>> reported. On the other hand,
>>
>> dd if=/dev/sdb of=/tmp/lala count=512
>>
>> has no problem completing, While the reverse,
>>
>> dd if=/tmp/lala of=/dev/sdb count=512
>>
>> hangs indefinately. While in this state, I can still,ssh to the guest
>> and work as long as I don't touch the new device. It appears that when a
>> write to the device backed by the exclusive-lock featured image hangs, a
>> read to it will hang as well.
>>
>> -K.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD exclusive-lock and lqemu/librbd

2017-06-02 Thread koukou73gr
On 2017-06-02 13:22, Peter Maloney wrote:
> On 06/02/17 12:06, koukou73gr wrote:
>> Thanks for the reply.
>>
>> Easy?
>> Sure, it happens reliably every time I boot the guest with
>> exclusive-lock on :)
> If it's that easy, also try with only exclusive-lock, and not object-map
> nor fast-diff. And also with one or the other of those.

Already did, before reporting.

It is the exclussive-lock bit that needs to be turned off for things to
start working. Removing only (fast-diff) or (object-map and fast-diff)
make no difference with regards to this.


-K.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD exclusive-lock and lqemu/librbd

2017-06-02 Thread koukou73gr

On 2017-06-02 13:01, Peter Maloney wrote:
>> Is it easy for you to reproduce it? I had the same problem, and the same
>> solution. But it isn't easy to reproduce... Jason Dillaman asked me for
>> a gcore dump of a hung process but I wasn't able to get one. Can you do
>> that, and when you reply, CC  Jason Dillaman  ?
> I mean a hung qemu process on the vm host (the one that uses librbd).
> And I guess that should be TO rather than CC.
>

Peter,

Can it be that my situation is different?

In my case the guest/qemu process it self does not hang. The guest root
filesystem resides in an rbd image w/o exclusive-lock enabled (the
pre-existing kind I described).

The problem surfaced when additional storage was attached to the guest,
through a new rbd image created with exclusive-lock as it is the default
on Jewel.

Problem being when parted/fdisk is run on that device, they hang as
reported. On the other hand,

dd if=/dev/sdb of=/tmp/lala count=512

has no problem completing, While the reverse,

dd if=/tmp/lala of=/dev/sdb count=512

hangs indefinately. While in this state, I can still,ssh to the guest
and work as long as I don't touch the new device. It appears that when a
write to the device backed by the exclusive-lock featured image hangs, a
read to it will hang as well.

-K.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD exclusive-lock and lqemu/librbd

2017-06-02 Thread koukou73gr
Thanks for the reply.

Easy?
Sure, it happens reliably every time I boot the guest with
exclusive-lock on :)

I'll need some walkthrough on the gcore part though!

-K.


On 2017-06-02 12:59, Peter Maloney wrote:
> On 06/01/17 17:12, koukou73gr wrote:
>> Hello list,
>>
>> Today I had to create a new image for a VM. This was the first time,
>> since our cluster was updated from Hammer to Jewel. So far I was just
>> copying an existing golden image and resized it as appropriate. But this
>> time I used rbd create.
>>
>> So I "rbd create"d a 2T image and attached it to an existing VM guest
>> with librbd using:
>> 
>>   
>>   
>> 
>>   
>>   
>>   
>>   
>> 
>>
>>
>> Booted the guest and tried to partition it the new drive from inside the
>> guest. That's it, parted (and anything else for that matter) that tried
>> to access the new disk would freeze. After 2 minutes the kernel would
>> start complaining:
>>
>> [  360.212391] INFO: task parted:1836 blocked for more than 120 seconds.
>> [  360.216001]   Not tainted 4.4.0-78-generic #99-Ubuntu
>> [  360.218663] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
> Is it easy for you to reproduce it? I had the same problem, and the same
> solution. But it isn't easy to reproduce... Jason Dillaman asked me for
> a gcore dump of a hung process but I wasn't able to get one. Can you do
> that, and when you reply, CC  Jason Dillaman  ?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD exclusive-lock and lqemu/librbd

2017-06-01 Thread koukou73gr

Hello list,

Today I had to create a new image for a VM. This was the first time,
since our cluster was updated from Hammer to Jewel. So far I was just
copying an existing golden image and resized it as appropriate. But this
time I used rbd create.

So I "rbd create"d a 2T image and attached it to an existing VM guest
with librbd using:

  
  

  
  
  
  



Booted the guest and tried to partition it the new drive from inside the
guest. That's it, parted (and anything else for that matter) that tried
to access the new disk would freeze. After 2 minutes the kernel would
start complaining:

[  360.212391] INFO: task parted:1836 blocked for more than 120 seconds.
[  360.216001]   Not tainted 4.4.0-78-generic #99-Ubuntu
[  360.218663] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

After much headbanging, trial and error, I finaly thought of checking
the enabled rbd features of an existing image versus the new one.

pre-existing: layering, stripping
new: layering, exclusive-lock, object-map, fast-diff, deep-flatten

Disabling exclusive-lock (and fast-diff and object-map before that)
would allow the new image to become usable in the guest at last.

This is with:

ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
qemu-img version 2.6.0 (qemu-kvm-ev-2.6.0-28.el7_3.3.1), Copyright (c)
2004-2008 Fabrice Bellard

on a host running:
CentOS Linux release 7.3.1611 (Core)
Linux host-10-206-123-184.physics.auth.gr 3.10.0-327.36.2.el7.x86_64 #1
SMP Mon Oct 10 23:08:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

and a guest
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.2 LTS"
Linux srv-10-206-123-87.physics.auth.gr 4.4.0-78-generic #99-Ubuntu SMP
Thu Apr 27 15:29:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

I vagually remember references of problems when exclusive-lock was
enabled on rbd images but trying Google didn't reveal much to me.

So what is it with exclusive lock? Why does it fail like this? Could you
please point me to some documentation on this behaviour?

Thanks for any feedback.

-K.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] - permission denied on journal after reboot

2017-02-13 Thread koukou73gr
On 2017-02-13 13:47, Wido den Hollander wrote:

> 
> The udev rules of Ceph should chown the journal to ceph:ceph if it's set to 
> the right partition UUID.
> 
> This blog shows it partially: 
> http://ceph.com/planet/ceph-recover-osds-after-ssd-journal-failure/
> 
> This is done by *95-ceph-osd.rules*, you might want to check the source of 
> that.
> 

Unfortunatelly the udev rules do not handle non-gpt partitioned disks.
This is because partition typecode GUID is just not supported on MBR
partition.

If your journals live on an MBR disk you'll have to add some custom udev
rules yourself. This is what I did:

[root@ceph-10-206-123-182 ~]# cat
/etc/udev/rules.d/70-persisnent-ceph-journal.rules
KERNEL=="sdc5", SUBSYSTEM=="block", ATTRS{model}=="KINGSTON SV300S3",
OWNER="ceph", GROUP="ceph"
KERNEL=="sdc6", SUBSYSTEM=="block", ATTRS{model}=="KINGSTON SV300S3",
OWNER="ceph", GROUP="ceph"

Cheers,

-K.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] virt-install into rbd hangs during Anaconda package installation

2017-02-07 Thread koukou73gr
On 2017-02-07 10:11, Tracy Reed wrote:
> Weird. Now the VMs that were hung in interruptable wait state have now
> disappeared. No idea why.

Have you tried the same procedure but with local storage instead?

-K.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.7 Hammer released

2016-05-17 Thread koukou73gr
Same here.

Warnings appeared for OSDs running the .6 version each time one of the
rest was restarted to the .7 version.

When the last .6 OSD host was upgraded, there where no more warnings
from the rest.

Cluster seems happy :)

-K.


On 05/17/2016 11:04 AM, Dan van der Ster wrote:
> Hi Sage et al,
>
> I'm updating our pre-prod cluster from 0.94.6 to 0.94.7 and after
> upgrading the ceph-mon's I'm getting loads of warnings like:
>
> 2016-05-17 10:01:29.314785 osd.76 [WRN] failed to encode map e103116
> with expected crc
>
> I've seen that error is whitelisted in the qa-suite:
> https://github.com/ceph/ceph-qa-suite/pull/602/files
>
> Is it really harmless? (This is the first time I've seen such a warning).
>
> Thanks in advance!
>
> Dan
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Advice on OSD upgrades

2016-04-14 Thread koukou73gr
If you have empty drive slots in your OSD hosts, I'd be tempted to
insert new drive in slot, set noout, shutdown one OSD, unmount OSD
directory, dd the old drive to the new one, remove old drive, restart OSD.

No rebalancing and minimal data movment when the OSD rejoins.

-K.

On 04/14/2016 04:29 PM, Stephen Mercier wrote:
> Good morning,
> 
> We've been running a medium-sized (88 OSDs - all SSD) ceph cluster for
> the past 20 months. We're very happy with our experience with the
> platform so far.
> 
> Shortly, we will be embarking on an initiative to replace all 88 OSDs
> with new drives (Planned maintenance and lifecycle replacement). Before
> we do so, however, I wanted to confirm with the community as to the
> proper order of operation to perform such a task.
> 
> The OSDs are divided evenly across an even number of hosts which are
> then divided evenly between 2 cabinets in 2 physically separate
> locations. The plan is to replace the OSDs, one host at a time, cycling
> back and forth between cabinets, replacing one host per week, or every 2
> weeks (Depending on the amount of time the crush rebalancing takes).
> 
> For each host, the plan was to mark the OSDs as out, one at a time,
> closely monitoring each of them, moving to the next OSD one the current
> one is balanced out. Once all OSDs are successfully marked as out, we
> will then delete those OSDs from the cluster, shutdown the server,
> replace the physical drives, and once rebooted, add the new drives to
> the cluster as new OSDs using the same method we've used previously,
> doing so one at a time to allow for rebalancing as they rejoin the cluster.
> 
> My questions are…Does this process sound correct? Should I also mark the
> OSDs as down when I mark them as out? Are there any steps I'm
> overlooking in this process?
> 
> Any advice is greatly appreciated.
> 
> Cheers,
> -
> Stephen Mercier | Sr. Systems Architect
> Attainia Capital Planning Solutions (ACPS)
> O: (650)241-0567, 727 | TF: (866)288-2464, 727
> stephen.merc...@attainia.com  |
> www.attainia.com 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1 pg stuck

2016-03-24 Thread koukou73gr
Space on hosts in rack2 does not add up to cover space in rack1. After
enough data are written to the cluster all pgs on rack2 would be
allocated and the cluster won't be able to find a free pg to map new
data to for the 3rd replica.

Bottomline, spread your big disks to all 4 hosts, or add some more
disks/OSDs to hosts on rack2. As a last resort, you may decrease the
failure domain to 'osd' instead of the default 'host' but that is very
dangerous for a production cluster.

-K.


On 03/24/2016 04:36 PM, yang sheng wrote:
> Hi all,
> 
> I am testing the ceph right now using 4 servers with 8 OSDs (all OSDs
> are up and in). I have 3 pools in my cluster (image pool, volume pool
> and default rbd pool), both image and volume pool have replication size
> =3. Based on the pg equation, there are 448 pgs in my cluster. 
> 
> $ ceph osd tree
> ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT
> PRIMARY-AFFINITY 
> -1 16.07797 root default
> 
> -5 14.38599 rack rack1  
> -2  7.17599 host psusnjhhdlc7iosstb001  
> 
>  0  3.53899 osd.0   up  1.0
>  1.0 
>  1  3.63699 osd.1   up  1.0
>  1.0 
> -3  7.20999 host psusnjhhdlc7iosstb002  
> 
>  2  3.63699 osd.2   up  1.0
>  1.0 
>  3  3.57300 osd.3   up  1.0
>  1.0 
> -6  1.69199 rack rack2  
> -4  0.83600 host psusnjhhdlc7iosstb003  
> 
>  5  0.43500 osd.5   up  1.0
>  1.0 
>  4  0.40099 osd.4   up  1.0
>  1.0 
> -7  0.85599 host psusnjhhdlc7iosstb004  
> 
>  6  0.40099 osd.6   up  1.0
>0 
>  7  0.45499 osd.7   up  1.0
>0 
> 
> $ ceph osd dump
> pool 0 'rbd' replicated size 2 min_size 2 crush_ruleset 0 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 745 flags hashpspool
> stripe_width 0
> pool 3 'imagesliberty' replicated size 3 min_size 2 crush_ruleset 0
> object_hash rjenkins pg_num 128 pgp_num 128 last_change 777 flags
> hashpspool stripe_width 0
> removed_snaps [1~1,8~c]
> pool 4 'volumesliberty' replicated size 3 min_size 2 crush_ruleset 0
> object_hash rjenkins pg_num 256 pgp_num 256 last_change 776 flags
> hashpspool stripe_width 0
> removed_snaps [1~1,15~14,2a~1,2c~1,2e~24,57~2,5a~18,74~2,78~1,94~5,b7~2]
> 
> 
> Right now, the ceph health is HEALTH_WARN. I use "ceph health detail"
>  to dump the information, and there is a pg stuck.
> 
> $ ceph -s:
> cluster 2e906379-f211-4329-8faf-a8e7600b8418
>  health HEALTH_WARN
> 1 pgs degraded
> 1 pgs stuck degraded
> 1 pgs stuck inactive
> 1 pgs stuck unclean
> 1 pgs stuck undersized
> 1 pgs undersized
> recovery 23/55329 objects degraded (0.042%)
>  monmap e14: 2 mons at
> {psusnjhhdlc7ioscom002=192.168.2.62:6789/0,psusnjhhdlc7ioscon002=192.168.2.12:6789/0
> }
> election epoch 106, quorum 0,1
> psusnjhhdlc7ioscon002,psusnjhhdlc7ioscom002
>  osdmap e776: 8 osds: 8 up, 8 in
> flags sortbitwise
>   pgmap v519644: 448 pgs, 3 pools, 51541 MB data, 18443 objects
> 170 GB used, 16294 GB / 16464 GB avail
> 23/55329 objects degraded (0.042%)
>  447 active+clean
>1 undersized+degraded+peered
> 
> 
> $ ceph health detail
> HEALTH_WARN 1 pgs degraded; 1 pgs stuck unclean; 1 pgs undersized;
> recovery 23/55329 objects degraded (0.042%)
> pg 3.d is stuck unclean for 58161.177025, current state
> active+undersized+degraded, last acting [1,3]
> pg 3.d is active+undersized+degraded, acting [1,3]
> recovery 23/55329 objects degraded (0.042%)
> 
> If I am right, the pg 3.d has only 2 replicas, primary in OSD.1 and
> secondary in OSD.3. There is no 3rd replica in the cluster. That's why
> it gives the unhealthy warning.  
> 
> I tried to decrease the replication size =2 for image pool and the stuck
> pg disappeared. After I change the size back to 3, still the ceph didn't
> create the 3rd replica for pg 3.d.
> 
> I also tried to shutdown Server 0 which has OSD.0 and OSD.1 which let pg
> d.3 has only 1 replica in the cluster. Still it didn't create another
> copy even I set size =3 and min_size=2. Also, there are more pg in
> degraded undersized or unclean mode.
> 
> $ ceph pg map 3.d
> osdmap e796 pg 3.d (3.d) -> up [3] acting [3]
> 
> $ ceph -s
> cluster 2e906379-f211-4329-8faf

Re: [ceph-users] dealing with the full osd / help reweight

2016-03-24 Thread koukou73gr
What is your pool size? 304 pgs sound awfuly small for 20 OSDs.
More pgs will help distribute full pgs better.

But with a full or near full OSD in hand, increasing pgs is a no-no
operation. If you search in the list archive, I believe there was a
thread last month or so which provided a walkthrough-sort of for dealing
with uneven distribution and a full OSD.

-K.


On 03/24/2016 01:54 PM, Jacek Jarosiewicz wrote:
> disk usage on the full osd is as below. What are the *_TEMP directories
> for? How can I make sure which pg directories are safe to remove?
> 
> [root@cf04 current]# du -hs *
> 156G0.14_head
> 156G0.21_head
> 155G0.32_head
> 157G0.3a_head
> 155G0.e_head
> 156G0.f_head
> 40K10.2_head
> 4.0K11.3_head
> 218G14.13_head
> 218G14.15_head
> 218G14.1b_head
> 219G14.1f_head
> 9.1G14.29_head
> 219G14.2a_head
> 75G14.2d_head
> 125G14.2e_head
> 113G14.32_head
> 163G14.33_head
> 218G14.35_head
> 151G14.39_head
> 218G14.3b_head
> 103G14.3d_head
> 217G14.3f_head
> 219G14.a_head
> 773M17.0_head
> 814M17.10_head
> 4.0K17.10_TEMP
> 747M17.19_head
> 4.0K17.19_TEMP
> 669M17.1b_head
> 659M17.1c_head
> 638M17.1f_head
> 681M17.30_head
> 4.0K17.30_TEMP
> 721M17.34_head
> 695M17.3d_head
> 726M17.3e_head
> 734M17.3f_head
> 4.0K17.3f_TEMP
> 670M17.d_head
> 597M17.e_head
> 4.0K17.e_TEMP
> 4.0K1.7_head
> 34M5.1_head
> 34M5.6_head
> 4.0K9.6_head
> 4.0Kcommit_op_seq
> 30Mmeta
> 0nosnap
> 614Momap
> 
> 
> 
> On 03/24/2016 10:11 AM, Jacek Jarosiewicz wrote:
>> Hi!
>>
>> I have a problem with the osds getting full on our cluster.
>>
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help for PG problem

2016-03-23 Thread koukou73gr
Are you runnig with the default failure domain of 'host'?

If so, with a pool size of 3 and your 20 OSDs physically only on 2 hosts
Ceph is unable to find a 3rd host to map the 3rd replica.

Either add a host and move some OSDs there or reduce pool size to 2.

-K.

On 03/23/2016 02:17 PM, Zhang Qiang wrote:
> And here's the osd tree if it matters.
> 
> ID WEIGHT   TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY 
> -1 22.39984 root default  
> -2 21.39984 host 10   
>  0  1.06999 osd.0up  1.0  1.0 
>  1  1.06999 osd.1up  1.0  1.0 
>  2  1.06999 osd.2up  1.0  1.0 
>  3  1.06999 osd.3up  1.0  1.0 
>  4  1.06999 osd.4up  1.0  1.0 
>  5  1.06999 osd.5up  1.0  1.0 
>  6  1.06999 osd.6up  1.0  1.0 
>  7  1.06999 osd.7up  1.0  1.0 
>  8  1.06999 osd.8up  1.0  1.0 
>  9  1.06999 osd.9up  1.0  1.0 
> 10  1.06999 osd.10   up  1.0  1.0 
> 11  1.06999 osd.11   up  1.0  1.0 
> 12  1.06999 osd.12   up  1.0  1.0 
> 13  1.06999 osd.13   up  1.0  1.0 
> 14  1.06999 osd.14   up  1.0  1.0 
> 15  1.06999 osd.15   up  1.0  1.0 
> 16  1.06999 osd.16   up  1.0  1.0 
> 17  1.06999 osd.17   up  1.0  1.0 
> 18  1.06999 osd.18   up  1.0  1.0 
> 19  1.06999 osd.19   up  1.0  1.0 
> -3  1.0 host 148_96   
>  0  1.0 osd.0up  1.0  1.0
> 
> On Wed, 23 Mar 2016 at 19:10 Zhang Qiang  > wrote:
> 
> Oliver, Goncalo, 
> 
> Sorry to disturb again, but recreating the pool with a smaller
> pg_num didn't seem to work, now all 666 pgs are degraded + undersized.
> 
> New status:
> cluster d2a69513-ad8e-4b25-8f10-69c4041d624d
>  health HEALTH_WARN
> 666 pgs degraded
> 82 pgs stuck unclean
> 666 pgs undersized
>  monmap e5: 5 mons at
> 
> {1=10.3.138.37:6789/0,2=10.3.138.39:6789/0,3=10.3.138.40:6789/0,4=10.3.138.59:6789/0,GGZ-YG-S0311-PLATFORM-138=10.3.138.36:6789/0
> 
> }
> election epoch 28, quorum 0,1,2,3,4
> GGZ-YG-S0311-PLATFORM-138,1,2,3,4
>  osdmap e705: 20 osds: 20 up, 20 in
>   pgmap v1961: 666 pgs, 1 pools, 0 bytes data, 0 objects
> 13223 MB used, 20861 GB / 21991 GB avail
>  666 active+undersized+degraded
> 
> Only one pool and its size is 3. So I think according to the
> algorithm, (20 * 100) / 3 = 666 pgs is reasonable.
> 
> I updated health detail and also attached a pg query result on
> gist(https://gist.github.com/dotSlashLu/22623b4cefa06a46e0d4).
> 
> On Wed, 23 Mar 2016 at 09:01 Dotslash Lu  > wrote:
> 
> Hello Gonçalo,
> 
> Thanks for your reminding. I was just setting up the cluster for
> test, so don't worry, I can just remove the pool. And I learnt
> that since the replication number and pool number are related to
> pg_num, I'll consider them carefully before deploying any data. 
> 
> On Mar 23, 2016, at 6:58 AM, Goncalo Borges
>  > wrote:
> 
>> Hi Zhang...
>>
>> If I can add some more info, the change of PGs is a heavy
>> operation, and as far as i know, you should NEVER decrease
>> PGs. From the notes in pgcalc (http://ceph.com/pgcalc/):
>>
>> "It's also important to know that the PG count can be
>> increased, but NEVER decreased without destroying / recreating
>> the pool. However, increasing the PG Count of a pool is one of
>> the most impactful events in a Ceph Cluster, and should be
>> avoided for production clusters if possible."
>>
>> So, in your case, I would consider in adding more OSDs.
>>
>> Cheers
>> Goncalo
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help for PG problem

2016-03-23 Thread koukou73gr
You should have settled with the nearest power of 2, which for 666 is
512. Since you created the cluster and IIRC is a testbed, you may as
well recreate it again, however it will less of a hassle to just
increase the pgs to the next power of two: 1024

Your 20 ods appear to be equal sized in your crushmap so ~150pgs / osd
should still be acceptable.

Hope you nail it this time :)

-K.


On 03/23/2016 01:10 PM, Zhang Qiang wrote:
> Oliver, Goncalo,
>
> Sorry to disturb again, but recreating the pool with a smaller pg_num
> didn't seem to work, now all 666 pgs are degraded + undersized.
>
> New status:
> cluster d2a69513-ad8e-4b25-8f10-69c4041d624d
>  health HEALTH_WARN
> 666 pgs degraded
> 82 pgs stuck unclean
> 666 pgs undersized
>  monmap e5: 5 mons at
>
{1=10.3.138.37:6789/0,2=10.3.138.39:6789/0,3=10.3.138.40:6789/0,4=10.3.138.59:6789/0,GGZ-YG-S0311-PLATFORM-138=10.3.138.36:6789/0
>
}
> election epoch 28, quorum 0,1,2,3,4
> GGZ-YG-S0311-PLATFORM-138,1,2,3,4
>  osdmap e705: 20 osds: 20 up, 20 in
>   pgmap v1961: 666 pgs, 1 pools, 0 bytes data, 0 objects
> 13223 MB used, 20861 GB / 21991 GB avail
>  666 active+undersized+degraded
>
> Only one pool and its size is 3. So I think according to the algorithm,
> (20 * 100) / 3 = 666 pgs is reasonable.
>
> I updated health detail and also attached a pg query result on
> gist(https://gist.github.com/dotSlashLu/22623b4cefa06a46e0d4).
>
> On Wed, 23 Mar 2016 at 09:01 Dotslash Lu  > wrote:
>
> Hello Gonçalo,
>
> Thanks for your reminding. I was just setting up the cluster for
> test, so don't worry, I can just remove the pool. And I learnt that
> since the replication number and pool number are related to pg_num,
> I'll consider them carefully before deploying any data.
>
> On Mar 23, 2016, at 6:58 AM, Goncalo Borges
> mailto:goncalo.bor...@sydney.edu.au>>
> wrote:
>
>> Hi Zhang...
>>
>> If I can add some more info, the change of PGs is a heavy
>> operation, and as far as i know, you should NEVER decrease PGs.
>> From the notes in pgcalc (http://ceph.com/pgcalc/):
>>
>> "It's also important to know that the PG count can be increased,
>> but NEVER decreased without destroying / recreating the pool.
>> However, increasing the PG Count of a pool is one of the most
>> impactful events in a Ceph Cluster, and should be avoided for
>> production clusters if possible."
>>
>> So, in your case, I would consider in adding more OSDs.
>>
>> Cheers
>> Goncalo
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help: pool not responding

2016-02-14 Thread koukou73gr
Have you tried restarting  osd.0 ?

-K.

On 02/14/2016 09:56 PM, Mario Giammarco wrote:
> Hello,
> I am using ceph hammer under proxmox. 
> I have working cluster it is several month I am using it.
> For reasons yet to discover I am now in this situation:
> 
> HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean; 7 
> requests are blocked > 32 sec; 1 osds have slow requests
> pg 0.0 is stuck inactive for 3541712.92, current state incomplete, last 
> acting [0,1,3]
> pg 0.40 is stuck inactive for 1478467.695684, current state incomplete, 
> last acting [1,0,3]
> pg 0.3f is stuck inactive for 3541852.000546, current state incomplete, 
> last acting [0,3,1]
> pg 0.3b is stuck inactive for 3541865.897979, current state incomplete, 
> last acting [0,3,1]
> pg 0.0 is stuck unclean for 326.301120, current state incomplete, last 
> acting [0,1,3]
> pg 0.40 is stuck unclean for 326.301128, current state incomplete, last 
> acting [1,0,3]
> pg 0.3f is stuck unclean for 345.066879, current state incomplete, last 
> acting [0,3,1]
> pg 0.3b is stuck unclean for 379.201819, current state incomplete, last 
> acting [0,3,1]
> pg 0.40 is incomplete, acting [1,0,3]
> pg 0.3f is incomplete, acting [0,3,1]
> pg 0.3b is incomplete, acting [0,3,1]
> pg 0.0 is incomplete, acting [0,1,3]
> 7 ops are blocked > 2097.15 sec
> 7 ops are blocked > 2097.15 sec on osd.0
> 1 osds have slow requests
> 
> 
> Problem is that when I try to read or write to pool "rbd" (where I have all 
> my virtual machines) ceph starts to log "slow reads" and system hungs.
> If in the same cluster I create another pool and inside it I create an 
> image I can read and write correctly (and fast) so it seems the cluster is 
> working and only the pool is not working.
> 
> Can you help me?
> Thanks,
> Mario
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + Libvirt + QEMU-KVM

2016-01-28 Thread koukou73gr
On 01/28/2016 03:44 PM, Simon Ironside wrote:

> Btw, using virtio-scsi devices as above and discard='unmap' above
> enables TRIM support. This means you can use fstrim or mount file
> systems with discard inside the VM to free up unused space in the image.

Doesn't discard require the pc-q35-rhel7 (or equivalent) guest machine
type, which in turn shoves a non-removable SATA AHCI device in the guest
which can't be frozen and thus disables guest live migration?

Has there been any change regarding this in a recent QEMU version?


Thanks,

-K.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel S3710 400GB and Samsung PM863 480GB fio results

2015-12-22 Thread koukou73gr
Even the cheapest stuff nowadays has some more or less decent wear
leveling algorithm built into their controller so this won't be a
problem. Wear leveling algorithms cycle the blocks internally so wear
evens out on the whole disk.

-K.

On 12/22/2015 06:57 PM, Alan Johnson wrote:
> I would also add that the journal activity is write intensive so a small part 
> of the drive would get excessive writes if the journal and data are 
> co-located on an SSD. This would also be the case where an SSD has multiple 
> journals associated with many HDDs.
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multiple journals and an OSD on one SSD doable?

2015-06-09 Thread koukou73gr
On 06/08/2015 11:54 AM, Jan Schermer wrote:
> 
> This should indicate the real wear:   100 Gigabytes_Erased
> 0x0032   000   000   000Old_age   Always   -   62936
> Bytes written after compression:  233 SandForce_Internal  
> 0x   000   000   000Old_age   Offline  -   40464
> Written bytes from OS perspective: 241 Lifetime_Writes_GiB 0x0032   
> 000   000   000Old_age   Always   -   53826
> 
> I wonder if it’s “write-mostly” for everyone… :)
> 242 Lifetime_Reads_GiB  0x0032   000   000   000Old_age   Always  
>  -   13085

LOL...

241 Lifetime_Writes_GiB -O--CK   000   000   000-10782
242 Lifetime_Reads_GiB  -O--CK   000   000   000-50

SSD contains 2x10GB journals partitions for 2x4TB OSD + 1x20GB for OS.

-K.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] QEMU Venom Vulnerability

2015-05-21 Thread koukou73gr

On 05/21/2015 02:36 PM, Brad Hubbard wrote:


If that's correct then starting from there and building a new RPM 
with RBD support is the proper way of updating. Correct?


I guess there are two ways to approach this.

1. use the existing ceph source rpm here.

http://ceph.com/packages/ceph-extras/rpm/centos6/SRPMS/qemu-kvm-0.12.1.2-2.415.el6.3ceph.src.rpm 




[...]
2. use the latest Centos/Red Hat src rpm and add the ceph patches to 
its source
tree and Patch lines to its spec file as well as the optional 
pkgrelease and

[...]


What I've been doing in my cluster is pickup the RHEV (version 6 or 7 as 
needed) qemu source RPM and use them. It is quite simple really:


- install & configure rpmbuild. There are numerous howtos for this.
- fetch and install qemu source rpm from RHEV. Google around and grab 
latest version. eg. for CentOS 7 (RHEV 7 based) I found them here: 
http://springdale.math.ias.edu/data/RedHat/enterprise/7Server/en/RHEV/SRPMS/qemu-kvm-rhev-2.1.2-23.el7_1.3.src.rpm
- rpmbuild -ba the spec file, it will probably complain about missing 
development packages. Satisfy them
- rpmbuild again, wait, grab the resulting binary rpms from the 
RPMS/ folder and distribute them as needed.


On occasion, you may need to tweak the spec file a bit. For example, the 
above RHEV 7 srpm required librbd1-devel and librados-devel which have 
been consolidated in ceph-devel. Just commenting out the requirement and 
having ceph-devel installed allowed the build to proceed. But these 
fixes are trivial.


Cheers,

-K.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] live migration fails with image on ceph

2015-04-15 Thread koukou73gr

Hello,

Can't really help you with you with nova, but using plain libvirt-1.1.1 
and qemu-1.5.3 live migration of rbd-backed VMs is (almost*) instant on 
the client side. We have rbd write-back cache enabled everywhere and 
have no problem at all.


-K.


*There is about a 1-2 second hitch at worse, I suppose the switches need 
that much to notice the change.


On 04/16/2015 06:45 AM, Yuming Ma (yumima) wrote:

The issue is reproducible in svl-3 with rbd cache set to false.

On the 5th ping-pong, the instance experienced ping drops and did not
recover for 20+ minutes:



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Force an OSD to try to peer

2015-03-31 Thread koukou73gr

On 03/31/2015 09:23 PM, Sage Weil wrote:


It's nothing specific to peering (or ceph).  The symptom we've seen is
just that byte stop passing across a TCP connection, usually when there is
some largish messages being sent.  The ping/heartbeat messages get through
because they are small and we disable nagle so they never end up in large
frames.


Is there any special route one should take in order to transition a live 
cluster to use jumbo frames and avoid such pitfalls with OSD peering?


-K.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] qemu-kvm and cloned rbd image

2015-03-09 Thread koukou73gr

On 03/05/2015 07:19 PM, Josh Durgin wrote:

client.libvirt
 key: 
 caps: [mon] allow r
 caps: [osd] allow class-read object_prefix rbd_children, allow rw
class-read pool=rbd


This includes everything except class-write on the pool you're using.
You'll need that so that a copy_up call (used just for clones) works.
That's what was getting a permissions error. You can use rwx for short.


Josh thanks! That was the problem indeed.

I removed class-write capability because I also use this user as the 
default for ceph cli commands. Without class-write this user can't erase 
an existing image from the pool, while at the same time being able to 
create new ones.


I should probably come up with a better scheme if I am to utilize cloned 
images.


Thanks again!

-Kostas

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] qemu-kvm and cloned rbd image

2015-03-05 Thread koukou73gr

On 03/05/2015 03:40 AM, Josh Durgin wrote:


It looks like your libvirt rados user doesn't have access to whatever
pool the parent image is in:


librbd::AioRequest: write 0x7f1ec6ad6960 
rbd_data.24413d1b58ba.0186 1523712~4096 should_complete: r 
= -1


-1 is EPERM, for operation not permitted.

Check the libvirt user capabilites shown in ceph auth list - it should
have at least r and class-read access to the pool storing the parent
image. You can update it via the 'ceph auth caps' command.


Josh,

All  images, parent, snapshot and clone reside on the same pool 
(libvirt-pool *) and the user (libvirt) seems to have the proper 
capabilities. See:


client.libvirt
key: 
caps: [mon] allow r
caps: [osd] allow class-read object_prefix rbd_children, allow rw 
class-read pool=rbd


This same pool contains other (flat) images used to back my production 
VMs. They are all accessed with this same user and there have been no 
problems so far. I just can't  seem able to use cloned images.


-K.



* In my original email describing the problem I used 'rbd' instead of 
'libvirt-pool' for the pool name for simplicity. As more and more 
configuration items are requested, it makes more sense to use the real 
pool name to avoid causing any misconceptions.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] qemu-kvm and cloned rbd image

2015-03-04 Thread koukou73gr

Hi Josh,

Thanks for taking a look at this. I 'm answering your questions inline.

On 03/04/2015 10:01 PM, Josh Durgin wrote:

[...]
And then proceeded to create a qemu-kvm guest with rbd/server as its

backing store. The guest booted but as soon as it got to mount the root
fs, things got weird:


What does the qemu command line look like?


I am using libvirt, so I'll be copy-pasting from the log file:

LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin 
/usr/libexec/qemu-kvm -name server -S -machine 
rhel6.5.0,accel=kvm,usb=off -cpu 
Penryn,+dca,+pdcm,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme 
-m 1024 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 
ee13f9a0-b7eb-93fd-aa8c-18da9e23ba5c -nographic -no-user-config 
-nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/server.monitor,server,nowait 
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc 
-no-shutdown -boot order=nc,menu=on,strict=on -device 
piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device 
virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -drive 
file=rbd:libvirt-pool/server:id=libvirt:key=AQAeDqRTQEknIhAA5Gqfl/CkWIfh+nR01hEgzA==:auth_supported=cephx\;none,if=none,id=drive-scsi0-0-0-0 
-device 
scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 
-netdev tap,fd=23,id=hostnet0,vhost=on,vhostfd=24 -device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:73:98:a9,bus=pci.0,addr=0x3 
-chardev pty,id=charserial0 -device 
isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5



[...]
scsi2 : Virtio SCSI HBA
scsi 2:0:0:0: Direct-Access QEMU QEMU HARDDISK1.5. PQ: 0
ANSI: 5
sd 2:0:0:0: [sda] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)
sd 2:0:0:0: [sda] Write Protect is off
sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA
  sda: sda1 sda2
sd 2:0:0:0: [sda] Attached SCSI disk
dracut: Scanning devices sda2  for LVM logical volumes vg_main/lv_swap
vg_main/lv_root
dracut: inactive '/dev/vg_main/lv_swap' [1.00 GiB] inherit
dracut: inactive '/dev/vg_main/lv_root' [6.50 GiB] inherit
EXT4-fs (dm-1): INFO: recovery required on readonly filesystem


This suggests the disk is being exposed as read-only via QEMU,
perhaps via qemu's snapshot or other options.

You're right, the disk does seem R/O but also corrupt. The disk image 
was cleanly unmounted before creating the snapshot and cloning it.


What is more, if I just flatten the image and start the guest again it 
boots fine and there is no recovery needed on the fs.


There are also a some:

block I/O error in device 'drive-scsi0-0-0-0': Operation not permitted (1)

messages logged in /var/log/libvirt/qemu/server.log


You can use a clone in exactly the same way as any other rbd image.
If you're running QEMU manually, for example, something like:

qemu-kvm -drive file=rbd:rbd/server,format=raw,cache=writeback

is fine for using the clone. QEMU is supposed to be unaware of any
snapshots, parents, etc. at the rbd level.
In a sense, the parameters passed to QEMU from libvirt boil down to your 
suggested command line. I think it should work as well, it is written 
all over the place :)


I'm a still a newbie wrt ceph, maybe I am missing something flat-out 
obvious.

Thanks for your time,

-K.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] qemu-kvm and cloned rbd image

2015-03-04 Thread koukou73gr

On 03/03/2015 05:53 PM, Jason Dillaman wrote:

Your procedure appears correct to me.  Would you mind re-running your cloned 
image VM with the following ceph.conf properties:

[client]
rbd cache off
debug rbd = 20
log file = /path/writeable/by/qemu.$pid.log

If you recreate the issue, would you mind opening a ticket at 
http://tracker.ceph.com/projects/rbd/issues?

Jason,

Thanks for the reply. Recreating the issue is not a problem, I can 
reproduce it any time.
The log file was getting a bit large, I destroyed the guest after 
letting it thrash for about ~3 minutes, plenty of time to hit the 
problem. I've uploaded it at:


http://paste.scsys.co.uk/468868 (~19MB)

Do you really think this is a bug and not an err on my side?

-K.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] qemu-kvm and cloned rbd image

2015-03-02 Thread koukou73gr


Hello,

Today I thought I'd experiment with snapshots and cloning. So I did:

rbd import --image-format=2 vm-proto.raw rbd/vm-proto
rbd snap create rbd/vm-proto@s1
rbd snap protect rbd/vm-proto@s1
rbd clone rbd/vm-proto@s1 rbd/server

And then proceeded to create a qemu-kvm guest with rbd/server as its
backing store. The guest booted but as soon as it got to mount the root
fs, things got weird:

[...]
scsi2 : Virtio SCSI HBA
scsi 2:0:0:0: Direct-Access QEMU QEMU HARDDISK1.5. PQ: 0 ANSI: 5
sd 2:0:0:0: [sda] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)
sd 2:0:0:0: [sda] Write Protect is off
sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA
 sda: sda1 sda2
sd 2:0:0:0: [sda] Attached SCSI disk
dracut: Scanning devices sda2  for LVM logical volumes vg_main/lv_swap 
vg_main/lv_root
dracut: inactive '/dev/vg_main/lv_swap' [1.00 GiB] inherit
dracut: inactive '/dev/vg_main/lv_root' [6.50 GiB] inherit
EXT4-fs (dm-1): INFO: recovery required on readonly filesystem
EXT4-fs (dm-1): write access will be enabled during recovery
sd 2:0:0:0: [sda] abort
sd 2:0:0:0: [sda] abort
sd 2:0:0:0: [sda] abort
sd 2:0:0:0: [sda] abort
sd 2:0:0:0: [sda] abort
sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 2:0:0:0: [sda] Sense Key : Aborted Command [current]
sd 2:0:0:0: [sda] Add. Sense: I/O process terminated
sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 00 b0 e0 d8 00 00 08 00
Buffer I/O error on device dm-1, logical block 1058331
lost page write due to I/O error on dm-1
sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 2:0:0:0: [sda] Sense Key : Aborted Command [current]
sd 2:0:0:0: [sda] Add. Sense: I/O process terminated
sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 00 6f ba c8 00 00 08 00
[ ... snip ... snip ... more or less the same messages ]
end_request: I/O error, dev sda, sector 3129880
end_request: I/O error, dev sda, sector 11518432
end_request: I/O error, dev sda, sector 3194664
end_request: I/O error, dev sda, sector 3129824
end_request: I/O error, dev sda, sector 3194376
end_request: I/O error, dev sda, sector 11579664
end_request: I/O error, dev sda, sector 3129448
end_request: I/O error, dev sda, sector 3197856
end_request: I/O error, dev sda, sector 3129400
end_request: I/O error, dev sda, sector 7385360
end_request: I/O error, dev sda, sector 11515912
end_request: I/O error, dev sda, sector 11514112
__ratelimit: 12 callbacks suppressed
sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 2:0:0:0: [sda] Sense Key : Aborted Command [current]
sd 2:0:0:0: [sda] Add. Sense: I/O process terminated
sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 00 af b0 80 00 00 10 00
__ratelimit: 12 callbacks suppressed
__ratelimit: 13 callbacks suppressed
Buffer I/O error on device dm-1, logical block 1048592
lost page write due to I/O error on dm-1
Buffer I/O error on device dm-1, logical block 1048593
lost page write due to I/O error on dm-1
sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 2:0:0:0: [sda] Sense Key : Aborted Command [current]
sd 2:0:0:0: [sda] Add. Sense: I/O process terminated
sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 00 2f bf 00 00 00 08 00
Buffer I/O error on device dm-1, logical block 480
lost page write due to I/O error on dm-1
[... snip... more of the same ... ]
Buffer I/O error on device dm-1, logical block 475
lost page write due to I/O error on dm-1
Buffer I/O error on device dm-1, logical block 476
lost page write due to I/O error on dm-1
Buffer I/O error on device dm-1, logical block 477
lost page write due to I/O error on dm-1
sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 2:0:0:0: [sda] Sense Key : Aborted Command [current]
sd 2:0:0:0: [sda] Add. Sense: I/O process terminated
sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 00 2f be 30 00 00 10 00
Buffer I/O error on device dm-1, logical block 454
lost page write due to I/O error on dm-1
sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 2:0:0:0: [sda] Sense Key : Aborted Command [current]
sd 2:0:0:0: [sda] Add. Sense: I/O process terminated
sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 00 2f be 10 00 00 18 00
sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 2:0:0:0: [sda] Sense Key : Aborted Command [current]
sd 2:0:0:0: [sda] Add. Sense: I/O process terminated
sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 00 2f be 08 00 00 08 00
sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 2:0:0:0: [sda] Sense Key : Aborted Command [current]
sd 2:0:0:0: [sda] Add. Sense: I/O process terminated
sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 00 2f bd 88 00 00 08 00
__ratelimit: 5 callbacks suppressed
Buffer I/O error on device dm-1, logical block 433
lost page write due to I/O error on dm-1
sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 2:0:0:0: [sda] Sense Key : Aborted Command [current]
sd 2:0:0:0: [sda] Add. Sense: I/O process terminated
sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 00